рдЬрдм рдореИрдВ verbose=FALSE
рд╕рд╛рде рдирд┐рдореНрдирд▓рд┐рдЦрд┐рдд рдЪрд▓рд╛рддрд╛ рд╣реВрдВ, рддреЛ рдореБрдЭреЗ рдПрдХ R рдХреНрд░реИрд╢ ('рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди') рдХрд╛ рдЕрдиреБрднрд╡ рд╣реЛрддрд╛ рд╣реИред рдиреЛрдЯ рдореИрдВ рдПрдХ рдпрд╛ рджреЛ рдорд╣реАрдиреЗ рдкрд╣рд▓реЗ data.table
рдкреБрд░рд╛рдиреЗ рд╕рдВрд╕реНрдХрд░рдг рдкрд░ рдиреАрдЪреЗ рджрд┐рдП рдЧрдП рдХреЛрдб рдХреЛ рд╕рдлрд▓рддрд╛рдкреВрд░реНрд╡рдХ рдЪрд▓рд╛рдиреЗ рдореЗрдВ рд╕рдХреНрд╖рдо рдерд╛, рдЗрд╕рд▓рд┐рдП рдореЗрд░рд╛ рдорд╛рдирдирд╛ тАЛтАЛрд╣реИ рдХрд┐ рдпрд╣ рдПрдХ рд╣рд╛рд▓рд┐рдпрд╛ рдмрдЧ рд╣реИред (рдХреНрд╖рдорд╛ рдХрд░реЗрдВ, рдореБрдЭреЗ рд╕рдЯреАрдХ рджреЗрд╡ рд╕рдВрд╕реНрдХрд░рдг рдпрд╛рдж рдирд╣реАрдВ рд╣реИ рдЬрд╣рд╛рдВ рдпрд╣ рдХрд╛рдо рдХрд░ рд░рд╣рд╛ рдерд╛ред)
рд╕рдорд╕реНрдпрд╛ рдПрдХ рдмрд╣реБрдд рдЫреЛрдЯреА рдлрд╝рд╛рдЗрд▓ рдкрд░ рдкреБрди: рдЙрддреНрдкрдиреНрди рдирд╣реАрдВ рд╣реЛрддреА рд╣реИред рдЬрд╝рд┐рдк рдлрд╝рд╛рдЗрд▓ рдХрд╛ рд▓рд┐рдВрдХ (рд╕реАрдПрд╕рд╡реА 350 рдПрдордмреА рд╣реИ): https://github.com/HughParsonage/ABS-data/blob/master/inbox/SA2-by-DJZ-2011.zip
рдореИрдВ рдХрднреА-рдХрднреА рд╡рд┐рднрд┐рдиреНрди рддреНрд░реБрдЯрд┐рдпреЛрдВ рдХрд╛ рдЕрдиреБрднрд╡ рдХрд░рддрд╛ рд╣реВрдВред рдЙрджрд╛рд╣рд░рдг рдХреЗ рд▓рд┐рдП,
рдкрд╛рдиреЗ рдореЗрдВ рддреНрд░реБрдЯрд┐ (рдирд╛рдо, envir = ns, inherits = FALSE): рдЕрдорд╛рдиреНрдп рдкрд╣рд▓рд╛ рддрд░реНрдХ
рдпрд╛
рдЪреЗрддрд╛рд╡рдиреА: '$' рдореЗрдВ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди, 16 рдлрд┐рд░ 15
рддреНрд░реБрдЯрд┐: R_Reprotect: рдХреЗрд╡рд▓ 1 рд╕рдВрд░рдХреНрд╖рд┐рдд рдЖрдЗрдЯрдо, рдЕрдиреБрдХреНрд░рдордгрд┐рдХрд╛ -2 рдХреЛ рдкреБрди: рд▓рд┐рдЦ рдирд╣реАрдВ рд╕рдХрддрд╛
#
Minimal reproducible example
library(data.table)
#> data.table 1.10.5 IN DEVELOPMENT built 2017-11-13 02:46:28 UTC; appveyor
#> The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#> Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#> Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.550 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 94%. ETA 00:00 Warning: stack imbalance in '$', 27 then 28
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.991
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.006s ( 0%) Memory map 0.341GB file
0.011s ( 0%) sep=',' ncol=4 and header detection
0.002s ( 0%) Column type detection using 10027 sample rows
0.328s ( 9%) Allocation of 22885380 rows x 4 cols (0.469GB)
3.194s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.362s ( 10%) Parse to row-major thread buffers
+ 1.963s ( 55%) Transpose
+ 0.868s ( 25%) Waiting
0.991s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
3.541s Total
Warning: stack imbalance in 'withVisible', 3 then 5
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
#
Output of sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5 RevoUtils_10.0.6 RevoUtilsMath_10.0.1
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2 yaml_2.1.14
@HghParsonage , рдпрд╣ # 2457 рдХреЗ рд╕рдорд╛рди рджрд┐рдЦрддрд╛ рд╣реИред рд╢рд╛рдпрдж showProgress=FALSE
рдкрд╛рд╕ рдХрд░рдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХрд░реЗрдВ рдФрд░ рджреЗрдЦреЗрдВ рдХрд┐ рдХреНрдпрд╛ рдпрд╣ рдкреВрд░рд╛ рд╣реЛ рдЧрдпрд╛ рд╣реИред
2017-11-09 рдХреЗ рдмрд╛рдж рд╕реЗ @mattdowle рдПрдХ рдкреНрд░рддрд┐рдЧрдорди рд╣реЛ рд╕рдХрддрд╛ рдерд╛?
showProgress=FALSE
рд╕рд╛рде рдЪрд▓рдиреЗ рд╕реЗ рд╡рд╛рд╕реНрддрд╡ рдореЗрдВ рдкрд░рд┐рдгрд╛рдо (рдХреЗрд╡рд▓ рдЕрдкреЗрдХреНрд╖рд┐рдд рдЪреЗрддрд╛рд╡рдирд┐рдпреЛрдВ рдХреЗ рд╕рд╛рде) рд╡рд╛рдкрд╕ рдЖ рдЧрдпрд╛ред
рд╕рднреА рд╡рд┐рд╕реНрддреГрдд рдЬрд╛рдирдХрд╛рд░реА рдХреЗ рд▓рд┐рдП рдзрдиреНрдпрд╡рд╛рджред рдореБрдЭреЗ рд╕рдВрджреЗрд╣ рд╣реИ рдХрд┐ 2017-11-09 рдХреЗ рдмрд╛рдж рд╕реЗ рдПрдХ рдкреНрд░рддрд┐рдЧрдорди рд╣реИ, рд▓реЗрдХрд┐рди рд╢рд╛рдпрдж рд▓рдВрдмреЗ рд╕рдордп рддрдХ verbose=TRUE
рдЖрдЙрдЯрдкреБрдЯ рдХрд╛ рдИрдЯреАрдП рдЖрдЙрдЯрдкреБрдЯ рдкрд░ рд╕рдорд╛рди рдкреНрд░рднрд╛рд╡ рдкрдбрд╝ рд░рд╣рд╛ рд╣реИред рдлрд╝рд╛рдЗрд▓ рдХреЛ рдПрдХ рд░реАрд░реЗрдб рдХреА рдЖрд╡рд╢реНрдпрдХрддрд╛ рд╣реЛрддреА рд╣реИ рдЬрд┐рд╕рдХрд╛ рдЕрд░реНрде рд╣реИ рдХрд┐ рдЕрдзрд┐рдХ рдЖрдЙрдЯрдкреБрдЯ рдЙрддреНрдкрдиреНрди рд╣реЛрддрд╛ рд╣реИред рдореБрдЭреЗ рдбрд░ рд╣реИ рдХрд┐ @HughParsonage рдХреА рд░рд┐рдкреЛрд░реНрдЯ рдмрддрд╛рддреА рд╣реИ рдХрд┐ showProgress = TRUE рдЙрд╕рдХреЗ рд▓рд┐рдП рдХрд╛рд░рдЧрд░ рд╣реИ рдФрд░ рдпрд╣ рд╕рдорд╕реНрдпрд╛ рддрдм рд╣реЛрдЧреА рдЬрдм рдЗрд╕реЗ 5-10 рдмрд╛рд░ verbose = TRUE рд╕реЗ рдЪрд▓рд╛рдпрд╛ рдЬрд╛рдПрдЧрд╛ред
рд╕рдорд╛рдирд╛рдВрддрд░ рдЦрдВрдб (рдкреНрд░рдЧрддрд┐ рдИрдЯреАрдП рдХреЗ рдЕрд▓рд╛рд╡рд╛ рдЬреЛ рдкрд╣рд▓реЗ рд╕реЗ рддрдп рд╣реЛ рдЧрдпрд╛ рд╣реИ) рдХреЗ рднреАрддрд░ рд╕реЗ рдХреЛрдИ рдХреНрд░рд┐рдпрд╛ рд╕рдВрджреЗрд╢ рдирд╣реАрдВ рдЫрдкрд╛ рд╣реИред рд╣рд╛рд▓рд╛рдБрдХрд┐, рдкрд╣рд▓реЗ рд░реАрдб рдХреЗ рдмрд╛рдж рдФрд░ 2 рд░реАрд░реЗрдб рд╢реБрд░реВ рд╣реЛрдиреЗ рд╕реЗ рдкрд╣рд▓реЗ (рдЬреЛ рдЗрд╕ рдлрд╝рд╛рдЗрд▓ рдХреЗ рд▓рд┐рдП рд╣реЛ рд░рд╣рд╛ рд╣реИ) рдХреНрд░рд┐рдпрд╛ рд╕рдВрджреЗрд╢ рд╣реИрдВред рдореБрдЭреЗ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рдпрд╣ рд╕рдВрднрд╡ рд╣реИ рдХрд┐ рдпрджрд┐ рд╡реЗ рдкреНрд░рд┐рдВрдЯ 100 рд╡реАрдВ рдЪреЗрдХрдпреВрдЬрд░рдЗрдВрдЯрд░рдкреНрдЯ (# 2457 рджреЗрдЦреЗрдВ) рдХреЛ рдЯреНрд░рд┐рдЧрд░ рдХрд░рддреЗ рд╣реИрдВ, рддреЛ рдпрд╣ 2 рд╕рдорд╛рдирд╛рдВрддрд░ рдХреНрд╖реЗрддреНрд░ рдХреЛ рд╡рд┐рдлрд▓ рдХрд░рдиреЗ (рд╣рд╛рд▓рд╛рдВрдХрд┐ рд╡рд┐рд╖рдо) рд╣реЛ рд╕рдХрддрд╛ рд╣реИред рд╡реИрд╕реЗ рднреА рд╢рд╛рд╕рди рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП, рдореИрдВрдиреЗ Rprintf рдХреЗ рдмрдЬрд╛рдп REprintf (ETA рдХреЗ рд▓рд┐рдП # 2457 рдХреЗ рд░реВрдк рдореЗрдВ рдПрдХ рд╣реА рдлрд┐рдХреНрд╕) рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рд╕рднреА рдХреНрд░рд┐рдпрд╛ рд╕рдВрджреЗрд╢реЛрдВ рдХреЛ рдмрджрд▓ рджрд┐рдпрд╛ рд╣реИред рдпрд╣ рд╡рд┐рдлрд▓ рд░рд╣рд╛ рд╣реИ рдХреНрдпреЛрдВрдХрд┐ рдкрд░реАрдХреНрд╖рдг stderr рдкрд░ рдЖрдЙрдЯрдкреБрдЯ рдирд╣реАрдВ рдвреВрдВрдв рд░рд╣реЗ рд╣реИрдВ - рдареАрдХ рдХрд░ рджреЗрдВрдЧреЗред рдПрдХ рдмрд╛рд░ рдЬрдм рдпрд╣ рдЧреБрдЬрд░ рд░рд╣рд╛ рд╣реИ, рддреЛ рд╡рд┐рдВрдбреЛрдЬ .zip рд╕реНрд╡рдЪрд╛рд▓рд┐рдд рд░реВрдк рд╕реЗ рдмрдирд╛рдпрд╛ рдЬрд╛рдПрдЧрд╛, рдФрд░ рдлрд┐рд░ рдЖрдк рдХреГрдкрдпрд╛ рдлрд┐рд░ рд╕реЗ рдХреЛрд╢рд┐рд╢ рдХрд░ рд╕рдХрддреЗ рд╣реИрдВред рддреИрдпрд╛рд░ рд╣реЛрдиреЗ рдкрд░ рдореИрдВ рдпрд╣рд╛рдВ рдЕрдкрдбреЗрдЯ рдХрд░реВрдВрдЧрд╛ред
рдареАрдХ рд╣реИ, рджреВрд╕рд░рд╛ рдкреНрд░рдпрд╛рд╕ рдЪреЗрдХ рдкрд╛рд╕ рдХрд░ рд░рд╣рд╛ рд╣реИ рдФрд░ Windows.zip рдЙрдкрд▓рдмреНрдз рд╣реИред @HughParsonage рдХреНрдпрд╛ рдЖрдк рдлрд┐рд░ рд╕реЗ рдХреЛрд╢рд┐рд╢ рдХрд░рдирд╛ рдЪрд╛рд╣реЗрдВрдЧреЗ? рдореИрдВ r_ealushConsole () рдХреЛ рд░реАрдмреЙрдЗрдбрд┐рдВрдЧ рд╕реЗ рдареАрдХ рдкрд╣рд▓реЗ рд╡рд░реНрдмреЛрдЬрд╝ рдореЛрдб рдореЗрдВ рд╕рдВрджреЗрд╢реЛрдВ рдХреЗ рдмрд╛рдж рдПрдХ рдХреЙрд▓ рдЬреЛрдбрд╝рд╛ рд╣реИред рдпрд╣ рдлреНрд▓рд╢ рдХреЗрд╡рд▓ рд╡рд┐рдВрдбреЛрдЬ рдкрд░ рдХрднреА рдЬрд░реВрд░рдд рд╣реИред рдореИрдВ рдПрдХ рдЕрдиреБрдорд╛рди рд▓рдЧрд╛ рд░рд╣рд╛ рд╣реВрдВ рдХрд┐ рдлреНрд▓рд╢ рдХреЗ рдмрд┐рдирд╛, рдХрдВрд╕реЛрд▓ рдХрднреА-рдХрднреА рдПрдХ рдЫреЛрдЯреЗ рд╕реЗ рдмрд╛рдж рдореЗрдВ рдЕрдкрдбреЗрдЯ рдХрд░рддрд╛ рд╣реИ рдЬрдм рд╕рдорд╛рдирд╛рдВрддрд░ рд░реАрд░реЗрдб рд╣реЛ рд░рд╣рд╛ рд╣реИ рдФрд░ рдпрд╣ рд╕рдорд╕реНрдпрд╛ рдкреИрджрд╛ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдЬрд╛рдирд╛ рдЬрд╛рддрд╛ рд╣реИред рдХреГрдкрдпрд╛ 10 рдмрд╛рд░ рджреЛрд╣рд░рд╛рдПрдВ, рджреЛрдиреЛрдВ verbose=TRUE
рдФрд░ showProgress=TRUE
ред рдпрджрд┐ рдЖрдк 10 рд╕реНрдкрд╖реНрдЯ рд░рди рджреЗрдЦрддреЗ рд╣реИрдВ рддреЛ рд╣рдо рдХрд╣реЗрдВрдЧреЗ рдХрд┐ рдпрд╣ рдерд╛ред рдирд╣реАрдВ рддреЛ рдореБрдЭреЗ рдлрд┐рд░ рд╕реЗ рд╕реЛрдЪрдирд╛ рдкрдбрд╝реЗрдЧрд╛ред
рджреБрд░реНрднрд╛рдЧреНрдп рд╕реЗ, рддрдп рдирд╣реАрдВ:
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = FALSE)
Read 26%. ETA 00:00 Warning: stack imbalance in '$', 20 then 22
Read 52%. ETA 00:00 Warning: stack imbalance in '$', 36 then 35
Warning: stack imbalance in '$', 21 then 22
Read 59%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
unprotect_ptr: pointer not found
In addition: Warning: stack imbalance in '$', 26 then 28
Warning messages:
1: Warning: stack imbalance in '$', 26 then 27
In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15
10 рд░рди рдХреЗ рдмрд╛рдж рднреА verbose=TRUE, showProgress=TRUE
рдХрд░рдиреЗ рд╕реЗ рдореБрдЭреЗ рдХреЛрдИ рддреНрд░реБрдЯрд┐ рдирд╣реАрдВ рдорд┐рд▓рддреА рд╣реИред рдпрд╣рд╛рдВ рджреЗрдЦреЗрдВ 10 рд╡реЗрдВ рдЖрдЙрдЯрдкреБрдЯ рдХрд╛ рдкрд░рд┐рдгрд╛рдо:
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.094 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.752
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.004s ( 0%) Memory map 0.341GB file
0.008s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.173s ( 4%) Allocation of 22885380 rows x 4 cols (0.469GB)
3.660s ( 95%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
= 0.009s ( 0%) Finding first non-embedded \n after each jump
+ 1.946s ( 51%) Parse to row-major thread buffers
+ 1.098s ( 29%) Transpose
+ 0.608s ( 16%) Waiting
1.752s ( 46%) Rereading 1 columns due to out-of-sample type exceptions
3.846s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.589 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.418
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.001s ( 0%) Memory map 0.341GB file
0.003s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.574s ( 14%) Allocation of 22885380 rows x 4 cols (0.469GB)
3.428s ( 86%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
= 0.010s ( 0%) Finding first non-embedded \n after each jump
+ 1.988s ( 50%) Parse to row-major thread buffers
+ 1.137s ( 28%) Transpose
+ 0.292s ( 7%) Waiting
1.418s ( 35%) Rereading 1 columns due to out-of-sample type exceptions
4.007s Total
There were 20 warnings (use warnings() to see them)
@HughParsonage рдзрдиреНрдпрд╡рд╛рдж! рдореИрдВ рд╣рд╛рд▓рд╛рдВрдХрд┐ рдЙрд▓рдЭрди рдореЗрдВ рд╣реВрдБред рдЖрдк рдХрд╣ рд░рд╣реЗ рд╣реИрдВ рдХрд┐ рдпрд╣ verbose=TRUE, showProgress=TRUE
рд╕рд╛рде рдареАрдХ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИ, рдЬреЛ рдХрд┐ рд╣рдо - yay рдХреЗ рд▓рд┐рдП рдЖрд╢рд╛ рдХрд░рддреЗ рд╣реИрдВ! рдЗрд╕рд╕реЗ рдкрд╣рд▓реЗ рдХрд┐ рдпрд╣ рдЕрд╕рдлрд▓ рдирд╣реАрдВ рд╣реБрдЖ? showProgress
рдХреЗ рд▓рд┐рдП рдбрд┐рдлрд╝реЙрд▓реНрдЯ рд╡реИрд╕реЗ рднреА рд╕рд╣реА рд╣реИ, рд▓реЗрдХрд┐рди рдЬрдм рдЖрдк verbose
рд▓рд┐рдП рдбрд┐рдлрд╝реЙрд▓реНрдЯ FALSE рдХреЗ рд╕рд╛рде рдЪрд▓рддреЗ рд╣реИрдВ, рддреЛ рдпрд╣ рдХрд╛рдо рдирд╣реАрдВ рдХрд░рддрд╛ рд╣реИ рдФрд░ рдЖрдк рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рджреЗрдЦрддреЗ рд╣реИрдВ? рдпрд╣ рдЕрдЬреАрдм рд╣реИ рдХрд┐ _less_ рдЖрдЙрдЯрдкреБрдЯ рдЗрд╕реЗ рд╡рд┐рдлрд▓ рдмрдирд╛рддрд╛ рд╣реИред рдХреГрдкрдпрд╛ рдкреБрд╖реНрдЯрд┐ рдХрд░реЗрдВред рдЕрдЧрд░ рдРрд╕рд╛ рд╣реИ рддреЛ рд╢рд╛рдпрдж рдореИрдВ рдЧрд▓рдд рдкреЗрдбрд╝ рдХреЛ рдХрд╛рдЯ рд░рд╣рд╛ рд╣реВрдВред рдпрд╣ рд▓рд┐рдирдХреНрд╕ рдкрд░ рдореЗрд░реЗ рд▓рд┐рдП рдпрд╣рд╛рдБ рдареАрдХ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИ рдЗрд╕рд▓рд┐рдП рдореИрдВ рд╡рд┐рдВрдбреЛрдЬрд╝ рдкрд░ рдЖрдкрдХреЗ рдкрд░реАрдХреНрд╖рдг рдкрд░ рдирд┐рд░реНрднрд░ рд╣реВрдБред рдзрдиреНрдпрд╡рд╛рджред
(рд╕рд╛рде рд╣реА, 10 рд╡реЗрдВ рд░рди рдЖрдЙрдЯрдкреБрдЯ рдХреЗ рдирд┐рдЪрд▓реЗ рднрд╛рдЧ рдореЗрдВ, рдпрд╣ рдХрд╣рддрд╛ рд╣реИ рдХрд┐ 20 рдЪреЗрддрд╛рд╡рдирд┐рдпрд╛рдБ рдереАрдВред рдореБрдЭреЗ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рд╡реЗ 2 рдЪреЗрддрд╛рд╡рдирд┐рдпрд╛рдБ рд╣реИрдВ рдЬреЛ рдЙрдЪреНрдЪрддрд░ рджрд┐рдЦрд╛рдИ рдЬрд╛рддреА рд╣реИрдВ, 10 рдмрд╛рд░ рджреЛрд╣рд░рд╛рдИ рдЬрд╛рддреА рд╣реИрдВред рдпрджрд┐ рдРрд╕рд╛ рд╣реИ, рддреЛ рд╕рдордЭ рдореЗрдВ рдЖрддрд╛ рд╣реИред)
рд╣рд╛рдп рднреНрд░рдо рдХреЗ рд▓рд┐рдП рдЦреЗрдж рд╣реИ, рдореИрдЯред
рдЖрдк рд╕рд╣реА рдХрд╣ рд░рд╣реЗ рд╣реИрдВ рдХрд┐ рдореВрд▓ рд╕рдорд╕реНрдпрд╛ рдЕрдм рдХрд┐рд╕реА рджреБрд░реНрдШрдЯрдирд╛ рдХрд╛ рдкрд░рд┐рдгрд╛рдо рдирд╣реАрдВ рд╣реИ, рдЕрд░реНрдерд╛рддреН рдирд┐рдореНрдирд▓рд┐рдЦрд┐рдд рдХрд╛рд░реНрдп рдЕрдкреЗрдХреНрд╖рд┐рдд рд░реВрдк рд╕реЗ рд╣реЛрддреЗ рд╣реИрдВ:
fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "")
рд╕реНрдкрд╖реНрдЯ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП, рдореВрд▓ рдореЗрдВ, рдЬрдм verbose =FALSE
(рдбрд┐рдлрд╝реЙрд▓реНрдЯ) рдореБрдЭреЗ рдХреНрд░реИрд╢ рдорд┐рд▓рд╛ред рдореИрдВрдиреЗ рдЗрд╕реЗ рдЬрд╛рд░реА рдХрд░рдиреЗ рд╕реЗ рдкрд╣рд▓реЗ verbose = TRUE
рд╕рд╛рде рднрд╛рдЧ рд▓рд┐рдпрд╛, рдФрд░ 'рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди' рдХреА рдЪреЗрддрд╛рд╡рдиреА рдкрд░ рдзреНрдпрд╛рди рджрд┐рдпрд╛, рд▓реЗрдХрд┐рди рдПрдХ рджреБрд░реНрдШрдЯрдирд╛ рдХрд╛ рд╕рд╛рдордирд╛ рдирд╣реАрдВ рдХрд┐рдпрд╛ред рдирд╡реАрдирддрдо рд╕рдВрд╕реНрдХрд░рдг рдХреЗ рд╕рд╛рде, рдореБрдЭреЗ verbose = FALSE
рд╕рд╛рде рдХреЛрдИ рдХреНрд░реИрд╢ (рдпрд╛ рд╡рд╛рд╕реНрддрд╡ рдореЗрдВ рдХреЛрдИ рд╕рдорд╕реНрдпрд╛) рдирд╣реАрдВ рдорд┐рд▓рддреА рд╣реИред
рдореЗрд░реЗ рджреНрд╡рд╛рд░рд╛ 'рдирд┐рд╢реНрдЪрд┐рдд рдирд╣реАрдВ' рдХрд┐рдП рдЬрд╛рдиреЗ рдХрд╛ рдХрд╛рд░рдг рдпрд╣ рдерд╛ рдХрд┐ рдореИрдВрдиреЗ рдЪреЗрддрд╛рд╡рдиреА рд╕рдВрджреЗрд╢реЛрдВ рдХреЛ рджреЗрдЦрд╛ рдерд╛:
Warning messages:
Warning: stack imbalance in '$', 26 then 27
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15
рдЬреЛ рдЕрдЬреАрдм рд▓рдЧ рд░рд╣рд╛ рдерд╛ рдФрд░ рдореБрдЭреЗ рд▓рдЧрд╛ рдХрд┐ рдпрд╣ рд╕рдорд░реВрдк рд╕рдорд╕реНрдпрд╛ рдирд╣реАрдВ рдмрд▓реНрдХрд┐ рдирд┐рдХрдЯрддрд╛ рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рд╕рдВрдХреЗрдд рджреЗ рд╕рдХрддрд╛ рд╣реИред рдпрд╣ рдХрд╣рддреЗ рд╣реБрдП рдХрд┐, рдСрд╕реНрдЯреНрд░реЗрд▓рд┐рдпрд╛ рдореЗрдВ рдЖрдЬ рд╕реБрдмрд╣ рдореИрдВ рдЪреЗрддрд╛рд╡рдиреА рд╕рдВрджреЗрд╢ рджреЛрдмрд╛рд░рд╛ рдирд╣реАрдВ рд▓рд╛ рд╕рдХрддрд╛ред
рдареАрдХ рд╣реИ рдореИрдВ рд╕рдордЭрд╛ред рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдЪреЗрддрд╛рд╡рдиреА рд╕рдВрджреЗрд╢ рдЕрдирд┐рд╡рд╛рд░реНрдп рд░реВрдк рд╕реЗ рддреНрд░реБрдЯрд┐рдпрд╛рдВ рд╣реИрдВ, рд╣рд╛рдБред рд╣рдо рдЙрдиреНрд╣реЗрдВ рдЫреЛрдбрд╝ рдирд╣реАрдВ рд╕рдХрддреЗред рдореИрдВ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдЪреЗрддрд╛рд╡рдиреА рджреЗ рд░рд╣рд╛ рд╣реВрдВ рдХрд┐ рджреБрд░реНрдШрдЯрдирд╛, рднрд▓реЗ рд╣реА рдпрд╣ рд╡рд╛рд╕реНрддрд╡ рдореЗрдВ рдЕрднреА рддрдХ рджреБрд░реНрдШрдЯрдирд╛рдЧреНрд░рд╕реНрдд рди рд╣реБрдИ рд╣реЛред (рдпрд╣ рдЙрд╕ рдЪреЗрддрд╛рд╡рдиреА рдХреЛ рджреЗрдЦрдиреЗ рдХреЗ рдмрд╛рдж рдХреНрд░реИрд╢ рд╣реЛрдиреЗ рддрдХ рдмрд╕ рдХреБрдЫ рд╕рдордп рдХреА рдмрд╛рдд рд╣реИред)
рдЬрдм рдЖрдк verbose=TRUE, showProgress=TRUE
рд╕рд╛рде рдПрдХ рддрд╛рдЬрд╛ рдЖрд░ рд╕рддреНрд░ рдореЗрдВ 10 рд░рди рдХрд░рддреЗ рд╣реИрдВ, рддреЛ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ 20 рдЪреЗрддрд╛рд╡рдирд┐рдпреЛрдВ рдореЗрдВ рд╕реЗ рдХреЛрдИ рднреА рд╣реЛ рдпрд╛ рд╡реЗ рд╕рднреА 20 рдХреЗрд╡рд▓ рдирд┐рдореНрдирд▓рд┐рдЦрд┐рдд рдирд┐рдпрдорд┐рдд рдЪреЗрддрд╛рд╡рдиреА рд╣реИрдВред
1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
рдПрдХ рдмрд╛рд░ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдЪреЗрддрд╛рд╡рдиреА рд╣реЛрдиреЗ рдХреЗ рдмрд╛рдж, рдХреГрдкрдпрд╛ рдПрдХ рдирдпрд╛ рдЖрд░ рд╕рддреНрд░ рд╢реБрд░реВ рдХрд░реЗрдВред рд╣рдо R рд╕реЗ рдХреБрдЫ рднреА рднрд░реЛрд╕рд╛ рдирд╣реАрдВ рдХрд░ рд╕рдХрддреЗ рд╣реИрдВ рдХрд┐ рдПрдХ рдмрд╛рд░ рднреА рд╣реБрдЖ рд╣реИред
рдЬрдм рдореИрдВ verbose=TRUE, showProgress=TRUE
рд╕рд╛рде рднрд╛рдЧрд╛ рддреЛ рдореИрдВ рдПрдХ рдХреНрд░реИрд╢ рдкреНрд░рд╛рдкреНрдд рдХрд░рдиреЗ рдореЗрдВ рд╕рдлрд▓ рд░рд╣рд╛ред const char
SEXP
ред рдореИрдВ рдЗрд╕реЗ рдХрдорд╛рдВрдб рд▓рд╛рдЗрди рд╕реЗ рдкреБрди: рдЙрддреНрдкрдиреНрди рдХрд░рдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХрд░ рд░рд╣рд╛ рд╣реВрдВ (рджреБрд░реНрднрд╛рдЧреНрдп рд╕реЗ рдпрд╣ RStudio рдореЗрдВ рд╣реБрдЖ рдФрд░ RStudio рдмрдВрдж рд╣реЛ рдЧрдпрд╛ рдЗрд╕рд╕реЗ рдкрд╣рд▓реЗ рдХрд┐ рдореИрдВ рдкреВрд░рд╛ рд╕рдВрджреЗрд╢ рдкрдврд╝ рд╕рдХреВрдВ)ред
рджреБрд░реНрдШрдЯрдирд╛ рдХреЛ рдкреБрди: рдЙрддреНрдкрдиреНрди рдирд╣реАрдВ рдХрд░ рд╕рдХрддрд╛ред рдпрд╣рд╛рдБ рд░рд┐рдмреВрдЯ рдХрд░рдиреЗ рдХреЗ рдмрд╛рдж рдкрд░рд┐рдгрд╛рдо рд╣реИред рдПрдХ рдвреЗрд░ рдЕрд╕рдВрддреБрд▓рди рдЪреЗрддрд╛рд╡рдиреА рдереА:
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> for (i in 1:10) fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE, showProgress = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 31%. ETA 00:00 Warning: stack imbalance in '$', 24 then 23
Read 91%. ETA 00:00 Warning: stack imbalance in '$', 27 then 26
Read 95%. ETA 00:00 Warning: stack imbalance in '$', 28 then 29
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.895
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.029s ( 1%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.314s ( 15%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.761s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.015s ( 1%) Finding first non-embedded \n after each jump
+ 0.599s ( 28%) Parse to row-major thread buffers
+ 0.400s ( 19%) Transpose
+ 0.746s ( 35%) Waiting
0.895s ( 42%) Rereading 1 columns due to out-of-sample type exceptions
2.107s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.335 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.049
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.402s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.974s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.209s ( 9%) Parse to row-major thread buffers
+ 0.864s ( 36%) Transpose
+ 0.900s ( 38%) Waiting
1.049s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
2.385s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.414
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.007s ( 0%) sep=',' ncol=4 and header detection
0.002s ( 0%) Column type detection using 10027 sample rows
0.293s ( 18%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.322s ( 81%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.199s ( 12%) Parse to row-major thread buffers
+ 0.822s ( 51%) Transpose
+ 0.301s ( 19%) Waiting
0.414s ( 25%) Rereading 1 columns due to out-of-sample type exceptions
1.626s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.451 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.409
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.403s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.448s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.194s ( 10%) Parse to row-major thread buffers
+ 0.974s ( 52%) Transpose
+ 0.279s ( 15%) Waiting
0.409s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
1.860s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.480 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.412
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.007s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.459s ( 24%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.424s ( 75%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.197s ( 10%) Parse to row-major thread buffers
+ 0.938s ( 50%) Transpose
+ 0.288s ( 15%) Waiting
0.412s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
1.892s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.381 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.401
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.005s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.384s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.389s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.196s ( 11%) Parse to row-major thread buffers
+ 0.911s ( 51%) Transpose
+ 0.281s ( 16%) Waiting
0.401s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
1.781s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.384 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.480
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.002s ( 0%) Column type detection using 10027 sample rows
0.476s ( 26%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.378s ( 74%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.192s ( 10%) Parse to row-major thread buffers
+ 0.833s ( 45%) Transpose
+ 0.352s ( 19%) Waiting
0.480s ( 26%) Rereading 1 columns due to out-of-sample type exceptions
1.864s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.374 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.507
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.311s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.562s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.193s ( 10%) Parse to row-major thread buffers
+ 0.988s ( 52%) Transpose
+ 0.381s ( 20%) Waiting
0.507s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
1.881s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.318 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.493
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.306s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.496s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.193s ( 11%) Parse to row-major thread buffers
+ 0.935s ( 52%) Transpose
+ 0.367s ( 20%) Waiting
0.493s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
1.811s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.141 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.506
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.007s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.132s ( 8%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.506s ( 91%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.195s ( 12%) Parse to row-major thread buffers
+ 0.938s ( 57%) Transpose
+ 0.371s ( 23%) Waiting
0.506s ( 31%) Rereading 1 columns due to out-of-sample type exceptions
1.647s Total
Warning: stack imbalance in 'for', 2 then 8
There were 20 warnings (use warnings() to see them)
рдЕрдЬреАрдм рддрд░рд╣ рд╕реЗ рдХрд┐ рдирд┐рд╢реНрдЪрд┐рддрддрд╛ рдорд╣рд╛рди рд╣реИред рдзрдиреНрдпрд╡рд╛рджред рдЗрд╕рдХрд╛ рдорддрд▓рдм рд╣реИ рдХрд┐ рдлреНрд▓рд╢ рдХрд╛рдо рдирд╣реАрдВ рдХрд░рддрд╛ рд╣реИ рдФрд░ рдореБрдЭреЗ рд╕рдм рдХреЗ рдмрд╛рдж Rprintf
рд╕реЗ рдмрдЪрдиреЗ рдХрд╛ рдПрдХ рддрд░реАрдХрд╛ рдЦреЛрдЬрдирд╛ рд╣реЛрдЧрд╛ред рдпрд╣ рдордЬрд╝рдмреВрддреА рд╕реЗ verbose=FALSE, showProgress=FALSE
рд╕рд╛рде рдХрд╛рдо рдХрд░рддрд╛ рд╣реИ (рдЖрдкрдиреЗ рд▓рд┐рдЦрд╛ рд╣реИ рдХрд┐ рдЗрд╕ рдореБрджреНрджреЗ рдХреЗ рд╢реАрд░реНрд╖ рдХреЗ рдкрд╛рд╕ рдЗрд╕рд▓рд┐рдП рдореИрдВ рдЙрд╕ рдкрд░ рднрд░реЛрд╕рд╛ рдХрд░ рд░рд╣рд╛ рд╣реВрдВред) "рд╡рд┐рд╢реНрд╡рд╕рдиреАрдп" рдЕрд░реНрде рдХреЗрд╡рд▓ 10 рд╕реАрдзреЗ рджреЛ рдЕрдкреЗрдХреНрд╖рд┐рдд рдЪреЗрддрд╛рд╡рдирд┐рдпреЛрдВ рдХреЗ рд╕рд╛рде рдЪрд▓рддрд╛ рд╣реИ рдФрд░ рд╕реНрдЯреИрдХ рдХреА рдХреЛрдИ рджреГрд╖реНрдЯрд┐ рдирд╣реАрдВ рд╣реИред рдЕрд╕рдВрддреБрд▓рди рдХреА рдЪреЗрддрд╛рд╡рдиреАред
рдЗрд╕реЗ рдореЗрд░реЗ рд╕рд╛рде рдЫреЛрдбрд╝ рджреЛред рдПрдХ рдмрд╛рд░ рдлрд┐рд░ рдзрдиреНрдпрд╡рд╛рджред
@HughParsonage рдареАрдХ рд╣реИ, рдХреГрдкрдпрд╛ рдЙрд╕ рд╣рд╛рд▓ рдХреЗ рджреВрд╕рд░реЗ рдкреНрд░рдпрд╛рд╕ рдХреЗ рд╕рд╛рде рдлрд┐рд░ рд╕реЗ рдХреЛрд╢рд┐рд╢ рдХрд░реЗрдВред рдЗрд╕реЗ рдЕрднреА рддрдХ рдорд╛рд╕реНрдЯрд░ рдореЗрдВ рд╡рд┐рд▓рдп рдирд╣реАрдВ рдХрд┐рдпрд╛ рдЧрдпрд╛ рд╣реИ, рдЗрд╕рд▓рд┐рдП рдХреГрдкрдпрд╛ рдпрд╣рд╛рдВ рд╢рд╛рдЦрд╛ рд╕реЗ Windows.zip рд▓рд╛рдиреЗ рдХреЗ рд▓рд┐рдП рд╕рд╛рд╡рдзрд╛рди рд░рд╣реЗрдВред рдкрд╣рд▓реЗ рдХреА рддрд░рд╣, рдХреГрдкрдпрд╛ рдкреВрд░реНрдг рдЖрдЙрдЯрдкреБрдЯ рдкреНрд░рджрд╛рди рдХрд░реЗрдВ рддрд╛рдХрд┐ рдореИрдВ рдЗрд╕реЗ рдЬрд╛рдВрдЪ рд╕рдХреВрдВред рдзрдиреНрдпрд╡рд╛рдж!
рдирд┐рдореНрдирд▓рд┐рдЦрд┐рдд рдХреЗ рдкрд╣рд▓реЗ рдкреНрд░рдпрд╛рд╕ рдореЗрдВ рдПрдХ рджреБрд░реНрдШрдЯрдирд╛ (рдПрдХ рдкреЙрдЗрдВрдЯрд░ рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдХреБрдЫ) рд╣реБрдИред
рджреВрд╕рд░рд╛ рдкреНрд░рдпрд╛рд╕ (рд░рд┐рдмреВрдЯ рдХрд░рдиреЗ рдХреЗ рдмрд╛рдж) stack imbalance in '$', 16 then 15
рдЪреЗрддрд╛рд╡рдиреА рдХреЗ рд░реВрдк рдореЗрдВ рд╣реЛрддрд╛ рд╣реИред
# Assert that `data.table` is not installed:
stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into тАШC:/Users/hughp/Documents/R/win-library/3.4тАЩ
# (as тАШlibтАЩ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557502 bytes (1.5 MB)
# downloaded 1.5 MB
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 01:38:17 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapping ... ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \r-only line endings are not allowed because \n is found in the data
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
# [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=',' with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
# Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000) : 1551 Quote rule 0
# Type codes (jump 100) : 1A51 Quote rule 0
# =====
# Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
# [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# Read 78%. ETA 00:00 Warning: stack imbalance in '$', 16 then 15
# Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.677 wall clock time
# [12] Finalizing the datatable
# Type counts:
# 1 : bool8 '1'
# 1 : int32 '5'
# 2 : string 'A'
# =============================
# 0.002s ( 0%) Memory map 0.341GB file
# 0.007s ( 0%) sep=',' ncol=4 and header detection
# 0.001s ( 0%) Column type detection using 10027 sample rows
# 0.297s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 2.369s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# = 0.003s ( 0%) Finding first non-embedded \n after each jump
# + 0.273s ( 10%) Parse to row-major thread buffers (grown 0 times)
# + 1.313s ( 49%) Transpose
# + 0.780s ( 29%) Waiting
# 0.893s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
# 2.677s Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1 V2 V3 V4
# 1: Goulburn 110018063 3499 NA
# 2: NA 110018064 812 NA
# 3: NA 110018065 2158 NA
# 4: NA 110019999 402 NA
# 5: NA 110028068 10 NA
# ---
# 22885376: NA 997999799 0 NA
# 22885377: NA 998999899 64 NA
# 22885378: NA 994999499 34 NA
# 22885379: NA 0&&&&&&&& 250796 NA
# 22885380: NA 0@@@@@@@@ 7305367 NA
# Warning messages:
# 1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
# 2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
рд╣рд╛рдп, @mattdowleред рдЬреАрд╕реАрд╕реА рдХреЗ рд╕рдВрд╕реНрдХрд░рдг рдЕрднреА рднреА рдЙрдкрдпреЛрдЧ рдореЗрдВ рд╣реИрдВ рдЬрд┐рдирдХреЗ рдУрдкрдирдПрдордкреА рд╕рдмрд╕реЗ рдЕрдЪреНрдЫреЗ 3.1 рдкрд░ рд╣реИрдВ, 4.0 рдирд╣реАрдВред рдореИрдВ CRAN ( Delaporte ) рдкрд░ рдЕрдкрдиреЗ рдПрдХ рдкреИрдХреЗрдЬ рдореЗрдВ рдЙрд╕ рд╕рдорд╕реНрдпрд╛ рдореЗрдВ рднрд╛рдЧ рдЧрдпрд╛, рдЬрд╣рд╛рдВ рдореИрдВрдиреЗ рдПрдХ SIMD рдирд┐рд░реНрджреЗрд╢ (OpenMP 4.0) рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХреА, рдЬреЛ рд╡рд┐рдВрдбреЛрдЬ рдХреЗ рд▓рд┐рдП Rtools (4.9.3 рдкрд░ рдЖрдзрд╛рд░рд┐рдд) рдХреЗ рд╕рд╛рде рд╕рдВрдХрд▓рд┐рдд рдХрд┐рдпрд╛ рдЧрдпрд╛ рдерд╛, рд▓реЗрдХрд┐рди рдХрд┐рд╕реА рдХреЗ рд▓рд┐рдирдХреНрд╕ рдорд╢реАрди рдкрд░ рдереНрд░реВ рдФрд░ рддреНрд░реБрдЯрд┐ рдЕрднреА рднреА gcc рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░ рд░рд╣реА рд╣реИ 4.8.0ред рдпрд╣рд╛рдВ рддрдХ тАЛтАЛрдХрд┐ рд╡рд┐рдВрдбреЛрдЬ рдХреЗрд╡рд▓ 4.0 рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░ рд╕рдХрддрд╛ рд╣реИ рдФрд░ 4.5 рдХреЙрд▓ рдХрд╛ рдирд╣реАрдВ, рдЕрдЧрд░ рдореБрдЭреЗ рд╕рд╣реА рдпрд╛рдж рд╣реИред рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдХрд┐ рдЗрд╕ рдореБрджреНрджреЗ рдореЗрдВ рдпреЛрдЧрджрд╛рди рджреЗ рд░рд╣рд╛ рд╣реИ?
@HughParsonage рдЗрддрдиреА рдЬрд▓реНрджреА рдкрд░реАрдХреНрд╖рдг рдХреЗ рд▓рд┐рдП рдзрдиреНрдпрд╡рд╛рдж! рдареАрдХ рд╣реИ, рдореИрдВ рд╕реЛрдЪрддрд╛ рд░рд╣реВрдВрдЧрд╛!
@aadler рдпрд╣ рдПрдХ рдЕрдЪреНрдЫрд╛ рд╡рд┐рдЪрд╛рд░ рд╣реИ - рдХреБрдЫ рднреА рд╕рдВрднрд╡ рд╣реИред
@HughParsonage рдХреГрдкрдпрд╛ рдкреБрд╖реНрдЯрд┐ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдХрд┐ рдХреЗрд╡рд▓ рдПрдХ рдкрд░рд┐рд╡рд░реНрддрди ( verbose=FALSE
) рдХреЗ рд╕рд╛рде рдПрдХ рд╣реА рдЖрджреЗрд╢ рдареАрдХ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИ? рдпрд╛рдиреА fread("SA2-by-DJZ-2011.csv", verbose = FALSE, na.strings = "", header = FALSE)
ред рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рдЕрднреА рднреА рдкреНрд░рджрд░реНрд╢рд┐рдд рд╣реЛрдЧрд╛ред
рд╣рд╛рдВ, рдЙрд╕ рдХрдорд╛рдВрдб рдХреЛ рдЪрд▓рд╛рдиреЗ (рджрд╕ рдмрд╛рд░) рдиреЗ рдЕрдкреЗрдХреНрд╖рд┐рдд рдкрд░рд┐рдгрд╛рдо рд▓реМрдЯрд╛рдпрд╛ (рдпрд╛рдиреА рдбреЗрдЯрд╛рдЯреЗрдмрд▓ рдХреЗрд╡рд▓ рджреЛ рдЪреЗрддрд╛рд╡рдирд┐рдпреЛрдВ рдХреЗ рд╕рд╛рде рдХреНрдпреЛрдВрдХрд┐ рдпрд╣ рдмреБрд░реА рддрд░рд╣ рд╕реЗ рд╕реНрд╡рд░реВрдкрд┐рдд рд╣реИ)ред рдХреЛрдИ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдЪреЗрддрд╛рд╡рдиреАред
рдзрдиреНрдпрд╡рд╛рджред рддреЛ рдпрд╣ рдХрдВрд╕реЛрд▓ рдЖрдЙрдЯрдкреБрдЯ рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рдкреНрд░рддреАрдд рд╣реЛрддрд╛ рд╣реИред рдХреБрдЫ рдФрд░ рдЪреАрдЬреЗрдВ рдЯреНрд░рд╛рдИ рдХрд░реЗрдВ ...
рд╡рд░реНрдмреЛрдЬрд╝ рдореЛрдб рдореЗрдВ, рд╕рдорд╛рдирд╛рдВрддрд░ рдХреНрд╖реЗрддреНрд░ рдХреЗ рдЕрдВрджрд░ рдХреБрдЫ рд╢рд╛рдЦрд╛рдПрдБ рд╣реЛрддреА рд╣реИрдВ рдЬрд┐рдиреНрд╣реЗрдВ wallclock()
ред рдореИрдВрдиреЗ рд╢реЙрд░реНрдЯ-рд╕рд░реНрдХреБрд▓реЗрдЯ рдХрд┐рдпрд╛ рд╣реИ рдЬреЛ рд╣рдореЗрд╢рд╛ 0.0 рд▓реМрдЯрд╛рддрд╛ рд╣реИ рдФрд░ рд╕рд┐рд╕реНрдЯрдо рдХреЙрд▓ рд╕реЗ рдмрдЪрдиреЗ рдХреЗ рд▓рд┐рдП, рдЙрд╕ рдирд┐рдпрдо рд╕реЗред рдореБрдЭреЗ рд▓рдЧрд╛ рдХрд┐ рдпрд╣ рдереНрд░реЗрдб рд╕реЗрдл рд╣реИ рд▓реЗрдХрд┐рди рд╢рд╛рдпрдж рдирд╣реАрдВред рдХреГрдкрдпрд╛ рдпрд╣рд╛рдБ рдкреБрдирд░реНрдирд┐рд░реНрдорд╛рдг рд╢рд╛рдЦрд╛ рд╕реЗ рдирдП Windows.zip рдХрд╛ рдкреНрд░рдпрд╛рд╕
рдкрд╣рд▓рд╛ рдкреНрд░рдпрд╛рд╕:
install.packages("https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into тАШC:/Users/hughp/Documents/R/win-library/3.4тАЩ
# (as тАШlibтАЩ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1556972 bytes (1.5 MB)
# downloaded 1.5 MB
#
# package тАШdata.tableтАЩ successfully unpacked and MD5 sums checked
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 03:49:20 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
рджреВрд╕рд░рд╛ рдкреНрд░рдпрд╛рд╕, рдореБрдЭреЗ рдирд┐рдореНрди рдЪреЗрддрд╛рд╡рдиреА рдорд┐рд▓рддреА рд╣реИ:
Read 22%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
unprotect_ptr: pointer not found
In addition: Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Warning: stack imbalance in '$', 29 then 28
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 125 then 126
Warning: stack imbalance in 'lapply', 55 then 53
Warning: stack imbalance in 'lapply', 30 then 34
Warning: stack imbalance in '<-', 28 then 31
Warning: stack imbalance in '{', 24 then 27
Warning: stack imbalance in '{', 18 then 21
рдмрд╕ рдПрдХ рд╡рд┐рдЪрд╛рд░: рдпрд╣ RStudio рдХреЗ рд╕рд╛рде рдПрдХ рд╕рдорд╕реНрдпрд╛ рд╣реЛ рд╕рдХрддреА рд╣реИ? рдЯрд░реНрдорд┐рдирд▓ рд╕реЗ рд╕реНрдХреНрд░рд┐рдкреНрдЯ рдЪрд▓рд╛рдирд╛ рдЖрд╕рд╛рдиреА рд╕реЗ рдкреБрдирд░реБрддреНрдкрд╛рджрд┐рдд рдирд╣реАрдВ рд╣реЛрддрд╛ рд╣реИред рдореИрдВ RStudio рд╕реЗ рдЪрд▓ рд░рд╣рд╛ рд╣реВрдВ рдХреНрдпреЛрдВрдХрд┐ рдпрд╣ рдХрдВрд╕реЛрд▓ рдЖрдЙрдЯрдкреБрдЯ рдХреЛ рдХреЙрдкреА рдХрд░рдирд╛ рдЖрд╕рд╛рди рдмрдирд╛рддрд╛ рд╣реИред
рдЬрдм рдЖрдк рдХрд╣рддреЗ рд╣реИрдВ рдХрд┐ рдпрд╣ RStudio рдХреЗ рдмрд╛рд╣рд░ _as рдХреЛ рдЖрд╕рд╛рдиреА рд╕реЗ рдкреБрди: рдкреНрд░рд╕реНрддреБрдд рдирд╣реАрдВ рдХрд░рддрд╛ рд╣реИ, рддреЛ рдХреНрдпрд╛ рдпрд╣ _at all_ рдХреЛ рдкреБрди: рдЙрддреНрдкрдиреНрди рдХрд░рддрд╛ рд╣реИ? рдпрд╣рд╛рдВ рддрдХ тАЛтАЛрдХрд┐ рдЕрдЧрд░ рдпрд╣ рдХреЗрд╡рд▓ RStudio рдХреЗ рдЕрдВрджрд░ рд╣реЛрддрд╛ рд╣реИ, рддреЛ рдпрд╣ рдЕрднреА рднреА рдХреБрдЫ рд╣реИ рдЬрд┐рд╕рдХрд╛ рдЙрджреНрджреЗрд╢реНрдп рдореИрдВ рдбреЗрдЯрд╛рдЯреЗрдмрд▓ рдкрдХреНрд╖ рдХреЛ рдареАрдХ рдХрд░реВрдВрдЧрд╛ред рдореИрдВ рд╕рд┐рд░реНрдл рдПрдХ рдЕрдиреНрдп рдорд╛рд░реНрдЧ рдХреЗ рд░реВрдк рдореЗрдВ рдпрд╣ рдкреБрд╖реНрдЯрд┐ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдХрд╣ рд░рд╣рд╛ рд╣реВрдВ рдХрд┐ рдпрд╣ рдирд┐рд╢реНрдЪрд┐рдд рд░реВрдк рд╕реЗ "рдмрд╕" рдХрдВрд╕реЛрд▓ рдЖрдЙрдЯрдкреБрдЯ рд╣реИ рди рдХрд┐ рдХреБрдЫ рдЕрдиреНрдп рд╕рд╣реА рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди, рдЬреЛ рдХрд┐ рдлрд╝реНрд░реЗрдб рд▓реЙрдЬрд┐рдХ рдореЗрдВ рд╣реИред
рдореИрдВ рдЕрднреА рддрдХ RStudio рдХреЗ рдмрд╛рд╣рд░ рдкреБрди: рдкреЗрд╢ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рд╣реВрдВ, рдФрд░ рдЗрд╕рдХреЗ рдЕрдВрджрд░ рд╡рд┐рд╢реНрд╡рд╕рдиреАрдп рдкреБрди: рдкреЗрд╢ рдХрд░ рд╕рдХрддрд╛ рд╣реВрдВ (рдЕрд░реНрдерд╛рдд, рдореИрдВ рдХреБрдЫ рдЪреЗрддрд╛рд╡рдиреА рдпрд╛ рджреБрд░реНрдШрдЯрдирд╛ рдХреЛ рдкреБрди: рдЙрддреНрдкрдиреНрди рдХрд░ рд╕рдХрддрд╛ рд╣реВрдВ)ред рдореИрдВрдиреЗ рд╡рд┐рдВрдбреЛрдЬрд╝ рдХрдорд╛рдВрдб рдкреНрд░реЙрдореНрдкреНрдЯ рдФрд░ рдЧрд┐рдЯ рд╢реЗрд▓ (рд╡рд┐рдВрдбреЛрдЬрд╝ рдореЗрдВ) рдХреА рдХреЛрд╢рд┐рд╢ рдХреА рд╣реИред
рдореИрдВ рд╡рд┐рдВрдбреЛрдЬ рдкрд░ RStudio рд╕рдВрд╕реНрдХрд░рдг 1.1.383 рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░ рд░рд╣рд╛ рд╣реВрдВред рдХреНрдпрд╛ рдпрд╣ рдЖрдкрдХреЗ рд▓рд┐рдП рдорджрджрдЧрд╛рд░ рд╣реЛрдЧрд╛ рдпрджрд┐ рдореИрдВрдиреЗ рдЙрдирдХреЗ рд╕рд╛рде рднреА рдЗрд╕ рдореБрджреНрджреЗ рдХреЛ рдЙрдард╛рдпрд╛, рдпрд╛ рдЖрдк рдореБрдЭреЗ рдЗрдВрддрдЬрд╛рд░ рдХрд░рдирд╛ рдЪрд╛рд╣реЗрдВрдЧреЗ?
рдзрдиреНрдпрд╡рд╛рджред рдпрд╣ рдЬрд╛рдирдирд╛ рд╡рд╛рд╕реНрддрд╡ рдореЗрдВ рдЙрдкрдпреЛрдЧреА рд╣реИ рдХрд┐ рдпрд╣ рд╕рд┐рд░реНрдл RStudio рдХреЗ рдЕрдВрджрд░ рд╣реИред рдЗрд╕реЗ рдЙрдирдХреЗ рд╕рд╛рде рдЙрдард╛рдиреЗ рдХреА рдЬрд░реВрд░рдд рдирд╣реАрдВ рд╣реИред рдпрд╣ рд╕рд┐рд░реНрдл рдЗрд╕рдХрд╛ рдорддрд▓рдм рд╣реИ рдХрд┐ рдпрд╣ рдЖрдЙрдЯрдкреБрдЯ рдХрдВрд╕реЛрд▓ рдмрдлрд╝рд░рд┐рдВрдЧ (рдпрд╛ рд╕рдорд╛рди) рдХреЗ рд╕рд╛рде рдХреБрдЫ рдХрд░рдирд╛ рд╣реИред рдореИрдВ рдПрдХ рдХрд╛рдо рдХреЗ рд╕рд╛рде рдФрд░ рдзрдХреНрдХрд╛ рджреЗрдиреЗ рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдЖрдЧреЗ рдмрдврд╝ рдЧрдпрд╛ рд╣реВрдВред
рдореИрдВ рдпрд╣ рдирд╣реАрдВ рджреЗрдЦрддрд╛ рдХрд┐ Windows рдЙрд╕ рдкрд░рд┐рд╡рд░реНрддрди рдХрд╛ рд╕рдВрдХрд▓рди рдХреНрдпреЛрдВ рдирд╣реАрдВ рдХрд░ рд░рд╣рд╛ рд╣реИ:
fread.c:1054:3: warning: too many arguments for format [-Wformat-extra-args]
рд▓рд┐рдирдХреНрд╕ рдФрд░ рдЯреНрд░реИрд╡рд┐рд╕ рдкрд░ рдпрд╣рд╛рдБ рдареАрдХ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИред рдЖрдкрдХреЗ рджреНрд╡рд╛рд░рд╛ рдЗрд╕ рд╡рд░реНрдХрдЕрд░рд╛рдЙрдВрдб рдХрд╛ рдкрд░реАрдХреНрд╖рдг рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП Windows.zip рдмрдирд╛рдпрд╛ рдЬрд╛ рд░рд╣рд╛ рд╣реИред рдореБрдЭреЗ рдЗрд╕ рдкрд░ рд╕реЛрдирд╛ рдкрдбрд╝реЗрдЧрд╛ред
(рдпрд╣ рд▓рд╛рдЗрди 1054 рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рд╢рд┐рдХрд╛рдпрдд рдХрд░ рд░рд╣рд╛ рд╣реИ, рд▓реЗрдХрд┐рди рдмрд╣реБрдд рдЕрдЧрд▓реА рдкрдВрдХреНрддрд┐ 1055 рдирд╣реАрдВ рд╣реИ, рдЬреЛ рдХрд┐ рд╕рд┐рд░реНрдл рдПрдХ рд╣реА рдПрдлрд┐рдХрд┐рдХреНрд╕ рд╣реИред рдХреБрдЫ рдЕрдВрддрд░ рд╣реЛрдирд╛ рдЪрд╛рд╣рд┐рдПред рд╡рд┐рдВрдбреЛрдЬрд╝ рдкрд░ __VA_ARGS__
рд╕рд╛рде рдПрдХ рд╕рдорд╕реНрдпрд╛ рдХреЛ% llu - рдирд┐рд╢реНрдЪрд┐рдд рд░реВрдк рд╕реЗ рдирд╣реАрдВред)
рдареАрдХ рд╣реИ, рдЕрдВрдд рдореЗрдВ windows.zip рдЖрдк рдпрд╣рд╛рдБ рдлрд┐рд░ рд╕реЗ рдХреЛрд╢рд┐рд╢ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рддреИрдпрд╛рд░
рдЗрд╕ рд╢рд╛рдЦрд╛ рдореЗрдВ рдЕрднреА рдХрдИ рдХрд╛рд░реНрдпрджрдХреНрд╖рддрд╛рдПрдБ рд╣реИрдВред рдпрджрд┐ рдпрд╣ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИ, рддреЛ рдореИрдВ рдпрд╣ рд╕реНрдерд╛рдкрд┐рдд рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рд╡рд░реНрдХрдЖрд░реНрдб рдХреЛ рдирд┐рдХрд╛рд▓рдирд╛ рд╢реБрд░реВ рдХрд░ рджреВрдВрдЧрд╛ рдХрд┐ рдпрд╣ рдХреМрди рд╕рд╛ рдерд╛ред рд▓реБрд▓ рдХрдВрдкрд╛рдЗрд▓рд░ рдЪреЗрддрд╛рд╡рдирд┐рдпрд╛рдВ рд╕рдмрд╕реЗ рдЕрдзрд┐рдХ рдЖрд╢рд╛рдЬрдирдХ рд▓рдЧ рд░рд╣реА рд╣реИрдВ, рдХреНрдпреЛрдВрдХрд┐ рдпрд╣рд╛рдВ рдкрд╛рдпрд╛ рдЧрдпрд╛ рд╕реНрдкрд╖реНрдЯреАрдХрд░рдг @ рд╕реЗрдВрдЯ-рдкрд╛рд╢рд╛ рдХреЗ рдЕрдиреБрд░реВрдк, рд╡рд░реНрдмреЛрдЬрд╝ рдЖрдЙрдЯрдкреБрдЯ рдореЗрдВ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХреЛ рдЬрдиреНрдо рджреЗрдЧрд╛ред рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдХрд┐ Rprintf
рдкрд░рдд рдЙрд╕ рд╕рдВрдХрд▓рдХ рд╕реЗ рдЫрд┐рдкрд╛ рд░рд╣реА рдереА рдЬрд┐рд╕реЗ рдЕрдм рд╡рд╣ рджреЗрдЦ рд╕рдХрддрд╛ рд╣реИ рдХрд┐ рдпрд╣ fprintf
рд╕реАрдзреЗ рдЙрдкрдпреЛрдЧ рдХрд░ рд░рд╣рд╛ рд╣реИред
рджреВрд╕рд░реЗ рдкреНрд░рдпрд╛рд╕ рдореЗрдВ (рд░рд┐рдмреВрдЯ рдХрд░рдиреЗ рдХреЗ рдмрд╛рдж)
stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into тАШC:/Users/hughp/Documents/R/win-library/3.4тАЩ
# (as тАШlibтАЩ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1559167 bytes (1.5 MB)
# downloaded 1.5 MB
#
# package тАШdata.tableтАЩ successfully unpacked and MD5 sums checked
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-18 04:58:23 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file: C:\Users\hughp\AppData\Local\Temp\RtmpIT9H0D/fread.out
Input contains no \n. Taking this to be a filename to open
Read 11%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 28%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 48%. ETA 00:00 Warning: stack imbalance in '$', 20 then 19
Read 98%. ETA 00:00 [01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 1A51 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.822 wall clock time
[12] Finalizing the datatable
Type counts:
1 : bool8 '1'
1 : int32 '5'
2 : string 'A'
=============================
0.000s ( 0%) Memory map 0.341GB file
0.001s ( 0%) sep=',' ncol=4 and header detection
0.000s ( 0%) Column type detection using 10027 sample rows
0.291s ( 10%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
2.531s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.002s ( 0%) Finding first non-embedded \n after each jump
+ 0.282s ( 10%) Parse to row-major thread buffers (grown 0 times)
+ 1.537s ( 54%) Transpose
+ 0.710s ( 25%) Waiting
0.842s ( 30%) Rereading 1 columns due to out-of-sample type exceptions
2.822s Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
V1 V2 V3 V4
1: Goulburn 110018063 3499 NA
2: NA 110018064 812 NA
3: NA 110018065 2158 NA
4: NA 110019999 402 NA
5: NA 110028068 10 NA
---
22885376: NA 997999799 0 NA
22885377: NA 998999899 64 NA
22885378: NA 994999499 34 NA
22885379: NA 0&&&&&&&& 250796 NA
22885380: NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
рдлрд┐рд░, RStudio рдХреЗ рдмрд╛рд╣рд░ рдкреНрд░рддрд┐рд▓рд┐рдкрд┐ рдкреНрд░рд╕реНрддреБрдд рдХрд░рдиреЗ рдпреЛрдЧреНрдп рдирд╣реАрдВред
рдЗрддрдиреА рдЬрд▓реНрджреА рдкрд░реАрдХреНрд╖рдг рдХреЗ рд▓рд┐рдП рдзрдиреНрдпрд╡рд╛рджред рдЦреИрд░, рдпрд╣ рдирд┐рд╢реНрдЪрд┐рдд рд░реВрдк рд╕реЗ рдПрдХ рдмрд╣реБрдд рдмрд╛рд╣рд░ рдирд┐рдпрдо рд╣реИ! рджреЛ рд╡рд┐рдЪрд╛рд░ рдмрдЪреЗред рдкрд╣рд▓реЗ рдПрдХ рдзрдХреНрдХрд╛ рджрд┐рдпрд╛ рдФрд░ рдЧреБрдЬрд░ рдЧрдпрд╛ред рдХреГрдкрдпрд╛ рдпрд╣рд╛рдБ рдирдпрд╛ Windows.zip рдЖрдЬрд╝рдорд╛рдПрдБред рдпрд╣ alloca
рд╕реНрдЯреИрдХ рдкрд░ рд╣реИ рдФрд░ рдЗрд╕реЗ na.strings
рдЬрд┐рд╕реЗ рдЖрдк рд╕реЗрдЯ рдХрд░ рд░рд╣реЗ рд╣реИрдВ рдЬреИрд╕рд╛ рдХрд┐ рдРрд╕рд╛ рд╣реЛрддрд╛ рд╣реИред рдирд┐рд╢реНрдЪрд┐рдд рд░реВрдк рд╕реЗ рд╕рд╣реА рдХреНрд╖реЗрддреНрд░ рдореЗрдВ (рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди) рдФрд░ рдХреЛрд╢рд┐рд╢ рдХрд░рдиреЗ рд▓рд╛рдпрдХред
рдХреЛрдИ рд╕рдорд╕реНрдпрд╛ рдирд╣реАрдВ рд╣реИ - рдореИрдВ рдЕрдЧрд▓реЗ 12 рдШрдВрдЯреЛрдВ рдХреЗ рд▓рд┐рдП рджреВрд░ рд░рд╣реВрдБрдЧрд╛ рдпрд╛ рддреЛ рддрдм рддрдХ рдкрд░реАрдХреНрд╖рдг рдирд╣реАрдВ рдХрд░ рд╕рдХрддрд╛ред
рд╢рдирд┐ рдкрд░, 18 рдирд╡рдВрдмрд░ 2017 рдХреЛ рд╢рд╛рдо 5:20 рдмрдЬреЗ, рдореИрдЯ рдбрд╛рдЙрд▓реЗ рдиреЛрдЯрд┐рдлрд┐рдХреЗрд╢рди @github.com рдиреЗ рд▓рд┐рдЦрд╛:
рдЗрддрдиреА рдЬрд▓реНрджреА рдкрд░реАрдХреНрд╖рдг рдХреЗ рд▓рд┐рдП рдзрдиреНрдпрд╡рд╛рджред рдЦреИрд░, рдпрд╣ рдирд┐рд╢реНрдЪрд┐рдд рд░реВрдк рд╕реЗ рдПрдХ рдмрд╣реБрдд рдмрд╛рд╣рд░ рдирд┐рдпрдо рд╣реИ!
рджреЛ рд╡рд┐рдЪрд╛рд░ рдмрдЪреЗред рдкрд╣рд▓реЗ рдПрдХ рдзрдХреНрдХрд╛ рджрд┐рдпрд╛ рдФрд░ рдЧреБрдЬрд░ рдЧрдпрд╛ред рдХреГрдкрдпрд╛ рдирдП Windows.zip рдХрд╛ рдкреНрд░рдпрд╛рд╕ рдХрд░реЗрдВ
рдпрд╣рд╛рдБ
https://ci.appveyor.com/project/Rdatatable/data-table/build/1.0.1363/job/fo02vnbu5ebhwy3w/artifacts ред
рдпрд╣ рдЖрд╡рдВрдЯрди рд╕реНрдЯреИрдХ рдкрд░ рдЖрд╡рдВрдЯрд┐рдд рдХрд┐рдпрд╛ рдЧрдпрд╛ рд╣реИ рдФрд░ рдЗрд╕реЗ na.strings рдХреЗ рд╕рд╛рде рдХрд░рдирд╛ рд╣реИ
рдЬреИрд╕рд╛ рд╣реЛ рд░рд╣рд╛ рд╣реИ рд╡реИрд╕рд╛ рд╣реА рдЖрдк рд╕реЗрдЯ рдХрд░ рд░рд╣реЗ рд╣реИрдВред рдирд┐рд╢реНрдЪрд┐рдд рд░реВрдк рд╕реЗ рд╕рд╣реА рдХреНрд╖реЗрддреНрд░ (рдвреЗрд░) рдореЗрдВ
рдЕрд╕рдВрддреБрд▓рди) рдФрд░ рдХреЛрд╢рд┐рд╢ рдХрд░рдиреЗ рд▓рд╛рдпрдХред-
рдЖрдк рдЗрд╕реЗ рдкреНрд░рд╛рдкреНрдд рдХрд░ рд░рд╣реЗ рд╣реИрдВ рдХреНрдпреЛрдВрдХрд┐ рдЖрдкрдХрд╛ рдЙрд▓реНрд▓реЗрдЦ рдХрд┐рдпрд╛ рдЧрдпрд╛ рдерд╛редрдЗрд╕ рдИрдореЗрд▓ рдХрд╛ рдЙрддреНрддрд░ рд╕реАрдзреЗ рджреЗрдВ, рдЗрд╕реЗ GitHub рдкрд░ рджреЗрдЦреЗрдВ
https://github.com/Rdatatable/data.table/issues/2481#issuecomment-345421856 ,
рдпрд╛ рдзрд╛рдЧрд╛ рдореНрдпреВрдЯ рдХрд░реЗрдВ
https://github.com/notifications/unsubscribe-auth/AHvGDGa5Qnls5eSFBMaQO5s8DElfrpKSks5s3ncqgaJpZM4QnuPc
ред
рдареАрдХ рд╣реИ рдХреЛрдИ рдмрд╛рдд рдирд╣реАрдВред рдзрдиреНрдпрд╡рд╛рдж! рдореИрдВрдиреЗ рдЕрдм рджреВрд╕рд░рд╛ рд╡рд┐рдЪрд╛рд░ рднреА рдЖрдЧреЗ рдмрдврд╝рд╛ рджрд┐рдпрд╛ рд╣реИред рдореБрдЭреЗ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рдкрд┐рдЫрд▓реЗ рджрд┐рдиреЛрдВ рд╡рд┐рдВрдбреЛрдЬ рдкрд░ \r
рд╕рдорд╕реНрдпрд╛ рд╣реЛ рд░рд╣реА рдереА, рд▓реЗрдХрд┐рди рдореБрдЭреЗ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХреА рдпрд╛рдж рдирд╣реАрдВ рд╣реИред рд╡реИрд╕реЗ рднреА, рдпрд╣ рддрдп рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП, рдореИрдВрдиреЗ рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рд╕реЗ \r
рдирд┐рдХрд╛рд▓ рджрд┐рдпрд╛ рд╣реИред рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рд╕рдВрджреЗрд╢ рдореБрджреНрд░рд┐рдд рд╣реЛрдиреЗ рд▓рдЧрддрд╛ рд╣реИ рдЬрд╣рд╛рдВ рдИрдЯреАрдП рд▓рд╛рдЗрдиреЗрдВ рд╣реЛрддреА рд╣реИрдВред рдпрд╣ рд╕рдВрднрд╡ рд╣реИ рдХрд┐ рдХрдВрд╕реЛрд▓ \r
рдкрдХрдбрд╝реЗ рдФрд░ рдЗрд╕реЗ рдЕрд▓рдЧ рддрд░реАрдХреЗ рд╕реЗ рд╡реНрдпрд╡рд╣рд╛рд░ рдХрд░рддрд╛ рд╣реИ рддрд╛рдХрд┐ рдЕрдВрддрд┐рдо рдкрдВрдХреНрддрд┐ рдХреЛ рдмрджрд▓ рджрд┐рдпрд╛ рдЬрд╛рдПред ETA рдЕрдкрдбреЗрдЯ рд╣реЛрдиреЗ рдХреЗ рдмрд╛рдж рдЖрдкрдХреЛ рд╣рд░ рдмрд╛рд░ рдПрдХ рдирдИ рд▓рд╛рдЗрди рджреЗрдЦрдиреА рдЪрд╛рд╣рд┐рдПред рдмрд╕ рдЕрд╕реНрдерд╛рдпреА рд░реВрдк рд╕реЗ рдирд┐рдпрдо рд╣реИ рдХрд┐ рдмрд╛рд╣рд░ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдПред рдирдпрд╛ Windows.zip рдмрдирд╛рдпрд╛ рдФрд░ рдпрд╣рд╛рдБ рд╕реЗ рдЧреБрдЬрд░ рд░рд╣рд╛
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file: C:\Users\hughp\AppData\Local\Temp\RtmpcVjZ1f/fread.out
Input contains no \n. Taking this to be a filename to open
Read 5%. ETA 00:00
Read 8%. ETA 00:00
Read 11%. ETA 00:00
Read 15%. ETA 00:00
Read 18%. ETA 00:00
Read 21%. ETA 00:00
Read 25%. ETA 00:00
Read 28%. ETA 00:00
Read 31%. ETA 00:00
Read 35%. ETA 00:00
Read 38%. ETA 00:00
Read 41%. ETA 00:00
Read 45%. ETA 00:00
Read 48%. ETA 00:00
Read 51%. ETA 00:00
Read 55%. ETA 00:00
Warning: stack imbalance in '$', 30 then 31
Warning: stack imbalance in '$', 17 then 16
Read 58%. ETA 00:00
Read 61%. ETA 00:00
Read 65%. ETA 00:00
Read 68%. ETA 00:00
Read 71%. ETA 00:00
Read 75%. ETA 00:00
Read 78%. ETA 00:00
Read 81%. ETA 00:00
Read 85%. ETA 00:00
Read 88%. ETA 00:00
Read 91%. ETA 00:00
Read 95%. ETA 00:00
Read 98%. ETA 00:00
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 1A51 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.894 wall clock time
[12] Finalizing the datatable
Type counts:
1 : bool8 '1'
1 : int32 '5'
2 : string 'A'
=============================
0.001s ( 0%) Memory map 0.341GB file
0.003s ( 0%) sep=',' ncol=4 and header detection
0.000s ( 0%) Column type detection using 10027 sample rows
0.316s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
2.574s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.004s ( 0%) Finding first non-embedded \n after each jump
+ 0.284s ( 10%) Parse to row-major thread buffers (grown 0 times)
+ 1.450s ( 50%) Transpose
+ 0.837s ( 29%) Waiting
0.953s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
2.894s Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
V1 V2 V3 V4
1: Goulburn 110018063 3499 NA
2: NA 110018064 812 NA
3: NA 110018065 2158 NA
4: NA 110019999 402 NA
5: NA 110028068 10 NA
---
22885376: NA 997999799 0 NA
22885377: NA 998999899 64 NA
22885378: NA 994999499 34 NA
22885379: NA 0&&&&&&&& 250796 NA
22885380: NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
FYI рдХрд░реЗрдВ: рдореИрдВ RStudio рдХреЗ рдереЛрдбрд╝реЗ рдкреБрд░рд╛рдиреЗ рд╕рдВрд╕реНрдХрд░рдг рдХреЗ рд╕рд╛рде рдПрдХ рдЕрд▓рдЧ рд╡рд┐рдВрдбреЛрдЬ рдорд╢реАрди рдкрд░ рдЗрд╕ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рддреНрд░реБрдЯрд┐ рдХреЛ рдкреБрди: рдЙрддреНрдкрдиреНрди рдирд╣реАрдВ рдХрд░ рд╕рдХрд╛ред
рдЙрд╕ рд╕реНрдерд┐рддрд┐ рдореЗрдВ, рдРрд╕рд╛ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рдпрд╣ RStudio рд╕рдорд░реНрдерди рд╕реЗ рдкреВрдЫрдиреЗ рдХрд╛ рд╕рдордп рд╣реИ рдЬреИрд╕рд╛ рдЖрдкрдиреЗ рд╕реБрдЭрд╛рд╡ рджрд┐рдпрд╛ рдерд╛ред рдореБрдЭреЗ рдлрд┐рд░ рд╕реЗ рдлрд╝реНрд░реЗрдб рдХреЛрдб рдХреЗ рдорд╛рдзреНрдпрдо рд╕реЗ рдПрдХ рдирдЬрд╝рд░ рдорд┐рд▓реА рд╣реИ рдФрд░ рдореИрдВ рдЕрдкрдиреА рддрд░рдл рд╕реЗ рд╡рд┐рдЪрд╛рд░реЛрдВ рд╕реЗ рдмрд╛рд╣рд░ рд╣реВрдВред рдХреГрдкрдпрд╛ рдЙрдиреНрд╣реЗрдВ RStudio рдХреЗ рджреЛ рд╕рдВрд╕реНрдХрд░рдг рдирдВрдмрд░ рдмрддрд╛рдПрдВред рдпрд╣ рдЬрд░реВрд░реА рдирд╣реАрдВ рд╣реИ рдХрд┐ рдпрд╣ RStudio рд╣реИ, рдпрд╣ data.table рдХреЗ рдкрдХреНрд╖ рдореЗрдВ рдПрдХ рджреЛрд╖ рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдЬреЛ рдХрд┐ RStudio рдХреЗ рдПрдХ рд╕рдВрд╕реНрдХрд░рдг рдкрд░ рджрд┐рдЦрд╛рдИ рджреЗрдиреЗ рдХреЗ рд▓рд┐рдП рд╣реЛрддрд╛ рд╣реИред рд▓реЗрдХрд┐рди рдпрд╣ рдЕрдЬреАрдм рд╣реИ рдХрд┐ рдпрд╣ рдХрдВрд╕реЛрд▓ рдЖрдЙрдЯрдкреБрдЯ рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рд▓рдЧрддрд╛ рд╣реИ рдФрд░ рдпрд╣ рдХреБрдЫ рдРрд╕рд╛ рд╣реИ рдЬреЛ рдЕрд▓рдЧ рд╣реИ рдФрд░ RStudio- рд╡рд┐рд╢рд┐рд╖реНрдЯ рд╣реИред рдореИрдВрдиреЗ "RStudio рд╕реНрдЯреИрдХ рдЗрдореНрдмреИрд▓реЗрдВрд╕" рдХреЗ рд▓рд┐рдП рдЦреЛрдЬ рдХреА рд╣реИ, рд▓реЗрдХрд┐рди рдмрд╣реБрдд рд╕рд╛рд░реЗ рдореБрджреНрджреЗ рдкреИрдХреЗрдЬ рджреЛрд╖реЛрдВ рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдЖрддреЗ рд╣реИрдВ, рдкреНрд░рддрд┐ RStudio рдирд╣реАрдВред рдЦреЛрдЬ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдореБрд╢реНрдХрд┐рд▓ рд╕рдорд╕реНрдпрд╛ред рдЖрдЗрдП рдореБрджреНрджреЗ рдХреЛ рдпрд╣рд╛рдВ рдЦреБрд▓рд╛ рд░рдЦреЗрдВ рдФрд░ рджреЗрдЦреЗрдВ рдХрд┐ рд╡реЗ рдХреНрдпрд╛ рдХрд╣рддреЗ рд╣реИрдВред
рдореБрдЭреЗ рд╕рдВрджреЗрд╣ рд╣реИ рдХрд┐ рдЕрдВрддрд┐рдо рдкреНрд░рдпрд╛рд╕ рд╕реЗ рдорджрдж рдорд┐рд▓реЗрдЧреА, рд▓реЗрдХрд┐рди рдкреВрд░реНрдгрддрд╛ рдХреЗ рд▓рд┐рдП, рдХреГрдкрдпрд╛ рдЗрд╕реЗ рдпрд╣рд╛рдВ рджреЗрдВ ред рд╢рд╛рдпрдж MinGW рдХрдВрдкрд╛рдЗрд▓рд░ рдЬреЛ рд╡рд┐рдВрдбреЛрдЬ рдкрд░ рдЙрдкрдпреЛрдЧ рдХрд┐рдпрд╛ рдЬрд╛рддрд╛ рд╣реИ, рдЙрди рджреЛ ints рдХреЗ рд╕рд╛рде рдХреБрдЫ рдЕрдЬреАрдм рдХрд░рддрд╛ рд╣реИред рдЙрдирдореЗрдВ рд╕реЗ рдПрдХ рдирд┐рд░рдВрддрд░ 0 рд╣реИ рдЬреЛ рд╢рд╛рдпрдж рджреВрд░ рдЕрдиреБрдХреВрд▓рд┐рдд рд╣реИ рдФрд░ рдлрд┐рд░ рдХрд┐рд╕реА рддрд░рд╣ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХрд╛ рдХрд╛рд░рдг рдмрдирддрд╛ рд╣реИред
рд╣рд╛рд▓рд╛рдБрдХрд┐, рд╡рд╣ рд╡рд┐рд╢реЗрд╖ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рд╕рдВрджреЗрд╢ Ral рдореЗрдВ eval.c: 491 рд╕реЗ рдЖ рд░рд╣рд╛ рд╣реИред рдХреБрдЫ рдзрд╛рдЧрд╛ рдЙрд╕ рд▓рд╛рдЗрди рдХреЛ рдЪрд▓рд╛рдиреЗ рдЪрд╛рд╣рд┐рдП, рд▓реЗрдХрд┐рди рдореБрдЭреЗ рдирд╣реАрдВ рд▓рдЧрддрд╛ рдХрд┐ рдпрд╣ fread
рдпрд╛ data.table
ред рд╡рд╣ check_stack_balance()
рдХреЗрд╡рд▓ R рдЗрдВрдЯрд░реНрдирд▓ рдореЗрдВ 5 рд╕реНрдерд╛рдиреЛрдВ рд╕реЗ рдХрд╣рд╛ рдЬрд╛рддрд╛ рд╣реИ:
names.c
do_internal()
рдХреЗ рдЕрдВрдд рдореЗрдВ
objects.c
, рджреЛ рдмрд╛рд░ applyMethod()
eval.c
, eval()
рдореЗрдВ рджреЛ рдмрд╛рд░
рдореИрдВ рдпрд╣ рдирд╣реАрдВ рджреЗрдЦрддрд╛ рдХрд┐ fread.c
рд╣реЛрдиреЗ рдкрд░ рдЙрдирдореЗрдВ рд╕реЗ рдХрд┐рд╕реА рдХреЛ рдХреИрд╕реЗ рдкрд╣реБрдБрдЪрд╛ рдЬрд╛ рд╕рдХрддрд╛ рд╣реИред рдХрд╣рд╛ рдЬрд╛ рд░рд╣рд╛ рд╣реИ рдХреЗрд╡рд▓ рдкреНрд░рд╡рд┐рд╖реНрдЯрд┐ рдмрд┐рдВрджреБ REprintf
рдФрд░ рдореИрдВ рдпрд╣ рдирд╣реАрдВ рджреЗрдЦрддрд╛ рдХрд┐ check_stack_balance()
рдХреИрд╕реЗ рдкрд╣реБрдВрдЪ рд╕рдХрддрд╛ рд╣реИред рд╕рднреА рдореИрдВ рд╕реЛрдЪ рд╕рдХрддрд╛ рд╣реВрдВ рдХрд┐ рд╡рд░реНрддрдорд╛рди рдореЗрдВ RStudio рдХрд╛ рдПрдХ рдзрд╛рдЧрд╛ рд╣реИ рдЬреЛ рдкреГрд╖реНрдарднреВрдорд┐ рдореЗрдВ рдХреБрдЫ рдХрд░ рд░рд╣рд╛ рд╣реИ рдЬреЛ рд╢рд╛рдпрдж рдХрдВрд╕реЛрд▓ рдЖрдЙрдЯрдкреБрдЯ рдХреЗ рд╕рд╛рде рдЗрдВрдЯрд░реИрдХреНрдЯ рдХрд░рддрд╛ рд╣реИ, рд╢рд╛рдпрдж рд╡рд┐рдВрдбреЛрдЬ рдХреЗ рд╕рд╛рде рдЕрд▓рдЧ рддрд░рд╣ рд╕реЗред
рдЕрдВрдд рдореЗрдВ, рдкреВрд░реНрдгрддрд╛ рдХреЗ рд▓рд┐рдП, рдРрд╕рд╛ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ REprintf
рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рдиреЗ рдХрд╛ рд╕рд╣реА рддрд░реАрдХрд╛ рд╣реИ рдХреНрдпреЛрдВрдХрд┐ рдЖрд░ рдЖрд░ libcurl.c: 354 рдФрд░ internet.c: 409 рдореЗрдВ рдЕрдкрдиреА рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рдореЗрдВ (Rprintf рдХреЗ рдмрдЬрд╛рдп) рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░ рд░рд╣рд╛ рд╣реИред рдпрд╣ рд╢рд░реНрдо рдХреА рдмрд╛рдд рд╣реИ рдХрд┐ C рд╕реНрддрд░ рдкрд░ R рдХреА рдкреНрд░рдЧрддрд┐ рдкрдЯреНрдЯреА R рдХреЗ API рдореЗрдВ рдЙрдкрд▓рдмреНрдз рдирд╣реАрдВ рд╣реИ (рдпрд╣ R рд╕реНрддрд░ C рдореЗрдВ рднреА рджреЛ рдмрд╛рд░ рд▓рд╛рдЧреВ рдХрд┐рдпрд╛ рдЧрдпрд╛ рд▓рдЧрддрд╛ рд╣реИ)ред
@mattdowle , рдХреНрдпрд╛ рдпрд╣ рдорджрджрдЧрд╛рд░ рд╣реЛрдЧрд╛? https://github.com/r-lib/progress
@aader рд╣рд╛рдБ - рдзрдиреНрдпрд╡рд╛рдж! рдЗрд╕рдХреЗ рд╕реНрд░реЛрдд рдореЗрдВ рдпрд╣ рдЯрд┐рдкреНрдкрдгреА рд╣реИ :
// In R Studio we should print to stdout, because printing a \r
// to stderr is buggy (reported)
рд▓реЗрдХрд┐рди рдореИрдВрдиреЗ рдкрд╣рд▓реЗ рд╣реА \r
рд╣рдЯрд╛ рджрд┐рдпрд╛ рдФрд░ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдЕрднреА рднреА рд╣реЛрддрд╛ рд╣реИред рдореБрдЭреЗ рдЖрд╢реНрдЪрд░реНрдп рд╣реИ рдХрд┐ рдпрд╣ рдХрд╣рд╛рдВ рдмрддрд╛рдпрд╛ рдЧрдпрд╛ рдерд╛ред
рдЕрдВрддрд┐рдо рдмрд┐рд▓реНрдб рднреА рдХрд╛рдо рдирд╣реАрдВ рдХрд┐рдпрд╛:
Https://community.rstudio.com/t/stack-imbalance-possibly-in-stderb//9 рдкрд░ рд░рд┐рдкреЛрд░реНрдЯ рдХреА рдЧрдИ
R-devel рдкрд░ рд╕рдордп рдкрд░ рдкреНрд░рд╢реНрди: [Rd] рдХреНрдпрд╛ Rprintf рдФрд░ REprintf рдзрд╛рдЧрд╛ рд╕реБрд░рдХреНрд╖рд┐рдд рд╣реИрдВ?
рдЕрдкрд╢реЙрдЯ "Rprintf рдФрд░ REprintf рдереНрд░реЗрдб-рд╕реБрд░рдХреНрд╖рд┐рдд рдирд╣реАрдВ рд╣реИрдВред"
рдпреЛрдЗрдХреНрд╕!
рд▓рд┐рдВрдХ рдХреЗ рд▓рд┐рдП рд╕рднреА рдХреЛ рдзрдиреНрдпрд╡рд╛рдж рдФрд░ RStudio рдХреЗ рд╕рд╛рде рдЗрд╕ рдореБрджреНрджреЗ рдХреЛ рдЙрдард╛рдиреЗ рдХреЗ рд▓рд┐рдП рд╣реНрдпреВрдЧред
data.table::fwrite()
рдФрд░ data.table::fread()
Rprintf
REprintf
рд╡реЗ рдХреЗрд╡рд▓ рдЙрдиреНрд╣реЗрдВ рдорд╛рд╕реНрдЯрд░ рдереНрд░реЗрдб рд╕реЗ рдмреБрд▓рд╛рддреЗ рд╣реИрдВред рди рдХреЗрд╡рд▓ рджреЛ рдбреЗрдЯрд╛рдЯреЗрдмрд▓ рдереНрд░реЗрдбреНрд╕ рдХрднреА рднреА рдЙрд╕ рдЖрд░ рдПрдВрдЯреНрд░реА рдкреЙрдЗрдВрдЯ рдХреЛ рдПрдХ рд╣реА рд╕рдордп рдореЗрдВ рдХреЙрд▓ рдирд╣реАрдВ рдХрд░рддреЗ рд╣реИрдВ, рдмрд▓реНрдХрд┐ рдХреЗрд╡рд▓ рдорд╛рд╕реНрдЯрд░ рдереНрд░реЗрдб рдЗрд╕реЗ рдХрднреА рднреА рдХреЙрд▓ рдХрд░рддрд╛ рд╣реИ, рдФрд░ рдпрд╣ рдПрдХрдорд╛рддреНрд░ рдЖрд░ рдПрдВрдЯреНрд░реА рдкреЙрдЗрдВрдЯ рд╣реИ рдЬрд┐рд╕реЗ рдереНрд░реЗрдбреНрд╕ рдХреЗ рдХрд┐рд╕реА рднреА рд╕рдордп рдХрд┐рд╕реА рднреА рд╕рдордп рдХреЙрд▓ рдХрд┐рдпрд╛ рдЬрд╛рддрд╛ рд╣реИред рд╕рдорд╛рдирд╛рдВрддрд░ рдЦрдВрдбред рд╣рд╛рд▓рд╛рдБрдХрд┐, Rprintf
R_CheckUserInterrupt
рд╣рд░ 100 рдкреНрд░рд┐рдВрдЯ рдХрд░рддрд╛ рд╣реИред рдореБрдЭреЗ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рд╡рд╣ рд╣рд┐рд╕реНрд╕рд╛ рдЬреЛ рд╕рдВрднрд╡рддрдГ рдХреЗрд╡рд▓ рдорд╛рд╕реНрдЯрд░ рдзрд╛рдЧреЗ рд╕реЗ рднреА рд╕реБрд░рдХреНрд╖рд┐рдд рдирд╣реАрдВ рд╣реИред рдпрд╣реА рдХрд╛рд░рдг рд╣реИ рдХрд┐ рдЕрдм рдкреНрд░рдпреЛрдЧ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдХрд╛рд░рдг рдирд╣реАрдВ рд╣реИ REprintf
рдХрд┐ рдХреЗ рд░реВрдк рдореЗрдВ рдлреЛрди рдирд╣реАрдВ рдХрд░рддрд╛ R_CheckUserInterrupt
ред рдЗрдВрдЯрд░реНрдирд▓реНрд╕ рдкреНрд░рдЧрддрд┐ рдХреЗ рдореАрдЯрд░ рдХреЗ рд▓рд┐рдП REprintf
рдХрд░рддреЗ рд╣реИрдВ, рдЗрд╕рд▓рд┐рдП рдХреЛрд░ рдмрдирд╛рдиреЗ рдХреА рднрд╛рд╡рдирд╛ рдХреЗ рд╕рд╛рде REprintf
рд▓рд┐рдП рд╕реНрд╡рд┐рдЪ рдХрд░рдирд╛; рдЗрд╕рдХрд╛ рдорддрд▓рдм рдпрд╣ рд╣реИ рдХрд┐ рдЪреБрдирд╛рд╡ рдХрд╛ рд╕реНрдЯреИрдбрд░ рдмрдирд╛рдо рд╕реНрдЯрдбрдЖрдЙрдЯ рд╕реЗ рдХреЛрдИ рд▓реЗрдирд╛-рджреЗрдирд╛ рдирд╣реАрдВ рд╣реИред
@kevinushey рдХреНрдпрд╛ рдЖрдк рдЗрд╕ рдзрд╛рдЧреЗ рдкрд░ рдПрдХ рдирдЬрд╝рд░ рдбрд╛рд▓реЗрдВрдЧреЗ рдФрд░ рдореБрдЭреЗ рдХреБрдЫ рдФрд░ рдмрддрд╛рдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХрд░реЗрдВрдЧреЗ? рдпрд╣ RStudio рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рд╣реЛ рд╕рдХрддрд╛ рд╣реИ, рдХрд┐рд╕реА рднреА рддрд░рд╣, рд╢рд╛рдпрдж рдПрдХ рдкреГрд╖реНрдарднреВрдорд┐ рдзрд╛рдЧреЗ рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рд╣реИ? рдпрджрд┐ RStudio рдХрд╛ рдмреИрдХрдЧреНрд░рд╛рдЙрдВрдб рдереНрд░реЗрдб рд╣реИ, рддреЛ рдпрд╣ рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдХрд┐ Rprintf
/ REprintf
рдПрдХ рд╣реА рд╕рдордп рдореЗрдВ рджреЛ рдереНрд░реЗрдб рд╕реЗ рдХреЙрд▓ рдХрд┐рдП рдЬрд╛ рд╕рдХреЗрдВред рд▓реЗрдХрд┐рди, рдЕрдЧрд░ рдРрд╕рд╛ рд╣реЛрддрд╛, рддреЛ рд╣рдо рдЕрдм рд╕реЗ рдкрд╣рд▓реЗ рдХрдИ рдФрд░ рд╕рдорд╕реНрдпрд╛рдУрдВ рдХреЛ рджреЗрдЦрддреЗред рдЗрд╕рд▓рд┐рдП рдпрд╣ рдмрд╣реБрдд рдХрдо рд╕рдВрднрд╛рд╡рдирд╛ рд╣реИред рд╢рд╛рдпрдж RStudio , R-exts рдХреЗ рд╕реЗрдХреНрд╢рди ptr_*
рдХреЙрд▓рдмреИрдХ рдХреА рдЬрдЧрд╣ рд▓реЗрддреА рд╣реИ - рдЬреЛ рдХрдВрд╕реЛрд▓ рдЖрдЙрдЯрдкреБрдЯ рдФрд░ рдЗрдВрдЯрд░реИрдХреНрд╢рди рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рд╣реИрдВред рд╣рд╛рд▓рд╛рдБрдХрд┐, рдпрд╣ рдЦрдВрдб "рдпреВрдирд┐рдХреНрд╕-рдмрд╛рдЗрдХ рдХреЗ рд▓рд┐рдП" рд╕реЗ рд╢реБрд░реВ рд╣реЛрддрд╛ рд╣реИ, рдЗрд╕рд▓рд┐рдП рдореБрдЭреЗ рдирд╣реАрдВ рдкрддрд╛ рдХрд┐ рд╡рд┐рдВрдбреЛрдЬ рдХреИрд╕реЗ рдЖрддрд╛ рд╣реИред рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдХрд┐ рдЦрдВрдб 8.1.5 рдереНрд░реЗрдбрд┐рдВрдЧ рд╕рдорд╕реНрдпрд╛рдПрдБ рднреА рдкреНрд░рд╛рд╕рдВрдЧрд┐рдХ рд╣реЛрдВред рджреЛрдиреЛрдВ рдзрд╛рд░рд╛ 8 рдХреЗ рдЙрдк-рдЦрдВрдб рд╣реИрдВ: "GUI рдФрд░ рдЕрдиреНрдп рдлреНрд░рдВрдЯ-рдПрдВрдб рдХреЛ R рд╕реЗ рдЬреЛрдбрд╝рдирд╛"ред
рдореИрдВ рджрд┐рд╕рдВрдмрд░ рдХреА рд╢реБрд░реБрдЖрдд рддрдХ рдмрд╛рд╣рд░ рд░рд╣рдиреЗ рд╡рд╛рд▓рд╛ рд╣реВрдВ, рдЗрд╕рд▓рд┐рдП рджреБрд░реНрднрд╛рдЧреНрдп рд╕реЗ рдореБрдЭреЗ рддрдм рддрдХ рджреЗрдЦрдиреЗ рдХрд╛ рдореМрдХрд╛ рдирд╣реАрдВ рдорд┐рд▓реЗрдЧрд╛ред рд╣рд╛рд▓рд╛рдВрдХрд┐, RStudio, R рдЗрд╡реЗрдВрдЯ рд▓реВрдк рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рдХреЗ рдореБрдЦреНрдп рдереНрд░реЗрдб рдкрд░ рд▓рдЧрднрдЧ рд╕рдм рдХреБрдЫ рдЪрд▓рд╛рддрд╛ рд╣реИ; рдХреЗрд╡рд▓ рдЕрдкрд╡рд╛рдж рдЬреИрд╕реЗ рдкреНрд░реЛрдЬреЗрдХреНрдЯ-рд╕реНрддрд░ рдлрд╝рд╛рдЗрд▓ рдЕрдиреБрдХреНрд░рдордг рдФрд░ рд╡реЗ рдкреГрд╖реНрдарднреВрдорд┐ рдереНрд░реЗрдб рдЖрдорддреМрд░ рдкрд░ рдХрд┐рд╕реА рднреА R API рдХреЛ рд╕реНрдкрд░реНрд╢ рдирд╣реАрдВ рдХрд░рддреЗ рд╣реИрдВред
рдХрдВрд╕реЛрд▓ рдЗрдирдкреБрдЯ рдФрд░ рдЖрдЙрдЯрдкреБрдЯ рдХреЛ рд╕рдВрднрд╛рд▓рдиреЗ рдХреЗ рд▓рд┐рдП RStudio рд╡рд┐рднрд┐рдиреНрди ptr_*
рдХреЙрд▓рдмреИрдХ рд▓реЗрддрд╛ рд╣реИ; рдореИрдВ рддреБрд░рдВрдд рдирд╣реАрдВ рд╕реЛрдЪ рд╕рдХрддрд╛ рдХрд┐ рд╡реЗ рдпрд╣рд╛рдВ рдХреИрд╕реЗ рд╣реЛ рд╕рдХрддреЗ рд╣реИрдВ, рд▓реЗрдХрд┐рди рдЬрдм рдореИрдВ рд╡рд╛рдкрд╕ рдЖрдКрдВрдЧрд╛ рддреЛ рдПрдХ рдЧрд╣рди рд░реВрдк рд▓реЗрдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХрд░реВрдВрдЧрд╛ред
рдареАрдХ рд╣реИ, рдХреГрдкрдпрд╛ рдЗрд╕реЗ рдпрд╣рд╛рдБ рдЖрдЬрд╝рдорд╛рдПрдБред рдЗрд╕рд╕реЗ рдкрд╣рд▓реЗ, рдпрд╣ рд╣рд░ 2% рдкрд░ рдкреНрд░рдЧрддрд┐ рдХреА рд╕реНрдерд┐рддрд┐ рдХреЛ рдЕрдкрдбреЗрдЯ рдХрд░ рд░рд╣рд╛ рдерд╛ред рдЖрдкрдХреЗ рдорд╛рдорд▓реЗ рдореЗрдВ рдЖрдкрдХреА рдлрд╝рд╛рдЗрд▓ рдХреЗрд╡рд▓ 3 рд╕реЗрдХрдВрдб рд╕реЗ рдХрдо рд╕рдордп рд▓реЗ рд░рд╣реА рд╣реИ, рдЗрд╕рд▓рд┐рдП рдкреНрд░рддреНрдпреЗрдХ 0.06 рд╕реЗрдХрдВрдб рдореЗрдВ RStudio рдХрдВрд╕реЛрд▓ рдХреЗ рд▓рд┐рдП рдПрдХ рдирдИ рдкреНрд░рдЧрддрд┐ рдЕрдкрдбреЗрдЯ рдереАред рд╢рд╛рдпрдж RStudio рдХреЗ рд▓рд┐рдП рдпрд╣ рдмрд╣реБрдд рдЬреНрдпрд╛рджрд╛ рдерд╛ред рдЗрд╕рд▓рд┐рдП рдпрд╣ рдкреНрд░рдпрд╛рд╕ рдПрдХ рдмрд╛рд░ рдЫрд╛рдкрддрд╛ рд╣реИред рдпрд╣ \r
рдЙрдкрдпреЛрдЧ рдмрд┐рд▓реНрдХреБрд▓ рдирд╣реАрдВ рдХрд░рддрд╛ рд╣реИред рдпрд╣ рд╡реИрд╕реЗ рднреА рд░рд┐рдкреЛрд░реНрдЯ рдФрд░ рд▓реЙрдЧ рдлрд╝рд╛рдЗрд▓реЛрдВ рдХреЗ рд▓рд┐рдП рдмреЗрд╣рддрд░ рд╣реЛрдирд╛ рдЪрд╛рд╣рд┐рдП рдЬрд╣рд╛рдБ \r
рдЖрдЙрдЯрдкреБрдЯ рднрд░ рд╕рдХреЗред
рдЪреВрдВрдХрд┐ рдЖрдкрдХрд╛ 3 рд╕реЗрдХрдВрдб рдХрд╛ рд╕рдордп рдХрд╛рдлреА рддреЗрдЬ рд╣реИ, рдЗрд╕рд▓рд┐рдП рдореИрдВрдиреЗ 1 рд╕реЗрдХрдВрдб рд╕реЗ рд╢реБрд░реВ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдкреНрд░рдЧрддрд┐ рдкрдЯреНрдЯреА рдХреЛ рдХрдо рдХрд░ рджрд┐рдпрд╛ рд╣реИ рдпрджрд┐ рд╡рд╣рд╛рдВ рд╕реЗ 1 рд╕реЗрдХрдВрдб рдХрд╛ рдИрдЯреАрдП рд╣реИред рдЕрдиреНрдпрдерд╛ рдпрд╣ рдмрд┐рд▓реНрдХреБрд▓ рднреА рдкреНрд░рджрд░реНрд╢рд┐рдд рдирд╣реАрдВ рд╣реЛрдЧрд╛ рдФрд░ рдЖрдкрдХреА рдлрд╝рд╛рдЗрд▓ рдХреЗ рд▓рд┐рдП рдХрд╛рдо рдХрд░реЗрдЧрд╛ рдХреНрдпреЛрдВрдХрд┐ рдпрд╣ рдкреНрд░рджрд░реНрд╢рд┐рдд рдирд╣реАрдВ рдХрд┐рдпрд╛ рдЬрд╛ рд░рд╣рд╛ рдерд╛ред рдЖрдкрдХреЗ рджреНрд╡рд╛рд░рд╛ рдкрд░реАрдХреНрд╖рдг рдХрд┐рдП рдЬрд╛рдиреЗ рдХреЗ рдмрд╛рдж, рдореИрдВ fwrite
рдХреНрдпрд╛ рдХрд░реВрдВрдЧрд╛; рдпрджрд┐ рдИрдЯреАрдП рд╡рд╣рд╛рдВ рд╕реЗ 2 рд╕реЗрдХрдВрдб рд╣реИ рддреЛ 2 рд╕реЗрдХрдВрдб рд╕реЗ рд╢реБрд░реВ рд╣реЛрддрд╛ рд╣реИред
рдирдорд╕реНрдХрд╛рд░, @mattdowleред # 2503 рдореЗрдВ рдореЗрд░реА рдЕрдВрддрд┐рдо рдЯрд┐рдкреНрдкрдгреА рднреА рдЗрд╕ рдореБрджреНрджреЗ рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рд╣реЛ рд╕рдХрддреА рд╣реИред
рдЕрдЫрд╛ рд▓рдЧрддрд╛ рд╣реИ! рдХреЛрдИ рдЪреЗрддрд╛рд╡рдиреА (5 рд░рди рдХреЗ рдмрд╛рдж)ред рдкрд╣рд▓реЗ рдиреАрдЪреЗ рдЪрд▓рд╛рдПрдВ (рдзреНрдпрд╛рди рджреЗрдВ рдХрд┐ рдкреНрд░рдореБрдЦ рд╕реНрдерд╛рди рд╡рд╛рд╕реНрддрд╡рд┐рдХ рдЖрдЙрдЯрдкреБрдЯ рдореЗрдВ рднрд┐рдиреНрди рджрд┐рдЦрддреЗ рд╣реИрдВ):
stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into тАШC:/Users/hughp/Documents/R/win-library/3.4тАЩ
# (as тАШlibтАЩ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557423 bytes (1.5 MB)
# downloaded 1.5 MB
#
# package тАШdata.tableтАЩ successfully unpacked and MD5 sums checked
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapped ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
# [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=',' with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
# Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000) : 1551 Quote rule 0
# Type codes (jump 100) : 1A51 Quote rule 0
# =====
# Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
# [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# |--------------------------------------------------|
# |==================================================|
# Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.280 wall clock time
# [12] Finalizing the datatable
# Type counts:
# 1 : bool8 '1'
# 1 : int32 '5'
# 2 : string 'A'
# =============================
# 0.005s ( 0%) Memory map 0.341GB file
# 0.037s ( 2%) sep=',' ncol=4 and header detection
# 0.000s ( 0%) Column type detection using 10027 sample rows
# 0.321s ( 14%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 1.917s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# = 0.011s ( 0%) Finding first non-embedded \n after each jump
# + 0.560s ( 25%) Parse to row-major thread buffers (grown 0 times)
# + 0.488s ( 21%) Transpose
# + 0.858s ( 38%) Waiting
# 0.999s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
# 2.280s Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1 V2 V3 V4
# 1: Goulburn 110018063 3499 NA
# 2: NA 110018064 812 NA
# 3: NA 110018065 2158 NA
# 4: NA 110019999 402 NA
# 5: NA 110028068 10 NA
# ---
# 22885376: NA 997999799 0 NA
# 22885377: NA 998999899 64 NA
# 22885378: NA 994999499 34 NA
# 22885379: NA 0&&&&&&&& 250796 NA
# 22885380: NA 0@@@@@@@@ 7305367 NA
# Warning messages:
# 1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
# 2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
@HughParsonage рд░рд╛рд╣рдд! рдореБрдЭреЗ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рдпрд╣ рдПрдХ рдЬреАрдд рд╣реИред рдореИрдВ рд╕рд╛рдл, рд╡рд┐рд▓рдп рдФрд░ рдЖрдЧреЗ рдмрдврд╝реВрдБрдЧрд╛ред рдкрд░реАрдХреНрд╖рдг рдХреЗ рд▓рд┐рдП рдмрд╣реБрдд рдзрдиреНрдпрд╡рд╛рджред
@aadler рд╣рд╛рдВ рдиреЗ рдЖрдкрдХреА рдЯрд┐рдкреНрдкрдгреА рдХреЛ рдЬрд╛рд░реА рдХрд┐рдпрд╛ # 2503 рдореЗрдВ рд╕рд┐рд░реНрдл рдПрдХ рдЬреИрд╕рд╛ рджрд┐рдЦрддрд╛ рд╣реИред рдХреНрдпрд╛ рдЖрдк рджреЗрд╡ рд╕реЗ рдирд╡реАрдирддрдо рдкрд░реАрдХреНрд╖рдг рдХрд░ рд╕рдХрддреЗ рд╣реИрдВ рдФрд░ рдХреГрдкрдпрд╛ рдкреБрд╖реНрдЯрд┐ рдХрд░реЗрдВ рдХрд┐ рдпрд╣ рдЕрдм рддрдп рд╣реЛ рдЧрдпрд╛ рд╣реИ? рдпрд╣рд╛рдБ as.IDate
рд╕рд╛рде рд╕рдорд╕реНрдпрд╛ рдХреА рдЙрдореНрдореАрдж рд╣реИ рдЬреЛ рдЖрдкрдХреЛ рд╡рд╛рд╕реНрддрд╡ рдореЗрдВ рдкрд╣рд▓реЗ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХреЗ рдХрд╛рд░рдг рдорд┐рд▓реА рдереАред
рдЕрдЪреНрдЫрд╛ рдирд╣реАрдВ :(
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
> DT <- fread('2017-11-22_1999_Performance.csv', header = TRUE, colClasses = CLS, select = SEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file 2017-11-22_1999_Performance.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|=======Warning: stack imbalance in '$', 27 then 26
===Warning: stack imbalance in '$', 26 then 27
================Error in fread("2017-11-22_1999_Performance.csv", header = TRUE, colClasses = CLS, :
unprotect_ptr: pointer not found
@aadler рдЙрд╕ рд░рд┐рдкреЛрд░реНрдЯ рдХреЗ рд▓рд┐рдП рдзрдиреНрдпрд╡рд╛рджред рдореИрдВ freadR
рдорд╛рдзреНрдпрдо рд╕реЗ рдЪрд▓рд╛ рдЧрдпрд╛ рдФрд░ рд╕реБрд░рдХреНрд╖рд╛ рдХреЛ рд╕реНрдерд╛рдиреАрдп рдмрдирд╛ рджрд┐рдпрд╛ред 30% рд╕рдВрднрд╛рд╡рдирд╛ рд╣реИ рдХрд┐ рдЖрдкрдХреЗ рдорд╛рдорд▓реЗ рдореЗрдВ рдЖрдк рдЬрд┐рд╕ рдкреНрд░рдХрд╛рд░ рд╕реЗ рдУрд╡рд░рд░рд╛рдЗрдб рдХрд░ рд░рд╣реЗ рд╣реИрдВ, рдЙрд╕рдХреЗ рдмрд╛рдж рд╕реЗ рдХрд╛рдо рдХрд░ рд╕рдХрддреЗ рд╣реИрдВ рдФрд░ рдХреЛрдб рдХреЗ рдЙрд╕ рд╣рд┐рд╕реНрд╕реЗ рдореЗрдВ рдХрд╛рдлреА рдХреБрдЫ рд╕реБрд░рдХреНрд╖рд┐рдд рдереЗред рдХреГрдкрдпрд╛ рдЗрд╕ рдмрд┐рд▓реНрдб рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рдХреЗ рдкреБрдирдГ рдкреНрд░рдпрд╛рд╕ рдХрд░реЗрдВред
@aadler рдпрджрд┐ рдЖрдкрдиреЗ рдЕрднреА рддрдХ рдЕрдВрддрд┐рдо рдирд┐рд░реНрдорд╛рдг рдХреА рдХреЛрд╢рд┐рд╢ рдирд╣реАрдВ рдХреА рд╣реИ, рддреЛ рдХреГрдкрдпрд╛ рдЗрд╕реЗ рд╕реАрдзреЗ рдЖрдЬрд╝рдорд╛рдПрдВ ред рдЗрд╕рдХреЗ рдЕрд▓рд╛рд╡рд╛, рдпрджрд┐ рдЖрдкрдХреА рдлрд╝рд╛рдЗрд▓ рдХреА рдПрдХ рдкреНрд░рддрд┐ рдореБрдЭреЗ рдкреНрд░рд╛рдкреНрдд рдХрд░рдирд╛ рд╕рдВрднрд╡ рд╣реИ, рддреЛ рдореИрдВ рд╕реНрд╡рдпрдВ рдХреЛ рд╡рд┐рдВрдбреЛрдЬрд╝ RStudio рдкрд░ рдЖрдЬрд╝рдорд╛ рд╕рдХрддрд╛ рд╣реВрдБред
:(
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-01 01:54:04 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
> ColCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+ rep('integer', 3L), rep('character', 2L),
+ 'integer', 'Date', rep('numeric', 2L), 'Date',
+ rep('numeric', 12L), rep('integer', 5),
+ rep('numeric', 3L), 'integer', 'character')
> SELCOL <- c(WHATEVER)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = ColCLASS, select = SELCOL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file LargeFile.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|Error in fread("LargeFile.csv", header = TRUE, colClasses = ColCLASS, :
unprotect_ptr: pointer not found
рдИрдореЗрд▓ рдкрд░ @aadler рдХрд╛ рдзрдиреНрдпрд╡рд╛рдж, рдЕрдм рдореИрдВ рдкреБрди: рдкреЗрд╢ рдХрд░ рд╕рдХрддрд╛ рд╣реВрдВред R 3.4.2, рдирд╡реАрдирддрдо RStudio 1.1.383 рдФрд░ рд╡рд┐рдВрдбреЛрдЬ 10 рдкреНрд░реЛ 10.0.16299 рдмрд┐рд▓реНрдб 16299ред
рдореИрдВ рдпрд╣рд╛рдВ рд░рд┐рдХреЙрд░реНрдб рдХрд┐рдП рдЧрдП RStudio рдореЗрдВ рдЕрдЬреАрдм рд╡реНрдпрд╡рд╣рд╛рд░ рджреЗрдЦ рд░рд╣рд╛ рд╣реВрдВ:
https://www.youtube.com/watch?v=tl2x2vmZxMU
рдРрд╕рд╛ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ RStudio рд╕рд┐рд░реНрдл рдЯрд╛рдЗрдк рдХрд░рдХреЗ GCs рдЙрддреНрдкрдиреНрди рдХрд░ рд░рд╣рд╛ рд╣реИред рдРрд╕рд╛ рдХреНрдпреЛрдВ рд╣реИ рдФрд░ рдЗрд╕реЗ рдмрдВрдж рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рд╡реИрд╕реЗ рднреА рд╣реИ? рдРрд╕рд╛ рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдХрд┐ рдЬрдм fread()
рдЕрдкрдиреА рдкреНрд░рдЧрддрд┐ рдкрдЯреНрдЯреА рдХреЛ рдкреНрд░рд┐рдВрдЯ рдХрд░ рд░рд╣рд╛ рд╣реЛ, RStudio рдХрд╛ рдЕрд▓рдЧ рдИрд╡реЗрдВрдЯ рд▓реВрдк рд╕реЛрдЪ рд░рд╣рд╛ рд╣реИ рдХрд┐ рдХрдВрд╕реЛрд▓ рдХреЛ рдЖрдЙрдЯрдкреБрдЯ рдпреВрдЬрд░ рдЯрд╛рдЗрдкрд┐рдВрдЧ рд╣реИ рдФрд░ рдЖрд░ рдХреЛ рдХреЙрд▓ рдХрд░рддрд╛ рд╣реИ рдЬреЛ рдЬреАрд╕реА рдХреЛ рдЬрдиреНрдо рджреЗрддрд╛ рд╣реИ рдФрд░ рд╕рдм рдХреБрдЫ рдЯреНрд░рд┐рдк рдХрд░рддрд╛ рд╣реИ? рд╢рд╛рдпрдж RStudio рдЙрдкрдпреЛрдЧрдХрд░реНрддрд╛ рдпрд╣рд╛рдВ рдЬрд╛рдирддреЗ рд╣реИрдВ, рдореБрдЭреЗ рд╕рд╣реА рджрд┐рд╢рд╛ рдореЗрдВ рдЗрдВрдЧрд┐рдд рдХрд░ рд╕рдХрддреЗ рд╣реИрдВ, рдпрд╛ рд╢рд╛рдпрдж @kevinushey рд╡рд╛рдкрд╕ рдЖ рдЧрдП рд╣реИрдВ (рдЖрдкрдиреЗ рдХреЗрд╡рд┐рди рдХреЛ рджрд┐рд╕рдВрдмрд░ рдХреЗ рд╢реБрд░реВ рдореЗрдВ рдХрд╣рд╛ рдерд╛, рдФрд░ рдпрд╣ рдЖрдЬ 1 рд╣реИ: -))
рдореИрдВ RStudio рдХрдВрд╕реЛрд▓ рдореЗрдВ рдордЬрд╝рдмреВрддреА рд╕реЗ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХреЛ рдкреБрди: рдЙрддреНрдкрдиреНрди рдХрд░ рд╕рдХрддрд╛ рд╣реВрдВред RStudio рдЯрд░реНрдорд┐рдирд▓ рдЯреИрдм рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рддреЗ рд╣реБрдП, рдореИрдВ gcinfo(TRUE)
рд╕рд╛рде рднреА рдЗрд╕реЗ рдкреБрди: рдЙрддреНрдкрдиреНрди рдирд╣реАрдВ рдХрд░ рд╕рдХрддрд╛ред рджрд┐рд▓рдЪрд╕реНрдк рдмрд╛рдд рдпрд╣ рд╣реИ рдХрд┐ рдЬреАрд╕реАрдПрд╕ рддрдм рд╣реЛрддрд╛ рд╣реИ рдЬрдм рдкреНрд░рдЧрддрд┐ рдкрдЯреНрдЯреА рдкреНрд░рд┐рдВрдЯ рд╣реЛрддреА рд╣реИ рдФрд░ рдпрд╣ рдареАрдХ рд▓рдЧрддрд╛ рд╣реИ, рдХреНрдпреЛрдВрдХрд┐ рдпрд╣ рд▓рд┐рдирдХреНрд╕ рдкрд░ рднреА рдареАрдХ рд╣реИред RStudio рдХрдВрд╕реЛрд▓ рдХреЗ рдЙрд╕ рд╡реАрдбрд┐рдпреЛ рдореЗрдВ рд╡реНрдпрд╡рд╣рд╛рд░ рдХреЛ рджреЗрдЦрддреЗ рд╣реБрдП, рдореИрдВ рдЗрд╕ рдирд┐рд╖реНрдХрд░реНрд╖ рдкрд░ рдкрд╣реБрдБрдЪ рд░рд╣рд╛ рд╣реВрдБ рдХрд┐ рдпрд╣ RStudio рдХрдВрд╕реЛрд▓ рдмрдЧ рд╣реИред рдореИрдВ RStudio рдЯрд░реНрдорд┐рдирд▓ рд╡рд┐рдВрдбреЛ рд╕реЗ рдкрд╛рда рдХреЛ рдХреЙрдкреА рдХрд░рдиреЗ рдореЗрдВ рдЕрд╕рдорд░реНрде рдерд╛ (рд╕рдВрдкрд╛рджрди-> рдХреЙрдкреА рдХрд╛рдо рдирд╣реАрдВ рдХрд░рддрд╛ рд╣реИ рдФрд░ рди рд╣реА Ctrl-C), рдЗрд╕рд▓рд┐рдП рдореИрдВрдиреЗ рдЯрд░реНрдорд┐рдирд▓ рдЯреИрдм рдХрд╛ рд╕реНрдХреНрд░реАрдирд╢реЙрдЯ рд▓рд┐рдпрд╛ рддрд╛рдХрд┐ рдпрд╣ рджрд┐рдЦрд╛рдпрд╛ рдЬрд╛ рд╕рдХреЗ рдХрд┐ рдкреНрд░рдЧрддрд┐ рдмрд╛рд░ рдХреЗ рджреМрд░рд╛рди GC рдареАрдХ рд╣реИред рдореБрдЭреЗ рдЙрдореНрдореАрдж рд╣реИ рдХрд┐ рдпрд╣ рдареАрдХ рд╣реЛрдЧрд╛ рдХреНрдпреЛрдВрдХрд┐ рдХреЗрд╡рд▓ рдорд╛рд╕реНрдЯрд░ рдзрд╛рдЧрд╛ REprintf
рдХреЙрд▓ рдХрд░ рд░рд╣рд╛ рд╣реИ рдФрд░ рдЕрдиреНрдп рдзрд╛рдЧреЗ рдХрд┐рд╕реА рднреА рдЖрд░ рдПрдкреАрдЖрдИ рдХреЛ рдХреЙрд▓ рдирд╣реАрдВ рдХрд░ рд░рд╣реЗ рд╣реИрдВред
RStudio рдЯрд░реНрдорд┐рдирд▓ рдореЗрдВ рдареАрдХ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИ:
рдзреНрдпрд╛рди рджреЗрдВ рдХрд┐ рдЬреАрд╕реА рд╣реИрдВ рдЬрдмрдХрд┐ рдкреНрд░рдЧрддрд┐ рдмрд╛рд░ рдкрд╣рд▓реА рдмрд╛рд░ рдкреНрд░рд┐рдВрдЯ рдХрд░ рд░рд╣рд╛ рд╣реИ рдФрд░ рдпрд╣ RStudio рдЯрд░реНрдорд┐рдирд▓ рдореЗрдВ рдареАрдХ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИред рдкреНрд░рдЧрддрд┐ рдмрд╛рд░ рджреВрд╕рд░реА рдмрд╛рд░ рдкреНрд░рд┐рдВрдЯ рдХрд░рддрд╛ рд╣реИ рдХреНрдпреЛрдВрдХрд┐ рдЗрд╕ рдкрд░реАрдХреНрд╖рдг рдлрд╝рд╛рдЗрд▓ рдореЗрдВ рдПрдХ рдЖрдЙрдЯ-рдСрдл-рд╕реИрдВрдкрд▓ рдкреНрд░рдХрд╛рд░ рдЕрдкрд╡рд╛рдж рд╣реИ рдЬреЛ рдЙрди рдХреЙрд▓рдореЛрдВ рдХреЗ рд▓рд┐рдП рдПрдХ рдСрдЯреЛ рд░реАрд░рд╛рдЗрдб рдХреЛ рдЯреНрд░рд┐рдЧрд░ рдХрд░рддрд╛ рд╣реИред
рд▓реЗрдХрд┐рди RStudio рдХрдВрд╕реЛрд▓ рдореЗрдВ рдпрд╛ рддреЛ stack imbalance
рдпрд╛ unprotect_ptr: pointer not found
:
R version 3.4.2 (2017-09-28) -- "Short Summer"
> gcinfo(TRUE)
[1] FALSE
Garbage collection 22 = 16+3+3 (level 0) ...
25.5 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 23 = 16+4+3 (level 1) ...
24.9 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 24 = 17+4+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 25 = 18+4+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 26 = 19+4+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 27 = 20+4+3 (level 0) ...
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 28 = 20+5+3 (level 1) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 29 = 21+5+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 30 = 22+5+3 (level 0) ...
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 31 = 23+5+3 (level 0) ...
25.2 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 32 = 24+5+3 (level 0) ...
25.3 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 33 = 25+5+3 (level 0) ...
25.4 Mbytes of cons cells used (80%)
6.7 Mbytes of vectors used (66%)
Garbage collection 34 = 25+5+4 (level 2) ...
24.6 Mbytes of cons cells used (61%)
6.4 Mbytes of vectors used (50%)
Garbage collection 35 = 26+5+4 (level 0) ...
25.0 Mbytes of cons cells used (62%)
6.5 Mbytes of vectors used (52%)
> require(data.table)
Loading required package: data.table
Garbage collection 36 = 27+5+4 (level 0) ...
27.2 Mbytes of cons cells used (68%)
7.1 Mbytes of vectors used (56%)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 01:04:34 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
Garbage collection 37 = 28+5+4 (level 0) ...
27.7 Mbytes of cons cells used (69%)
7.3 Mbytes of vectors used (58%)
Garbage collection 38 = 29+5+4 (level 0) ...
28.0 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (58%)
Garbage collection 39 = 30+5+4 (level 0) ...
28.1 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (59%)
Garbage collection 40 = 31+5+4 (level 0) ...
28.2 Mbytes of cons cells used (70%)
7.5 Mbytes of vectors used (59%)
Garbage collection 41 = 32+5+4 (level 0) ...
28.4 Mbytes of cons cells used (71%)
7.5 Mbytes of vectors used (59%)
> DT = fread("/Users/pasha/Downloads/LargeFile.csv")
Garbage collection 42 = 32+5+5 (level 2) ...
27.4 Mbytes of cons cells used (54%)
7.1 Mbytes of vectors used (2%)
Garbage collection 43 = 32+5+6 (level 2) ...
27.4 Mbytes of cons cells used (54%)
244.7 Mbytes of vectors used (42%)
Garbage collection 44 = 32+5+7 (level 2) ...
27.4 Mbytes of cons cells used (54%)
482.3 Mbytes of vectors used (42%)
Garbage collection 45 = 32+5+8 (level 2) ...
27.4 Mbytes of cons cells used (54%)
957.4 Mbytes of vectors used (56%)
Garbage collection 46 = 32+5+9 (level 2) ...
27.4 Mbytes of cons cells used (54%)
1432.6 Mbytes of vectors used (63%)
Garbage collection 47 = 32+5+10 (level 2) ...
27.4 Mbytes of cons cells used (54%)
2145.3 Mbytes of vectors used (75%)
Garbage collection 48 = 32+5+11 (level 2) ...
27.4 Mbytes of cons cells used (54%)
2620.4 Mbytes of vectors used (71%)
Garbage collection 49 = 32+5+12 (level 2) ...
27.4 Mbytes of cons cells used (54%)
3570.8 Mbytes of vectors used (78%)
Garbage collection 50 = 32+5+13 (level 2) ...
27.4 Mbytes of cons cells used (54%)
4283.5 Mbytes of vectors used (75%)
Garbage collection 51 = 32+5+14 (level 2) ...
27.4 Mbytes of cons cells used (54%)
5709.0 Mbytes of vectors used (77%)
Garbage collection 52 = 32+5+15 (level 2) ...
27.4 Mbytes of cons cells used (54%)
7372.0 Mbytes of vectors used (81%)
Garbage collection 53 = 32+5+16 (level 2) ...
27.4 Mbytes of cons cells used (54%)
8797.5 Mbytes of vectors used (79%)
Garbage collection 54 = 32+5+17 (level 2) ...
27.4 Mbytes of cons cells used (54%)
10935.7 Mbytes of vectors used (80%)
|--------------------------------------------------|
|=====Error in fread("LargeFile.csv") :
unprotect_ptr: pointer not found
>
showProgress=FALSE
рдЗрд╕реЗ RStudio рдХрдВрд╕реЛрд▓ рдореЗрдВ рдордЬрд╝рдмреВрддреА рд╕реЗ рд╣рд▓ рдХрд░рддрд╛ рд╣реИред рдкреБрди: рдЙрддреНрдкрдиреНрди рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП, рдпрд╣ showProgress=TRUE
(рдпрд╛рдиреА рдбрд┐рдлрд╝реЙрд▓реНрдЯ) рдХреЗ рд╕рд╛рде рдПрдХ рдирдП RStudio рдХрдВрд╕реЛрд▓ рдореЗрдВ рдмрд╣реБрдд рдкрд╣рд▓реЗ рд░рди рд╣реЛрдирд╛ рдЪрд╛рд╣рд┐рдПред рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рдХреЗ рджреМрд░рд╛рди рдПрдХ рдЬреАрд╕реА рд╣реИ рдпрд╛ рдирд╣реАрдВ рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рд▓рдЧрддрд╛ рд╣реИ; рдирдП рд╕рддреНрд░ рдореЗрдВ рдкрд╣рд▓реЗ рднрд╛рдЧ рдореЗрдВ рд╣реИред рдпрд╣ рд╕рд┐рд░реНрдл рдПрдХ рдмрдбрд╝реА рдлрд╛рдЗрд▓ рд╣реЛрдирд╛ рдЪрд╛рд╣рд┐рдП рддрд╛рдХрд┐ рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рдкреНрд░рджрд░реНрд╢рд┐рдд рд╣реЛред fread
рдкрд╛рд░рд┐рдд рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдХреБрдЫ рднреА рдирд╣реАрдВ рд╣реИред рдпрджрд┐ рдПрдХ рдирдП RStudio рдХрдВрд╕реЛрд▓ рдореЗрдВ рдкрд╣рд▓рд╛ рд░рди showProgress=FALSE
рдХреЗ рд╕рд╛рде рд╣реИ, рддреЛ рдпрд╣ рд░рди R рдХреЗ рд╣реАрдк рдХрд╛ рд╡рд┐рд╕реНрддрд╛рд░ рдХрд░рддрд╛ рд╣реИ, рдмрд╛рдж рдореЗрдВ рдЙрд╕реА рд╕рддреНрд░ рдореЗрдВ showProgress=TRUE
рднреА рдХрд╛рдо рдХрд░рддрд╛ рд╣реИред рд▓реЗрдХрд┐рди рд╕рд┐рд░реНрдл рдЗрд╕рд▓рд┐рдП рдХрд┐ рдкрд╣рд▓реЗ рд╕реЗ рдкрд╣рд▓реЗ рд╣реА рдвреЗрд░ рдХреЗ рд╡рд┐рд╕реНрддрд╛рд░ рдХреЗ рдХрд╛рд░рдг рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рдХреЗ рджреМрд░рд╛рди рдХреЛрдИ рдЬреАрд╕реА рдирд╣реАрдВ рд╣реИред
рдХреНрдпреЛрдВ рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рдХреЗ рджреМрд░рд╛рди рдорд╛рд╕реНрдЯрд░ рдереНрд░реЗрдб рдкрд░ рдПрдХ рдЬреАрд╕реА рд▓рд┐рдирдХреНрд╕ рдкрд░ рдФрд░ рд╡рд┐рдВрдбреЛрдЬ RStudio рдЯрд░реНрдорд┐рдирд▓ рдореЗрдВ рдареАрдХ рд╣реИ, рд▓реЗрдХрд┐рди RStudio рдХрдВрд╕реЛрд▓ рдореЗрдВ рдмрдХрд╛рдпрд╛ рдирд╣реАрдВ рд╣реИред
рдареАрдХ рд╣реИ, рдпрд╣ рдЗрд╕реЗ рдареАрдХ рдХрд░рддрд╛ рд╣реИред рдбреЗрдЯрд╛ рдбреЗрдЯрд╛ рдкрд░ рд╕рдорд╕реНрдпрд╛ рдереАред RStudio рдирд╣реАрдВред рдЕрдм рдореЗрд░реЗ рд▓рд┐рдП рд╡рд┐рдВрдбреЛрдЬ рдкрд░ RStudio рдХрдВрд╕реЛрд▓ рдореЗрдВ рдордЬрд╝рдмреВрддреА рд╕реЗ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИред рдпрд╣ рдПрдХ рд╕рдорд╕реНрдпрд╛ рдереА рдЬреЛ рд▓рд┐рдирдХреНрд╕ рдФрд░ рдореИрдХреНрд╕ рдкрд░ рднреА рд╣реЛ рд╕рдХрддреА рдереА, рдпрд╣ рд╕рд┐рд░реНрдл рдЗрддрдирд╛ рд╣реИ рдХрд┐ рдореЗрдореЛрд░реА рдкреИрдЯрд░реНрди рдЗрд╕реЗ рдЯреНрд░рд┐рдЧрд░ рдирд╣реАрдВ рдХрд░ рд░рд╣реЗ рдереЗред рдЕрдиреНрдп рдереНрд░реЗрдбреНрд╕ рдореЗрдВ R (рдПрдВрдЯреНрд░реА рдХреЙрд▓рдо рдХреЗ рд╕рд╛рде рдЕрдкрдиреЗ рдмрдлрд╝рд░реНрд╕ рдХреЛ рдзрдХреЗрд▓рдиреЗ рдкрд░) рдХрд╛ рдПрдХ рдПрдВрдЯреНрд░реА рдкреЙрдЗрдВрдЯ рд╣реЛрддрд╛ рдерд╛, рдЬреЛ рдХрд┐ рдЙрд╕реА рд╕рдордп рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдЬрдм рдорд╛рд╕реНрдЯрд░ рдереНрд░реЗрдб рдкреНрд░рд┐рдВрдЯрд┐рдВрдЧ рдкреНрд░рдЧрддрд┐ REprintf
рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░ рд░рд╣реА рд╣реЛред рдЗрд╕рд▓рд┐рдП рдпрд╣ рдХреЗрд╡рд▓ рдирдП рд╕рддреНрд░ рдореЗрдВ рдкрд╣рд▓реЗ рд░рди рдореЗрдВ рд╣реБрдЖред 2 рдбреА рд░рди рдХреЗ рдмрд╛рдж, рдлрд╝рд╛рдЗрд▓ рдореЗрдВ рд╕рднреА рд╕реНрдЯреНрд░рд┐рдВрдЧреНрд╕ рдХреЛ рджреЗрдЦрд╛ рдЧрдпрд╛ рдерд╛, рдЗрд╕рд╕реЗ рдкрд╣рд▓реЗ рдХрд┐ рдХреИрд╢ рд▓реБрдХрдЕрдк (рдереНрд░реЗрдб-рд╕реЗрдлрд╝) рд╣рд┐рдЯ рдХрд░ рд░рд╣реЗ рдереЗ рдФрд░ рдЖрд╡рдВрдЯрд┐рдд рдирд╣реАрдВ рдХрд┐рдпрд╛ рдЧрдпрд╛ рдерд╛ (рдереНрд░реЗрдб-рд╕реЗрдлрд╝ рдирд╣реАрдВ)ред
рддреЛ, @aadler рдФрд░ @HughParsonage, рддреЛ рдХреГрдкрдпрд╛ рдкрд╛рдБрдЪ рдЗрд╕ рдПрдХ ред 95% рдореМрдХрд╛ рдЕрдм рдпрд╣ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИ!
рдХреЛрдИ рдЪреЗрддрд╛рд╡рдиреА рдирд╣реАрдВ, рд╕реБрдирд┐рд╢реНрдЪрд┐рдд рдХрд░реЗрдВ рдХрд┐ рдЖрдк рдХрд┐рд╕реА рдФрд░ рдЪреАрдЬрд╝ рдХреА рддрд▓рд╛рд╢ рдореЗрдВ рд╣реИрдВ:
> gcinfo(TRUE)
[1] FALSE
> fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 1A51 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
Garbage collection 53 = 36+5+12 (level 2) ...
30.3 Mbytes of cons cells used (60%)
7.9 Mbytes of vectors used (1%)
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Garbage collection 54 = 37+5+12 (level 0) ...
30.8 Mbytes of cons cells used (61%)
566.6 Mbytes of vectors used (74%)
Garbage collection 55 = 37+6+12 (level 1) ...
30.8 Mbytes of cons cells used (61%)
549.2 Mbytes of vectors used (72%)
jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.626 wall clock time
[12] Finalizing the datatable
Type counts:
1 : bool8 '1'
1 : int32 '5'
2 : string 'A'
=============================
0.002s ( 0%) Memory map 0.341GB file
0.005s ( 0%) sep=',' ncol=4 and header detection
0.000s ( 0%) Column type detection using 10027 sample rows
0.469s ( 18%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
2.150s ( 82%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.103s ( 4%) Finding first non-embedded \n after each jump
+ 0.230s ( 9%) Parse to row-major thread buffers (grown 0 times)
+ 0.718s ( 27%) Transpose
+ 1.099s ( 42%) Waiting
0.745s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
2.626s Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
Garbage collection 56 = 37+6+13 (level 2) ...
31.1 Mbytes of cons cells used (62%)
531.9 Mbytes of vectors used (70%)
Garbage collection 57 = 38+6+13 (level 0) ...
31.1 Mbytes of cons cells used (62%)
532.0 Mbytes of vectors used (70%)
V1 V2 V3 V4
1: Goulburn 110018063 3499 NA
2: NA 110018064 812 NA
3: NA 110018065 2158 NA
4: NA 110019999 402 NA
5: NA 110028068 10 NA
---
22885376: NA 997999799 0 NA
22885377: NA 998999899 64 NA
22885378: NA 994999499 34 NA
22885379: NA 0&&&&&&&& 250796 NA
22885380: NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
рдзрдиреНрдпрд╡рд╛рдж рд╣реНрдпреВрдЧред рд╣рд╛рдВ, рдпрд╣ рдПрдХ рд╕рд╛рдл рд░рди рд╣реИ, рдпрд╣ рдорд╛рдирддреЗ рд╣реБрдП рдХрд┐ рдПрдХ рддрд╛рдЬрд╛ RStudio рдХрдВрд╕реЛрд▓ рд╕рддреНрд░ рдореЗрдВ рдерд╛ред рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдпрд╛ "рдЕрдирдкреНрд░реЛрдЯреЗрдХреНрдЯ_рдкреНрд░реЗрдЯ: рдкреЙрдЗрдВрдЯрд░ рдирд╣реАрдВ рдорд┐рд▓рд╛" рд╕рдВрджреЗрд╢реЛрдВ рдХрд╛ рдХреЛрдИ рд╕рдВрдХреЗрдд рдирд╣реАрдВ рд╣реИ, рдФрд░ рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рд╕рд╣реА рдврдВрдЧ рд╕реЗ рдЪрд▓ рд░рд╣рд╛ рд╣реИ (рдЗрд╕ рдорд╛рдорд▓реЗ рдореЗрдВ рджреЛ рдмрд╛рд░ рдХреЗ рд░реВрдк рдореЗрдВ рдПрдХ рдлрд┐рд░ рд╕реЗ рдкрдврд╝рдирд╛ рд╣реИ)ред рдЕрдм рдкреБрд╖реНрдЯрд┐ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рд╕рд┐рд░реНрдл @aadler ред
рд╕рдлрд▓рддрд╛ред
рдкрд╣рд▓рд╛ рд░рди, RStudio рдХрд╛ рддрд╛рдЬрд╝рд╛ рдЙрджрд╛рд╣рд░рдгред
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
> DT <- fread('LargeFile.csv', colClasses = colCLASS, select = colSEL, header = TRUE, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file LargeFile.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|==================================================|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:25.938 wall clock time
[12] Finalizing the datatable
Type counts:
23 : drop '0'
5 : int32 '5'
7 : float64 '7'
2 : string 'A'
=============================
0.005s ( 0%) Memory map 6.355GB file
0.025s ( 0%) sep=',' ncol=37 and header detection
0.001s ( 0%) Column type detection using 10049 sample rows
4.681s ( 18%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
21.226s ( 82%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
= 0.485s ( 2%) Finding first non-embedded \n after each jump
+ 1.465s ( 6%) Parse to row-major thread buffers (grown 0 times)
+ 9.095s ( 35%) Transpose
+ 10.181s ( 39%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
25.938s Total
RStudio рдХреЛ рдмрдВрдж рдХрд░ рджрд┐рдпрд╛ рдФрд░ рд╕реНрдЯреНрд░рд┐рдВрдЧ рдХреИрд╢рд┐рдВрдЧ рдХреЛ рд╕рдХреНрд░рд┐рдп рд╣реЛрдиреЗ рд╕реЗ рд░реЛрдХрдиреЗ рдХреЗ рд▓рд┐рдП рдЗрд╕реЗ рдлрд┐рд░ рд╕реЗ рдЦреЛрд▓рд╛ рдФрд░ gcinfo(TRUE)
рд╕рд╛рде рдЗрд╕реЗ рдлрд┐рд░ рд╕реЗ рдЪрд▓рд╛рдпрд╛ред рдЬреЛрдбрд╝рд╛ рдЧрдпрд╛ рдмреЛрдирд╕, IDate рдореЗрдВ рд░реВрдкрд╛рдВрддрд░рдг рдкреВрд░рд╛ рд╣реБрдЖ (40 рд╕реЗрдХрдВрдб рд╕реЗ рдЕрдзрд┐рдХ, рд╣рд╛рд▓рд╛рдБрдХрд┐ :))ред
> colCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+ rep('integer', 3L), rep('character', 2L),
+ 'integer', 'Date', rep('numeric', 2L), 'Date',
+ rep('numeric', 12L), rep('integer', 5),
+ rep('numeric', 3L), 'integer', 'character')
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> gcinfo(TRUE)
[1] FALSE
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
Garbage collection 46 = 36+5+5 (level 0) ...
38.6 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 47 = 37+5+5 (level 0) ...
38.7 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 48 = 38+5+5 (level 0) ...
38.8 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 49 = 39+5+5 (level 0) ...
39.0 Mbytes of cons cells used (78%)
11.2 Mbytes of vectors used (71%)
Garbage collection 50 = 40+5+5 (level 0) ...
39.1 Mbytes of cons cells used (78%)
11.3 Mbytes of vectors used (71%)
Garbage collection 51 = 40+6+5 (level 1) ...
38.8 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 52 = 41+6+5 (level 0) ...
38.9 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 53 = 42+6+5 (level 0) ...
41.5 Mbytes of cons cells used (83%)
12.2 Mbytes of vectors used (77%)
Garbage collection 54 = 42+7+5 (level 1) ...
43.4 Mbytes of cons cells used (86%)
12.8 Mbytes of vectors used (81%)
Garbage collection 55 = 42+7+6 (level 2) ...
44.7 Mbytes of cons cells used (72%)
13.0 Mbytes of vectors used (67%)
Garbage collection 56 = 43+7+6 (level 0) ...
46.5 Mbytes of cons cells used (74%)
13.6 Mbytes of vectors used (70%)
Garbage collection 57 = 44+7+6 (level 0) ...
47.0 Mbytes of cons cells used (75%)
13.8 Mbytes of vectors used (71%)
Garbage collection 58 = 45+7+6 (level 0) ...
47.4 Mbytes of cons cells used (76%)
13.9 Mbytes of vectors used (71%)
Garbage collection 59 = 46+7+6 (level 0) ...
47.7 Mbytes of cons cells used (76%)
14.2 Mbytes of vectors used (73%)
Garbage collection 60 = 47+7+6 (level 0) ...
48.0 Mbytes of cons cells used (77%)
14.2 Mbytes of vectors used (73%)
Garbage collection 61 = 48+7+6 (level 0) ...
48.1 Mbytes of cons cells used (77%)
14.3 Mbytes of vectors used (73%)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = colCLASS, select = colSEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file LargeFile.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
Garbage collection 62 = 48+7+7 (level 2) ...
46.5 Mbytes of cons cells used (60%)
13.6 Mbytes of vectors used (2%)
Garbage collection 63 = 48+7+8 (level 2) ...
46.5 Mbytes of cons cells used (60%)
488.7 Mbytes of vectors used (42%)
Garbage collection 64 = 48+7+9 (level 2) ...
46.5 Mbytes of cons cells used (60%)
963.9 Mbytes of vectors used (56%)
Garbage collection 65 = 48+7+10 (level 2) ...
46.5 Mbytes of cons cells used (60%)
1439.1 Mbytes of vectors used (63%)
Garbage collection 66 = 48+7+11 (level 2) ...
46.5 Mbytes of cons cells used (60%)
1914.2 Mbytes of vectors used (67%)
Garbage collection 67 = 48+7+12 (level 2) ...
46.5 Mbytes of cons cells used (60%)
2864.5 Mbytes of vectors used (77%)
Garbage collection 68 = 48+7+13 (level 2) ...
46.5 Mbytes of cons cells used (60%)
3577.3 Mbytes of vectors used (78%)
Garbage collection 69 = 48+7+14 (level 2) ...
46.5 Mbytes of cons cells used (60%)
4290.0 Mbytes of vectors used (75%)
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|============================Garbage collection 70 = 49+7+14 (level 0) ...
76.5 Mbytes of cons cells used (99%)
5487.5 Mbytes of vectors used (96%)
=Garbage collection 71 = 49+8+14 (level 1) ...
77.0 Mbytes of cons cells used (100%)
5487.6 Mbytes of vectors used (96%)
Garbage collection 72 = 49+8+15 (level 2) ...
77.0 Mbytes of cons cells used (81%)
5487.1 Mbytes of vectors used (80%)
==============Garbage collection 73 = 50+8+15 (level 0) ...
94.3 Mbytes of cons cells used (100%)
5494.0 Mbytes of vectors used (80%)
Garbage collection 74 = 50+9+15 (level 1) ...
94.5 Mbytes of cons cells used (100%)
5494.1 Mbytes of vectors used (80%)
Garbage collection 75 = 50+9+16 (level 2) ...
94.5 Mbytes of cons cells used (82%)
5493.1 Mbytes of vectors used (67%)
=======|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:24.772 wall clock time
[12] Finalizing the datatable
Type counts:
23 : drop '0'
5 : int32 '5'
7 : float64 '7'
2 : string 'A'
=============================
0.005s ( 0%) Memory map 6.355GB file
0.018s ( 0%) sep=',' ncol=37 and header detection
0.000s ( 0%) Column type detection using 10049 sample rows
5.496s ( 22%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
19.253s ( 78%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
= 0.433s ( 2%) Finding first non-embedded \n after each jump
+ 1.482s ( 6%) Parse to row-major thread buffers (grown 0 times)
+ 9.515s ( 38%) Transpose
+ 7.822s ( 32%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
24.772s Total
Garbage collection 76 = 51+9+16 (level 0) ...
105.3 Mbytes of cons cells used (91%)
5500.3 Mbytes of vectors used (67%)
Garbage collection 77 = 51+10+16 (level 1) ...
105.4 Mbytes of cons cells used (91%)
5500.2 Mbytes of vectors used (67%)
> DT[, Month := as.IDate(Month, format = "%Y-%m-%d")]
Garbage collection 78 = 51+10+17 (level 2) ...
107.5 Mbytes of cons cells used (76%)
8174.1 Mbytes of vectors used (81%)
Garbage collection 79 = 51+11+17 (level 1) ...
107.5 Mbytes of cons cells used (76%)
5910.4 Mbytes of vectors used (59%)
> gcinfo(FALSE)
[1] TRUE
рдмрд╣реБрдд рдмрдврд╝рд┐рдпрд╛! : рдЯрд╛рдбрд╛: рдЗрд╕рдореЗрдВ рд╢рд╛рдорд┐рд▓ рд╕рднреА рд▓реЛрдЧреЛрдВ рдХреЗ рд▓рд┐рдП рдорд╣рд╛рди рдХрд╛рдо, рд╡рд┐рд╢реЗрд╖ рд░реВрдк рд╕реЗ @mattdowle рдЬреЛ рдЗрд╕ рдХреЗ рд╕рд╛рде рдЕрдм рддрдХ рдереЛрдбрд╝реЗ рдмрд╛рд▓ рд╣реЛрдирд╛ рдЪрд╛рд╣рд┐рдП :)
рдРрд╕рд╛ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ 'рдореЗрд░реА рдЫреБрдЯреНрдЯреА рдкрд░ рд░рд╣рдиреЗ рдХреА рд░рдгрдиреАрддрд┐ рдЬрдм рддрдХ рд╕рдорд╕реНрдпрд╛ рдареАрдХ рдирд╣реАрдВ рд╣реЛ рдЬрд╛рддреА рд╣реИ' рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рдпрд╣рд╛рдВ рдХрд╛рдо рдХрд┐рдпрд╛ рдЧрдпрд╛ рд╣реИ :-)
рдХреНрдпрд╛ рдХреБрдЫ рдФрд░ рд╣реИ рдЬрд┐рд╕реЗ рдореБрдЭреЗ рджреЗрдЦрдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХрд░рдиреА рдЪрд╛рд╣рд┐рдП рдпрд╛ рдЗрд╕ рдореБрджреНрджреЗ рдХреЛ рд╣рд▓ рдХрд░рдиреЗ рдкрд░ рд╡рд┐рдЪрд╛рд░ рдХрд┐рдпрд╛ рдЬрд╛рдирд╛ рдЪрд╛рд╣рд┐рдП?
рдзрдиреНрдпрд╡рд╛рдж @aadler рдФрд░ @HughParsonage! рд░рд╛рд╣рддред
@kevinushey рд╣рд╛ рд╣рд╛ред рд╣рд╛рдВ рдпрд╣ рдбреЗрдЯрд╛рдЯреИрдм рд╕рд╛рдЗрдб рдерд╛ рдФрд░ рдЕрдм рд╣рд▓ рд╣реЛ рдЧрдпрд╛ рд╣реИ (PR # 2488)ред рдзрдиреНрдпрд╡рд╛рджред
рд╕рдмрд╕реЗ рдЙрдкрдпреЛрдЧреА рдЯрд┐рдкреНрдкрдгреА
рдРрд╕рд╛ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ 'рдореЗрд░реА рдЫреБрдЯреНрдЯреА рдкрд░ рд░рд╣рдиреЗ рдХреА рд░рдгрдиреАрддрд┐ рдЬрдм рддрдХ рд╕рдорд╕реНрдпрд╛ рдареАрдХ рдирд╣реАрдВ рд╣реЛ рдЬрд╛рддреА рд╣реИ' рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рдпрд╣рд╛рдВ рдХрд╛рдо рдХрд┐рдпрд╛ рдЧрдпрд╛ рд╣реИ :-)
рдХреНрдпрд╛ рдХреБрдЫ рдФрд░ рд╣реИ рдЬрд┐рд╕реЗ рдореБрдЭреЗ рджреЗрдЦрдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХрд░рдиреА рдЪрд╛рд╣рд┐рдП рдпрд╛ рдЗрд╕ рдореБрджреНрджреЗ рдХреЛ рд╣рд▓ рдХрд░рдиреЗ рдкрд░ рд╡рд┐рдЪрд╛рд░ рдХрд┐рдпрд╛ рдЬрд╛рдирд╛ рдЪрд╛рд╣рд┐рдП?