Data.table: рднрдп рдореЗрдВ рдвреЗрд░ рдЕрд╕рдВрддреБрд▓рди

рдХреЛ рдирд┐рд░реНрдорд┐рдд 14 рдирд╡ре░ 2017  ┬╖  61рдЯрд┐рдкреНрдкрдгрд┐рдпрд╛рдБ  ┬╖  рд╕реНрд░реЛрдд: Rdatatable/data.table

рдЬрдм рдореИрдВ verbose=FALSE рд╕рд╛рде рдирд┐рдореНрдирд▓рд┐рдЦрд┐рдд рдЪрд▓рд╛рддрд╛ рд╣реВрдВ, рддреЛ рдореБрдЭреЗ рдПрдХ R рдХреНрд░реИрд╢ ('рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди') рдХрд╛ рдЕрдиреБрднрд╡ рд╣реЛрддрд╛ рд╣реИред рдиреЛрдЯ рдореИрдВ рдПрдХ рдпрд╛ рджреЛ рдорд╣реАрдиреЗ рдкрд╣рд▓реЗ data.table рдкреБрд░рд╛рдиреЗ рд╕рдВрд╕реНрдХрд░рдг рдкрд░ рдиреАрдЪреЗ рджрд┐рдП рдЧрдП рдХреЛрдб рдХреЛ рд╕рдлрд▓рддрд╛рдкреВрд░реНрд╡рдХ рдЪрд▓рд╛рдиреЗ рдореЗрдВ рд╕рдХреНрд╖рдо рдерд╛, рдЗрд╕рд▓рд┐рдП рдореЗрд░рд╛ рдорд╛рдирдирд╛ тАЛтАЛрд╣реИ рдХрд┐ рдпрд╣ рдПрдХ рд╣рд╛рд▓рд┐рдпрд╛ рдмрдЧ рд╣реИред (рдХреНрд╖рдорд╛ рдХрд░реЗрдВ, рдореБрдЭреЗ рд╕рдЯреАрдХ рджреЗрд╡ рд╕рдВрд╕реНрдХрд░рдг рдпрд╛рдж рдирд╣реАрдВ рд╣реИ рдЬрд╣рд╛рдВ рдпрд╣ рдХрд╛рдо рдХрд░ рд░рд╣рд╛ рдерд╛ред)

рд╕рдорд╕реНрдпрд╛ рдПрдХ рдмрд╣реБрдд рдЫреЛрдЯреА рдлрд╝рд╛рдЗрд▓ рдкрд░ рдкреБрди: рдЙрддреНрдкрдиреНрди рдирд╣реАрдВ рд╣реЛрддреА рд╣реИред рдЬрд╝рд┐рдк рдлрд╝рд╛рдЗрд▓ рдХрд╛ рд▓рд┐рдВрдХ (рд╕реАрдПрд╕рд╡реА 350 рдПрдордмреА рд╣реИ): https://github.com/HughParsonage/ABS-data/blob/master/inbox/SA2-by-DJZ-2011.zip

рдореИрдВ рдХрднреА-рдХрднреА рд╡рд┐рднрд┐рдиреНрди рддреНрд░реБрдЯрд┐рдпреЛрдВ рдХрд╛ рдЕрдиреБрднрд╡ рдХрд░рддрд╛ рд╣реВрдВред рдЙрджрд╛рд╣рд░рдг рдХреЗ рд▓рд┐рдП,

рдкрд╛рдиреЗ рдореЗрдВ рддреНрд░реБрдЯрд┐ (рдирд╛рдо, envir = ns, inherits = FALSE): рдЕрдорд╛рдиреНрдп рдкрд╣рд▓рд╛ рддрд░реНрдХ

рдпрд╛

рдЪреЗрддрд╛рд╡рдиреА: '$' рдореЗрдВ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди, 16 рдлрд┐рд░ 15
рддреНрд░реБрдЯрд┐: R_Reprotect: рдХреЗрд╡рд▓ 1 рд╕рдВрд░рдХреНрд╖рд┐рдд рдЖрдЗрдЯрдо, рдЕрдиреБрдХреНрд░рдордгрд┐рдХрд╛ -2 рдХреЛ рдкреБрди: рд▓рд┐рдЦ рдирд╣реАрдВ рд╕рдХрддрд╛

# Minimal reproducible example

library(data.table)

#> data.table 1.10.5 IN DEVELOPMENT built 2017-11-13 02:46:28 UTC; appveyor
#>   The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#>   Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#>   Release notes, videos and slides: http://r-datatable.com


fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE)

Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.550 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 94%. ETA 00:00 Warning: stack imbalance in '$', 27 then 28
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.991
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.006s (  0%) Memory map 0.341GB file
   0.011s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.328s (  9%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.194s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.362s ( 10%) Parse to row-major thread buffers
   +    1.963s ( 55%) Transpose
   +    0.868s ( 25%) Waiting
   0.991s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
   3.541s        Total
Warning: stack imbalance in 'withVisible', 3 then 5
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

# Output of sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5    RevoUtils_10.0.6     RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2    yaml_2.1.14 
bug fread idatitime platform-specific

рд╕рдмрд╕реЗ рдЙрдкрдпреЛрдЧреА рдЯрд┐рдкреНрдкрдгреА

рдРрд╕рд╛ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ 'рдореЗрд░реА рдЫреБрдЯреНрдЯреА рдкрд░ рд░рд╣рдиреЗ рдХреА рд░рдгрдиреАрддрд┐ рдЬрдм рддрдХ рд╕рдорд╕реНрдпрд╛ рдареАрдХ рдирд╣реАрдВ рд╣реЛ рдЬрд╛рддреА рд╣реИ' рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рдпрд╣рд╛рдВ рдХрд╛рдо рдХрд┐рдпрд╛ рдЧрдпрд╛ рд╣реИ :-)

рдХреНрдпрд╛ рдХреБрдЫ рдФрд░ рд╣реИ рдЬрд┐рд╕реЗ рдореБрдЭреЗ рджреЗрдЦрдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХрд░рдиреА рдЪрд╛рд╣рд┐рдП рдпрд╛ рдЗрд╕ рдореБрджреНрджреЗ рдХреЛ рд╣рд▓ рдХрд░рдиреЗ рдкрд░ рд╡рд┐рдЪрд╛рд░ рдХрд┐рдпрд╛ рдЬрд╛рдирд╛ рдЪрд╛рд╣рд┐рдП?

рд╕рднреА 61 рдЯрд┐рдкреНрдкрдгрд┐рдпрд╛рдБ

@HghParsonage , рдпрд╣ # 2457 рдХреЗ рд╕рдорд╛рди рджрд┐рдЦрддрд╛ рд╣реИред рд╢рд╛рдпрдж showProgress=FALSE рдкрд╛рд╕ рдХрд░рдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХрд░реЗрдВ рдФрд░ рджреЗрдЦреЗрдВ рдХрд┐ рдХреНрдпрд╛ рдпрд╣ рдкреВрд░рд╛ рд╣реЛ рдЧрдпрд╛ рд╣реИред
2017-11-09 рдХреЗ рдмрд╛рдж рд╕реЗ @mattdowle рдПрдХ рдкреНрд░рддрд┐рдЧрдорди рд╣реЛ рд╕рдХрддрд╛ рдерд╛?

showProgress=FALSE рд╕рд╛рде рдЪрд▓рдиреЗ рд╕реЗ рд╡рд╛рд╕реНрддрд╡ рдореЗрдВ рдкрд░рд┐рдгрд╛рдо (рдХреЗрд╡рд▓ рдЕрдкреЗрдХреНрд╖рд┐рдд рдЪреЗрддрд╛рд╡рдирд┐рдпреЛрдВ рдХреЗ рд╕рд╛рде) рд╡рд╛рдкрд╕ рдЖ рдЧрдпрд╛ред

рд╕рднреА рд╡рд┐рд╕реНрддреГрдд рдЬрд╛рдирдХрд╛рд░реА рдХреЗ рд▓рд┐рдП рдзрдиреНрдпрд╡рд╛рджред рдореБрдЭреЗ рд╕рдВрджреЗрд╣ рд╣реИ рдХрд┐ 2017-11-09 рдХреЗ рдмрд╛рдж рд╕реЗ рдПрдХ рдкреНрд░рддрд┐рдЧрдорди рд╣реИ, рд▓реЗрдХрд┐рди рд╢рд╛рдпрдж рд▓рдВрдмреЗ рд╕рдордп рддрдХ verbose=TRUE рдЖрдЙрдЯрдкреБрдЯ рдХрд╛ рдИрдЯреАрдП рдЖрдЙрдЯрдкреБрдЯ рдкрд░ рд╕рдорд╛рди рдкреНрд░рднрд╛рд╡ рдкрдбрд╝ рд░рд╣рд╛ рд╣реИред рдлрд╝рд╛рдЗрд▓ рдХреЛ рдПрдХ рд░реАрд░реЗрдб рдХреА рдЖрд╡рд╢реНрдпрдХрддрд╛ рд╣реЛрддреА рд╣реИ рдЬрд┐рд╕рдХрд╛ рдЕрд░реНрде рд╣реИ рдХрд┐ рдЕрдзрд┐рдХ рдЖрдЙрдЯрдкреБрдЯ рдЙрддреНрдкрдиреНрди рд╣реЛрддрд╛ рд╣реИред рдореБрдЭреЗ рдбрд░ рд╣реИ рдХрд┐ @HughParsonage рдХреА рд░рд┐рдкреЛрд░реНрдЯ рдмрддрд╛рддреА рд╣реИ рдХрд┐ showProgress = TRUE рдЙрд╕рдХреЗ рд▓рд┐рдП рдХрд╛рд░рдЧрд░ рд╣реИ рдФрд░ рдпрд╣ рд╕рдорд╕реНрдпрд╛ рддрдм рд╣реЛрдЧреА рдЬрдм рдЗрд╕реЗ 5-10 рдмрд╛рд░ verbose = TRUE рд╕реЗ рдЪрд▓рд╛рдпрд╛ рдЬрд╛рдПрдЧрд╛ред

рд╕рдорд╛рдирд╛рдВрддрд░ рдЦрдВрдб (рдкреНрд░рдЧрддрд┐ рдИрдЯреАрдП рдХреЗ рдЕрд▓рд╛рд╡рд╛ рдЬреЛ рдкрд╣рд▓реЗ рд╕реЗ рддрдп рд╣реЛ рдЧрдпрд╛ рд╣реИ) рдХреЗ рднреАрддрд░ рд╕реЗ рдХреЛрдИ рдХреНрд░рд┐рдпрд╛ рд╕рдВрджреЗрд╢ рдирд╣реАрдВ рдЫрдкрд╛ рд╣реИред рд╣рд╛рд▓рд╛рдБрдХрд┐, рдкрд╣рд▓реЗ рд░реАрдб рдХреЗ рдмрд╛рдж рдФрд░ 2 рд░реАрд░реЗрдб рд╢реБрд░реВ рд╣реЛрдиреЗ рд╕реЗ рдкрд╣рд▓реЗ (рдЬреЛ рдЗрд╕ рдлрд╝рд╛рдЗрд▓ рдХреЗ рд▓рд┐рдП рд╣реЛ рд░рд╣рд╛ рд╣реИ) рдХреНрд░рд┐рдпрд╛ рд╕рдВрджреЗрд╢ рд╣реИрдВред рдореБрдЭреЗ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рдпрд╣ рд╕рдВрднрд╡ рд╣реИ рдХрд┐ рдпрджрд┐ рд╡реЗ рдкреНрд░рд┐рдВрдЯ 100 рд╡реАрдВ рдЪреЗрдХрдпреВрдЬрд░рдЗрдВрдЯрд░рдкреНрдЯ (# 2457 рджреЗрдЦреЗрдВ) рдХреЛ рдЯреНрд░рд┐рдЧрд░ рдХрд░рддреЗ рд╣реИрдВ, рддреЛ рдпрд╣ 2 рд╕рдорд╛рдирд╛рдВрддрд░ рдХреНрд╖реЗрддреНрд░ рдХреЛ рд╡рд┐рдлрд▓ рдХрд░рдиреЗ (рд╣рд╛рд▓рд╛рдВрдХрд┐ рд╡рд┐рд╖рдо) рд╣реЛ рд╕рдХрддрд╛ рд╣реИред рд╡реИрд╕реЗ рднреА рд╢рд╛рд╕рди рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП, рдореИрдВрдиреЗ Rprintf рдХреЗ рдмрдЬрд╛рдп REprintf (ETA рдХреЗ рд▓рд┐рдП # 2457 рдХреЗ рд░реВрдк рдореЗрдВ рдПрдХ рд╣реА рдлрд┐рдХреНрд╕) рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рд╕рднреА рдХреНрд░рд┐рдпрд╛ рд╕рдВрджреЗрд╢реЛрдВ рдХреЛ рдмрджрд▓ рджрд┐рдпрд╛ рд╣реИред рдпрд╣ рд╡рд┐рдлрд▓ рд░рд╣рд╛ рд╣реИ рдХреНрдпреЛрдВрдХрд┐ рдкрд░реАрдХреНрд╖рдг stderr рдкрд░ рдЖрдЙрдЯрдкреБрдЯ рдирд╣реАрдВ рдвреВрдВрдв рд░рд╣реЗ рд╣реИрдВ - рдареАрдХ рдХрд░ рджреЗрдВрдЧреЗред рдПрдХ рдмрд╛рд░ рдЬрдм рдпрд╣ рдЧреБрдЬрд░ рд░рд╣рд╛ рд╣реИ, рддреЛ рд╡рд┐рдВрдбреЛрдЬ .zip рд╕реНрд╡рдЪрд╛рд▓рд┐рдд рд░реВрдк рд╕реЗ рдмрдирд╛рдпрд╛ рдЬрд╛рдПрдЧрд╛, рдФрд░ рдлрд┐рд░ рдЖрдк рдХреГрдкрдпрд╛ рдлрд┐рд░ рд╕реЗ рдХреЛрд╢рд┐рд╢ рдХрд░ рд╕рдХрддреЗ рд╣реИрдВред рддреИрдпрд╛рд░ рд╣реЛрдиреЗ рдкрд░ рдореИрдВ рдпрд╣рд╛рдВ рдЕрдкрдбреЗрдЯ рдХрд░реВрдВрдЧрд╛ред

рдареАрдХ рд╣реИ, рджреВрд╕рд░рд╛ рдкреНрд░рдпрд╛рд╕ рдЪреЗрдХ рдкрд╛рд╕ рдХрд░ рд░рд╣рд╛ рд╣реИ рдФрд░ Windows.zip рдЙрдкрд▓рдмреНрдз рд╣реИред @HughParsonage рдХреНрдпрд╛ рдЖрдк рдлрд┐рд░ рд╕реЗ рдХреЛрд╢рд┐рд╢ рдХрд░рдирд╛ рдЪрд╛рд╣реЗрдВрдЧреЗ? рдореИрдВ r_ealushConsole () рдХреЛ рд░реАрдмреЙрдЗрдбрд┐рдВрдЧ рд╕реЗ рдареАрдХ рдкрд╣рд▓реЗ рд╡рд░реНрдмреЛрдЬрд╝ рдореЛрдб рдореЗрдВ рд╕рдВрджреЗрд╢реЛрдВ рдХреЗ рдмрд╛рдж рдПрдХ рдХреЙрд▓ рдЬреЛрдбрд╝рд╛ рд╣реИред рдпрд╣ рдлреНрд▓рд╢ рдХреЗрд╡рд▓ рд╡рд┐рдВрдбреЛрдЬ рдкрд░ рдХрднреА рдЬрд░реВрд░рдд рд╣реИред рдореИрдВ рдПрдХ рдЕрдиреБрдорд╛рди рд▓рдЧрд╛ рд░рд╣рд╛ рд╣реВрдВ рдХрд┐ рдлреНрд▓рд╢ рдХреЗ рдмрд┐рдирд╛, рдХрдВрд╕реЛрд▓ рдХрднреА-рдХрднреА рдПрдХ рдЫреЛрдЯреЗ рд╕реЗ рдмрд╛рдж рдореЗрдВ рдЕрдкрдбреЗрдЯ рдХрд░рддрд╛ рд╣реИ рдЬрдм рд╕рдорд╛рдирд╛рдВрддрд░ рд░реАрд░реЗрдб рд╣реЛ рд░рд╣рд╛ рд╣реИ рдФрд░ рдпрд╣ рд╕рдорд╕реНрдпрд╛ рдкреИрджрд╛ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдЬрд╛рдирд╛ рдЬрд╛рддрд╛ рд╣реИред рдХреГрдкрдпрд╛ 10 рдмрд╛рд░ рджреЛрд╣рд░рд╛рдПрдВ, рджреЛрдиреЛрдВ verbose=TRUE рдФрд░ showProgress=TRUE ред рдпрджрд┐ рдЖрдк 10 рд╕реНрдкрд╖реНрдЯ рд░рди рджреЗрдЦрддреЗ рд╣реИрдВ рддреЛ рд╣рдо рдХрд╣реЗрдВрдЧреЗ рдХрд┐ рдпрд╣ рдерд╛ред рдирд╣реАрдВ рддреЛ рдореБрдЭреЗ рдлрд┐рд░ рд╕реЗ рд╕реЛрдЪрдирд╛ рдкрдбрд╝реЗрдЧрд╛ред

рджреБрд░реНрднрд╛рдЧреНрдп рд╕реЗ, рддрдп рдирд╣реАрдВ:

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = FALSE)
Read 26%. ETA 00:00 Warning: stack imbalance in '$', 20 then 22
Read 52%. ETA 00:00 Warning: stack imbalance in '$', 36 then 35
Warning: stack imbalance in '$', 21 then 22
Read 59%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  : 
  unprotect_ptr: pointer not found
In addition: Warning: stack imbalance in '$', 26 then 28
Warning messages:
1: Warning: stack imbalance in '$', 26 then 27
In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15

10 рд░рди рдХреЗ рдмрд╛рдж рднреА verbose=TRUE, showProgress=TRUE рдХрд░рдиреЗ рд╕реЗ рдореБрдЭреЗ рдХреЛрдИ рддреНрд░реБрдЯрд┐ рдирд╣реАрдВ рдорд┐рд▓рддреА рд╣реИред рдпрд╣рд╛рдВ рджреЗрдЦреЗрдВ 10 рд╡реЗрдВ рдЖрдЙрдЯрдкреБрдЯ рдХрд╛ рдкрд░рд┐рдгрд╛рдо:

Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.094 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.752
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.004s (  0%) Memory map 0.341GB file
   0.008s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.173s (  4%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.660s ( 95%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
   =    0.009s (  0%) Finding first non-embedded \n after each jump
   +    1.946s ( 51%) Parse to row-major thread buffers
   +    1.098s ( 29%) Transpose
   +    0.608s ( 16%) Waiting
   1.752s ( 46%) Rereading 1 columns due to out-of-sample type exceptions
   3.846s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.589 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.418
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.001s (  0%) Memory map 0.341GB file
   0.003s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.574s ( 14%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.428s ( 86%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
   =    0.010s (  0%) Finding first non-embedded \n after each jump
   +    1.988s ( 50%) Parse to row-major thread buffers
   +    1.137s ( 28%) Transpose
   +    0.292s (  7%) Waiting
   1.418s ( 35%) Rereading 1 columns due to out-of-sample type exceptions
   4.007s        Total
There were 20 warnings (use warnings() to see them)

@HughParsonage рдзрдиреНрдпрд╡рд╛рдж! рдореИрдВ рд╣рд╛рд▓рд╛рдВрдХрд┐ рдЙрд▓рдЭрди рдореЗрдВ рд╣реВрдБред рдЖрдк рдХрд╣ рд░рд╣реЗ рд╣реИрдВ рдХрд┐ рдпрд╣ verbose=TRUE, showProgress=TRUE рд╕рд╛рде рдареАрдХ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИ, рдЬреЛ рдХрд┐ рд╣рдо - yay рдХреЗ рд▓рд┐рдП рдЖрд╢рд╛ рдХрд░рддреЗ рд╣реИрдВ! рдЗрд╕рд╕реЗ рдкрд╣рд▓реЗ рдХрд┐ рдпрд╣ рдЕрд╕рдлрд▓ рдирд╣реАрдВ рд╣реБрдЖ? showProgress рдХреЗ рд▓рд┐рдП рдбрд┐рдлрд╝реЙрд▓реНрдЯ рд╡реИрд╕реЗ рднреА рд╕рд╣реА рд╣реИ, рд▓реЗрдХрд┐рди рдЬрдм рдЖрдк verbose рд▓рд┐рдП рдбрд┐рдлрд╝реЙрд▓реНрдЯ FALSE рдХреЗ рд╕рд╛рде рдЪрд▓рддреЗ рд╣реИрдВ, рддреЛ рдпрд╣ рдХрд╛рдо рдирд╣реАрдВ рдХрд░рддрд╛ рд╣реИ рдФрд░ рдЖрдк рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рджреЗрдЦрддреЗ рд╣реИрдВ? рдпрд╣ рдЕрдЬреАрдм рд╣реИ рдХрд┐ _less_ рдЖрдЙрдЯрдкреБрдЯ рдЗрд╕реЗ рд╡рд┐рдлрд▓ рдмрдирд╛рддрд╛ рд╣реИред рдХреГрдкрдпрд╛ рдкреБрд╖реНрдЯрд┐ рдХрд░реЗрдВред рдЕрдЧрд░ рдРрд╕рд╛ рд╣реИ рддреЛ рд╢рд╛рдпрдж рдореИрдВ рдЧрд▓рдд рдкреЗрдбрд╝ рдХреЛ рдХрд╛рдЯ рд░рд╣рд╛ рд╣реВрдВред рдпрд╣ рд▓рд┐рдирдХреНрд╕ рдкрд░ рдореЗрд░реЗ рд▓рд┐рдП рдпрд╣рд╛рдБ рдареАрдХ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИ рдЗрд╕рд▓рд┐рдП рдореИрдВ рд╡рд┐рдВрдбреЛрдЬрд╝ рдкрд░ рдЖрдкрдХреЗ рдкрд░реАрдХреНрд╖рдг рдкрд░ рдирд┐рд░реНрднрд░ рд╣реВрдБред рдзрдиреНрдпрд╡рд╛рджред
(рд╕рд╛рде рд╣реА, 10 рд╡реЗрдВ рд░рди рдЖрдЙрдЯрдкреБрдЯ рдХреЗ рдирд┐рдЪрд▓реЗ рднрд╛рдЧ рдореЗрдВ, рдпрд╣ рдХрд╣рддрд╛ рд╣реИ рдХрд┐ 20 рдЪреЗрддрд╛рд╡рдирд┐рдпрд╛рдБ рдереАрдВред рдореБрдЭреЗ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рд╡реЗ 2 рдЪреЗрддрд╛рд╡рдирд┐рдпрд╛рдБ рд╣реИрдВ рдЬреЛ рдЙрдЪреНрдЪрддрд░ рджрд┐рдЦрд╛рдИ рдЬрд╛рддреА рд╣реИрдВ, 10 рдмрд╛рд░ рджреЛрд╣рд░рд╛рдИ рдЬрд╛рддреА рд╣реИрдВред рдпрджрд┐ рдРрд╕рд╛ рд╣реИ, рддреЛ рд╕рдордЭ рдореЗрдВ рдЖрддрд╛ рд╣реИред)

рд╣рд╛рдп рднреНрд░рдо рдХреЗ рд▓рд┐рдП рдЦреЗрдж рд╣реИ, рдореИрдЯред

рдЖрдк рд╕рд╣реА рдХрд╣ рд░рд╣реЗ рд╣реИрдВ рдХрд┐ рдореВрд▓ рд╕рдорд╕реНрдпрд╛ рдЕрдм рдХрд┐рд╕реА рджреБрд░реНрдШрдЯрдирд╛ рдХрд╛ рдкрд░рд┐рдгрд╛рдо рдирд╣реАрдВ рд╣реИ, рдЕрд░реНрдерд╛рддреН рдирд┐рдореНрдирд▓рд┐рдЦрд┐рдд рдХрд╛рд░реНрдп рдЕрдкреЗрдХреНрд╖рд┐рдд рд░реВрдк рд╕реЗ рд╣реЛрддреЗ рд╣реИрдВ:

fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "")

рд╕реНрдкрд╖реНрдЯ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП, рдореВрд▓ рдореЗрдВ, рдЬрдм verbose =FALSE (рдбрд┐рдлрд╝реЙрд▓реНрдЯ) рдореБрдЭреЗ рдХреНрд░реИрд╢ рдорд┐рд▓рд╛ред рдореИрдВрдиреЗ рдЗрд╕реЗ рдЬрд╛рд░реА рдХрд░рдиреЗ рд╕реЗ рдкрд╣рд▓реЗ verbose = TRUE рд╕рд╛рде рднрд╛рдЧ рд▓рд┐рдпрд╛, рдФрд░ 'рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди' рдХреА рдЪреЗрддрд╛рд╡рдиреА рдкрд░ рдзреНрдпрд╛рди рджрд┐рдпрд╛, рд▓реЗрдХрд┐рди рдПрдХ рджреБрд░реНрдШрдЯрдирд╛ рдХрд╛ рд╕рд╛рдордирд╛ рдирд╣реАрдВ рдХрд┐рдпрд╛ред рдирд╡реАрдирддрдо рд╕рдВрд╕реНрдХрд░рдг рдХреЗ рд╕рд╛рде, рдореБрдЭреЗ verbose = FALSE рд╕рд╛рде рдХреЛрдИ рдХреНрд░реИрд╢ (рдпрд╛ рд╡рд╛рд╕реНрддрд╡ рдореЗрдВ рдХреЛрдИ рд╕рдорд╕реНрдпрд╛) рдирд╣реАрдВ рдорд┐рд▓рддреА рд╣реИред

рдореЗрд░реЗ рджреНрд╡рд╛рд░рд╛ 'рдирд┐рд╢реНрдЪрд┐рдд рдирд╣реАрдВ' рдХрд┐рдП рдЬрд╛рдиреЗ рдХрд╛ рдХрд╛рд░рдг рдпрд╣ рдерд╛ рдХрд┐ рдореИрдВрдиреЗ рдЪреЗрддрд╛рд╡рдиреА рд╕рдВрджреЗрд╢реЛрдВ рдХреЛ рджреЗрдЦрд╛ рдерд╛:

Warning messages:
Warning: stack imbalance in '$', 26 then 27
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15

рдЬреЛ рдЕрдЬреАрдм рд▓рдЧ рд░рд╣рд╛ рдерд╛ рдФрд░ рдореБрдЭреЗ рд▓рдЧрд╛ рдХрд┐ рдпрд╣ рд╕рдорд░реВрдк рд╕рдорд╕реНрдпрд╛ рдирд╣реАрдВ рдмрд▓реНрдХрд┐ рдирд┐рдХрдЯрддрд╛ рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рд╕рдВрдХреЗрдд рджреЗ рд╕рдХрддрд╛ рд╣реИред рдпрд╣ рдХрд╣рддреЗ рд╣реБрдП рдХрд┐, рдСрд╕реНрдЯреНрд░реЗрд▓рд┐рдпрд╛ рдореЗрдВ рдЖрдЬ рд╕реБрдмрд╣ рдореИрдВ рдЪреЗрддрд╛рд╡рдиреА рд╕рдВрджреЗрд╢ рджреЛрдмрд╛рд░рд╛ рдирд╣реАрдВ рд▓рд╛ рд╕рдХрддрд╛ред

рдареАрдХ рд╣реИ рдореИрдВ рд╕рдордЭрд╛ред рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдЪреЗрддрд╛рд╡рдиреА рд╕рдВрджреЗрд╢ рдЕрдирд┐рд╡рд╛рд░реНрдп рд░реВрдк рд╕реЗ рддреНрд░реБрдЯрд┐рдпрд╛рдВ рд╣реИрдВ, рд╣рд╛рдБред рд╣рдо рдЙрдиреНрд╣реЗрдВ рдЫреЛрдбрд╝ рдирд╣реАрдВ рд╕рдХрддреЗред рдореИрдВ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдЪреЗрддрд╛рд╡рдиреА рджреЗ рд░рд╣рд╛ рд╣реВрдВ рдХрд┐ рджреБрд░реНрдШрдЯрдирд╛, рднрд▓реЗ рд╣реА рдпрд╣ рд╡рд╛рд╕реНрддрд╡ рдореЗрдВ рдЕрднреА рддрдХ рджреБрд░реНрдШрдЯрдирд╛рдЧреНрд░рд╕реНрдд рди рд╣реБрдИ рд╣реЛред (рдпрд╣ рдЙрд╕ рдЪреЗрддрд╛рд╡рдиреА рдХреЛ рджреЗрдЦрдиреЗ рдХреЗ рдмрд╛рдж рдХреНрд░реИрд╢ рд╣реЛрдиреЗ рддрдХ рдмрд╕ рдХреБрдЫ рд╕рдордп рдХреА рдмрд╛рдд рд╣реИред)

рдЬрдм рдЖрдк verbose=TRUE, showProgress=TRUE рд╕рд╛рде рдПрдХ рддрд╛рдЬрд╛ рдЖрд░ рд╕рддреНрд░ рдореЗрдВ 10 рд░рди рдХрд░рддреЗ рд╣реИрдВ, рддреЛ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ 20 рдЪреЗрддрд╛рд╡рдирд┐рдпреЛрдВ рдореЗрдВ рд╕реЗ рдХреЛрдИ рднреА рд╣реЛ рдпрд╛ рд╡реЗ рд╕рднреА 20 рдХреЗрд╡рд▓ рдирд┐рдореНрдирд▓рд┐рдЦрд┐рдд рдирд┐рдпрдорд┐рдд рдЪреЗрддрд╛рд╡рдиреА рд╣реИрдВред

1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

рдПрдХ рдмрд╛рд░ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдЪреЗрддрд╛рд╡рдиреА рд╣реЛрдиреЗ рдХреЗ рдмрд╛рдж, рдХреГрдкрдпрд╛ рдПрдХ рдирдпрд╛ рдЖрд░ рд╕рддреНрд░ рд╢реБрд░реВ рдХрд░реЗрдВред рд╣рдо R рд╕реЗ рдХреБрдЫ рднреА рднрд░реЛрд╕рд╛ рдирд╣реАрдВ рдХрд░ рд╕рдХрддреЗ рд╣реИрдВ рдХрд┐ рдПрдХ рдмрд╛рд░ рднреА рд╣реБрдЖ рд╣реИред

рдЬрдм рдореИрдВ verbose=TRUE, showProgress=TRUE рд╕рд╛рде рднрд╛рдЧрд╛ рддреЛ рдореИрдВ рдПрдХ рдХреНрд░реИрд╢ рдкреНрд░рд╛рдкреНрдд рдХрд░рдиреЗ рдореЗрдВ рд╕рдлрд▓ рд░рд╣рд╛ред const char SEXP ред рдореИрдВ рдЗрд╕реЗ рдХрдорд╛рдВрдб рд▓рд╛рдЗрди рд╕реЗ рдкреБрди: рдЙрддреНрдкрдиреНрди рдХрд░рдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХрд░ рд░рд╣рд╛ рд╣реВрдВ (рджреБрд░реНрднрд╛рдЧреНрдп рд╕реЗ рдпрд╣ RStudio рдореЗрдВ рд╣реБрдЖ рдФрд░ RStudio рдмрдВрдж рд╣реЛ рдЧрдпрд╛ рдЗрд╕рд╕реЗ рдкрд╣рд▓реЗ рдХрд┐ рдореИрдВ рдкреВрд░рд╛ рд╕рдВрджреЗрд╢ рдкрдврд╝ рд╕рдХреВрдВ)ред

рджреБрд░реНрдШрдЯрдирд╛ рдХреЛ рдкреБрди: рдЙрддреНрдкрдиреНрди рдирд╣реАрдВ рдХрд░ рд╕рдХрддрд╛ред рдпрд╣рд╛рдБ рд░рд┐рдмреВрдЯ рдХрд░рдиреЗ рдХреЗ рдмрд╛рдж рдкрд░рд┐рдгрд╛рдо рд╣реИред рдПрдХ рдвреЗрд░ рдЕрд╕рдВрддреБрд▓рди рдЪреЗрддрд╛рд╡рдиреА рдереА:

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> for (i in 1:10) fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE, showProgress = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 31%. ETA 00:00 Warning: stack imbalance in '$', 24 then 23
Read 91%. ETA 00:00 Warning: stack imbalance in '$', 27 then 26
Read 95%. ETA 00:00 Warning: stack imbalance in '$', 28 then 29
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.895
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.029s (  1%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.314s ( 15%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.761s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.015s (  1%) Finding first non-embedded \n after each jump
   +    0.599s ( 28%) Parse to row-major thread buffers
   +    0.400s ( 19%) Transpose
   +    0.746s ( 35%) Waiting
   0.895s ( 42%) Rereading 1 columns due to out-of-sample type exceptions
   2.107s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.335 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.049
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.402s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.974s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.209s (  9%) Parse to row-major thread buffers
   +    0.864s ( 36%) Transpose
   +    0.900s ( 38%) Waiting
   1.049s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
   2.385s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.414
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.293s ( 18%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.322s ( 81%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.199s ( 12%) Parse to row-major thread buffers
   +    0.822s ( 51%) Transpose
   +    0.301s ( 19%) Waiting
   0.414s ( 25%) Rereading 1 columns due to out-of-sample type exceptions
   1.626s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.451 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.409
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.403s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.448s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.194s ( 10%) Parse to row-major thread buffers
   +    0.974s ( 52%) Transpose
   +    0.279s ( 15%) Waiting
   0.409s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.860s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.480 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.412
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.459s ( 24%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.424s ( 75%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.197s ( 10%) Parse to row-major thread buffers
   +    0.938s ( 50%) Transpose
   +    0.288s ( 15%) Waiting
   0.412s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.892s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.381 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.401
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.005s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.384s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.389s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.196s ( 11%) Parse to row-major thread buffers
   +    0.911s ( 51%) Transpose
   +    0.281s ( 16%) Waiting
   0.401s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.781s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.384 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.480
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.476s ( 26%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.378s ( 74%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.192s ( 10%) Parse to row-major thread buffers
   +    0.833s ( 45%) Transpose
   +    0.352s ( 19%) Waiting
   0.480s ( 26%) Rereading 1 columns due to out-of-sample type exceptions
   1.864s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.374 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.507
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.311s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.562s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.193s ( 10%) Parse to row-major thread buffers
   +    0.988s ( 52%) Transpose
   +    0.381s ( 20%) Waiting
   0.507s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
   1.881s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.318 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.493
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.306s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.496s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.193s ( 11%) Parse to row-major thread buffers
   +    0.935s ( 52%) Transpose
   +    0.367s ( 20%) Waiting
   0.493s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
   1.811s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.141 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.506
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.132s (  8%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.506s ( 91%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.195s ( 12%) Parse to row-major thread buffers
   +    0.938s ( 57%) Transpose
   +    0.371s ( 23%) Waiting
   0.506s ( 31%) Rereading 1 columns due to out-of-sample type exceptions
   1.647s        Total
Warning: stack imbalance in 'for', 2 then 8
There were 20 warnings (use warnings() to see them)

рдЕрдЬреАрдм рддрд░рд╣ рд╕реЗ рдХрд┐ рдирд┐рд╢реНрдЪрд┐рддрддрд╛ рдорд╣рд╛рди рд╣реИред рдзрдиреНрдпрд╡рд╛рджред рдЗрд╕рдХрд╛ рдорддрд▓рдм рд╣реИ рдХрд┐ рдлреНрд▓рд╢ рдХрд╛рдо рдирд╣реАрдВ рдХрд░рддрд╛ рд╣реИ рдФрд░ рдореБрдЭреЗ рд╕рдм рдХреЗ рдмрд╛рдж Rprintf рд╕реЗ рдмрдЪрдиреЗ рдХрд╛ рдПрдХ рддрд░реАрдХрд╛ рдЦреЛрдЬрдирд╛ рд╣реЛрдЧрд╛ред рдпрд╣ рдордЬрд╝рдмреВрддреА рд╕реЗ verbose=FALSE, showProgress=FALSE рд╕рд╛рде рдХрд╛рдо рдХрд░рддрд╛ рд╣реИ (рдЖрдкрдиреЗ рд▓рд┐рдЦрд╛ рд╣реИ рдХрд┐ рдЗрд╕ рдореБрджреНрджреЗ рдХреЗ рд╢реАрд░реНрд╖ рдХреЗ рдкрд╛рд╕ рдЗрд╕рд▓рд┐рдП рдореИрдВ рдЙрд╕ рдкрд░ рднрд░реЛрд╕рд╛ рдХрд░ рд░рд╣рд╛ рд╣реВрдВред) "рд╡рд┐рд╢реНрд╡рд╕рдиреАрдп" рдЕрд░реНрде рдХреЗрд╡рд▓ 10 рд╕реАрдзреЗ рджреЛ рдЕрдкреЗрдХреНрд╖рд┐рдд рдЪреЗрддрд╛рд╡рдирд┐рдпреЛрдВ рдХреЗ рд╕рд╛рде рдЪрд▓рддрд╛ рд╣реИ рдФрд░ рд╕реНрдЯреИрдХ рдХреА рдХреЛрдИ рджреГрд╖реНрдЯрд┐ рдирд╣реАрдВ рд╣реИред рдЕрд╕рдВрддреБрд▓рди рдХреА рдЪреЗрддрд╛рд╡рдиреАред
рдЗрд╕реЗ рдореЗрд░реЗ рд╕рд╛рде рдЫреЛрдбрд╝ рджреЛред рдПрдХ рдмрд╛рд░ рдлрд┐рд░ рдзрдиреНрдпрд╡рд╛рджред

@HughParsonage рдареАрдХ рд╣реИ, рдХреГрдкрдпрд╛ рдЙрд╕ рд╣рд╛рд▓ рдХреЗ рджреВрд╕рд░реЗ рдкреНрд░рдпрд╛рд╕ рдХреЗ рд╕рд╛рде рдлрд┐рд░ рд╕реЗ рдХреЛрд╢рд┐рд╢ рдХрд░реЗрдВред рдЗрд╕реЗ рдЕрднреА рддрдХ рдорд╛рд╕реНрдЯрд░ рдореЗрдВ рд╡рд┐рд▓рдп рдирд╣реАрдВ рдХрд┐рдпрд╛ рдЧрдпрд╛ рд╣реИ, рдЗрд╕рд▓рд┐рдП рдХреГрдкрдпрд╛ рдпрд╣рд╛рдВ рд╢рд╛рдЦрд╛ рд╕реЗ Windows.zip рд▓рд╛рдиреЗ рдХреЗ рд▓рд┐рдП рд╕рд╛рд╡рдзрд╛рди рд░рд╣реЗрдВред рдкрд╣рд▓реЗ рдХреА рддрд░рд╣, рдХреГрдкрдпрд╛ рдкреВрд░реНрдг рдЖрдЙрдЯрдкреБрдЯ рдкреНрд░рджрд╛рди рдХрд░реЗрдВ рддрд╛рдХрд┐ рдореИрдВ рдЗрд╕реЗ рдЬрд╛рдВрдЪ рд╕рдХреВрдВред рдзрдиреНрдпрд╡рд╛рдж!

рдирд┐рдореНрдирд▓рд┐рдЦрд┐рдд рдХреЗ рдкрд╣рд▓реЗ рдкреНрд░рдпрд╛рд╕ рдореЗрдВ рдПрдХ рджреБрд░реНрдШрдЯрдирд╛ (рдПрдХ рдкреЙрдЗрдВрдЯрд░ рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдХреБрдЫ) рд╣реБрдИред

рджреВрд╕рд░рд╛ рдкреНрд░рдпрд╛рд╕ (рд░рд┐рдмреВрдЯ рдХрд░рдиреЗ рдХреЗ рдмрд╛рдж) stack imbalance in '$', 16 then 15 рдЪреЗрддрд╛рд╡рдиреА рдХреЗ рд░реВрдк рдореЗрдВ рд╣реЛрддрд╛ рд╣реИред

# Assert that `data.table` is not installed:
stopifnot(!requireNamespace("data.table", quietly = TRUE))

install.packages("https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into тАШC:/Users/hughp/Documents/R/win-library/3.4тАЩ
# (as тАШlibтАЩ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557502 bytes (1.5 MB)
# downloaded 1.5 MB

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 01:38:17 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com

setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapping ... ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \r-only line endings are not allowed because \n is found in the data
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
#   [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=','  with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
#   Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000)    : 1551  Quote rule 0
# Type codes (jump 100)    : 1A51  Quote rule 0
# =====
#   Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
#   [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# Read 78%. ETA 00:00 Warning: stack imbalance in '$', 16 then 15
# Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.677 wall clock time
# [12] Finalizing the datatable
# Type counts:
#   1 : bool8     '1'
# 1 : int32     '5'
# 2 : string    'A'
# =============================
#   0.002s (  0%) Memory map 0.341GB file
# 0.007s (  0%) sep=',' ncol=4 and header detection
# 0.001s (  0%) Column type detection using 10027 sample rows
# 0.297s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 2.369s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# =    0.003s (  0%) Finding first non-embedded \n after each jump
# +    0.273s ( 10%) Parse to row-major thread buffers (grown 0 times)
# +    1.313s ( 49%) Transpose
# +    0.780s ( 29%) Waiting
# 0.893s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
# 2.677s        Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1        V2      V3 V4
# 1: Goulburn 110018063    3499 NA
# 2:       NA 110018064     812 NA
# 3:       NA 110018065    2158 NA
# 4:       NA 110019999     402 NA
# 5:       NA 110028068      10 NA
# ---                              
#   22885376:       NA 997999799       0 NA
# 22885377:       NA 998999899      64 NA
# 22885378:       NA 994999499      34 NA
# 22885379:       NA 0&&&&&&&&  250796 NA
# 22885380:       NA 0@@@@@@@@ 7305367 NA
# Warning messages:
#   1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#                 Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
#               2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#               Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

рд╣рд╛рдп, @mattdowleред рдЬреАрд╕реАрд╕реА рдХреЗ рд╕рдВрд╕реНрдХрд░рдг рдЕрднреА рднреА рдЙрдкрдпреЛрдЧ рдореЗрдВ рд╣реИрдВ рдЬрд┐рдирдХреЗ рдУрдкрдирдПрдордкреА рд╕рдмрд╕реЗ рдЕрдЪреНрдЫреЗ 3.1 рдкрд░ рд╣реИрдВ, 4.0 рдирд╣реАрдВред рдореИрдВ CRAN ( Delaporte ) рдкрд░ рдЕрдкрдиреЗ рдПрдХ рдкреИрдХреЗрдЬ рдореЗрдВ рдЙрд╕ рд╕рдорд╕реНрдпрд╛ рдореЗрдВ рднрд╛рдЧ рдЧрдпрд╛, рдЬрд╣рд╛рдВ рдореИрдВрдиреЗ рдПрдХ SIMD рдирд┐рд░реНрджреЗрд╢ (OpenMP 4.0) рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХреА, рдЬреЛ рд╡рд┐рдВрдбреЛрдЬ рдХреЗ рд▓рд┐рдП Rtools (4.9.3 рдкрд░ рдЖрдзрд╛рд░рд┐рдд) рдХреЗ рд╕рд╛рде рд╕рдВрдХрд▓рд┐рдд рдХрд┐рдпрд╛ рдЧрдпрд╛ рдерд╛, рд▓реЗрдХрд┐рди рдХрд┐рд╕реА рдХреЗ рд▓рд┐рдирдХреНрд╕ рдорд╢реАрди рдкрд░ рдереНрд░реВ рдФрд░ рддреНрд░реБрдЯрд┐ рдЕрднреА рднреА gcc рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░ рд░рд╣реА рд╣реИ 4.8.0ред рдпрд╣рд╛рдВ рддрдХ тАЛтАЛрдХрд┐ рд╡рд┐рдВрдбреЛрдЬ рдХреЗрд╡рд▓ 4.0 рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░ рд╕рдХрддрд╛ рд╣реИ рдФрд░ 4.5 рдХреЙрд▓ рдХрд╛ рдирд╣реАрдВ, рдЕрдЧрд░ рдореБрдЭреЗ рд╕рд╣реА рдпрд╛рдж рд╣реИред рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдХрд┐ рдЗрд╕ рдореБрджреНрджреЗ рдореЗрдВ рдпреЛрдЧрджрд╛рди рджреЗ рд░рд╣рд╛ рд╣реИ?

@HughParsonage рдЗрддрдиреА рдЬрд▓реНрджреА рдкрд░реАрдХреНрд╖рдг рдХреЗ рд▓рд┐рдП рдзрдиреНрдпрд╡рд╛рдж! рдареАрдХ рд╣реИ, рдореИрдВ рд╕реЛрдЪрддрд╛ рд░рд╣реВрдВрдЧрд╛!
@aadler рдпрд╣ рдПрдХ рдЕрдЪреНрдЫрд╛ рд╡рд┐рдЪрд╛рд░ рд╣реИ - рдХреБрдЫ рднреА рд╕рдВрднрд╡ рд╣реИред

@HughParsonage рдХреГрдкрдпрд╛ рдкреБрд╖реНрдЯрд┐ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдХрд┐ рдХреЗрд╡рд▓ рдПрдХ рдкрд░рд┐рд╡рд░реНрддрди ( verbose=FALSE ) рдХреЗ рд╕рд╛рде рдПрдХ рд╣реА рдЖрджреЗрд╢ рдареАрдХ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИ? рдпрд╛рдиреА fread("SA2-by-DJZ-2011.csv", verbose = FALSE, na.strings = "", header = FALSE) ред рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рдЕрднреА рднреА рдкреНрд░рджрд░реНрд╢рд┐рдд рд╣реЛрдЧрд╛ред

рд╣рд╛рдВ, рдЙрд╕ рдХрдорд╛рдВрдб рдХреЛ рдЪрд▓рд╛рдиреЗ (рджрд╕ рдмрд╛рд░) рдиреЗ рдЕрдкреЗрдХреНрд╖рд┐рдд рдкрд░рд┐рдгрд╛рдо рд▓реМрдЯрд╛рдпрд╛ (рдпрд╛рдиреА рдбреЗрдЯрд╛рдЯреЗрдмрд▓ рдХреЗрд╡рд▓ рджреЛ рдЪреЗрддрд╛рд╡рдирд┐рдпреЛрдВ рдХреЗ рд╕рд╛рде рдХреНрдпреЛрдВрдХрд┐ рдпрд╣ рдмреБрд░реА рддрд░рд╣ рд╕реЗ рд╕реНрд╡рд░реВрдкрд┐рдд рд╣реИ)ред рдХреЛрдИ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдЪреЗрддрд╛рд╡рдиреАред

рдзрдиреНрдпрд╡рд╛рджред рддреЛ рдпрд╣ рдХрдВрд╕реЛрд▓ рдЖрдЙрдЯрдкреБрдЯ рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рдкреНрд░рддреАрдд рд╣реЛрддрд╛ рд╣реИред рдХреБрдЫ рдФрд░ рдЪреАрдЬреЗрдВ рдЯреНрд░рд╛рдИ рдХрд░реЗрдВ ...

рд╡рд░реНрдмреЛрдЬрд╝ рдореЛрдб рдореЗрдВ, рд╕рдорд╛рдирд╛рдВрддрд░ рдХреНрд╖реЗрддреНрд░ рдХреЗ рдЕрдВрджрд░ рдХреБрдЫ рд╢рд╛рдЦрд╛рдПрдБ рд╣реЛрддреА рд╣реИрдВ рдЬрд┐рдиреНрд╣реЗрдВ wallclock() ред рдореИрдВрдиреЗ рд╢реЙрд░реНрдЯ-рд╕рд░реНрдХреБрд▓реЗрдЯ рдХрд┐рдпрд╛ рд╣реИ рдЬреЛ рд╣рдореЗрд╢рд╛ 0.0 рд▓реМрдЯрд╛рддрд╛ рд╣реИ рдФрд░ рд╕рд┐рд╕реНрдЯрдо рдХреЙрд▓ рд╕реЗ рдмрдЪрдиреЗ рдХреЗ рд▓рд┐рдП, рдЙрд╕ рдирд┐рдпрдо рд╕реЗред рдореБрдЭреЗ рд▓рдЧрд╛ рдХрд┐ рдпрд╣ рдереНрд░реЗрдб рд╕реЗрдл рд╣реИ рд▓реЗрдХрд┐рди рд╢рд╛рдпрдж рдирд╣реАрдВред рдХреГрдкрдпрд╛ рдпрд╣рд╛рдБ рдкреБрдирд░реНрдирд┐рд░реНрдорд╛рдг рд╢рд╛рдЦрд╛ рд╕реЗ рдирдП Windows.zip рдХрд╛ рдкреНрд░рдпрд╛рд╕

рдкрд╣рд▓рд╛ рдкреНрд░рдпрд╛рд╕:

install.packages("https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into тАШC:/Users/hughp/Documents/R/win-library/3.4тАЩ
# (as тАШlibтАЩ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1556972 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package тАШdata.tableтАЩ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 03:49:20 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)

image

рджреВрд╕рд░рд╛ рдкреНрд░рдпрд╛рд╕, рдореБрдЭреЗ рдирд┐рдореНрди рдЪреЗрддрд╛рд╡рдиреА рдорд┐рд▓рддреА рд╣реИ:

Read 22%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  : 
  unprotect_ptr: pointer not found
In addition: Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
 Warning: stack imbalance in '$', 29 then 28
 Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 125 then 126
Warning: stack imbalance in 'lapply', 55 then 53
Warning: stack imbalance in 'lapply', 30 then 34
Warning: stack imbalance in '<-', 28 then 31
Warning: stack imbalance in '{', 24 then 27
Warning: stack imbalance in '{', 18 then 21

рдмрд╕ рдПрдХ рд╡рд┐рдЪрд╛рд░: рдпрд╣ RStudio рдХреЗ рд╕рд╛рде рдПрдХ рд╕рдорд╕реНрдпрд╛ рд╣реЛ рд╕рдХрддреА рд╣реИ? рдЯрд░реНрдорд┐рдирд▓ рд╕реЗ рд╕реНрдХреНрд░рд┐рдкреНрдЯ рдЪрд▓рд╛рдирд╛ рдЖрд╕рд╛рдиреА рд╕реЗ рдкреБрдирд░реБрддреНрдкрд╛рджрд┐рдд рдирд╣реАрдВ рд╣реЛрддрд╛ рд╣реИред рдореИрдВ RStudio рд╕реЗ рдЪрд▓ рд░рд╣рд╛ рд╣реВрдВ рдХреНрдпреЛрдВрдХрд┐ рдпрд╣ рдХрдВрд╕реЛрд▓ рдЖрдЙрдЯрдкреБрдЯ рдХреЛ рдХреЙрдкреА рдХрд░рдирд╛ рдЖрд╕рд╛рди рдмрдирд╛рддрд╛ рд╣реИред

рдЬрдм рдЖрдк рдХрд╣рддреЗ рд╣реИрдВ рдХрд┐ рдпрд╣ RStudio рдХреЗ рдмрд╛рд╣рд░ _as рдХреЛ рдЖрд╕рд╛рдиреА рд╕реЗ рдкреБрди: рдкреНрд░рд╕реНрддреБрдд рдирд╣реАрдВ рдХрд░рддрд╛ рд╣реИ, рддреЛ рдХреНрдпрд╛ рдпрд╣ _at all_ рдХреЛ рдкреБрди: рдЙрддреНрдкрдиреНрди рдХрд░рддрд╛ рд╣реИ? рдпрд╣рд╛рдВ рддрдХ тАЛтАЛрдХрд┐ рдЕрдЧрд░ рдпрд╣ рдХреЗрд╡рд▓ RStudio рдХреЗ рдЕрдВрджрд░ рд╣реЛрддрд╛ рд╣реИ, рддреЛ рдпрд╣ рдЕрднреА рднреА рдХреБрдЫ рд╣реИ рдЬрд┐рд╕рдХрд╛ рдЙрджреНрджреЗрд╢реНрдп рдореИрдВ рдбреЗрдЯрд╛рдЯреЗрдмрд▓ рдкрдХреНрд╖ рдХреЛ рдареАрдХ рдХрд░реВрдВрдЧрд╛ред рдореИрдВ рд╕рд┐рд░реНрдл рдПрдХ рдЕрдиреНрдп рдорд╛рд░реНрдЧ рдХреЗ рд░реВрдк рдореЗрдВ рдпрд╣ рдкреБрд╖реНрдЯрд┐ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдХрд╣ рд░рд╣рд╛ рд╣реВрдВ рдХрд┐ рдпрд╣ рдирд┐рд╢реНрдЪрд┐рдд рд░реВрдк рд╕реЗ "рдмрд╕" рдХрдВрд╕реЛрд▓ рдЖрдЙрдЯрдкреБрдЯ рд╣реИ рди рдХрд┐ рдХреБрдЫ рдЕрдиреНрдп рд╕рд╣реА рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди, рдЬреЛ рдХрд┐ рдлрд╝реНрд░реЗрдб рд▓реЙрдЬрд┐рдХ рдореЗрдВ рд╣реИред

рдореИрдВ рдЕрднреА рддрдХ RStudio рдХреЗ рдмрд╛рд╣рд░ рдкреБрди: рдкреЗрд╢ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рд╣реВрдВ, рдФрд░ рдЗрд╕рдХреЗ рдЕрдВрджрд░ рд╡рд┐рд╢реНрд╡рд╕рдиреАрдп рдкреБрди: рдкреЗрд╢ рдХрд░ рд╕рдХрддрд╛ рд╣реВрдВ (рдЕрд░реНрдерд╛рдд, рдореИрдВ рдХреБрдЫ рдЪреЗрддрд╛рд╡рдиреА рдпрд╛ рджреБрд░реНрдШрдЯрдирд╛ рдХреЛ рдкреБрди: рдЙрддреНрдкрдиреНрди рдХрд░ рд╕рдХрддрд╛ рд╣реВрдВ)ред рдореИрдВрдиреЗ рд╡рд┐рдВрдбреЛрдЬрд╝ рдХрдорд╛рдВрдб рдкреНрд░реЙрдореНрдкреНрдЯ рдФрд░ рдЧрд┐рдЯ рд╢реЗрд▓ (рд╡рд┐рдВрдбреЛрдЬрд╝ рдореЗрдВ) рдХреА рдХреЛрд╢рд┐рд╢ рдХреА рд╣реИред

рдореИрдВ рд╡рд┐рдВрдбреЛрдЬ рдкрд░ RStudio рд╕рдВрд╕реНрдХрд░рдг 1.1.383 рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░ рд░рд╣рд╛ рд╣реВрдВред рдХреНрдпрд╛ рдпрд╣ рдЖрдкрдХреЗ рд▓рд┐рдП рдорджрджрдЧрд╛рд░ рд╣реЛрдЧрд╛ рдпрджрд┐ рдореИрдВрдиреЗ рдЙрдирдХреЗ рд╕рд╛рде рднреА рдЗрд╕ рдореБрджреНрджреЗ рдХреЛ рдЙрдард╛рдпрд╛, рдпрд╛ рдЖрдк рдореБрдЭреЗ рдЗрдВрддрдЬрд╛рд░ рдХрд░рдирд╛ рдЪрд╛рд╣реЗрдВрдЧреЗ?

рдзрдиреНрдпрд╡рд╛рджред рдпрд╣ рдЬрд╛рдирдирд╛ рд╡рд╛рд╕реНрддрд╡ рдореЗрдВ рдЙрдкрдпреЛрдЧреА рд╣реИ рдХрд┐ рдпрд╣ рд╕рд┐рд░реНрдл RStudio рдХреЗ рдЕрдВрджрд░ рд╣реИред рдЗрд╕реЗ рдЙрдирдХреЗ рд╕рд╛рде рдЙрдард╛рдиреЗ рдХреА рдЬрд░реВрд░рдд рдирд╣реАрдВ рд╣реИред рдпрд╣ рд╕рд┐рд░реНрдл рдЗрд╕рдХрд╛ рдорддрд▓рдм рд╣реИ рдХрд┐ рдпрд╣ рдЖрдЙрдЯрдкреБрдЯ рдХрдВрд╕реЛрд▓ рдмрдлрд╝рд░рд┐рдВрдЧ (рдпрд╛ рд╕рдорд╛рди) рдХреЗ рд╕рд╛рде рдХреБрдЫ рдХрд░рдирд╛ рд╣реИред рдореИрдВ рдПрдХ рдХрд╛рдо рдХреЗ рд╕рд╛рде рдФрд░ рдзрдХреНрдХрд╛ рджреЗрдиреЗ рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдЖрдЧреЗ рдмрдврд╝ рдЧрдпрд╛ рд╣реВрдВред

рдореИрдВ рдпрд╣ рдирд╣реАрдВ рджреЗрдЦрддрд╛ рдХрд┐ Windows рдЙрд╕ рдкрд░рд┐рд╡рд░реНрддрди рдХрд╛ рд╕рдВрдХрд▓рди рдХреНрдпреЛрдВ рдирд╣реАрдВ рдХрд░ рд░рд╣рд╛ рд╣реИ:
fread.c:1054:3: warning: too many arguments for format [-Wformat-extra-args]
рд▓рд┐рдирдХреНрд╕ рдФрд░ рдЯреНрд░реИрд╡рд┐рд╕ рдкрд░ рдпрд╣рд╛рдБ рдареАрдХ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИред рдЖрдкрдХреЗ рджреНрд╡рд╛рд░рд╛ рдЗрд╕ рд╡рд░реНрдХрдЕрд░рд╛рдЙрдВрдб рдХрд╛ рдкрд░реАрдХреНрд╖рдг рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП Windows.zip рдмрдирд╛рдпрд╛ рдЬрд╛ рд░рд╣рд╛ рд╣реИред рдореБрдЭреЗ рдЗрд╕ рдкрд░ рд╕реЛрдирд╛ рдкрдбрд╝реЗрдЧрд╛ред
(рдпрд╣ рд▓рд╛рдЗрди 1054 рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рд╢рд┐рдХрд╛рдпрдд рдХрд░ рд░рд╣рд╛ рд╣реИ, рд▓реЗрдХрд┐рди рдмрд╣реБрдд рдЕрдЧрд▓реА рдкрдВрдХреНрддрд┐ 1055 рдирд╣реАрдВ рд╣реИ, рдЬреЛ рдХрд┐ рд╕рд┐рд░реНрдл рдПрдХ рд╣реА рдПрдлрд┐рдХрд┐рдХреНрд╕ рд╣реИред рдХреБрдЫ рдЕрдВрддрд░ рд╣реЛрдирд╛ рдЪрд╛рд╣рд┐рдПред рд╡рд┐рдВрдбреЛрдЬрд╝ рдкрд░ __VA_ARGS__ рд╕рд╛рде рдПрдХ рд╕рдорд╕реНрдпрд╛ рдХреЛ% llu - рдирд┐рд╢реНрдЪрд┐рдд рд░реВрдк рд╕реЗ рдирд╣реАрдВред)

рдареАрдХ рд╣реИ, рдЕрдВрдд рдореЗрдВ windows.zip рдЖрдк рдпрд╣рд╛рдБ рдлрд┐рд░ рд╕реЗ рдХреЛрд╢рд┐рд╢ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рддреИрдпрд╛рд░

рдЗрд╕ рд╢рд╛рдЦрд╛ рдореЗрдВ рдЕрднреА рдХрдИ рдХрд╛рд░реНрдпрджрдХреНрд╖рддрд╛рдПрдБ рд╣реИрдВред рдпрджрд┐ рдпрд╣ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИ, рддреЛ рдореИрдВ рдпрд╣ рд╕реНрдерд╛рдкрд┐рдд рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рд╡рд░реНрдХрдЖрд░реНрдб рдХреЛ рдирд┐рдХрд╛рд▓рдирд╛ рд╢реБрд░реВ рдХрд░ рджреВрдВрдЧрд╛ рдХрд┐ рдпрд╣ рдХреМрди рд╕рд╛ рдерд╛ред рд▓реБрд▓ рдХрдВрдкрд╛рдЗрд▓рд░ рдЪреЗрддрд╛рд╡рдирд┐рдпрд╛рдВ рд╕рдмрд╕реЗ рдЕрдзрд┐рдХ рдЖрд╢рд╛рдЬрдирдХ рд▓рдЧ рд░рд╣реА рд╣реИрдВ, рдХреНрдпреЛрдВрдХрд┐ рдпрд╣рд╛рдВ рдкрд╛рдпрд╛ рдЧрдпрд╛ рд╕реНрдкрд╖реНрдЯреАрдХрд░рдг @ рд╕реЗрдВрдЯ-рдкрд╛рд╢рд╛ рдХреЗ рдЕрдиреБрд░реВрдк, рд╡рд░реНрдмреЛрдЬрд╝ рдЖрдЙрдЯрдкреБрдЯ рдореЗрдВ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХреЛ рдЬрдиреНрдо рджреЗрдЧрд╛ред рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдХрд┐ Rprintf рдкрд░рдд рдЙрд╕ рд╕рдВрдХрд▓рдХ рд╕реЗ рдЫрд┐рдкрд╛ рд░рд╣реА рдереА рдЬрд┐рд╕реЗ рдЕрдм рд╡рд╣ рджреЗрдЦ рд╕рдХрддрд╛ рд╣реИ рдХрд┐ рдпрд╣ fprintf рд╕реАрдзреЗ рдЙрдкрдпреЛрдЧ рдХрд░ рд░рд╣рд╛ рд╣реИред

image

рджреВрд╕рд░реЗ рдкреНрд░рдпрд╛рд╕ рдореЗрдВ (рд░рд┐рдмреВрдЯ рдХрд░рдиреЗ рдХреЗ рдмрд╛рдж)

stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into тАШC:/Users/hughp/Documents/R/win-library/3.4тАЩ
# (as тАШlibтАЩ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1559167 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package тАШdata.tableтАЩ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-18 04:58:23 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com

fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file:  C:\Users\hughp\AppData\Local\Temp\RtmpIT9H0D/fread.out 
Input contains no \n. Taking this to be a filename to open
Read 11%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 28%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 48%. ETA 00:00 Warning: stack imbalance in '$', 20 then 19
Read 98%. ETA 00:00 [01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.822 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.000s (  0%) Memory map 0.341GB file
   0.001s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.291s ( 10%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.531s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.002s (  0%) Finding first non-embedded \n after each jump
   +    0.282s ( 10%) Parse to row-major thread buffers (grown 0 times)
   +    1.537s ( 54%) Transpose
   +    0.710s ( 25%) Waiting
   0.842s ( 30%) Rereading 1 columns due to out-of-sample type exceptions
   2.822s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

рдлрд┐рд░, RStudio рдХреЗ рдмрд╛рд╣рд░ рдкреНрд░рддрд┐рд▓рд┐рдкрд┐ рдкреНрд░рд╕реНрддреБрдд рдХрд░рдиреЗ рдпреЛрдЧреНрдп рдирд╣реАрдВред

рдЗрддрдиреА рдЬрд▓реНрджреА рдкрд░реАрдХреНрд╖рдг рдХреЗ рд▓рд┐рдП рдзрдиреНрдпрд╡рд╛рджред рдЦреИрд░, рдпрд╣ рдирд┐рд╢реНрдЪрд┐рдд рд░реВрдк рд╕реЗ рдПрдХ рдмрд╣реБрдд рдмрд╛рд╣рд░ рдирд┐рдпрдо рд╣реИ! рджреЛ рд╡рд┐рдЪрд╛рд░ рдмрдЪреЗред рдкрд╣рд▓реЗ рдПрдХ рдзрдХреНрдХрд╛ рджрд┐рдпрд╛ рдФрд░ рдЧреБрдЬрд░ рдЧрдпрд╛ред рдХреГрдкрдпрд╛ рдпрд╣рд╛рдБ рдирдпрд╛ Windows.zip рдЖрдЬрд╝рдорд╛рдПрдБред рдпрд╣ alloca рд╕реНрдЯреИрдХ рдкрд░ рд╣реИ рдФрд░ рдЗрд╕реЗ na.strings рдЬрд┐рд╕реЗ рдЖрдк рд╕реЗрдЯ рдХрд░ рд░рд╣реЗ рд╣реИрдВ рдЬреИрд╕рд╛ рдХрд┐ рдРрд╕рд╛ рд╣реЛрддрд╛ рд╣реИред рдирд┐рд╢реНрдЪрд┐рдд рд░реВрдк рд╕реЗ рд╕рд╣реА рдХреНрд╖реЗрддреНрд░ рдореЗрдВ (рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди) рдФрд░ рдХреЛрд╢рд┐рд╢ рдХрд░рдиреЗ рд▓рд╛рдпрдХред

рдХреЛрдИ рд╕рдорд╕реНрдпрд╛ рдирд╣реАрдВ рд╣реИ - рдореИрдВ рдЕрдЧрд▓реЗ 12 рдШрдВрдЯреЛрдВ рдХреЗ рд▓рд┐рдП рджреВрд░ рд░рд╣реВрдБрдЧрд╛ рдпрд╛ рддреЛ рддрдм рддрдХ рдкрд░реАрдХреНрд╖рдг рдирд╣реАрдВ рдХрд░ рд╕рдХрддрд╛ред
рд╢рдирд┐ рдкрд░, 18 рдирд╡рдВрдмрд░ 2017 рдХреЛ рд╢рд╛рдо 5:20 рдмрдЬреЗ, рдореИрдЯ рдбрд╛рдЙрд▓реЗ рдиреЛрдЯрд┐рдлрд┐рдХреЗрд╢рди @github.com рдиреЗ рд▓рд┐рдЦрд╛:

рдЗрддрдиреА рдЬрд▓реНрджреА рдкрд░реАрдХреНрд╖рдг рдХреЗ рд▓рд┐рдП рдзрдиреНрдпрд╡рд╛рджред рдЦреИрд░, рдпрд╣ рдирд┐рд╢реНрдЪрд┐рдд рд░реВрдк рд╕реЗ рдПрдХ рдмрд╣реБрдд рдмрд╛рд╣рд░ рдирд┐рдпрдо рд╣реИ!
рджреЛ рд╡рд┐рдЪрд╛рд░ рдмрдЪреЗред рдкрд╣рд▓реЗ рдПрдХ рдзрдХреНрдХрд╛ рджрд┐рдпрд╛ рдФрд░ рдЧреБрдЬрд░ рдЧрдпрд╛ред рдХреГрдкрдпрд╛ рдирдП Windows.zip рдХрд╛ рдкреНрд░рдпрд╛рд╕ рдХрд░реЗрдВ
рдпрд╣рд╛рдБ
https://ci.appveyor.com/project/Rdatatable/data-table/build/1.0.1363/job/fo02vnbu5ebhwy3w/artifacts ред
рдпрд╣ рдЖрд╡рдВрдЯрди рд╕реНрдЯреИрдХ рдкрд░ рдЖрд╡рдВрдЯрд┐рдд рдХрд┐рдпрд╛ рдЧрдпрд╛ рд╣реИ рдФрд░ рдЗрд╕реЗ na.strings рдХреЗ рд╕рд╛рде рдХрд░рдирд╛ рд╣реИ
рдЬреИрд╕рд╛ рд╣реЛ рд░рд╣рд╛ рд╣реИ рд╡реИрд╕рд╛ рд╣реА рдЖрдк рд╕реЗрдЯ рдХрд░ рд░рд╣реЗ рд╣реИрдВред рдирд┐рд╢реНрдЪрд┐рдд рд░реВрдк рд╕реЗ рд╕рд╣реА рдХреНрд╖реЗрддреНрд░ (рдвреЗрд░) рдореЗрдВ
рдЕрд╕рдВрддреБрд▓рди) рдФрд░ рдХреЛрд╢рд┐рд╢ рдХрд░рдиреЗ рд▓рд╛рдпрдХред

-
рдЖрдк рдЗрд╕реЗ рдкреНрд░рд╛рдкреНрдд рдХрд░ рд░рд╣реЗ рд╣реИрдВ рдХреНрдпреЛрдВрдХрд┐ рдЖрдкрдХрд╛ рдЙрд▓реНрд▓реЗрдЦ рдХрд┐рдпрд╛ рдЧрдпрд╛ рдерд╛ред

рдЗрд╕ рдИрдореЗрд▓ рдХрд╛ рдЙрддреНрддрд░ рд╕реАрдзреЗ рджреЗрдВ, рдЗрд╕реЗ GitHub рдкрд░ рджреЗрдЦреЗрдВ
https://github.com/Rdatatable/data.table/issues/2481#issuecomment-345421856 ,
рдпрд╛ рдзрд╛рдЧрд╛ рдореНрдпреВрдЯ рдХрд░реЗрдВ
https://github.com/notifications/unsubscribe-auth/AHvGDGa5Qnls5eSFBMaQO5s8DElfrpKSks5s3ncqgaJpZM4QnuPc
ред

рдареАрдХ рд╣реИ рдХреЛрдИ рдмрд╛рдд рдирд╣реАрдВред рдзрдиреНрдпрд╡рд╛рдж! рдореИрдВрдиреЗ рдЕрдм рджреВрд╕рд░рд╛ рд╡рд┐рдЪрд╛рд░ рднреА рдЖрдЧреЗ рдмрдврд╝рд╛ рджрд┐рдпрд╛ рд╣реИред рдореБрдЭреЗ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рдкрд┐рдЫрд▓реЗ рджрд┐рдиреЛрдВ рд╡рд┐рдВрдбреЛрдЬ рдкрд░ \r рд╕рдорд╕реНрдпрд╛ рд╣реЛ рд░рд╣реА рдереА, рд▓реЗрдХрд┐рди рдореБрдЭреЗ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХреА рдпрд╛рдж рдирд╣реАрдВ рд╣реИред рд╡реИрд╕реЗ рднреА, рдпрд╣ рддрдп рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП, рдореИрдВрдиреЗ рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рд╕реЗ \r рдирд┐рдХрд╛рд▓ рджрд┐рдпрд╛ рд╣реИред рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рд╕рдВрджреЗрд╢ рдореБрджреНрд░рд┐рдд рд╣реЛрдиреЗ рд▓рдЧрддрд╛ рд╣реИ рдЬрд╣рд╛рдВ рдИрдЯреАрдП рд▓рд╛рдЗрдиреЗрдВ рд╣реЛрддреА рд╣реИрдВред рдпрд╣ рд╕рдВрднрд╡ рд╣реИ рдХрд┐ рдХрдВрд╕реЛрд▓ \r рдкрдХрдбрд╝реЗ рдФрд░ рдЗрд╕реЗ рдЕрд▓рдЧ рддрд░реАрдХреЗ рд╕реЗ рд╡реНрдпрд╡рд╣рд╛рд░ рдХрд░рддрд╛ рд╣реИ рддрд╛рдХрд┐ рдЕрдВрддрд┐рдо рдкрдВрдХреНрддрд┐ рдХреЛ рдмрджрд▓ рджрд┐рдпрд╛ рдЬрд╛рдПред ETA рдЕрдкрдбреЗрдЯ рд╣реЛрдиреЗ рдХреЗ рдмрд╛рдж рдЖрдкрдХреЛ рд╣рд░ рдмрд╛рд░ рдПрдХ рдирдИ рд▓рд╛рдЗрди рджреЗрдЦрдиреА рдЪрд╛рд╣рд┐рдПред рдмрд╕ рдЕрд╕реНрдерд╛рдпреА рд░реВрдк рд╕реЗ рдирд┐рдпрдо рд╣реИ рдХрд┐ рдмрд╛рд╣рд░ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдПред рдирдпрд╛ Windows.zip рдмрдирд╛рдпрд╛ рдФрд░ рдпрд╣рд╛рдБ рд╕реЗ рдЧреБрдЬрд░ рд░рд╣рд╛

fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file:  C:\Users\hughp\AppData\Local\Temp\RtmpcVjZ1f/fread.out 
Input contains no \n. Taking this to be a filename to open
Read 5%. ETA 00:00
Read 8%. ETA 00:00
Read 11%. ETA 00:00
Read 15%. ETA 00:00
Read 18%. ETA 00:00
Read 21%. ETA 00:00
Read 25%. ETA 00:00
Read 28%. ETA 00:00
Read 31%. ETA 00:00
Read 35%. ETA 00:00
Read 38%. ETA 00:00
Read 41%. ETA 00:00
Read 45%. ETA 00:00
Read 48%. ETA 00:00
Read 51%. ETA 00:00
Read 55%. ETA 00:00
Warning: stack imbalance in '$', 30 then 31
Warning: stack imbalance in '$', 17 then 16
Read 58%. ETA 00:00
Read 61%. ETA 00:00
Read 65%. ETA 00:00
Read 68%. ETA 00:00
Read 71%. ETA 00:00
Read 75%. ETA 00:00
Read 78%. ETA 00:00
Read 81%. ETA 00:00
Read 85%. ETA 00:00
Read 88%. ETA 00:00
Read 91%. ETA 00:00
Read 95%. ETA 00:00
Read 98%. ETA 00:00
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.894 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.001s (  0%) Memory map 0.341GB file
   0.003s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.316s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.574s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.004s (  0%) Finding first non-embedded \n after each jump
   +    0.284s ( 10%) Parse to row-major thread buffers (grown 0 times)
   +    1.450s ( 50%) Transpose
   +    0.837s ( 29%) Waiting
   0.953s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
   2.894s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

FYI рдХрд░реЗрдВ: рдореИрдВ RStudio рдХреЗ рдереЛрдбрд╝реЗ рдкреБрд░рд╛рдиреЗ рд╕рдВрд╕реНрдХрд░рдг рдХреЗ рд╕рд╛рде рдПрдХ рдЕрд▓рдЧ рд╡рд┐рдВрдбреЛрдЬ рдорд╢реАрди рдкрд░ рдЗрд╕ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рддреНрд░реБрдЯрд┐ рдХреЛ рдкреБрди: рдЙрддреНрдкрдиреНрди рдирд╣реАрдВ рдХрд░ рд╕рдХрд╛ред

рдЙрд╕ рд╕реНрдерд┐рддрд┐ рдореЗрдВ, рдРрд╕рд╛ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рдпрд╣ RStudio рд╕рдорд░реНрдерди рд╕реЗ рдкреВрдЫрдиреЗ рдХрд╛ рд╕рдордп рд╣реИ рдЬреИрд╕рд╛ рдЖрдкрдиреЗ рд╕реБрдЭрд╛рд╡ рджрд┐рдпрд╛ рдерд╛ред рдореБрдЭреЗ рдлрд┐рд░ рд╕реЗ рдлрд╝реНрд░реЗрдб рдХреЛрдб рдХреЗ рдорд╛рдзреНрдпрдо рд╕реЗ рдПрдХ рдирдЬрд╝рд░ рдорд┐рд▓реА рд╣реИ рдФрд░ рдореИрдВ рдЕрдкрдиреА рддрд░рдл рд╕реЗ рд╡рд┐рдЪрд╛рд░реЛрдВ рд╕реЗ рдмрд╛рд╣рд░ рд╣реВрдВред рдХреГрдкрдпрд╛ рдЙрдиреНрд╣реЗрдВ RStudio рдХреЗ рджреЛ рд╕рдВрд╕реНрдХрд░рдг рдирдВрдмрд░ рдмрддрд╛рдПрдВред рдпрд╣ рдЬрд░реВрд░реА рдирд╣реАрдВ рд╣реИ рдХрд┐ рдпрд╣ RStudio рд╣реИ, рдпрд╣ data.table рдХреЗ рдкрдХреНрд╖ рдореЗрдВ рдПрдХ рджреЛрд╖ рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдЬреЛ рдХрд┐ RStudio рдХреЗ рдПрдХ рд╕рдВрд╕реНрдХрд░рдг рдкрд░ рджрд┐рдЦрд╛рдИ рджреЗрдиреЗ рдХреЗ рд▓рд┐рдП рд╣реЛрддрд╛ рд╣реИред рд▓реЗрдХрд┐рди рдпрд╣ рдЕрдЬреАрдм рд╣реИ рдХрд┐ рдпрд╣ рдХрдВрд╕реЛрд▓ рдЖрдЙрдЯрдкреБрдЯ рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рд▓рдЧрддрд╛ рд╣реИ рдФрд░ рдпрд╣ рдХреБрдЫ рдРрд╕рд╛ рд╣реИ рдЬреЛ рдЕрд▓рдЧ рд╣реИ рдФрд░ RStudio- рд╡рд┐рд╢рд┐рд╖реНрдЯ рд╣реИред рдореИрдВрдиреЗ "RStudio рд╕реНрдЯреИрдХ рдЗрдореНрдмреИрд▓реЗрдВрд╕" рдХреЗ рд▓рд┐рдП рдЦреЛрдЬ рдХреА рд╣реИ, рд▓реЗрдХрд┐рди рдмрд╣реБрдд рд╕рд╛рд░реЗ рдореБрджреНрджреЗ рдкреИрдХреЗрдЬ рджреЛрд╖реЛрдВ рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдЖрддреЗ рд╣реИрдВ, рдкреНрд░рддрд┐ RStudio рдирд╣реАрдВред рдЦреЛрдЬ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдореБрд╢реНрдХрд┐рд▓ рд╕рдорд╕реНрдпрд╛ред рдЖрдЗрдП рдореБрджреНрджреЗ рдХреЛ рдпрд╣рд╛рдВ рдЦреБрд▓рд╛ рд░рдЦреЗрдВ рдФрд░ рджреЗрдЦреЗрдВ рдХрд┐ рд╡реЗ рдХреНрдпрд╛ рдХрд╣рддреЗ рд╣реИрдВред

рдореБрдЭреЗ рд╕рдВрджреЗрд╣ рд╣реИ рдХрд┐ рдЕрдВрддрд┐рдо рдкреНрд░рдпрд╛рд╕ рд╕реЗ рдорджрдж рдорд┐рд▓реЗрдЧреА, рд▓реЗрдХрд┐рди рдкреВрд░реНрдгрддрд╛ рдХреЗ рд▓рд┐рдП, рдХреГрдкрдпрд╛ рдЗрд╕реЗ рдпрд╣рд╛рдВ рджреЗрдВ ред рд╢рд╛рдпрдж MinGW рдХрдВрдкрд╛рдЗрд▓рд░ рдЬреЛ рд╡рд┐рдВрдбреЛрдЬ рдкрд░ рдЙрдкрдпреЛрдЧ рдХрд┐рдпрд╛ рдЬрд╛рддрд╛ рд╣реИ, рдЙрди рджреЛ ints рдХреЗ рд╕рд╛рде рдХреБрдЫ рдЕрдЬреАрдм рдХрд░рддрд╛ рд╣реИред рдЙрдирдореЗрдВ рд╕реЗ рдПрдХ рдирд┐рд░рдВрддрд░ 0 рд╣реИ рдЬреЛ рд╢рд╛рдпрдж рджреВрд░ рдЕрдиреБрдХреВрд▓рд┐рдд рд╣реИ рдФрд░ рдлрд┐рд░ рдХрд┐рд╕реА рддрд░рд╣ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХрд╛ рдХрд╛рд░рдг рдмрдирддрд╛ рд╣реИред

рд╣рд╛рд▓рд╛рдБрдХрд┐, рд╡рд╣ рд╡рд┐рд╢реЗрд╖ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рд╕рдВрджреЗрд╢ Ral рдореЗрдВ eval.c: 491 рд╕реЗ рдЖ рд░рд╣рд╛ рд╣реИред рдХреБрдЫ рдзрд╛рдЧрд╛ рдЙрд╕ рд▓рд╛рдЗрди рдХреЛ рдЪрд▓рд╛рдиреЗ рдЪрд╛рд╣рд┐рдП, рд▓реЗрдХрд┐рди рдореБрдЭреЗ рдирд╣реАрдВ рд▓рдЧрддрд╛ рдХрд┐ рдпрд╣ fread рдпрд╛ data.table ред рд╡рд╣ check_stack_balance() рдХреЗрд╡рд▓ R рдЗрдВрдЯрд░реНрдирд▓ рдореЗрдВ 5 рд╕реНрдерд╛рдиреЛрдВ рд╕реЗ рдХрд╣рд╛ рдЬрд╛рддрд╛ рд╣реИ:
names.c do_internal() рдХреЗ рдЕрдВрдд рдореЗрдВ
objects.c , рджреЛ рдмрд╛рд░ applyMethod()
eval.c , eval() рдореЗрдВ рджреЛ рдмрд╛рд░
рдореИрдВ рдпрд╣ рдирд╣реАрдВ рджреЗрдЦрддрд╛ рдХрд┐ fread.c рд╣реЛрдиреЗ рдкрд░ рдЙрдирдореЗрдВ рд╕реЗ рдХрд┐рд╕реА рдХреЛ рдХреИрд╕реЗ рдкрд╣реБрдБрдЪрд╛ рдЬрд╛ рд╕рдХрддрд╛ рд╣реИред рдХрд╣рд╛ рдЬрд╛ рд░рд╣рд╛ рд╣реИ рдХреЗрд╡рд▓ рдкреНрд░рд╡рд┐рд╖реНрдЯрд┐ рдмрд┐рдВрджреБ REprintf рдФрд░ рдореИрдВ рдпрд╣ рдирд╣реАрдВ рджреЗрдЦрддрд╛ рдХрд┐ check_stack_balance() рдХреИрд╕реЗ рдкрд╣реБрдВрдЪ рд╕рдХрддрд╛ рд╣реИред рд╕рднреА рдореИрдВ рд╕реЛрдЪ рд╕рдХрддрд╛ рд╣реВрдВ рдХрд┐ рд╡рд░реНрддрдорд╛рди рдореЗрдВ RStudio рдХрд╛ рдПрдХ рдзрд╛рдЧрд╛ рд╣реИ рдЬреЛ рдкреГрд╖реНрдарднреВрдорд┐ рдореЗрдВ рдХреБрдЫ рдХрд░ рд░рд╣рд╛ рд╣реИ рдЬреЛ рд╢рд╛рдпрдж рдХрдВрд╕реЛрд▓ рдЖрдЙрдЯрдкреБрдЯ рдХреЗ рд╕рд╛рде рдЗрдВрдЯрд░реИрдХреНрдЯ рдХрд░рддрд╛ рд╣реИ, рд╢рд╛рдпрдж рд╡рд┐рдВрдбреЛрдЬ рдХреЗ рд╕рд╛рде рдЕрд▓рдЧ рддрд░рд╣ рд╕реЗред
рдЕрдВрдд рдореЗрдВ, рдкреВрд░реНрдгрддрд╛ рдХреЗ рд▓рд┐рдП, рдРрд╕рд╛ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ REprintf рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рдиреЗ рдХрд╛ рд╕рд╣реА рддрд░реАрдХрд╛ рд╣реИ рдХреНрдпреЛрдВрдХрд┐ рдЖрд░ рдЖрд░ libcurl.c: 354 рдФрд░ internet.c: 409 рдореЗрдВ рдЕрдкрдиреА рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рдореЗрдВ (Rprintf рдХреЗ рдмрдЬрд╛рдп) рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░ рд░рд╣рд╛ рд╣реИред рдпрд╣ рд╢рд░реНрдо рдХреА рдмрд╛рдд рд╣реИ рдХрд┐ C рд╕реНрддрд░ рдкрд░ R рдХреА рдкреНрд░рдЧрддрд┐ рдкрдЯреНрдЯреА R рдХреЗ API рдореЗрдВ рдЙрдкрд▓рдмреНрдз рдирд╣реАрдВ рд╣реИ (рдпрд╣ R рд╕реНрддрд░ C рдореЗрдВ рднреА рджреЛ рдмрд╛рд░ рд▓рд╛рдЧреВ рдХрд┐рдпрд╛ рдЧрдпрд╛ рд▓рдЧрддрд╛ рд╣реИ)ред

@mattdowle , рдХреНрдпрд╛ рдпрд╣ рдорджрджрдЧрд╛рд░ рд╣реЛрдЧрд╛? https://github.com/r-lib/progress

@aader рд╣рд╛рдБ - рдзрдиреНрдпрд╡рд╛рдж! рдЗрд╕рдХреЗ рд╕реНрд░реЛрдд рдореЗрдВ рдпрд╣ рдЯрд┐рдкреНрдкрдгреА рд╣реИ :
// In R Studio we should print to stdout, because printing a \r
// to stderr is buggy (reported)
рд▓реЗрдХрд┐рди рдореИрдВрдиреЗ рдкрд╣рд▓реЗ рд╣реА \r рд╣рдЯрд╛ рджрд┐рдпрд╛ рдФрд░ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдЕрднреА рднреА рд╣реЛрддрд╛ рд╣реИред рдореБрдЭреЗ рдЖрд╢реНрдЪрд░реНрдп рд╣реИ рдХрд┐ рдпрд╣ рдХрд╣рд╛рдВ рдмрддрд╛рдпрд╛ рдЧрдпрд╛ рдерд╛ред

рдЕрдВрддрд┐рдо рдмрд┐рд▓реНрдб рднреА рдХрд╛рдо рдирд╣реАрдВ рдХрд┐рдпрд╛:

image

Https://community.rstudio.com/t/stack-imbalance-possibly-in-stderb//9 рдкрд░ рд░рд┐рдкреЛрд░реНрдЯ рдХреА рдЧрдИ

рдЕрдкрд╢реЙрдЯ "Rprintf рдФрд░ REprintf рдереНрд░реЗрдб-рд╕реБрд░рдХреНрд╖рд┐рдд рдирд╣реАрдВ рд╣реИрдВред"

рдпреЛрдЗрдХреНрд╕!

рд▓рд┐рдВрдХ рдХреЗ рд▓рд┐рдП рд╕рднреА рдХреЛ рдзрдиреНрдпрд╡рд╛рдж рдФрд░ RStudio рдХреЗ рд╕рд╛рде рдЗрд╕ рдореБрджреНрджреЗ рдХреЛ рдЙрдард╛рдиреЗ рдХреЗ рд▓рд┐рдП рд╣реНрдпреВрдЧред

data.table::fwrite() рдФрд░ data.table::fread() Rprintf REprintf рд╡реЗ рдХреЗрд╡рд▓ рдЙрдиреНрд╣реЗрдВ рдорд╛рд╕реНрдЯрд░ рдереНрд░реЗрдб рд╕реЗ рдмреБрд▓рд╛рддреЗ рд╣реИрдВред рди рдХреЗрд╡рд▓ рджреЛ рдбреЗрдЯрд╛рдЯреЗрдмрд▓ рдереНрд░реЗрдбреНрд╕ рдХрднреА рднреА рдЙрд╕ рдЖрд░ рдПрдВрдЯреНрд░реА рдкреЙрдЗрдВрдЯ рдХреЛ рдПрдХ рд╣реА рд╕рдордп рдореЗрдВ рдХреЙрд▓ рдирд╣реАрдВ рдХрд░рддреЗ рд╣реИрдВ, рдмрд▓реНрдХрд┐ рдХреЗрд╡рд▓ рдорд╛рд╕реНрдЯрд░ рдереНрд░реЗрдб рдЗрд╕реЗ рдХрднреА рднреА рдХреЙрд▓ рдХрд░рддрд╛ рд╣реИ, рдФрд░ рдпрд╣ рдПрдХрдорд╛рддреНрд░ рдЖрд░ рдПрдВрдЯреНрд░реА рдкреЙрдЗрдВрдЯ рд╣реИ рдЬрд┐рд╕реЗ рдереНрд░реЗрдбреНрд╕ рдХреЗ рдХрд┐рд╕реА рднреА рд╕рдордп рдХрд┐рд╕реА рднреА рд╕рдордп рдХреЙрд▓ рдХрд┐рдпрд╛ рдЬрд╛рддрд╛ рд╣реИред рд╕рдорд╛рдирд╛рдВрддрд░ рдЦрдВрдбред рд╣рд╛рд▓рд╛рдБрдХрд┐, Rprintf R_CheckUserInterrupt рд╣рд░ 100 рдкреНрд░рд┐рдВрдЯ рдХрд░рддрд╛ рд╣реИред рдореБрдЭреЗ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рд╡рд╣ рд╣рд┐рд╕реНрд╕рд╛ рдЬреЛ рд╕рдВрднрд╡рддрдГ рдХреЗрд╡рд▓ рдорд╛рд╕реНрдЯрд░ рдзрд╛рдЧреЗ рд╕реЗ рднреА рд╕реБрд░рдХреНрд╖рд┐рдд рдирд╣реАрдВ рд╣реИред рдпрд╣реА рдХрд╛рд░рдг рд╣реИ рдХрд┐ рдЕрдм рдкреНрд░рдпреЛрдЧ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдХрд╛рд░рдг рдирд╣реАрдВ рд╣реИ REprintf рдХрд┐ рдХреЗ рд░реВрдк рдореЗрдВ рдлреЛрди рдирд╣реАрдВ рдХрд░рддрд╛ R_CheckUserInterrupt ред рдЗрдВрдЯрд░реНрдирд▓реНрд╕ рдкреНрд░рдЧрддрд┐ рдХреЗ рдореАрдЯрд░ рдХреЗ рд▓рд┐рдП REprintf рдХрд░рддреЗ рд╣реИрдВ, рдЗрд╕рд▓рд┐рдП рдХреЛрд░ рдмрдирд╛рдиреЗ рдХреА рднрд╛рд╡рдирд╛ рдХреЗ рд╕рд╛рде REprintf рд▓рд┐рдП рд╕реНрд╡рд┐рдЪ рдХрд░рдирд╛; рдЗрд╕рдХрд╛ рдорддрд▓рдм рдпрд╣ рд╣реИ рдХрд┐ рдЪреБрдирд╛рд╡ рдХрд╛ рд╕реНрдЯреИрдбрд░ рдмрдирд╛рдо рд╕реНрдЯрдбрдЖрдЙрдЯ рд╕реЗ рдХреЛрдИ рд▓реЗрдирд╛-рджреЗрдирд╛ рдирд╣реАрдВ рд╣реИред

@kevinushey рдХреНрдпрд╛ рдЖрдк рдЗрд╕ рдзрд╛рдЧреЗ рдкрд░ рдПрдХ рдирдЬрд╝рд░ рдбрд╛рд▓реЗрдВрдЧреЗ рдФрд░ рдореБрдЭреЗ рдХреБрдЫ рдФрд░ рдмрддрд╛рдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХрд░реЗрдВрдЧреЗ? рдпрд╣ RStudio рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рд╣реЛ рд╕рдХрддрд╛ рд╣реИ, рдХрд┐рд╕реА рднреА рддрд░рд╣, рд╢рд╛рдпрдж рдПрдХ рдкреГрд╖реНрдарднреВрдорд┐ рдзрд╛рдЧреЗ рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рд╣реИ? рдпрджрд┐ RStudio рдХрд╛ рдмреИрдХрдЧреНрд░рд╛рдЙрдВрдб рдереНрд░реЗрдб рд╣реИ, рддреЛ рдпрд╣ рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдХрд┐ Rprintf / REprintf рдПрдХ рд╣реА рд╕рдордп рдореЗрдВ рджреЛ рдереНрд░реЗрдб рд╕реЗ рдХреЙрд▓ рдХрд┐рдП рдЬрд╛ рд╕рдХреЗрдВред рд▓реЗрдХрд┐рди, рдЕрдЧрд░ рдРрд╕рд╛ рд╣реЛрддрд╛, рддреЛ рд╣рдо рдЕрдм рд╕реЗ рдкрд╣рд▓реЗ рдХрдИ рдФрд░ рд╕рдорд╕реНрдпрд╛рдУрдВ рдХреЛ рджреЗрдЦрддреЗред рдЗрд╕рд▓рд┐рдП рдпрд╣ рдмрд╣реБрдд рдХрдо рд╕рдВрднрд╛рд╡рдирд╛ рд╣реИред рд╢рд╛рдпрдж RStudio , R-exts рдХреЗ рд╕реЗрдХреНрд╢рди ptr_* рдХреЙрд▓рдмреИрдХ рдХреА рдЬрдЧрд╣ рд▓реЗрддреА рд╣реИ - рдЬреЛ рдХрдВрд╕реЛрд▓ рдЖрдЙрдЯрдкреБрдЯ рдФрд░ рдЗрдВрдЯрд░реИрдХреНрд╢рди рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рд╣реИрдВред рд╣рд╛рд▓рд╛рдБрдХрд┐, рдпрд╣ рдЦрдВрдб "рдпреВрдирд┐рдХреНрд╕-рдмрд╛рдЗрдХ рдХреЗ рд▓рд┐рдП" рд╕реЗ рд╢реБрд░реВ рд╣реЛрддрд╛ рд╣реИ, рдЗрд╕рд▓рд┐рдП рдореБрдЭреЗ рдирд╣реАрдВ рдкрддрд╛ рдХрд┐ рд╡рд┐рдВрдбреЛрдЬ рдХреИрд╕реЗ рдЖрддрд╛ рд╣реИред рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдХрд┐ рдЦрдВрдб 8.1.5 рдереНрд░реЗрдбрд┐рдВрдЧ рд╕рдорд╕реНрдпрд╛рдПрдБ рднреА рдкреНрд░рд╛рд╕рдВрдЧрд┐рдХ рд╣реЛрдВред рджреЛрдиреЛрдВ рдзрд╛рд░рд╛ 8 рдХреЗ рдЙрдк-рдЦрдВрдб рд╣реИрдВ: "GUI рдФрд░ рдЕрдиреНрдп рдлреНрд░рдВрдЯ-рдПрдВрдб рдХреЛ R рд╕реЗ рдЬреЛрдбрд╝рдирд╛"ред

рдореИрдВ рджрд┐рд╕рдВрдмрд░ рдХреА рд╢реБрд░реБрдЖрдд рддрдХ рдмрд╛рд╣рд░ рд░рд╣рдиреЗ рд╡рд╛рд▓рд╛ рд╣реВрдВ, рдЗрд╕рд▓рд┐рдП рджреБрд░реНрднрд╛рдЧреНрдп рд╕реЗ рдореБрдЭреЗ рддрдм рддрдХ рджреЗрдЦрдиреЗ рдХрд╛ рдореМрдХрд╛ рдирд╣реАрдВ рдорд┐рд▓реЗрдЧрд╛ред рд╣рд╛рд▓рд╛рдВрдХрд┐, RStudio, R рдЗрд╡реЗрдВрдЯ рд▓реВрдк рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рдХреЗ рдореБрдЦреНрдп рдереНрд░реЗрдб рдкрд░ рд▓рдЧрднрдЧ рд╕рдм рдХреБрдЫ рдЪрд▓рд╛рддрд╛ рд╣реИ; рдХреЗрд╡рд▓ рдЕрдкрд╡рд╛рдж рдЬреИрд╕реЗ рдкреНрд░реЛрдЬреЗрдХреНрдЯ-рд╕реНрддрд░ рдлрд╝рд╛рдЗрд▓ рдЕрдиреБрдХреНрд░рдордг рдФрд░ рд╡реЗ рдкреГрд╖реНрдарднреВрдорд┐ рдереНрд░реЗрдб рдЖрдорддреМрд░ рдкрд░ рдХрд┐рд╕реА рднреА R API рдХреЛ рд╕реНрдкрд░реНрд╢ рдирд╣реАрдВ рдХрд░рддреЗ рд╣реИрдВред

рдХрдВрд╕реЛрд▓ рдЗрдирдкреБрдЯ рдФрд░ рдЖрдЙрдЯрдкреБрдЯ рдХреЛ рд╕рдВрднрд╛рд▓рдиреЗ рдХреЗ рд▓рд┐рдП RStudio рд╡рд┐рднрд┐рдиреНрди ptr_* рдХреЙрд▓рдмреИрдХ рд▓реЗрддрд╛ рд╣реИ; рдореИрдВ рддреБрд░рдВрдд рдирд╣реАрдВ рд╕реЛрдЪ рд╕рдХрддрд╛ рдХрд┐ рд╡реЗ рдпрд╣рд╛рдВ рдХреИрд╕реЗ рд╣реЛ рд╕рдХрддреЗ рд╣реИрдВ, рд▓реЗрдХрд┐рди рдЬрдм рдореИрдВ рд╡рд╛рдкрд╕ рдЖрдКрдВрдЧрд╛ рддреЛ рдПрдХ рдЧрд╣рди рд░реВрдк рд▓реЗрдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХрд░реВрдВрдЧрд╛ред

рдареАрдХ рд╣реИ, рдХреГрдкрдпрд╛ рдЗрд╕реЗ рдпрд╣рд╛рдБ рдЖрдЬрд╝рдорд╛рдПрдБред рдЗрд╕рд╕реЗ рдкрд╣рд▓реЗ, рдпрд╣ рд╣рд░ 2% рдкрд░ рдкреНрд░рдЧрддрд┐ рдХреА рд╕реНрдерд┐рддрд┐ рдХреЛ рдЕрдкрдбреЗрдЯ рдХрд░ рд░рд╣рд╛ рдерд╛ред рдЖрдкрдХреЗ рдорд╛рдорд▓реЗ рдореЗрдВ рдЖрдкрдХреА рдлрд╝рд╛рдЗрд▓ рдХреЗрд╡рд▓ 3 рд╕реЗрдХрдВрдб рд╕реЗ рдХрдо рд╕рдордп рд▓реЗ рд░рд╣реА рд╣реИ, рдЗрд╕рд▓рд┐рдП рдкреНрд░рддреНрдпреЗрдХ 0.06 рд╕реЗрдХрдВрдб рдореЗрдВ RStudio рдХрдВрд╕реЛрд▓ рдХреЗ рд▓рд┐рдП рдПрдХ рдирдИ рдкреНрд░рдЧрддрд┐ рдЕрдкрдбреЗрдЯ рдереАред рд╢рд╛рдпрдж RStudio рдХреЗ рд▓рд┐рдП рдпрд╣ рдмрд╣реБрдд рдЬреНрдпрд╛рджрд╛ рдерд╛ред рдЗрд╕рд▓рд┐рдП рдпрд╣ рдкреНрд░рдпрд╛рд╕ рдПрдХ рдмрд╛рд░ рдЫрд╛рдкрддрд╛ рд╣реИред рдпрд╣ \r рдЙрдкрдпреЛрдЧ рдмрд┐рд▓реНрдХреБрд▓ рдирд╣реАрдВ рдХрд░рддрд╛ рд╣реИред рдпрд╣ рд╡реИрд╕реЗ рднреА рд░рд┐рдкреЛрд░реНрдЯ рдФрд░ рд▓реЙрдЧ рдлрд╝рд╛рдЗрд▓реЛрдВ рдХреЗ рд▓рд┐рдП рдмреЗрд╣рддрд░ рд╣реЛрдирд╛ рдЪрд╛рд╣рд┐рдП рдЬрд╣рд╛рдБ \r рдЖрдЙрдЯрдкреБрдЯ рднрд░ рд╕рдХреЗред

рдЪреВрдВрдХрд┐ рдЖрдкрдХрд╛ 3 рд╕реЗрдХрдВрдб рдХрд╛ рд╕рдордп рдХрд╛рдлреА рддреЗрдЬ рд╣реИ, рдЗрд╕рд▓рд┐рдП рдореИрдВрдиреЗ 1 рд╕реЗрдХрдВрдб рд╕реЗ рд╢реБрд░реВ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдкреНрд░рдЧрддрд┐ рдкрдЯреНрдЯреА рдХреЛ рдХрдо рдХрд░ рджрд┐рдпрд╛ рд╣реИ рдпрджрд┐ рд╡рд╣рд╛рдВ рд╕реЗ 1 рд╕реЗрдХрдВрдб рдХрд╛ рдИрдЯреАрдП рд╣реИред рдЕрдиреНрдпрдерд╛ рдпрд╣ рдмрд┐рд▓реНрдХреБрд▓ рднреА рдкреНрд░рджрд░реНрд╢рд┐рдд рдирд╣реАрдВ рд╣реЛрдЧрд╛ рдФрд░ рдЖрдкрдХреА рдлрд╝рд╛рдЗрд▓ рдХреЗ рд▓рд┐рдП рдХрд╛рдо рдХрд░реЗрдЧрд╛ рдХреНрдпреЛрдВрдХрд┐ рдпрд╣ рдкреНрд░рджрд░реНрд╢рд┐рдд рдирд╣реАрдВ рдХрд┐рдпрд╛ рдЬрд╛ рд░рд╣рд╛ рдерд╛ред рдЖрдкрдХреЗ рджреНрд╡рд╛рд░рд╛ рдкрд░реАрдХреНрд╖рдг рдХрд┐рдП рдЬрд╛рдиреЗ рдХреЗ рдмрд╛рдж, рдореИрдВ fwrite рдХреНрдпрд╛ рдХрд░реВрдВрдЧрд╛; рдпрджрд┐ рдИрдЯреАрдП рд╡рд╣рд╛рдВ рд╕реЗ 2 рд╕реЗрдХрдВрдб рд╣реИ рддреЛ 2 рд╕реЗрдХрдВрдб рд╕реЗ рд╢реБрд░реВ рд╣реЛрддрд╛ рд╣реИред

рдирдорд╕реНрдХрд╛рд░, @mattdowleред # 2503 рдореЗрдВ рдореЗрд░реА рдЕрдВрддрд┐рдо рдЯрд┐рдкреНрдкрдгреА рднреА рдЗрд╕ рдореБрджреНрджреЗ рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рд╣реЛ рд╕рдХрддреА рд╣реИред

рдЕрдЫрд╛ рд▓рдЧрддрд╛ рд╣реИ! рдХреЛрдИ рдЪреЗрддрд╛рд╡рдиреА (5 рд░рди рдХреЗ рдмрд╛рдж)ред рдкрд╣рд▓реЗ рдиреАрдЪреЗ рдЪрд▓рд╛рдПрдВ (рдзреНрдпрд╛рди рджреЗрдВ рдХрд┐ рдкреНрд░рдореБрдЦ рд╕реНрдерд╛рди рд╡рд╛рд╕реНрддрд╡рд┐рдХ рдЖрдЙрдЯрдкреБрдЯ рдореЗрдВ рднрд┐рдиреНрди рджрд┐рдЦрддреЗ рд╣реИрдВ):

stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into тАШC:/Users/hughp/Documents/R/win-library/3.4тАЩ
# (as тАШlibтАЩ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557423 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package тАШdata.tableтАЩ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapped ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
#   [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=','  with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
#   Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000)    : 1551  Quote rule 0
# Type codes (jump 100)    : 1A51  Quote rule 0
# =====
#   Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
#   [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# |--------------------------------------------------|
#   |==================================================|
#   Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.280 wall clock time
# [12] Finalizing the datatable
# Type counts:
#   1 : bool8     '1'
# 1 : int32     '5'
# 2 : string    'A'
# =============================
#   0.005s (  0%) Memory map 0.341GB file
# 0.037s (  2%) sep=',' ncol=4 and header detection
# 0.000s (  0%) Column type detection using 10027 sample rows
# 0.321s ( 14%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 1.917s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# =    0.011s (  0%) Finding first non-embedded \n after each jump
# +    0.560s ( 25%) Parse to row-major thread buffers (grown 0 times)
# +    0.488s ( 21%) Transpose
# +    0.858s ( 38%) Waiting
# 0.999s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
# 2.280s        Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1        V2      V3 V4
# 1: Goulburn 110018063    3499 NA
# 2:       NA 110018064     812 NA
# 3:       NA 110018065    2158 NA
# 4:       NA 110019999     402 NA
# 5:       NA 110028068      10 NA
# ---                              
#   22885376:       NA 997999799       0 NA
# 22885377:       NA 998999899      64 NA
# 22885378:       NA 994999499      34 NA
# 22885379:       NA 0&&&&&&&&  250796 NA
# 22885380:       NA 0@@@@@@@@ 7305367 NA
# Warning messages:
#   1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#                 Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
#               2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#               Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

@HughParsonage рд░рд╛рд╣рдд! рдореБрдЭреЗ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рдпрд╣ рдПрдХ рдЬреАрдд рд╣реИред рдореИрдВ рд╕рд╛рдл, рд╡рд┐рд▓рдп рдФрд░ рдЖрдЧреЗ рдмрдврд╝реВрдБрдЧрд╛ред рдкрд░реАрдХреНрд╖рдг рдХреЗ рд▓рд┐рдП рдмрд╣реБрдд рдзрдиреНрдпрд╡рд╛рджред

@aadler рд╣рд╛рдВ рдиреЗ рдЖрдкрдХреА рдЯрд┐рдкреНрдкрдгреА рдХреЛ рдЬрд╛рд░реА рдХрд┐рдпрд╛ # 2503 рдореЗрдВ рд╕рд┐рд░реНрдл рдПрдХ рдЬреИрд╕рд╛ рджрд┐рдЦрддрд╛ рд╣реИред рдХреНрдпрд╛ рдЖрдк рджреЗрд╡ рд╕реЗ рдирд╡реАрдирддрдо рдкрд░реАрдХреНрд╖рдг рдХрд░ рд╕рдХрддреЗ рд╣реИрдВ рдФрд░ рдХреГрдкрдпрд╛ рдкреБрд╖реНрдЯрд┐ рдХрд░реЗрдВ рдХрд┐ рдпрд╣ рдЕрдм рддрдп рд╣реЛ рдЧрдпрд╛ рд╣реИ? рдпрд╣рд╛рдБ as.IDate рд╕рд╛рде рд╕рдорд╕реНрдпрд╛ рдХреА рдЙрдореНрдореАрдж рд╣реИ рдЬреЛ рдЖрдкрдХреЛ рд╡рд╛рд╕реНрддрд╡ рдореЗрдВ рдкрд╣рд▓реЗ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХреЗ рдХрд╛рд░рдг рдорд┐рд▓реА рдереАред

рдЕрдЪреНрдЫрд╛ рдирд╣реАрдВ :(

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> DT <- fread('2017-11-22_1999_Performance.csv', header = TRUE, colClasses = CLS, select = SEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file 2017-11-22_1999_Performance.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|=======Warning: stack imbalance in '$', 27 then 26
===Warning: stack imbalance in '$', 26 then 27
================Error in fread("2017-11-22_1999_Performance.csv", header = TRUE, colClasses = CLS,  : 
  unprotect_ptr: pointer not found

@aadler рдЙрд╕ рд░рд┐рдкреЛрд░реНрдЯ рдХреЗ рд▓рд┐рдП рдзрдиреНрдпрд╡рд╛рджред рдореИрдВ freadR рдорд╛рдзреНрдпрдо рд╕реЗ рдЪрд▓рд╛ рдЧрдпрд╛ рдФрд░ рд╕реБрд░рдХреНрд╖рд╛ рдХреЛ рд╕реНрдерд╛рдиреАрдп рдмрдирд╛ рджрд┐рдпрд╛ред 30% рд╕рдВрднрд╛рд╡рдирд╛ рд╣реИ рдХрд┐ рдЖрдкрдХреЗ рдорд╛рдорд▓реЗ рдореЗрдВ рдЖрдк рдЬрд┐рд╕ рдкреНрд░рдХрд╛рд░ рд╕реЗ рдУрд╡рд░рд░рд╛рдЗрдб рдХрд░ рд░рд╣реЗ рд╣реИрдВ, рдЙрд╕рдХреЗ рдмрд╛рдж рд╕реЗ рдХрд╛рдо рдХрд░ рд╕рдХрддреЗ рд╣реИрдВ рдФрд░ рдХреЛрдб рдХреЗ рдЙрд╕ рд╣рд┐рд╕реНрд╕реЗ рдореЗрдВ рдХрд╛рдлреА рдХреБрдЫ рд╕реБрд░рдХреНрд╖рд┐рдд рдереЗред рдХреГрдкрдпрд╛ рдЗрд╕ рдмрд┐рд▓реНрдб рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рдХреЗ рдкреБрдирдГ рдкреНрд░рдпрд╛рд╕ рдХрд░реЗрдВред

@aadler рдпрджрд┐ рдЖрдкрдиреЗ рдЕрднреА рддрдХ рдЕрдВрддрд┐рдо рдирд┐рд░реНрдорд╛рдг рдХреА рдХреЛрд╢рд┐рд╢ рдирд╣реАрдВ рдХреА рд╣реИ, рддреЛ рдХреГрдкрдпрд╛ рдЗрд╕реЗ рд╕реАрдзреЗ рдЖрдЬрд╝рдорд╛рдПрдВ ред рдЗрд╕рдХреЗ рдЕрд▓рд╛рд╡рд╛, рдпрджрд┐ рдЖрдкрдХреА рдлрд╝рд╛рдЗрд▓ рдХреА рдПрдХ рдкреНрд░рддрд┐ рдореБрдЭреЗ рдкреНрд░рд╛рдкреНрдд рдХрд░рдирд╛ рд╕рдВрднрд╡ рд╣реИ, рддреЛ рдореИрдВ рд╕реНрд╡рдпрдВ рдХреЛ рд╡рд┐рдВрдбреЛрдЬрд╝ RStudio рдкрд░ рдЖрдЬрд╝рдорд╛ рд╕рдХрддрд╛ рд╣реВрдБред

:(

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-01 01:54:04 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> ColCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+               rep('integer', 3L), rep('character', 2L),
+               'integer', 'Date', rep('numeric', 2L), 'Date',
+               rep('numeric', 12L), rep('integer', 5),
+               rep('numeric', 3L), 'integer', 'character')
> SELCOL <- c(WHATEVER)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = ColCLASS, select = SELCOL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|Error in fread("LargeFile.csv", header = TRUE, colClasses = ColCLASS,  : 
  unprotect_ptr: pointer not found

рдИрдореЗрд▓ рдкрд░ @aadler рдХрд╛ рдзрдиреНрдпрд╡рд╛рдж, рдЕрдм рдореИрдВ рдкреБрди: рдкреЗрд╢ рдХрд░ рд╕рдХрддрд╛ рд╣реВрдВред R 3.4.2, рдирд╡реАрдирддрдо RStudio 1.1.383 рдФрд░ рд╡рд┐рдВрдбреЛрдЬ 10 рдкреНрд░реЛ 10.0.16299 рдмрд┐рд▓реНрдб 16299ред

рдореИрдВ рдпрд╣рд╛рдВ рд░рд┐рдХреЙрд░реНрдб рдХрд┐рдП рдЧрдП RStudio рдореЗрдВ рдЕрдЬреАрдм рд╡реНрдпрд╡рд╣рд╛рд░ рджреЗрдЦ рд░рд╣рд╛ рд╣реВрдВ:
https://www.youtube.com/watch?v=tl2x2vmZxMU
рдРрд╕рд╛ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ RStudio рд╕рд┐рд░реНрдл рдЯрд╛рдЗрдк рдХрд░рдХреЗ GCs рдЙрддреНрдкрдиреНрди рдХрд░ рд░рд╣рд╛ рд╣реИред рдРрд╕рд╛ рдХреНрдпреЛрдВ рд╣реИ рдФрд░ рдЗрд╕реЗ рдмрдВрдж рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рд╡реИрд╕реЗ рднреА рд╣реИ? рдРрд╕рд╛ рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдХрд┐ рдЬрдм fread() рдЕрдкрдиреА рдкреНрд░рдЧрддрд┐ рдкрдЯреНрдЯреА рдХреЛ рдкреНрд░рд┐рдВрдЯ рдХрд░ рд░рд╣рд╛ рд╣реЛ, RStudio рдХрд╛ рдЕрд▓рдЧ рдИрд╡реЗрдВрдЯ рд▓реВрдк рд╕реЛрдЪ рд░рд╣рд╛ рд╣реИ рдХрд┐ рдХрдВрд╕реЛрд▓ рдХреЛ рдЖрдЙрдЯрдкреБрдЯ рдпреВрдЬрд░ рдЯрд╛рдЗрдкрд┐рдВрдЧ рд╣реИ рдФрд░ рдЖрд░ рдХреЛ рдХреЙрд▓ рдХрд░рддрд╛ рд╣реИ рдЬреЛ рдЬреАрд╕реА рдХреЛ рдЬрдиреНрдо рджреЗрддрд╛ рд╣реИ рдФрд░ рд╕рдм рдХреБрдЫ рдЯреНрд░рд┐рдк рдХрд░рддрд╛ рд╣реИ? рд╢рд╛рдпрдж RStudio рдЙрдкрдпреЛрдЧрдХрд░реНрддрд╛ рдпрд╣рд╛рдВ рдЬрд╛рдирддреЗ рд╣реИрдВ, рдореБрдЭреЗ рд╕рд╣реА рджрд┐рд╢рд╛ рдореЗрдВ рдЗрдВрдЧрд┐рдд рдХрд░ рд╕рдХрддреЗ рд╣реИрдВ, рдпрд╛ рд╢рд╛рдпрдж @kevinushey рд╡рд╛рдкрд╕ рдЖ рдЧрдП рд╣реИрдВ (рдЖрдкрдиреЗ рдХреЗрд╡рд┐рди рдХреЛ рджрд┐рд╕рдВрдмрд░ рдХреЗ рд╢реБрд░реВ рдореЗрдВ рдХрд╣рд╛ рдерд╛, рдФрд░ рдпрд╣ рдЖрдЬ 1 рд╣реИ: -))

рдореИрдВ RStudio рдХрдВрд╕реЛрд▓ рдореЗрдВ рдордЬрд╝рдмреВрддреА рд╕реЗ рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдХреЛ рдкреБрди: рдЙрддреНрдкрдиреНрди рдХрд░ рд╕рдХрддрд╛ рд╣реВрдВред RStudio рдЯрд░реНрдорд┐рдирд▓ рдЯреИрдм рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рддреЗ рд╣реБрдП, рдореИрдВ gcinfo(TRUE) рд╕рд╛рде рднреА рдЗрд╕реЗ рдкреБрди: рдЙрддреНрдкрдиреНрди рдирд╣реАрдВ рдХрд░ рд╕рдХрддрд╛ред рджрд┐рд▓рдЪрд╕реНрдк рдмрд╛рдд рдпрд╣ рд╣реИ рдХрд┐ рдЬреАрд╕реАрдПрд╕ рддрдм рд╣реЛрддрд╛ рд╣реИ рдЬрдм рдкреНрд░рдЧрддрд┐ рдкрдЯреНрдЯреА рдкреНрд░рд┐рдВрдЯ рд╣реЛрддреА рд╣реИ рдФрд░ рдпрд╣ рдареАрдХ рд▓рдЧрддрд╛ рд╣реИ, рдХреНрдпреЛрдВрдХрд┐ рдпрд╣ рд▓рд┐рдирдХреНрд╕ рдкрд░ рднреА рдареАрдХ рд╣реИред RStudio рдХрдВрд╕реЛрд▓ рдХреЗ рдЙрд╕ рд╡реАрдбрд┐рдпреЛ рдореЗрдВ рд╡реНрдпрд╡рд╣рд╛рд░ рдХреЛ рджреЗрдЦрддреЗ рд╣реБрдП, рдореИрдВ рдЗрд╕ рдирд┐рд╖реНрдХрд░реНрд╖ рдкрд░ рдкрд╣реБрдБрдЪ рд░рд╣рд╛ рд╣реВрдБ рдХрд┐ рдпрд╣ RStudio рдХрдВрд╕реЛрд▓ рдмрдЧ рд╣реИред рдореИрдВ RStudio рдЯрд░реНрдорд┐рдирд▓ рд╡рд┐рдВрдбреЛ рд╕реЗ рдкрд╛рда рдХреЛ рдХреЙрдкреА рдХрд░рдиреЗ рдореЗрдВ рдЕрд╕рдорд░реНрде рдерд╛ (рд╕рдВрдкрд╛рджрди-> рдХреЙрдкреА рдХрд╛рдо рдирд╣реАрдВ рдХрд░рддрд╛ рд╣реИ рдФрд░ рди рд╣реА Ctrl-C), рдЗрд╕рд▓рд┐рдП рдореИрдВрдиреЗ рдЯрд░реНрдорд┐рдирд▓ рдЯреИрдм рдХрд╛ рд╕реНрдХреНрд░реАрдирд╢реЙрдЯ рд▓рд┐рдпрд╛ рддрд╛рдХрд┐ рдпрд╣ рджрд┐рдЦрд╛рдпрд╛ рдЬрд╛ рд╕рдХреЗ рдХрд┐ рдкреНрд░рдЧрддрд┐ рдмрд╛рд░ рдХреЗ рджреМрд░рд╛рди GC рдареАрдХ рд╣реИред рдореБрдЭреЗ рдЙрдореНрдореАрдж рд╣реИ рдХрд┐ рдпрд╣ рдареАрдХ рд╣реЛрдЧрд╛ рдХреНрдпреЛрдВрдХрд┐ рдХреЗрд╡рд▓ рдорд╛рд╕реНрдЯрд░ рдзрд╛рдЧрд╛ REprintf рдХреЙрд▓ рдХрд░ рд░рд╣рд╛ рд╣реИ рдФрд░ рдЕрдиреНрдп рдзрд╛рдЧреЗ рдХрд┐рд╕реА рднреА рдЖрд░ рдПрдкреАрдЖрдИ рдХреЛ рдХреЙрд▓ рдирд╣реАрдВ рдХрд░ рд░рд╣реЗ рд╣реИрдВред

RStudio рдЯрд░реНрдорд┐рдирд▓ рдореЗрдВ рдареАрдХ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИ:
selection_014
рдзреНрдпрд╛рди рджреЗрдВ рдХрд┐ рдЬреАрд╕реА рд╣реИрдВ рдЬрдмрдХрд┐ рдкреНрд░рдЧрддрд┐ рдмрд╛рд░ рдкрд╣рд▓реА рдмрд╛рд░ рдкреНрд░рд┐рдВрдЯ рдХрд░ рд░рд╣рд╛ рд╣реИ рдФрд░ рдпрд╣ RStudio рдЯрд░реНрдорд┐рдирд▓ рдореЗрдВ рдареАрдХ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИред рдкреНрд░рдЧрддрд┐ рдмрд╛рд░ рджреВрд╕рд░реА рдмрд╛рд░ рдкреНрд░рд┐рдВрдЯ рдХрд░рддрд╛ рд╣реИ рдХреНрдпреЛрдВрдХрд┐ рдЗрд╕ рдкрд░реАрдХреНрд╖рдг рдлрд╝рд╛рдЗрд▓ рдореЗрдВ рдПрдХ рдЖрдЙрдЯ-рдСрдл-рд╕реИрдВрдкрд▓ рдкреНрд░рдХрд╛рд░ рдЕрдкрд╡рд╛рдж рд╣реИ рдЬреЛ рдЙрди рдХреЙрд▓рдореЛрдВ рдХреЗ рд▓рд┐рдП рдПрдХ рдСрдЯреЛ рд░реАрд░рд╛рдЗрдб рдХреЛ рдЯреНрд░рд┐рдЧрд░ рдХрд░рддрд╛ рд╣реИред

рд▓реЗрдХрд┐рди RStudio рдХрдВрд╕реЛрд▓ рдореЗрдВ рдпрд╛ рддреЛ stack imbalance рдпрд╛ unprotect_ptr: pointer not found :

R version 3.4.2 (2017-09-28) -- "Short Summer"
> gcinfo(TRUE)
[1] FALSE
Garbage collection 22 = 16+3+3 (level 0) ... 
25.5 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 23 = 16+4+3 (level 1) ... 
24.9 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 24 = 17+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 25 = 18+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 26 = 19+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 27 = 20+4+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 28 = 20+5+3 (level 1) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 29 = 21+5+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 30 = 22+5+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 31 = 23+5+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 32 = 24+5+3 (level 0) ... 
25.3 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 33 = 25+5+3 (level 0) ... 
25.4 Mbytes of cons cells used (80%)
6.7 Mbytes of vectors used (66%)
Garbage collection 34 = 25+5+4 (level 2) ... 
24.6 Mbytes of cons cells used (61%)
6.4 Mbytes of vectors used (50%)
Garbage collection 35 = 26+5+4 (level 0) ... 
25.0 Mbytes of cons cells used (62%)
6.5 Mbytes of vectors used (52%)
> require(data.table)
Loading required package: data.table
Garbage collection 36 = 27+5+4 (level 0) ... 
27.2 Mbytes of cons cells used (68%)
7.1 Mbytes of vectors used (56%)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 01:04:34 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
Garbage collection 37 = 28+5+4 (level 0) ... 
27.7 Mbytes of cons cells used (69%)
7.3 Mbytes of vectors used (58%)
Garbage collection 38 = 29+5+4 (level 0) ... 
28.0 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (58%)
Garbage collection 39 = 30+5+4 (level 0) ... 
28.1 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (59%)
Garbage collection 40 = 31+5+4 (level 0) ... 
28.2 Mbytes of cons cells used (70%)
7.5 Mbytes of vectors used (59%)
Garbage collection 41 = 32+5+4 (level 0) ... 
28.4 Mbytes of cons cells used (71%)
7.5 Mbytes of vectors used (59%)
> DT = fread("/Users/pasha/Downloads/LargeFile.csv")
Garbage collection 42 = 32+5+5 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
7.1 Mbytes of vectors used (2%)
Garbage collection 43 = 32+5+6 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
244.7 Mbytes of vectors used (42%)
Garbage collection 44 = 32+5+7 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
482.3 Mbytes of vectors used (42%)
Garbage collection 45 = 32+5+8 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
957.4 Mbytes of vectors used (56%)
Garbage collection 46 = 32+5+9 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
1432.6 Mbytes of vectors used (63%)
Garbage collection 47 = 32+5+10 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
2145.3 Mbytes of vectors used (75%)
Garbage collection 48 = 32+5+11 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
2620.4 Mbytes of vectors used (71%)
Garbage collection 49 = 32+5+12 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
3570.8 Mbytes of vectors used (78%)
Garbage collection 50 = 32+5+13 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
4283.5 Mbytes of vectors used (75%)
Garbage collection 51 = 32+5+14 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
5709.0 Mbytes of vectors used (77%)
Garbage collection 52 = 32+5+15 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
7372.0 Mbytes of vectors used (81%)
Garbage collection 53 = 32+5+16 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
8797.5 Mbytes of vectors used (79%)
Garbage collection 54 = 32+5+17 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
10935.7 Mbytes of vectors used (80%)
|--------------------------------------------------|
|=====Error in fread("LargeFile.csv") : 
  unprotect_ptr: pointer not found
> 

showProgress=FALSE рдЗрд╕реЗ RStudio рдХрдВрд╕реЛрд▓ рдореЗрдВ рдордЬрд╝рдмреВрддреА рд╕реЗ рд╣рд▓ рдХрд░рддрд╛ рд╣реИред рдкреБрди: рдЙрддреНрдкрдиреНрди рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП, рдпрд╣ showProgress=TRUE (рдпрд╛рдиреА рдбрд┐рдлрд╝реЙрд▓реНрдЯ) рдХреЗ рд╕рд╛рде рдПрдХ рдирдП RStudio рдХрдВрд╕реЛрд▓ рдореЗрдВ рдмрд╣реБрдд рдкрд╣рд▓реЗ рд░рди рд╣реЛрдирд╛ рдЪрд╛рд╣рд┐рдПред рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рдХреЗ рджреМрд░рд╛рди рдПрдХ рдЬреАрд╕реА рд╣реИ рдпрд╛ рдирд╣реАрдВ рд╕реЗ рд╕рдВрдмрдВрдзрд┐рдд рд▓рдЧрддрд╛ рд╣реИ; рдирдП рд╕рддреНрд░ рдореЗрдВ рдкрд╣рд▓реЗ рднрд╛рдЧ рдореЗрдВ рд╣реИред рдпрд╣ рд╕рд┐рд░реНрдл рдПрдХ рдмрдбрд╝реА рдлрд╛рдЗрд▓ рд╣реЛрдирд╛ рдЪрд╛рд╣рд┐рдП рддрд╛рдХрд┐ рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рдкреНрд░рджрд░реНрд╢рд┐рдд рд╣реЛред fread рдкрд╛рд░рд┐рдд рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдХреБрдЫ рднреА рдирд╣реАрдВ рд╣реИред рдпрджрд┐ рдПрдХ рдирдП RStudio рдХрдВрд╕реЛрд▓ рдореЗрдВ рдкрд╣рд▓рд╛ рд░рди showProgress=FALSE рдХреЗ рд╕рд╛рде рд╣реИ, рддреЛ рдпрд╣ рд░рди R рдХреЗ рд╣реАрдк рдХрд╛ рд╡рд┐рд╕реНрддрд╛рд░ рдХрд░рддрд╛ рд╣реИ, рдмрд╛рдж рдореЗрдВ рдЙрд╕реА рд╕рддреНрд░ рдореЗрдВ showProgress=TRUE рднреА рдХрд╛рдо рдХрд░рддрд╛ рд╣реИред рд▓реЗрдХрд┐рди рд╕рд┐рд░реНрдл рдЗрд╕рд▓рд┐рдП рдХрд┐ рдкрд╣рд▓реЗ рд╕реЗ рдкрд╣рд▓реЗ рд╣реА рдвреЗрд░ рдХреЗ рд╡рд┐рд╕реНрддрд╛рд░ рдХреЗ рдХрд╛рд░рдг рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рдХреЗ рджреМрд░рд╛рди рдХреЛрдИ рдЬреАрд╕реА рдирд╣реАрдВ рд╣реИред
рдХреНрдпреЛрдВ рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рдХреЗ рджреМрд░рд╛рди рдорд╛рд╕реНрдЯрд░ рдереНрд░реЗрдб рдкрд░ рдПрдХ рдЬреАрд╕реА рд▓рд┐рдирдХреНрд╕ рдкрд░ рдФрд░ рд╡рд┐рдВрдбреЛрдЬ RStudio рдЯрд░реНрдорд┐рдирд▓ рдореЗрдВ рдареАрдХ рд╣реИ, рд▓реЗрдХрд┐рди RStudio рдХрдВрд╕реЛрд▓ рдореЗрдВ рдмрдХрд╛рдпрд╛ рдирд╣реАрдВ рд╣реИред

рдареАрдХ рд╣реИ, рдпрд╣ рдЗрд╕реЗ рдареАрдХ рдХрд░рддрд╛ рд╣реИред рдбреЗрдЯрд╛ рдбреЗрдЯрд╛ рдкрд░ рд╕рдорд╕реНрдпрд╛ рдереАред RStudio рдирд╣реАрдВред рдЕрдм рдореЗрд░реЗ рд▓рд┐рдП рд╡рд┐рдВрдбреЛрдЬ рдкрд░ RStudio рдХрдВрд╕реЛрд▓ рдореЗрдВ рдордЬрд╝рдмреВрддреА рд╕реЗ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИред рдпрд╣ рдПрдХ рд╕рдорд╕реНрдпрд╛ рдереА рдЬреЛ рд▓рд┐рдирдХреНрд╕ рдФрд░ рдореИрдХреНрд╕ рдкрд░ рднреА рд╣реЛ рд╕рдХрддреА рдереА, рдпрд╣ рд╕рд┐рд░реНрдл рдЗрддрдирд╛ рд╣реИ рдХрд┐ рдореЗрдореЛрд░реА рдкреИрдЯрд░реНрди рдЗрд╕реЗ рдЯреНрд░рд┐рдЧрд░ рдирд╣реАрдВ рдХрд░ рд░рд╣реЗ рдереЗред рдЕрдиреНрдп рдереНрд░реЗрдбреНрд╕ рдореЗрдВ R (рдПрдВрдЯреНрд░реА рдХреЙрд▓рдо рдХреЗ рд╕рд╛рде рдЕрдкрдиреЗ рдмрдлрд╝рд░реНрд╕ рдХреЛ рдзрдХреЗрд▓рдиреЗ рдкрд░) рдХрд╛ рдПрдХ рдПрдВрдЯреНрд░реА рдкреЙрдЗрдВрдЯ рд╣реЛрддрд╛ рдерд╛, рдЬреЛ рдХрд┐ рдЙрд╕реА рд╕рдордп рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдЬрдм рдорд╛рд╕реНрдЯрд░ рдереНрд░реЗрдб рдкреНрд░рд┐рдВрдЯрд┐рдВрдЧ рдкреНрд░рдЧрддрд┐ REprintf рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░ рд░рд╣реА рд╣реЛред рдЗрд╕рд▓рд┐рдП рдпрд╣ рдХреЗрд╡рд▓ рдирдП рд╕рддреНрд░ рдореЗрдВ рдкрд╣рд▓реЗ рд░рди рдореЗрдВ рд╣реБрдЖред 2 рдбреА рд░рди рдХреЗ рдмрд╛рдж, рдлрд╝рд╛рдЗрд▓ рдореЗрдВ рд╕рднреА рд╕реНрдЯреНрд░рд┐рдВрдЧреНрд╕ рдХреЛ рджреЗрдЦрд╛ рдЧрдпрд╛ рдерд╛, рдЗрд╕рд╕реЗ рдкрд╣рд▓реЗ рдХрд┐ рдХреИрд╢ рд▓реБрдХрдЕрдк (рдереНрд░реЗрдб-рд╕реЗрдлрд╝) рд╣рд┐рдЯ рдХрд░ рд░рд╣реЗ рдереЗ рдФрд░ рдЖрд╡рдВрдЯрд┐рдд рдирд╣реАрдВ рдХрд┐рдпрд╛ рдЧрдпрд╛ рдерд╛ (рдереНрд░реЗрдб-рд╕реЗрдлрд╝ рдирд╣реАрдВ)ред

рддреЛ, @aadler рдФрд░ @HughParsonage, рддреЛ рдХреГрдкрдпрд╛ рдкрд╛рдБрдЪ рдЗрд╕ рдПрдХ ред 95% рдореМрдХрд╛ рдЕрдм рдпрд╣ рдХрд╛рдо рдХрд░рддрд╛ рд╣реИ!

рдХреЛрдИ рдЪреЗрддрд╛рд╡рдиреА рдирд╣реАрдВ, рд╕реБрдирд┐рд╢реНрдЪрд┐рдд рдХрд░реЗрдВ рдХрд┐ рдЖрдк рдХрд┐рд╕реА рдФрд░ рдЪреАрдЬрд╝ рдХреА рддрд▓рд╛рд╢ рдореЗрдВ рд╣реИрдВ:

> gcinfo(TRUE)
[1] FALSE
> fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
Garbage collection 53 = 36+5+12 (level 2) ... 
30.3 Mbytes of cons cells used (60%)
7.9 Mbytes of vectors used (1%)
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Garbage collection 54 = 37+5+12 (level 0) ... 
30.8 Mbytes of cons cells used (61%)
566.6 Mbytes of vectors used (74%)
Garbage collection 55 = 37+6+12 (level 1) ... 
30.8 Mbytes of cons cells used (61%)
549.2 Mbytes of vectors used (72%)
  jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.626 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.005s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.469s ( 18%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.150s ( 82%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.103s (  4%) Finding first non-embedded \n after each jump
   +    0.230s (  9%) Parse to row-major thread buffers (grown 0 times)
   +    0.718s ( 27%) Transpose
   +    1.099s ( 42%) Waiting
   0.745s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
   2.626s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
Garbage collection 56 = 37+6+13 (level 2) ... 
31.1 Mbytes of cons cells used (62%)
531.9 Mbytes of vectors used (70%)
Garbage collection 57 = 38+6+13 (level 0) ... 
31.1 Mbytes of cons cells used (62%)
532.0 Mbytes of vectors used (70%)
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

рдзрдиреНрдпрд╡рд╛рдж рд╣реНрдпреВрдЧред рд╣рд╛рдВ, рдпрд╣ рдПрдХ рд╕рд╛рдл рд░рди рд╣реИ, рдпрд╣ рдорд╛рдирддреЗ рд╣реБрдП рдХрд┐ рдПрдХ рддрд╛рдЬрд╛ RStudio рдХрдВрд╕реЛрд▓ рд╕рддреНрд░ рдореЗрдВ рдерд╛ред рд╕реНрдЯреИрдХ рдЕрд╕рдВрддреБрд▓рди рдпрд╛ "рдЕрдирдкреНрд░реЛрдЯреЗрдХреНрдЯ_рдкреНрд░реЗрдЯ: рдкреЙрдЗрдВрдЯрд░ рдирд╣реАрдВ рдорд┐рд▓рд╛" рд╕рдВрджреЗрд╢реЛрдВ рдХрд╛ рдХреЛрдИ рд╕рдВрдХреЗрдд рдирд╣реАрдВ рд╣реИ, рдФрд░ рдкреНрд░рдЧрддрд┐ рдореАрдЯрд░ рд╕рд╣реА рдврдВрдЧ рд╕реЗ рдЪрд▓ рд░рд╣рд╛ рд╣реИ (рдЗрд╕ рдорд╛рдорд▓реЗ рдореЗрдВ рджреЛ рдмрд╛рд░ рдХреЗ рд░реВрдк рдореЗрдВ рдПрдХ рдлрд┐рд░ рд╕реЗ рдкрдврд╝рдирд╛ рд╣реИ)ред рдЕрдм рдкреБрд╖реНрдЯрд┐ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рд╕рд┐рд░реНрдл @aadler ред

рд╕рдлрд▓рддрд╛ред

рдкрд╣рд▓рд╛ рд░рди, RStudio рдХрд╛ рддрд╛рдЬрд╝рд╛ рдЙрджрд╛рд╣рд░рдгред

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> DT <- fread('LargeFile.csv', colClasses = colCLASS, select = colSEL, header = TRUE, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|==================================================|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:25.938 wall clock time
[12] Finalizing the datatable
  Type counts:
        23 : drop      '0'
         5 : int32     '5'
         7 : float64   '7'
         2 : string    'A'
=============================
   0.005s (  0%) Memory map 6.355GB file
   0.025s (  0%) sep=',' ncol=37 and header detection
   0.001s (  0%) Column type detection using 10049 sample rows
   4.681s ( 18%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
  21.226s ( 82%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
   =    0.485s (  2%) Finding first non-embedded \n after each jump
   +    1.465s (  6%) Parse to row-major thread buffers (grown 0 times)
   +    9.095s ( 35%) Transpose
   +   10.181s ( 39%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  25.938s        Total

RStudio рдХреЛ рдмрдВрдж рдХрд░ рджрд┐рдпрд╛ рдФрд░ рд╕реНрдЯреНрд░рд┐рдВрдЧ рдХреИрд╢рд┐рдВрдЧ рдХреЛ рд╕рдХреНрд░рд┐рдп рд╣реЛрдиреЗ рд╕реЗ рд░реЛрдХрдиреЗ рдХреЗ рд▓рд┐рдП рдЗрд╕реЗ рдлрд┐рд░ рд╕реЗ рдЦреЛрд▓рд╛ рдФрд░ gcinfo(TRUE) рд╕рд╛рде рдЗрд╕реЗ рдлрд┐рд░ рд╕реЗ рдЪрд▓рд╛рдпрд╛ред рдЬреЛрдбрд╝рд╛ рдЧрдпрд╛ рдмреЛрдирд╕, IDate рдореЗрдВ рд░реВрдкрд╛рдВрддрд░рдг рдкреВрд░рд╛ рд╣реБрдЖ (40 рд╕реЗрдХрдВрдб рд╕реЗ рдЕрдзрд┐рдХ, рд╣рд╛рд▓рд╛рдБрдХрд┐ :))ред

> colCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+               rep('integer', 3L), rep('character', 2L),
+               'integer', 'Date', rep('numeric', 2L), 'Date',
+               rep('numeric', 12L), rep('integer', 5),
+               rep('numeric', 3L), 'integer', 'character')
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> gcinfo(TRUE)
[1] FALSE
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
Garbage collection 46 = 36+5+5 (level 0) ... 
38.6 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 47 = 37+5+5 (level 0) ... 
38.7 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 48 = 38+5+5 (level 0) ... 
38.8 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 49 = 39+5+5 (level 0) ... 
39.0 Mbytes of cons cells used (78%)
11.2 Mbytes of vectors used (71%)
Garbage collection 50 = 40+5+5 (level 0) ... 
39.1 Mbytes of cons cells used (78%)
11.3 Mbytes of vectors used (71%)
Garbage collection 51 = 40+6+5 (level 1) ... 
38.8 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 52 = 41+6+5 (level 0) ... 
38.9 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 53 = 42+6+5 (level 0) ... 
41.5 Mbytes of cons cells used (83%)
12.2 Mbytes of vectors used (77%)
Garbage collection 54 = 42+7+5 (level 1) ... 
43.4 Mbytes of cons cells used (86%)
12.8 Mbytes of vectors used (81%)
Garbage collection 55 = 42+7+6 (level 2) ... 
44.7 Mbytes of cons cells used (72%)
13.0 Mbytes of vectors used (67%)
Garbage collection 56 = 43+7+6 (level 0) ... 
46.5 Mbytes of cons cells used (74%)
13.6 Mbytes of vectors used (70%)
Garbage collection 57 = 44+7+6 (level 0) ... 
47.0 Mbytes of cons cells used (75%)
13.8 Mbytes of vectors used (71%)
Garbage collection 58 = 45+7+6 (level 0) ... 
47.4 Mbytes of cons cells used (76%)
13.9 Mbytes of vectors used (71%)
Garbage collection 59 = 46+7+6 (level 0) ... 
47.7 Mbytes of cons cells used (76%)
14.2 Mbytes of vectors used (73%)
Garbage collection 60 = 47+7+6 (level 0) ... 
48.0 Mbytes of cons cells used (77%)
14.2 Mbytes of vectors used (73%)
Garbage collection 61 = 48+7+6 (level 0) ... 
48.1 Mbytes of cons cells used (77%)
14.3 Mbytes of vectors used (73%)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = colCLASS, select = colSEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
Garbage collection 62 = 48+7+7 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
13.6 Mbytes of vectors used (2%)
Garbage collection 63 = 48+7+8 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
488.7 Mbytes of vectors used (42%)
Garbage collection 64 = 48+7+9 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
963.9 Mbytes of vectors used (56%)
Garbage collection 65 = 48+7+10 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
1439.1 Mbytes of vectors used (63%)
Garbage collection 66 = 48+7+11 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
1914.2 Mbytes of vectors used (67%)
Garbage collection 67 = 48+7+12 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
2864.5 Mbytes of vectors used (77%)
Garbage collection 68 = 48+7+13 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
3577.3 Mbytes of vectors used (78%)
Garbage collection 69 = 48+7+14 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
4290.0 Mbytes of vectors used (75%)
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|============================Garbage collection 70 = 49+7+14 (level 0) ... 
76.5 Mbytes of cons cells used (99%)
5487.5 Mbytes of vectors used (96%)
=Garbage collection 71 = 49+8+14 (level 1) ... 
77.0 Mbytes of cons cells used (100%)
5487.6 Mbytes of vectors used (96%)
Garbage collection 72 = 49+8+15 (level 2) ... 
77.0 Mbytes of cons cells used (81%)
5487.1 Mbytes of vectors used (80%)
==============Garbage collection 73 = 50+8+15 (level 0) ... 
94.3 Mbytes of cons cells used (100%)
5494.0 Mbytes of vectors used (80%)
Garbage collection 74 = 50+9+15 (level 1) ... 
94.5 Mbytes of cons cells used (100%)
5494.1 Mbytes of vectors used (80%)
Garbage collection 75 = 50+9+16 (level 2) ... 
94.5 Mbytes of cons cells used (82%)
5493.1 Mbytes of vectors used (67%)
=======|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:24.772 wall clock time
[12] Finalizing the datatable
  Type counts:
        23 : drop      '0'
         5 : int32     '5'
         7 : float64   '7'
         2 : string    'A'
=============================
   0.005s (  0%) Memory map 6.355GB file
   0.018s (  0%) sep=',' ncol=37 and header detection
   0.000s (  0%) Column type detection using 10049 sample rows
   5.496s ( 22%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
  19.253s ( 78%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
   =    0.433s (  2%) Finding first non-embedded \n after each jump
   +    1.482s (  6%) Parse to row-major thread buffers (grown 0 times)
   +    9.515s ( 38%) Transpose
   +    7.822s ( 32%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  24.772s        Total
Garbage collection 76 = 51+9+16 (level 0) ... 
105.3 Mbytes of cons cells used (91%)
5500.3 Mbytes of vectors used (67%)
Garbage collection 77 = 51+10+16 (level 1) ... 
105.4 Mbytes of cons cells used (91%)
5500.2 Mbytes of vectors used (67%)
> DT[, Month := as.IDate(Month, format = "%Y-%m-%d")]
Garbage collection 78 = 51+10+17 (level 2) ... 
107.5 Mbytes of cons cells used (76%)
8174.1 Mbytes of vectors used (81%)
Garbage collection 79 = 51+11+17 (level 1) ... 
107.5 Mbytes of cons cells used (76%)
5910.4 Mbytes of vectors used (59%)
> gcinfo(FALSE)
[1] TRUE

рдмрд╣реБрдд рдмрдврд╝рд┐рдпрд╛! : рдЯрд╛рдбрд╛: рдЗрд╕рдореЗрдВ рд╢рд╛рдорд┐рд▓ рд╕рднреА рд▓реЛрдЧреЛрдВ рдХреЗ рд▓рд┐рдП рдорд╣рд╛рди рдХрд╛рдо, рд╡рд┐рд╢реЗрд╖ рд░реВрдк рд╕реЗ @mattdowle рдЬреЛ рдЗрд╕ рдХреЗ рд╕рд╛рде рдЕрдм рддрдХ рдереЛрдбрд╝реЗ рдмрд╛рд▓ рд╣реЛрдирд╛ рдЪрд╛рд╣рд┐рдП :)

рдРрд╕рд╛ рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ 'рдореЗрд░реА рдЫреБрдЯреНрдЯреА рдкрд░ рд░рд╣рдиреЗ рдХреА рд░рдгрдиреАрддрд┐ рдЬрдм рддрдХ рд╕рдорд╕реНрдпрд╛ рдареАрдХ рдирд╣реАрдВ рд╣реЛ рдЬрд╛рддреА рд╣реИ' рд▓рдЧрддрд╛ рд╣реИ рдХрд┐ рдпрд╣рд╛рдВ рдХрд╛рдо рдХрд┐рдпрд╛ рдЧрдпрд╛ рд╣реИ :-)

рдХреНрдпрд╛ рдХреБрдЫ рдФрд░ рд╣реИ рдЬрд┐рд╕реЗ рдореБрдЭреЗ рджреЗрдЦрдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХрд░рдиреА рдЪрд╛рд╣рд┐рдП рдпрд╛ рдЗрд╕ рдореБрджреНрджреЗ рдХреЛ рд╣рд▓ рдХрд░рдиреЗ рдкрд░ рд╡рд┐рдЪрд╛рд░ рдХрд┐рдпрд╛ рдЬрд╛рдирд╛ рдЪрд╛рд╣рд┐рдП?

рдзрдиреНрдпрд╡рд╛рдж @aadler рдФрд░ @HughParsonage! рд░рд╛рд╣рддред
@kevinushey рд╣рд╛ рд╣рд╛ред рд╣рд╛рдВ рдпрд╣ рдбреЗрдЯрд╛рдЯреИрдм рд╕рд╛рдЗрдб рдерд╛ рдФрд░ рдЕрдм рд╣рд▓ рд╣реЛ рдЧрдпрд╛ рд╣реИ (PR # 2488)ред рдзрдиреНрдпрд╡рд╛рджред

рдХреНрдпрд╛ рдпрд╣ рдкреГрд╖реНрда рдЙрдкрдпреЛрдЧреА рдерд╛?
0 / 5 - 0 рд░реЗрдЯрд┐рдВрдЧреНрд╕

рд╕рдВрдмрдВрдзрд┐рдд рдореБрджреНрджреЛрдВ

mattdowle picture mattdowle  ┬╖  3рдЯрд┐рдкреНрдкрдгрд┐рдпрд╛рдБ

sbudai picture sbudai  ┬╖  3рдЯрд┐рдкреНрдкрдгрд┐рдпрд╛рдБ

rafapereirabr picture rafapereirabr  ┬╖  3рдЯрд┐рдкреНрдкрдгрд┐рдпрд╛рдБ

jameslamb picture jameslamb  ┬╖  3рдЯрд┐рдкреНрдкрдгрд┐рдпрд╛рдБ

nachti picture nachti  ┬╖  3рдЯрд┐рдкреНрдкрдгрд┐рдпрд╛рдБ