Data.table: fread์˜ ์Šคํƒ ๋ถˆ๊ท ํ˜•

์— ๋งŒ๋“  2017๋…„ 11์›” 14์ผ  ยท  61์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: Rdatatable/data.table

verbose=FALSE ๋‹ค์Œ์„ ์‹คํ–‰ํ•  ๋•Œ R ์ถฉ๋Œ ( '์Šคํƒ ๋ถˆ๊ท ํ˜•')์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ฐธ๊ณ  : ํ•œ๋‘ ๋‹ฌ ์ „์— data.table ์˜ ์ด์ „ ๊ฐœ๋ฐœ ๋ฒ„์ „์—์„œ ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋ฏ€๋กœ ์ด๊ฒƒ์€ ์ƒ๋‹นํžˆ ์ตœ๊ทผ์˜ ๋ฒ„๊ทธ๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. (๋ฏธ์•ˆํ•˜์ง€๋งŒ ์ž‘๋™ํ–ˆ๋˜ ์ •ํ™•ํ•œ ๊ฐœ๋ฐœ ๋ฒ„์ „์„ ๊ธฐ์–ตํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.)

์ด ๋ฌธ์ œ๋Š” ํ›จ์”ฌ ๋” ์ž‘์€ ํŒŒ์ผ์—์„œ ์žฌํ˜„๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. zip ํŒŒ์ผ ๋งํฌ (csv๋Š” 350MB) : https://github.com/HughParsonage/ABS-data/blob/master/inbox/SA2-by-DJZ-2011.zip

๊ฐ€๋” ๋‹ค๋ฅธ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด

get (name, envir = ns, inherits = FALSE) ์˜ค๋ฅ˜ : ์ž˜๋ชป๋œ ์ฒซ ๋ฒˆ์งธ ์ธ์ˆ˜

๋˜๋Š”

๊ฒฝ๊ณ  : '$'์˜ ์Šคํƒ ๋ถˆ๊ท ํ˜•, 16, 15
์˜ค๋ฅ˜ : R_Reprotect : ๋ณดํ˜ธ ๋œ ํ•ญ๋ชฉ 1 ๊ฐœ๋งŒ ์ธ๋ฑ์Šค๋ฅผ ๋‹ค์‹œ ๋ณดํ˜ธ ํ•  ์ˆ˜ ์—†์Œ -2

# Minimal reproducible example

library(data.table)

#> data.table 1.10.5 IN DEVELOPMENT built 2017-11-13 02:46:28 UTC; appveyor
#>   The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#>   Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#>   Release notes, videos and slides: http://r-datatable.com


fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE)

Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.550 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 94%. ETA 00:00 Warning: stack imbalance in '$', 27 then 28
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.991
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.006s (  0%) Memory map 0.341GB file
   0.011s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.328s (  9%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.194s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.362s ( 10%) Parse to row-major thread buffers
   +    1.963s ( 55%) Transpose
   +    0.868s ( 25%) Waiting
   0.991s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
   3.541s        Total
Warning: stack imbalance in 'withVisible', 3 then 5
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

# Output of sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5    RevoUtils_10.0.6     RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2    yaml_2.1.14 
bug fread idatitime platform-specific

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

'๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ ๋  ๋•Œ๊นŒ์ง€ ํœด๊ฐ€์— ๋จธ๋ฌผ๋Ÿฌ ๋ผ'๋ผ๋Š” ๋‚ด ์ „๋žต์ด ์—ฌ๊ธฐ์—์„œ ํ•ด๊ฒฐ ๋œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. :-)

ํ™•์ธํ•ด์•ผ ํ•  ๋‹ค๋ฅธ ์‚ฌํ•ญ์ด ์žˆ๊ฑฐ๋‚˜์ด ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ ๋œ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผ๋ฉ๋‹ˆ๊นŒ?

๋ชจ๋“  61 ๋Œ“๊ธ€

@HughParsonage , ์ด๊ฒƒ์€ # 2457๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค. showProgress=FALSE ์ „๋‹ฌํ•˜๊ณ  ์™„๋ฃŒ๋˜๋Š”์ง€ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค.
@mattdowle 2017-11-09 ์ดํ›„ ํšŒ๊ท€๊ฐ€ ์žˆ์—ˆ์„๊นŒ์š”?

showProgress=FALSE ์‹คํ–‰ํ•˜๋ฉด ์‹ค์ œ๋กœ ๊ฒฐ๊ณผ๊ฐ€ ๋ฐ˜ํ™˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค (์˜ˆ์ƒ ๋œ ๊ฒฝ๊ณ  ๋งŒ ํ‘œ์‹œ๋จ).

๋ชจ๋“  ์ž์„ธํ•œ ์ •๋ณด์— ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค. 2017-11-09 ์ดํ›„ ํšŒ๊ท€๊ฐ€ ์žˆ์—ˆ๋Š”์ง€ ์˜์‹ฌ ์Šค๋Ÿฝ์ง€๋งŒ ๊ธด verbose=TRUE ์ถœ๋ ฅ์ด ETA ์ถœ๋ ฅ๊ณผ ๋น„์Šทํ•œ ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŒŒ์ผ์„ ๋‹ค์‹œ ์ฝ์–ด์•ผํ•˜๋ฏ€๋กœ ๋” ๋งŽ์€ ์ถœ๋ ฅ์ด ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๋‚˜๋Š” showProgress ๊ทธ๋ฅผ ์œ„ํ•ด TRUE ์ผ์„ = ๊ฒƒ์„ @HughParsonage์˜ ๋ณด๊ณ ์„œ๊ฐ€ ๊ฐ€์งœ์ž„์„ ๋‘๋ ค์›Œํ•˜๊ณ , ๋ฌธ์ œ๋Š” ์ž์„ธํ•œ ์ •๋ณด์™€ ํ•จ๊ป˜ 5 ~ 10 ๋ฒˆ ์‹คํ–‰๋˜๋Š” ๊ฒฝ์šฐ = TRUE ์ผ์ด ์ผ์–ด๋‚  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋ณ‘๋ ฌ ์„น์…˜ ๋‚ด์—์„œ ์ธ์‡„ ๋œ ์ž์„ธํ•œ ๋ฉ”์‹œ์ง€๋Š” ์—†์Šต๋‹ˆ๋‹ค (์ด๋ฏธ ์ˆ˜์ • ๋œ ์ง„ํ–‰ ETA ์ œ์™ธ). ๊ทธ๋Ÿฌ๋‚˜ ์ฒซ ๋ฒˆ์งธ ์ฝ๊ธฐ ํ›„์™€ ๋‘ ๋ฒˆ์งธ ๋‹ค์‹œ ์ฝ๊ธฐ๊ฐ€ ์‹œ์ž‘๋˜๊ธฐ ์ „์— (์ด ํŒŒ์ผ์—์„œ ๋ฐœ์ƒํ•˜๋Š”) ์ž์„ธํ•œ ๋ฉ”์‹œ์ง€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์ธ์‡„๊ฐ€ 100 ๋ฒˆ์งธ CheckUserInterrupt (# 2457 ์ฐธ์กฐ)๋ฅผ ํŠธ๋ฆฌ๊ฑฐํ•˜๋ฉด ๋‘ ๋ฒˆ์งธ ๋ณ‘๋ ฌ ์˜์—ญ์ด ์‹คํŒจ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (์ด์ƒํ•˜๊ฒŒ๋„). ์–ด์จŒ๋“  ๊ทธ๊ฒƒ์„ ๋ฐฐ์ œํ•˜๊ธฐ ์œ„ํ•ด Rprintf ๋Œ€์‹  REprintf๋ฅผ ์‚ฌ์šฉํ•˜๋„๋ก ๋ชจ๋“  ์ž์„ธํ•œ ๋ฉ”์‹œ์ง€๋ฅผ ๋ณ€๊ฒฝํ–ˆ์Šต๋‹ˆ๋‹ค (ETA์— ๋Œ€ํ•œ # 2457๊ณผ ๋™์ผ). ํ…Œ์ŠคํŠธ๊ฐ€ stderr์—์„œ ์ถœ๋ ฅ์„ ์ฐพ์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์‹คํŒจํ–ˆ์Šต๋‹ˆ๋‹ค. ํ†ต๊ณผํ•˜๋ฉด Windows .zip์ด ์ž๋™์œผ๋กœ ์ƒ์„ฑ๋˜๋ฉฐ ๋‹ค์‹œ ์‹œ๋„ํ•ด์ฃผ์„ธ์š”. ์ค€๋น„๋˜๋ฉด ์—ฌ๊ธฐ์—์„œ ์—…๋ฐ์ดํŠธํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

์ข‹์•„, ๋‘ ๋ฒˆ์งธ ์‹œ๋„๋Š” ๊ฒ€์‚ฌ๋ฅผ ํ†ต๊ณผํ•˜๊ณ  Windows.zip ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. @HughParsonage ๋‹ค์‹œ ์‹œ๋„ํ•ด ์ฃผ์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ? ๋‹ค์‹œ ์ฝ๊ธฐ ์ง์ „์— verbose ๋ชจ๋“œ์˜ ๋ฉ”์‹œ์ง€ ๋’ค์— R_FlushConsole () ํ˜ธ์ถœ์„ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ํ”Œ๋Ÿฌ์‹œ๋Š” Windows์—์„œ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋‚˜๋Š” ํ”Œ๋Ÿฌ์‹œ๊ฐ€ ์—†์œผ๋ฉด ์ฝ˜์†”์ด ๋•Œ๋•Œ๋กœ ๋ณ‘๋ ฌ ๋‹ค์‹œ ์ฝ๊ธฐ๊ฐ€ ์ผ์–ด๋‚˜๊ณ  ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚ค๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ ธ์žˆ์„ ๋•Œ ์กฐ๊ธˆ ๋‚˜์ค‘์— ์—…๋ฐ์ดํŠธ๋œ๋‹ค๋Š” ์ถ”์ธก์„ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํ•ญ์ƒ verbose=TRUE ๋ฐ showProgress=TRUE ๋ชจ๋‘ ์‚ฌ์šฉํ•˜์—ฌ 10 ๋ฒˆ ๋ฐ˜๋ณตํ•˜์‹ญ์‹œ์˜ค. 10 ๋ฒˆ์˜ ํด๋ฆฌ์–ด ๋Ÿฐ์„ ๋ณธ๋‹ค๋ฉด ๊ทธ๊ฒŒ ๋‹ค๋ผ๊ณ  ๋งํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๋‹ค์‹œ ์ƒ๊ฐํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.

๋ถˆํ–‰ํžˆ๋„ ์ˆ˜์ •๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = FALSE)
Read 26%. ETA 00:00 Warning: stack imbalance in '$', 20 then 22
Read 52%. ETA 00:00 Warning: stack imbalance in '$', 36 then 35
Warning: stack imbalance in '$', 21 then 22
Read 59%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  : 
  unprotect_ptr: pointer not found
In addition: Warning: stack imbalance in '$', 26 then 28
Warning messages:
1: Warning: stack imbalance in '$', 26 then 27
In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15

10 ๋ฒˆ ์‹คํ–‰ ํ•œ ํ›„์—๋„ verbose=TRUE, showProgress=TRUE ์‚ฌ์šฉํ•˜๋ฉด ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์€ 10 ๋ฒˆ์งธ ์ถœ๋ ฅ์˜ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.094 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.752
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.004s (  0%) Memory map 0.341GB file
   0.008s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.173s (  4%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.660s ( 95%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
   =    0.009s (  0%) Finding first non-embedded \n after each jump
   +    1.946s ( 51%) Parse to row-major thread buffers
   +    1.098s ( 29%) Transpose
   +    0.608s ( 16%) Waiting
   1.752s ( 46%) Rereading 1 columns due to out-of-sample type exceptions
   3.846s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.589 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.418
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.001s (  0%) Memory map 0.341GB file
   0.003s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.574s ( 14%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.428s ( 86%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
   =    0.010s (  0%) Finding first non-embedded \n after each jump
   +    1.988s ( 50%) Parse to row-major thread buffers
   +    1.137s ( 28%) Transpose
   +    0.292s (  7%) Waiting
   1.418s ( 35%) Rereading 1 columns due to out-of-sample type exceptions
   4.007s        Total
There were 20 warnings (use warnings() to see them)

@HughParsonage ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค! ๊ทธ๋ž˜๋„ ํ˜ผ๋ž€ ์Šค๋Ÿฝ์Šต๋‹ˆ๋‹ค. ๋‹น์‹ ์€ ๊ทธ๊ฒƒ์ด ์šฐ๋ฆฌ๊ฐ€ ๋ฐ”๋žฌ๋˜ verbose=TRUE, showProgress=TRUE ์™€ ์ž˜ ์ž‘๋™ํ•œ๋‹ค๊ณ  ๋งํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค-์˜ˆ! ์ „์—๋Š” ์‹คํŒจํ•˜์ง€ ์•Š์•˜๋‚˜์š”? showProgress ๋Œ€ํ•œ ๊ธฐ๋ณธ๊ฐ’์€ ์–ด์จŒ๋“  TRUE์ด์ง€๋งŒ verbose ๋Œ€ํ•œ ๊ธฐ๋ณธ๊ฐ’ FALSE๋กœ ์‹คํ–‰ํ•˜๋ฉด _then_ ์ž‘๋™ํ•˜์ง€ ์•Š๊ณ  ์Šคํƒ ๋ถˆ๊ท ํ˜•์ด ํ‘œ์‹œ๋ฉ๋‹ˆ๊นŒ? _less_ ์ถœ๋ ฅ์œผ๋กœ ์ธํ•ด ์‹คํŒจํ•˜๋Š” ๊ฒƒ์ด ์ด์ƒํ•ฉ๋‹ˆ๋‹ค. ํ™•์ธ ํ•ด์ฃผ์„ธ์š”. ๊ทธ๋ ‡๋‹ค๋ฉด ์•„๋งˆ๋„ ์ž˜๋ชป๋œ ๋‚˜๋ฌด๋ฅผ ์ง–๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. Linux์—์„œ ๋‚˜๋ฅผ ์œ„ํ•ด ์ž˜ ์ž‘๋™ํ•˜๋ฏ€๋กœ Windows์—์„œ ํ…Œ์ŠคํŠธํ•˜๋Š” ๋ฐ ์˜์กดํ•ฉ๋‹ˆ๋‹ค. ๊ฐ์‚ฌ.
(๋˜ํ•œ 10 ๋ฒˆ์งธ ์‹คํ–‰ ์ถœ๋ ฅ์˜ ๋งจ ์•„๋ž˜์—๋Š” 20 ๊ฐœ์˜ ๊ฒฝ๊ณ ๊ฐ€ ์žˆ๋‹ค๊ณ  ํ‘œ์‹œ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ƒ์œ„์— ํ‘œ์‹œ๋œ 2 ๊ฐœ์˜ ๊ฒฝ๊ณ ๊ฐ€ 10 ๋ฒˆ ๋ฐ˜๋ณต ๋œ ๊ฒƒ์œผ๋กœ ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์˜๋ฏธ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.)

ํ˜ผ๋™์„ ๋“œ๋ ค ์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค, Matt.

์›๋ž˜ ๋ฌธ์ œ๋กœ ์ธํ•ด ๋” ์ด์ƒ ์ถฉ๋Œ์ด ๋ฐœ์ƒํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์ด ๋งž์Šต๋‹ˆ๋‹ค. ์ฆ‰, ๋‹ค์Œ์ด ์˜ˆ์ƒ๋Œ€๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "")

๋ช…ํ™•ํžˆํ•˜๊ธฐ ์œ„ํ•ด ์›๋ณธ์—์„œ verbose =FALSE (๊ธฐ๋ณธ๊ฐ’) ๋•Œ ์ถฉ๋Œ์ด ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์ œ๋ฅผ ์ œ๊ธฐํ•˜๊ธฐ ์ „์— verbose = TRUE ํ–ˆ๋Š”๋ฐ '์Šคํƒ ๋ถˆ๊ท ํ˜•'๊ฒฝ๊ณ ๊ฐ€ ๋‚˜ํƒ€ ๋‚ฌ์ง€๋งŒ ์ถฉ๋Œ์ด ๋ฐœ์ƒํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ์ตœ์‹  ๋ฒ„์ „์—์„œ๋Š” verbose = FALSE ์—์„œ ์ถฉ๋Œ (๋˜๋Š” ์‹ค์ œ๋กœ ๋ฌธ์ œ)์ด ๋ฐœ์ƒํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๋‚ด๊ฐ€ '์ˆ˜์ •๋˜์ง€ ์•Š์Œ'์ด๋ผ๊ณ  ๋งํ•œ ์ด์œ ๋Š” ๊ฒฝ๊ณ  ๋ฉ”์‹œ์ง€๋ฅผ ๋ฐœ๊ฒฌํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

Warning messages:
Warning: stack imbalance in '$', 26 then 27
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15

์ด์ƒํ•˜๊ฒŒ ๋ณด์˜€๊ณ  ๋™์ผํ•˜์ง€๋Š” ์•Š์ง€๋งŒ ๋ฐ€์ ‘ํ•œ ๊ด€๋ จ์ด ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์˜ค๋Š˜ ์•„์นจ ํ˜ธ์ฃผ์—์„œ๋Š” ๋” ์ด์ƒ ๊ฒฝ๊ณ  ๋ฉ”์‹œ์ง€๋ฅผ ์žฌํ˜„ ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜ ์•Œ์•˜์–ด. ์Šคํƒ ๋ถˆ๊ท ํ˜•์— ๋Œ€ํ•œ ๊ฒฝ๊ณ  ๋ฉ”์‹œ์ง€๋Š” ๋ณธ์งˆ์ ์œผ๋กœ ์˜ค๋ฅ˜์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ทธ๋“ค์„ ๊ฑด๋„ˆ ๋›ธ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์Šคํƒ ๋ถˆ๊ท ํ˜•์— ๋Œ€ํ•œ ๊ฒฝ๊ณ ๋Š” ์‹ค์ œ๋กœ ์•„์ง ์ถฉ๋Œํ•˜์ง€ ์•Š์•˜์ง€๋งŒ ์ถฉ๋Œ์ด๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค. (๊ฒฝ๊ณ ๋ฅผ ๋ณธ ํ›„ ์ถฉ๋Œํ•˜๋Š” ๊ฒƒ์€ ์‹œ๊ฐ„ ๋ฌธ์ œ ์ผ๋ฟ์ž…๋‹ˆ๋‹ค.)

verbose=TRUE, showProgress=TRUE ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด R ์„ธ์…˜์—์„œ 10 ๋ฒˆ ์‹คํ–‰ํ•˜๋ฉด ์Šคํƒ ๋ถˆ๊ท ํ˜•์— ๋Œ€ํ•œ 20 ๊ฐœ์˜ ๊ฒฝ๊ณ  ์ค‘ ํ•˜๋‚˜์ด๊ฑฐ๋‚˜ ๋ชจ๋‘ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ผ๋ฐ˜ ๊ฒฝ๊ณ  ์ผ๋ฟ์ž…๋‹ˆ๋‹ค.

1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

์Šคํƒ ๋ถˆ๊ท ํ˜• ๊ฒฝ๊ณ ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ์ƒˆ๋กœ์šด R ์„ธ์…˜์„ ์‹œ์ž‘ํ•˜์‹ญ์‹œ์˜ค. ํ•œ ๋ฒˆ์ด๋ผ๋„ ๋ฐœ์ƒํ•œ ์ดํ›„์—๋Š” R์˜ ์–ด๋–ค ๊ฒƒ๋„ ์‹ ๋ขฐํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

verbose=TRUE, showProgress=TRUE ๋‹ฌ๋ ธ์„ ๋•Œ ์ถฉ๋Œ์ด ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ์•ฝ ๋ญ”๊ฐ€ const char A๋ฅผ SEXP . ์ด ๋ฌธ์ œ๋ฅผ ๋ช…๋ น ์ค„์—์„œ ์žฌํ˜„ํ•˜๋ ค๊ณ ํ•ฉ๋‹ˆ๋‹ค (๋ถˆํ–‰ํžˆ๋„ RStudio์—์„œ ๋ฐœ์ƒํ–ˆ์œผ๋ฉฐ RStudio๊ฐ€ ์ „์ฒด ๋ฉ”์‹œ์ง€๋ฅผ ์ฝ๊ธฐ ์ „์— ๋‹ซํ˜”์Šต๋‹ˆ๋‹ค).

์ถฉ๋Œ์„ ์žฌํ˜„ ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์žฌ๋ถ€ํŒ… ํ›„ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ์Šคํƒ ๋ถˆ๊ท ํ˜• ๊ฒฝ๊ณ ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> for (i in 1:10) fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE, showProgress = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 31%. ETA 00:00 Warning: stack imbalance in '$', 24 then 23
Read 91%. ETA 00:00 Warning: stack imbalance in '$', 27 then 26
Read 95%. ETA 00:00 Warning: stack imbalance in '$', 28 then 29
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.895
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.029s (  1%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.314s ( 15%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.761s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.015s (  1%) Finding first non-embedded \n after each jump
   +    0.599s ( 28%) Parse to row-major thread buffers
   +    0.400s ( 19%) Transpose
   +    0.746s ( 35%) Waiting
   0.895s ( 42%) Rereading 1 columns due to out-of-sample type exceptions
   2.107s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.335 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.049
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.402s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.974s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.209s (  9%) Parse to row-major thread buffers
   +    0.864s ( 36%) Transpose
   +    0.900s ( 38%) Waiting
   1.049s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
   2.385s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.414
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.293s ( 18%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.322s ( 81%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.199s ( 12%) Parse to row-major thread buffers
   +    0.822s ( 51%) Transpose
   +    0.301s ( 19%) Waiting
   0.414s ( 25%) Rereading 1 columns due to out-of-sample type exceptions
   1.626s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.451 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.409
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.403s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.448s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.194s ( 10%) Parse to row-major thread buffers
   +    0.974s ( 52%) Transpose
   +    0.279s ( 15%) Waiting
   0.409s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.860s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.480 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.412
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.459s ( 24%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.424s ( 75%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.197s ( 10%) Parse to row-major thread buffers
   +    0.938s ( 50%) Transpose
   +    0.288s ( 15%) Waiting
   0.412s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.892s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.381 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.401
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.005s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.384s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.389s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.196s ( 11%) Parse to row-major thread buffers
   +    0.911s ( 51%) Transpose
   +    0.281s ( 16%) Waiting
   0.401s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.781s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.384 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.480
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.476s ( 26%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.378s ( 74%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.192s ( 10%) Parse to row-major thread buffers
   +    0.833s ( 45%) Transpose
   +    0.352s ( 19%) Waiting
   0.480s ( 26%) Rereading 1 columns due to out-of-sample type exceptions
   1.864s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.374 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.507
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.311s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.562s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.193s ( 10%) Parse to row-major thread buffers
   +    0.988s ( 52%) Transpose
   +    0.381s ( 20%) Waiting
   0.507s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
   1.881s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.318 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.493
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.306s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.496s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.193s ( 11%) Parse to row-major thread buffers
   +    0.935s ( 52%) Transpose
   +    0.367s ( 20%) Waiting
   0.493s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
   1.811s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.141 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.506
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.132s (  8%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.506s ( 91%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.195s ( 12%) Parse to row-major thread buffers
   +    0.938s ( 57%) Transpose
   +    0.371s ( 23%) Waiting
   0.506s ( 31%) Rereading 1 columns due to out-of-sample type exceptions
   1.647s        Total
Warning: stack imbalance in 'for', 2 then 8
There were 20 warnings (use warnings() to see them)

์ด์ƒํ•˜๊ฒŒ๋„ ํ™•์‹ค์„ฑ์€ ํ›Œ๋ฅญํ•ฉ๋‹ˆ๋‹ค. ๊ฐ์‚ฌ. ์ฆ‰, ํ”Œ๋Ÿฌ์‹œ๊ฐ€ ์ž‘๋™ํ•˜์ง€ ์•Š์•˜์œผ๋ฏ€๋กœ ๊ฒฐ๊ตญ Rprintf ์„ ํ”ผํ•  ๋ฐฉ๋ฒ•์„ ์ฐพ์•„์•ผํ•ฉ๋‹ˆ๋‹ค. verbose=FALSE, showProgress=FALSE ์•ˆ์ •์ ์œผ๋กœ ์ž‘๋™ํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค (์ด ๋ฌธ์ œ์˜ ์ƒ๋‹จ ๊ทผ์ฒ˜์— ์ž‘์„ฑ ํ–ˆ์œผ๋ฏ€๋กœ ์ด์— ์˜์กดํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.) "Reliably"๋Š” 2 ๊ฐœ์˜ ์˜ˆ์ƒ ๊ฒฝ๊ณ  ๋งŒ ์žˆ๊ณ  ์Šคํƒ์ด ๋ณด์ด์ง€ ์•Š๋Š” 10 ์—ฐ์†์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๋ถˆ๊ท ํ˜• ๊ฒฝ๊ณ .
๊ทธ๋Ÿผ ๋‚˜์—๊ฒŒ ๋งก๊ธฐ์‹ญ์‹œ์˜ค. ๋‹ค์‹œ ํ•œ๋ฒˆ ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

@HughParsonage ์ข‹์•„, ์ตœ๊ทผ ๋‘ ๋ฒˆ์งธ ์‹œ๋„๋กœ ๋‹ค์‹œ ์‹œ๋„ํ•˜์‹ญ์‹œ์˜ค. ์•„์ง ๋งˆ์Šคํ„ฐ์— ๋ณ‘ํ•ฉ๋˜์ง€ ์•Š์•˜์œผ๋ฏ€๋กœ ์—ฌ๊ธฐ ์ง€์  ํ•˜์‹ญ์‹œ์˜ค . ์ด์ „๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋„๋ก ์–ด๋Š ์ชฝ์ด๋“  ์ „์ฒด ์ถœ๋ ฅ์„ ์ œ๊ณตํ•˜์‹ญ์‹œ์˜ค. ๊ฐ์‚ฌ!

๋‹ค์Œ์˜ ์ฒซ ๋ฒˆ์งธ ์‹œ๋„๋กœ ์ธํ•ด ์ถฉ๋Œ์ด ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค (ํฌ์ธํ„ฐ์— ๋Œ€ํ•œ ๊ฒƒ).

๋‘ ๋ฒˆ์งธ ์‹œ๋„ (์žฌ๋ถ€ํŒ… ํ›„)๋Š” stack imbalance in '$', 16 then 15 ๊ฒฝ๊ณ ๋ฅผ ๋ฐœ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

# Assert that `data.table` is not installed:
stopifnot(!requireNamespace("data.table", quietly = TRUE))

install.packages("https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into โ€˜C:/Users/hughp/Documents/R/win-library/3.4โ€™
# (as โ€˜libโ€™ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557502 bytes (1.5 MB)
# downloaded 1.5 MB

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 01:38:17 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com

setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapping ... ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \r-only line endings are not allowed because \n is found in the data
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
#   [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=','  with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
#   Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000)    : 1551  Quote rule 0
# Type codes (jump 100)    : 1A51  Quote rule 0
# =====
#   Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
#   [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# Read 78%. ETA 00:00 Warning: stack imbalance in '$', 16 then 15
# Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.677 wall clock time
# [12] Finalizing the datatable
# Type counts:
#   1 : bool8     '1'
# 1 : int32     '5'
# 2 : string    'A'
# =============================
#   0.002s (  0%) Memory map 0.341GB file
# 0.007s (  0%) sep=',' ncol=4 and header detection
# 0.001s (  0%) Column type detection using 10027 sample rows
# 0.297s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 2.369s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# =    0.003s (  0%) Finding first non-embedded \n after each jump
# +    0.273s ( 10%) Parse to row-major thread buffers (grown 0 times)
# +    1.313s ( 49%) Transpose
# +    0.780s ( 29%) Waiting
# 0.893s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
# 2.677s        Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1        V2      V3 V4
# 1: Goulburn 110018063    3499 NA
# 2:       NA 110018064     812 NA
# 3:       NA 110018065    2158 NA
# 4:       NA 110019999     402 NA
# 5:       NA 110028068      10 NA
# ---                              
#   22885376:       NA 997999799       0 NA
# 22885377:       NA 998999899      64 NA
# 22885378:       NA 994999499      34 NA
# 22885379:       NA 0&&&&&&&&  250796 NA
# 22885380:       NA 0@@@@@@@@ 7305367 NA
# Warning messages:
#   1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#                 Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
#               2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#               Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

์•ˆ๋…•ํ•˜์„ธ์š”, @mattdowle. OpenMP๊ฐ€ 4.0์ด ์•„๋‹ˆ๋ผ ๊ธฐ๊ปํ•ด์•ผ 3.1 ์ธ GCC ๋ฒ„์ „์ด ์•„์ง ์‚ฌ์šฉ ์ค‘์ž…๋‹ˆ๋‹ค. ๋‚˜๋Š” CRAN (๋‚ด ํŒจํ‚ค์ง€ ์ค‘ ํ•˜๋‚˜์— ๊ทธ ๋ฌธ์ œ๋กœ ์‹คํ–‰ Delaporte ์—ฌ์ „ํžˆ GCC๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ๋žŒ์˜ ๋ฆฌ๋ˆ…์Šค ์‹œ์Šคํ…œ์—์„œ (4.9.3 ๊ธฐ์ค€) Windows ์šฉ Rtools ์ปดํŒŒ์ผ ๋‚˜๋Š” SIMD ์ง€์‹œ์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๋„) (OpenMP๋ฅผ 4.0)ํ•˜์ง€๋งŒ ๋˜์กŒ๋‹ค ๋ฐ ์˜ค๋ฅ˜ 4.8.0. ๋‚ด๊ฐ€ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ธฐ์–ตํ•œ๋‹ค๋ฉด Windows์กฐ์ฐจ๋„ 4.5 ํ˜ธ์ถœ์ด ์•„๋‹Œ 4.0 ๋งŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๊ฒŒ ๋ฌธ์ œ์˜ ์›์ธ์ผ๊นŒ์š”?

@HughParsonage ๋„ˆ๋ฌด ๋นจ๋ฆฌ ํ…Œ์ŠคํŠธ ํ•ด ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค! ์ข‹์•„์š”, ๊ทธ๋Ÿผ ๊ณ„์† ์ƒ๊ฐ ํ• ๊ฒŒ์š”!
@aadler ์ข‹์€ ์ƒ๊ฐ์ž…๋‹ˆ๋‹ค-๋ชจ๋“  ๊ฒƒ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

@HughParsonage ํ•œ ๋ฒˆ๋งŒ ๋ณ€๊ฒฝ ํ•œ ๋™์ผํ•œ ๋ช…๋ น ( verbose=FALSE )์ด ์ œ๋Œ€๋กœ ์ž‘๋™ ํ•˜๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ์˜ˆ : fread("SA2-by-DJZ-2011.csv", verbose = FALSE, na.strings = "", header = FALSE) . ์ง„ํ–‰๋ฅ  ํ‘œ์‹œ๊ธฐ๋Š” ๊ณ„์† ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.

์˜ˆ, ํ•ด๋‹น ๋ช…๋ น์„ (10 ๋ฒˆ) ์‹คํ–‰ํ•˜๋ฉด ์˜ˆ์ƒ ๊ฒฐ๊ณผ๊ฐ€ ๋ฐ˜ํ™˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค (์ฆ‰, ํ˜•์‹์ด ์ž˜๋ชป ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฝ๊ณ ๊ฐ€ ๋‘ ๊ฐœ ๋ฟ์ธ data.table). ์Šคํƒ ๋ถˆ๊ท ํ˜• ๊ฒฝ๊ณ ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

๊ฐ์‚ฌ. ๋”ฐ๋ผ์„œ ์ฝ˜์†” ์ถœ๋ ฅ๊ณผ ๊ด€๋ จ๋œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์‹œ๋„ ํ•  ๋ช‡ ๊ฐ€์ง€ ๋” ...

์ƒ์„ธ ๋ชจ๋“œ์—์„œ๋Š” wallclock() ๋ฅผ ํ˜ธ์ถœํ•˜๋Š” ๋ณ‘๋ ฌ ์˜์—ญ ๋‚ด๋ถ€์— ๋ช‡ ๊ฐ€์ง€ ๋ถ„๊ธฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” ๊ทธ๊ฒƒ์„ ๋ฐฐ์ œํ•˜๊ธฐ ์œ„ํ•ด ํ•ญ์ƒ 0.0์„ ๋ฐ˜ํ™˜ํ•˜๊ณ  ์‹œ์Šคํ…œ ํ˜ธ์ถœ์„ ํ”ผํ•˜๋„๋ก ๋‹จ๋ฝํ–ˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ ์•ˆ์ „ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์ง€๋งŒ ์•„๋‹ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ ์—์„œ ๋‹ค์‹œ ๋นŒ๋“œ ๋œ ๋ถ„๊ธฐ์—์„œ ์ƒˆ Windows.zip์„ ์‚ฌ์šฉํ•ด๋ณด์‹ญ์‹œ์˜ค.

์ฒซ๋ฒˆ์งธ ์‹œ๋„:

install.packages("https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into โ€˜C:/Users/hughp/Documents/R/win-library/3.4โ€™
# (as โ€˜libโ€™ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1556972 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package โ€˜data.tableโ€™ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 03:49:20 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)

image

๋‘ ๋ฒˆ์งธ ์‹œ๋„์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฝ๊ณ ๊ฐ€ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.

Read 22%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  : 
  unprotect_ptr: pointer not found
In addition: Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
 Warning: stack imbalance in '$', 29 then 28
 Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 125 then 126
Warning: stack imbalance in 'lapply', 55 then 53
Warning: stack imbalance in 'lapply', 30 then 34
Warning: stack imbalance in '<-', 28 then 31
Warning: stack imbalance in '{', 24 then 27
Warning: stack imbalance in '{', 18 then 21

์ƒ๊ฐ : ์ด๊ฒƒ์ด RStudio์— ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ํ„ฐ๋ฏธ๋„์—์„œ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์€ ์‰ฝ๊ฒŒ ์žฌํ˜„๋˜์ง€ ์•Š๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ฝ˜์†” ์ถœ๋ ฅ์„ ๋” ์‰ฝ๊ฒŒ ๋ณต์‚ฌ ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— RStudio์—์„œ ์‹คํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

RStudio ์™ธ๋ถ€์—์„œ _ ์‰ฝ๊ฒŒ ์žฌํ˜„๋˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ๋งํ•˜๋ฉด _ ์ „๋ถ€ _ ์žฌํ˜„๋ฉ๋‹ˆ๊นŒ? RStudio ๋‚ด์—์„œ๋งŒ ๋ฐœ์ƒํ•˜๋”๋ผ๋„ ์—ฌ์ „ํžˆ data.table ์ธก๋ฉด์—์„œ ์ˆ˜์ •ํ•˜๋ ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‚˜๋Š” ๊ทธ๊ฒƒ์ด ํ™•์‹คํžˆ "๊ทธ๋ƒฅ"์ฝ˜์†” ์ถœ๋ ฅ์ด๊ณ  fread ๋กœ์ง์˜ ๋‹ค๋ฅธ ์ง„์ •ํ•œ ์Šคํƒ ๋ถˆ๊ท ํ˜•์ด ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜๊ธฐ์œ„ํ•œ ๋˜ ๋‹ค๋ฅธ ๊ฒฝ๋กœ๋กœ ์š”์ฒญํ•˜๊ณ ์žˆ๋‹ค.

๋‚˜๋Š” ์ „ํ˜€ RStudio์˜ ์™ธ๋ถ€๋ฅผ ์žฌํ˜„ ์•„์ง์ด์•ผ, ๊ทธ๋ฆฌ๊ณ  ๊ทธ ์•ˆ์— ์•ˆ์ •์ ์ธ ์žฌํ˜„์ด (์ฆ‰, ๋‚ด๊ฐ€ ์–ด๋–ค ๊ฒฝ๊ณ  ๋‚˜ ์ถฉ๋Œ์„ ์žฌํ˜„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค) ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Windows ๋ช…๋ น ํ”„๋กฌํ”„ํŠธ์™€ git shell (Windows)์„ ์‚ฌ์šฉํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค.

Windows์—์„œ RStudio ๋ฒ„์ „ 1.1.383์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ œ๊ฐ€ ๊ทธ๋“ค๊ณผ ํ•จ๊ป˜์ด ๋ฌธ์ œ๋ฅผ ์ œ๊ธฐํ•œ๋‹ค๋ฉด ๋„์›€์ด ๋ ๊นŒ์š”, ์•„๋‹ˆ๋ฉด ์ œ๊ฐ€ ๊ธฐ๋‹ค๋ฆฌ๊ธฐ๋ฅผ ์›ํ•˜์‹ญ๋‹ˆ๊นŒ?

๊ฐ์‚ฌ. RStudio ๋‚ด๋ถ€์— ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•„๋Š” ๊ฒƒ์ด ์ •๋ง ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋“ค๊ณผ ํ•จ๊ป˜ ์˜ฌ๋ฆด ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ถœ๋ ฅ ์ฝ˜์†” ๋ฒ„ํผ๋ง (๋˜๋Š” ์œ ์‚ฌ)๊ณผ ๊ด€๋ จ์ด ์žˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๋‚˜๋Š” ์ž‘์—…์„ ์ง„ํ–‰ํ–ˆ๊ณ  ๊ณง ์ถ”์ง„ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Windows๊ฐ€ ๋ณ€๊ฒฝ ์‚ฌํ•ญ์„ ์ปดํŒŒ์ผํ•˜์ง€ ์•Š๋Š” ์ด์œ ๋ฅผ ์•Œ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
fread.c:1054:3: warning: too many arguments for format [-Wformat-extra-args]
Linux์™€ Travis์—์„œ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด์ด ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์„ ํ…Œ์ŠคํŠธ ํ•  ์ˆ˜ ์žˆ๋„๋ก Windows.zip์ด ์ƒ์„ฑ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋‚œ ์ž์•ผ ๊ฒ ์–ด.
(๊ทธ๊ฒƒ์€ 1054 ํ–‰์— ๋Œ€ํ•ด ๋ถˆํ‰ํ•˜์ง€๋งŒ ๋ฐ”๋กœ ๋‹ค์Œ ํ–‰ 1055๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์•ฝ๊ฐ„์˜ ์ฐจ์ด๊ฐ€์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. % llu Windows์—์„œ __VA_ARGS__ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ๋ฌผ๋ก  ์•„๋‹™๋‹ˆ๋‹ค.)

์ข‹์•„, ๋งˆ์ง€๋ง‰์œผ๋กœ windows.zip ๋‹ค์‹œ ์‹œ๋„ํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค์— ๋Œ€ํ•œ ์ค€๋น„๊ฐ€๋˜์–ด ์—ฌ๊ธฐ .

ํ˜„์žฌ์ด ๋ถ„๊ธฐ์—๋Š” ๋ช‡ ๊ฐ€์ง€ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ž‘๋™ํ•˜๋ฉด ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์„ ์ œ๊ฑฐํ•˜์—ฌ ์–ด๋Š ๊ฒƒ์ด ์—ˆ๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค. llu ์ปดํŒŒ์ผ๋Ÿฌ ๊ฒฝ๊ณ ๋Š” ์—ฌ๊ธฐ ์—์„œ ์ฐพ์€ @ st-pasha ์„ค๋ช…๊ณผ ์ผ์น˜ํ•˜๋Š” ์ž์„ธํ•œ ์ถœ๋ ฅ์—์„œ โ€‹โ€‹์Šคํƒ ๋ถˆ๊ท ํ˜•์„ ์œ ๋ฐœํ•˜๋ฏ€๋กœ ๊ฐ€์žฅ ์œ ๋ง ํ•ด ๋ณด์ž…๋‹ˆ๋‹ค. ์•„๋งˆ๋„ Rprintf ๋ ˆ์ด์–ด๋Š” ์ปดํŒŒ์ผ๋Ÿฌ์—์„œ ๊ทธ๊ฒƒ์„ ์ˆจ๊ธฐ๊ณ  ์žˆ์—ˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด์ œ fprintf ์ง์ ‘ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

image

๋‘ ๋ฒˆ์งธ ์‹œ๋„์‹œ (์žฌ๋ถ€ํŒ… ํ›„)

stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into โ€˜C:/Users/hughp/Documents/R/win-library/3.4โ€™
# (as โ€˜libโ€™ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1559167 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package โ€˜data.tableโ€™ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-18 04:58:23 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com

fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file:  C:\Users\hughp\AppData\Local\Temp\RtmpIT9H0D/fread.out 
Input contains no \n. Taking this to be a filename to open
Read 11%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 28%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 48%. ETA 00:00 Warning: stack imbalance in '$', 20 then 19
Read 98%. ETA 00:00 [01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.822 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.000s (  0%) Memory map 0.341GB file
   0.001s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.291s ( 10%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.531s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.002s (  0%) Finding first non-embedded \n after each jump
   +    0.282s ( 10%) Parse to row-major thread buffers (grown 0 times)
   +    1.537s ( 54%) Transpose
   +    0.710s ( 25%) Waiting
   0.842s ( 30%) Rereading 1 columns due to out-of-sample type exceptions
   2.822s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

๋‹ค์‹œ ๋งํ•˜์ง€๋งŒ, RStudio ์™ธ๋ถ€์—์„œ๋Š” ์žฌํ˜„ ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

๋นจ๋ฆฌ ํ…Œ์ŠคํŠธ ํ•ด ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๊ธ€์Ž„, ๊ทธ๊ฒƒ์€ ํ™•์‹คํžˆ ๊ทธ๋•Œ ๋งŽ์€ ๊ฒƒ์„ ์ง€๋ฐฐํ•ฉ๋‹ˆ๋‹ค! ๋‘ ๊ฐ€์ง€ ์•„์ด๋””์–ด๊ฐ€ ๋‚จ์•˜์Šต๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋Š” ๋ฐ€๊ณ  ํ†ต๊ณผํ–ˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—์„œ ์ƒˆ Windows.zip์„ ์‚ฌ์šฉํ•ด๋ณด์‹ญ์‹œ์˜ค. ๊ทธ alloca ๋Š” ์Šคํƒ์— ์žˆ์œผ๋ฉฐ ์„ค์ •์ค‘์ธ na.strings ์™€ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ™•์‹คํžˆ ์˜ฌ๋ฐ”๋ฅธ ์˜์—ญ (์Šคํƒ ๋ถˆ๊ท ํ˜•)์— ์žˆ๊ณ  ์‹œ๋„ ํ•  ๊ฐ€์น˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฌธ์ œ ์—†์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ 12 ์‹œ๊ฐ„ ๋™์•ˆ ์ž๋ฆฌ๋ฅผ ๋น„์šธ ๊ฒƒ์ด๋ฏ€๋กœ ๊ทธ๋•Œ๊นŒ์ง€๋Š” ํ…Œ์ŠคํŠธ ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
2017 ๋…„ 11 ์›” 18 ์ผ ํ† ์š”์ผ ์˜คํ›„ 5์‹œ 20 ๋ถ„ Matt Dowle [email protected] ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ผ์Šต๋‹ˆ๋‹ค.

๋นจ๋ฆฌ ํ…Œ์ŠคํŠธ ํ•ด ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๊ธ€์Ž„, ๊ทธ๊ฒƒ์€ ํ™•์‹คํžˆ ๊ทธ๋•Œ ๋งŽ์€ ๊ฒƒ์„ ์ง€๋ฐฐํ•ฉ๋‹ˆ๋‹ค!
๋‘ ๊ฐ€์ง€ ์•„์ด๋””์–ด๊ฐ€ ๋‚จ์•˜์Šต๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋Š” ๋ฐ€๊ณ  ํ†ต๊ณผํ–ˆ์Šต๋‹ˆ๋‹ค. ์ƒˆ Windows.zip์„ ์‚ฌ์šฉํ•ด๋ณด์‹ญ์‹œ์˜ค.
์—ฌ๊ธฐ
https://ci.appveyor.com/project/Rdatatable/data-table/build/1.0.1363/job/fo02vnbu5ebhwy3w/artifacts .
ํ•ด๋‹น ํ• ๋‹น์€ ์Šคํƒ์— ํ• ๋‹น๋˜๋ฉฐ na.strings์™€ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
๋‹น์‹ ์€ ๊ทธ๊ฒƒ์ด ์ผ์–ด๋‚˜๋Š”๋Œ€๋กœ ์„ค์ •ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํ™•์‹คํžˆ ์˜ฌ๋ฐ”๋ฅธ ์˜์—ญ (์Šคํƒ
๋ถˆ๊ท ํ˜•) ๋…ธ๋ ฅํ•  ๊ฐ€์น˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

โ€”
๋‹น์‹ ์ด ์–ธ๊ธ‰ ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ณ  GitHub์—์„œ ํ™•์ธํ•˜์„ธ์š”.
https://github.com/Rdatatable/data.table/issues/2481#issuecomment-345421856 ,
๋˜๋Š” ์Šค๋ ˆ๋“œ ์Œ์†Œ๊ฑฐ
https://github.com/notifications/unsubscribe-auth/AHvGDGa5Qnls5eSFBMaQO5s8DElfrpKSks5s3ncqgaJpZM4QcuPc
.

๊ทธ๋ž˜ ๊ฑฑ์ •๋งˆ. ๊ฐ์‚ฌ! ๋‚˜๋Š” ์ง€๊ธˆ๋„ ๋‘ ๋ฒˆ์งธ ์•„์ด๋””์–ด๋ฅผ ์ถ”์ง„ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ณผ๊ฑฐ์— Windows์—์„œ ๋ฌธ์ œ๋ฅผ ์ผ์œผ์ผฐ๋˜ \r ๋ฅผ ๊ธฐ์–ตํ•˜๋Š” ๊ฒƒ ๊ฐ™์ง€๋งŒ ์Šคํƒ ๋ถˆ๊ท ํ˜•์€ ๊ธฐ์–ต ๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์–ด์จŒ๋“ , ๊ทธ๊ฒƒ์„ ๋ฐฐ์ œํ•˜๊ธฐ ์œ„ํ•ด ์ง„ํ–‰๋ฅ  ์ธก์ •๊ธฐ์—์„œ \r ๋ฅผ ์ œ๊ฑฐํ–ˆ์Šต๋‹ˆ๋‹ค. ์Šคํƒ ๋ถˆ๊ท ํ˜• ๋ฉ”์‹œ์ง€๋Š” ETA ๋ผ์ธ์ด ๋ฐœ์ƒํ•˜๋Š” ๊ณณ์— ์ธ์‡„๋˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ฝ˜์†”์ด \r ์žก์•„์„œ ๋‹ค๋ฅด๊ฒŒ ์ทจ๊ธ‰ํ•˜์—ฌ ๋งˆ์ง€๋ง‰ ์ค„์ด ๊ต์ฒด๋˜๋„๋กํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ด์ œ ETA๊ฐ€ ์—…๋ฐ์ดํŠธ ๋  ๋•Œ๋งˆ๋‹ค ์ƒˆ ์ค„์ด ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ์„ ๋ฐฐ์ œํ•˜๊ธฐ ์œ„ํ•ด ์ผ์‹œ์ ์œผ๋กœ. ์ƒˆ Windows.zip์ด ๋นŒ๋“œ๋˜๊ณ  ์—ฌ๊ธฐ์— ์ „๋‹ฌ

fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file:  C:\Users\hughp\AppData\Local\Temp\RtmpcVjZ1f/fread.out 
Input contains no \n. Taking this to be a filename to open
Read 5%. ETA 00:00
Read 8%. ETA 00:00
Read 11%. ETA 00:00
Read 15%. ETA 00:00
Read 18%. ETA 00:00
Read 21%. ETA 00:00
Read 25%. ETA 00:00
Read 28%. ETA 00:00
Read 31%. ETA 00:00
Read 35%. ETA 00:00
Read 38%. ETA 00:00
Read 41%. ETA 00:00
Read 45%. ETA 00:00
Read 48%. ETA 00:00
Read 51%. ETA 00:00
Read 55%. ETA 00:00
Warning: stack imbalance in '$', 30 then 31
Warning: stack imbalance in '$', 17 then 16
Read 58%. ETA 00:00
Read 61%. ETA 00:00
Read 65%. ETA 00:00
Read 68%. ETA 00:00
Read 71%. ETA 00:00
Read 75%. ETA 00:00
Read 78%. ETA 00:00
Read 81%. ETA 00:00
Read 85%. ETA 00:00
Read 88%. ETA 00:00
Read 91%. ETA 00:00
Read 95%. ETA 00:00
Read 98%. ETA 00:00
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.894 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.001s (  0%) Memory map 0.341GB file
   0.003s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.316s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.574s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.004s (  0%) Finding first non-embedded \n after each jump
   +    0.284s ( 10%) Parse to row-major thread buffers (grown 0 times)
   +    1.450s ( 50%) Transpose
   +    0.837s ( 29%) Waiting
   0.953s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
   2.894s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

์ฐธ๊ณ  : ์•ฝ๊ฐ„ ์ด์ „ ๋ฒ„์ „์˜ RStudio๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋‹ค๋ฅธ Windows ์‹œ์Šคํ…œ์—์„œ์ด ์Šคํƒ ๋ถˆ๊ท ํ˜• ์˜ค๋ฅ˜๋ฅผ ์žฌํ˜„ ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

์ด ๊ฒฝ์šฐ ์ œ์•ˆํ•œ๋Œ€๋กœ RStudio ์ง€์›์„ ์š”์ฒญํ•  ๋•Œ๊ฐ€ ๋œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” fread ์ฝ”๋“œ๋ฅผ ๋‹ค์‹œ ์‚ดํŽด ๋ณด์•˜๊ณ  ๋‚ด ์ƒ๊ฐ์ด ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค. RStudio์˜ ๋‘ ๊ฐ€์ง€ ๋ฒ„์ „ ๋ฒˆํ˜ธ๋ฅผ ์•Œ๋ ค์ฃผ์„ธ์š”. ๋ฐ˜๋“œ์‹œ RStudio๋ผ๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋ฉฐ, RStudio์˜ ํ•œ ๋ฒ„์ „์— ํ‘œ์‹œ๋˜๋Š” data.table ์ธก์˜ ๊ฒฐํ•จ ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ทธ๊ฒƒ์ด ์ฝ˜์†” ์ถœ๋ ฅ๊ณผ ๊ด€๋ จ์ด์žˆ๋Š” ๊ฒƒ ๊ฐ™๊ณ  RStudio์— ํŠนํ™”๋œ ๋‹ค๋ฅธ ์ ์ด ์ด์ƒํ•ฉ๋‹ˆ๋‹ค. "RStudio stack imabalance"๋ฅผ ๊ฒ€์ƒ‰ํ–ˆ์ง€๋งŒ RStudio ์ž์ฒด๊ฐ€ ์•„๋‹Œ ํŒจํ‚ค์ง€ ๊ฒฐํ•จ์— ๋Œ€ํ•œ ๋งŽ์€ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ๊ฒ€์ƒ‰ํ•˜๊ธฐ ์–ด๋ ค์šด ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋ฌธ์ œ๋ฅผ ์—ด์–ด๋‘๊ณ  ๊ทธ๋“ค์ด ๋งํ•˜๋Š” ๊ฒƒ์„ ๋ณด์ž.

๋งˆ์ง€๋ง‰ ์‹œ๋„๊ฐ€ ๋„์›€์ด ๋ ์ง€ ์˜์‹ฌ ์Šค๋Ÿฝ์ง€๋งŒ ์™„์ „์„ฑ์„ ์œ„ํ•ด ์—ฌ๊ธฐ ์—์„œ ์‹œ๋„ํ•ด๋ณด์„ธ์š”. ์•„๋งˆ๋„ Windows์—์„œ ์‚ฌ์šฉ๋˜๋Š” MinGW ์ปดํŒŒ์ผ๋Ÿฌ๋Š”์ด ๋‘ ๊ฐ€์ง€ int์—์„œ ์ด์ƒํ•œ ์ผ์„ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ์ค‘ ํ•˜๋‚˜๋Š” ์ƒ์ˆ˜ 0์œผ๋กœ ์ตœ์ ํ™”๋˜์–ด ์Šคํƒ ๋ถˆ๊ท ํ˜•์„ ์œ ๋ฐœํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ํŠน์ • ์Šคํƒ ๋ถˆ๊ท ํ˜• ๋ฉ”์‹œ์ง€๋Š” R ์ž์ฒด์˜ eval.c : 491 ์—์„œ fread ๋˜๋Š” data.table ๋ผ๊ณ  ์ƒ๊ฐํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. check_stack_balance() ๋Š” R ๋‚ด๋ถ€์˜ 5 ๊ณณ์—์„œ๋งŒ ํ˜ธ์ถœ๋ฉ๋‹ˆ๋‹ค.
names.c ์—์„œ do_internal() ๋
objects.c , applyMethod() ์—์„œ ๋‘ ๋ฒˆ
eval.c , eval() ์—์„œ ๋‘ ๋ฒˆ
fread.c ์ด (๊ฐ€) ๋ณ‘๋ ฌ ์„น์…˜์—์žˆ๋Š” ๋™์•ˆ ์ด๋“ค ์ค‘ ์–ด๋–ค ๊ฒƒ์— ๋„๋‹ฌ ํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋Š” ์•Œ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ํ˜ธ์ถœ๋˜๋Š” ์œ ์ผํ•œ ์ง„์ž… ์ ์€ REprintf ์ด๋ฉฐ check_stack_balance() ๋„๋‹ฌ ํ•  ์ˆ˜์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ๋‚ด๊ฐ€ ์ƒ๊ฐํ•  ์ˆ˜์žˆ๋Š” ๊ฒƒ์€ RStudio์— ์•„๋งˆ๋„ Windows์—์„œ ๋‹ค๋ฅด๊ฒŒ ์ฝ˜์†” ์ถœ๋ ฅ๊ณผ ์ƒํ˜ธ ์ž‘์šฉํ•˜๋Š” ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ๋ฌด์–ธ๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ์Šค๋ ˆ๋“œ๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
๋งˆ์ง€๋ง‰์œผ๋กœ, ์™„์ „์„ฑ์„ ์œ„ํ•ด, base R์ด libcurl.c : 354 ๋ฐ internet.c : 409 ์˜ ์ง„ํ–‰๋ฅ  ์ธก์ •๊ธฐ์—์„œ Rprintf ๋Œ€์‹  ์‚ฌ์šฉํ•˜๋ฏ€๋กœ REprintf ๊ฒƒ์ด ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ• ์ธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. C ๋ ˆ๋ฒจ์—์„œ R์˜ ์ง„ํ–‰๋ฅ  ํ‘œ์‹œ ์ค„์€ R์˜ API์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค (C ๋ ˆ๋ฒจ์—์„œ๋„ R์—์„œ ๋‘ ๋ฒˆ ๊ตฌํ˜„ ๋œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค).

@mattdowle ,์ด๊ฒŒ ๋„์›€์ด ๋ ๊นŒ์š”? https://github.com/r-lib/progress

@aader ์˜ˆ-๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค! ์†Œ์Šค์—๋Š” ๋‹ค์Œ ์ฃผ์„์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
// In R Studio we should print to stdout, because printing a \r
// to stderr is buggy (reported)
ํ•˜์ง€๋งŒ ์ด๋ฏธ \r ์ œ๊ฑฐํ–ˆ๋Š”๋ฐ ์Šคํƒ ๋ถˆ๊ท ํ˜•์ด ์—ฌ์ „ํžˆ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์–ด๋””์„œ๋ณด๊ณ ๋˜์—ˆ๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰ ๋นŒ๋“œ๋„ ์ž‘๋™ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

image

https://community.rstudio.com/t/stack-imbalance-possibly-in-stderr/3009 ์—์„œ๋ณด๊ณ  ๋จ

R-devel์— ๋Œ€ํ•œ์‹œ๊ธฐ ์ ์ ˆํ•œ ์งˆ๋ฌธ : [Rd] Rprintf ๋ฐ REprintf๋Š” ์Šค๋ ˆ๋“œ๋กœ๋ถ€ํ„ฐ ์•ˆ์ „ํ•ฉ๋‹ˆ๊นŒ?

Upshot "Rprintf ๋ฐ REprintf๋Š” ์Šค๋ ˆ๋“œ๋กœ๋ถ€ํ„ฐ ์•ˆ์ „ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค."

Yoiks!

RStudio์— ๋ฌธ์ œ๋ฅผ ์ œ๊ธฐ ํ•œ ๋งํฌ์™€ Hugh์— ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

data.table::fwrite() ๋ฐ data.table::fread() ์€ (๋Š”) Rprintf ๋ฐ REprintf ๋Š” ์Šค๋ ˆ๋“œ๋กœ๋ถ€ํ„ฐ ์•ˆ์ „ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์ง„ํ–‰๋ฅ  ๋ฏธํ„ฐ์— ๋Œ€ํ•ด ๋งˆ์Šคํ„ฐ ์Šค๋ ˆ๋“œ์—์„œ๋งŒ ํ˜ธ์ถœํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๊ฐœ์˜ data.table ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ํ•ด๋‹น R ์ง„์ž… ์ ์„ ํ˜ธ์ถœํ•˜์ง€ ์•Š์„๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋งˆ์Šคํ„ฐ ์Šค๋ ˆ๋“œ ๋งŒ์ด์ด๋ฅผ ํ˜ธ์ถœ ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋ชจ๋“  ์Šค๋ ˆ๋“œ์—์„œ ์–ธ์ œ๋“  ํ˜ธ์ถœ๋˜๋Š” ์œ ์ผํ•œ R ์ง„์ž… ์ ์ž…๋‹ˆ๋‹ค. ๋ณ‘๋ ฌ ์„น์…˜. ๊ทธ๋Ÿฌ๋‚˜ Rprintf ๋Š” 100 ๋งค ์ธ์‡„ ํ•  ๋•Œ๋งˆ๋‹ค R_CheckUserInterrupt ํ˜ธ์ถœํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์Šคํ„ฐ ์Šค๋ ˆ๋“œ๋งŒ์œผ๋กœ๋„ ์•ˆ์ „ํ•˜์ง€ ์•Š์€ ๋ถ€๋ถ„์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด R_CheckUserInterrupt ํ˜ธ์ถœํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— REprintf ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ์ž…๋‹ˆ๋‹ค. R ๋‚ด๋ถ€๋Š” ์ง„ํ–‰๋ฅ  ์ธก์ •๊ธฐ์— REprintf ๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ์ฝ”์–ด R๊ณผ์˜ ์ผ๊ด€์„ฑ์„ ์œ„ํ•ด REprintf ๋กœ ์ „ํ™˜ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ์ฆ‰, ๊ทธ ์„ ํƒ์€ ๊ทธ ์ž์ฒด๋กœ stderr ๋Œ€ stdout๊ณผ ๊ด€๋ จ์ด ์—†์Šต๋‹ˆ๋‹ค.

@kevinushey ์ด ์Šค๋ ˆ๋“œ๋ฅผ ์‚ดํŽด๋ณด๊ณ  ๋‚ด๊ฐ€ ์‹œ๋„ ํ•  ์ˆ˜์žˆ๋Š” ๋‹ค๋ฅธ ๊ฒƒ์„ ์•Œ๋ ค์ฃผ์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ? RStudio์™€ ๊ด€๋ จ๋œ ๊ฒƒ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ์–ด์จŒ๋“  ๋ฐฐ๊ฒฝ ์Šค๋ ˆ๋“œ์™€ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๊นŒ? RStudio์— ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์Šค๋ ˆ๋“œ๊ฐ€์žˆ๋Š” ๊ฒฝ์šฐ Rprintf / REprintf ๊ฐ€ ๋™์‹œ์— ๋‘ ์Šค๋ ˆ๋“œ์—์„œ ํ˜ธ์ถœ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋งŒ์•ฝ ๊ทธ๋ ‡๋‹ค๋ฉด ์šฐ๋ฆฌ๋Š” ์ง€๊ธˆ๊นŒ์ง€ ๋” ๋งŽ์€ ๋ฌธ์ œ๋ฅผ ๋ณด์•˜์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๊ทธ๋Ÿด ๊ฒƒ ๊ฐ™์ง€ ์•Š์Šต๋‹ˆ๋‹ค. RStudio๋Š” R-exts ์„น์…˜ ptr_* ์ฝœ๋ฐฑ์„ ๋Œ€์ฒด ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.์ด ์ฝœ๋ฐฑ์€ ์ฝ˜์†” ์ถœ๋ ฅ ๋ฐ ์ƒํ˜ธ ์ž‘์šฉ๊ณผ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜์ด ์„น์…˜์€ "์œ ๋‹‰์Šค ์œ ์‚ฌ ์‚ฌ์šฉ์ž ์šฉ"์œผ๋กœ ์‹œ์ž‘ํ•˜๋ฏ€๋กœ Windows๊ฐ€ ์–ด๋–ป๊ฒŒ ๋“ค์–ด์˜ค๋Š” ์ง€ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค. ์„น์…˜ 8.1.5 ์Šค๋ ˆ๋”ฉ ๋ฌธ์ œ ๋„ ๊ด€๋ จ์ด์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‘˜ ๋‹ค ์„น์…˜ 8 : "GUI ๋ฐ ๊ธฐํƒ€ ํ”„๋ŸฐํŠธ ์—”๋“œ๋ฅผ R์— ์—ฐ๊ฒฐ"์˜ ํ•˜์œ„ ์„น์…˜์ž…๋‹ˆ๋‹ค.

12 ์›” ์ดˆ๊นŒ์ง€ ์™ธ์ถœ ํ•  ์˜ˆ์ • ์ด๋‹ˆ ์•ˆํƒ€๊น๊ฒŒ๋„ ๊ทธ๋•Œ๊นŒ์ง€๋Š” ๋ณผ ๊ธฐํšŒ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ RStudio๋Š” R ์ด๋ฒคํŠธ ๋ฃจํ”„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฉ”์ธ ์Šค๋ ˆ๋“œ์—์„œ ๊ฑฐ์˜ ๋ชจ๋“  ๊ฒƒ์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์œ ์ผํ•œ ์˜ˆ์™ธ๋Š” ์˜ˆ๋ฅผ ๋“ค์–ด ํ”„๋กœ์ ํŠธ ์ˆ˜์ค€ ํŒŒ์ผ ์ธ๋ฑ์‹ฑ์ด๋ฉฐ ์ด๋Ÿฌํ•œ ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์Šค๋ ˆ๋“œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ R API๋ฅผ ๊ฑด๋“œ๋ฆฌ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

RStudio๋Š” ์ฝ˜์†” ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ptr_* ์ฝœ๋ฐฑ์„ ์ธ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค. ๋‚˜๋Š” ๊ทธ๋“ค์ด ์—ฌ๊ธฐ์—์„œ ์–ด๋–ป๊ฒŒ ์›์ธ์ด ๋ ์ง€ ์ฆ‰์‹œ ์ƒ๊ฐํ•  ์ˆ˜ ์—†์ง€๋งŒ ๋‚ด๊ฐ€ ๋‹ค์‹œ ๋“ค์–ด ์˜ค๋ฉด ๋” ๊นŠ๊ฒŒ ์‚ดํŽด ๋ณด๋ ค๊ณ  ๋…ธ๋ ฅํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ข‹์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—์„œ ์‹œ๋„ํ•ด๋ณด์„ธ์š”. ์ด์ „์—๋Š” ์ง„ํ–‰ ์ƒํƒœ๋ฅผ 2 %๋งˆ๋‹ค ์—…๋ฐ์ดํŠธํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ท€ํ•˜์˜ ๊ฒฝ์šฐ ํŒŒ์ผ์€ 3 ์ดˆ ๋ฏธ๋งŒ ๋งŒ ์†Œ์š”๋˜๋ฏ€๋กœ 0.06 ์ดˆ๋งˆ๋‹ค RStudio ์ฝ˜์†”์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ์ง„ํ–‰๋ฅ  ์—…๋ฐ์ดํŠธ๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค. RStudio์—๊ฒ ๋„ˆ๋ฌด ๋งŽ์•˜์„ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ์ด ์‹œ๋„๋Š” ๋ง‰๋Œ€๋ฅผ ์ธ์‡„ํ•ฉ๋‹ˆ๋‹ค. \r ๋Š” ์ „ํ˜€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋Š” \r ๊ฐ€ ์ถœ๋ ฅ์„ ์ฑ„์šธ ์ˆ˜์žˆ๋Š” ๋ณด๊ณ ์„œ ๋ฐ ๋กœ๊ทธ ํŒŒ์ผ์— ๋” ์ข‹์Šต๋‹ˆ๋‹ค.

3 ์ดˆ์˜ ํƒ€์ด๋ฐ์ด ๋งค์šฐ ๋น ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— 1 ์ดˆ ETA๊ฐ€์žˆ๋Š” ๊ฒฝ์šฐ ์ง„ํ–‰๋ฅ  ํ‘œ์‹œ ์ค„์„ 1 ์ดˆ์—์„œ ์‹œ์ž‘ํ•˜๋„๋ก ์ค„์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ํ‘œ์‹œ๋˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— ํŒŒ์ผ์ด ์ „ํ˜€ ํ‘œ์‹œ๋˜์ง€ ์•Š๊ณ  ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ํ…Œ์ŠคํŠธ๋ฅผ ๋งˆ์นœ ํ›„์—๋Š” fwrite ์„ ๋Š˜๋ฆด ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ฆ‰, ETA๊ฐ€ ๊ฑฐ๊ธฐ์—์„œ 2 ์ดˆ์ด๋ฉด 2 ์ดˆ์—์„œ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

์•ˆ๋…•ํ•˜์„ธ์š”, @mattdowle. # 2503์˜ ๋งˆ์ง€๋ง‰ ๋Œ“๊ธ€๋„์ด ๋ฌธ์ œ์™€ ๊ด€๋ จ์ด์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ข‹์•„ ๋ณด์ธ๋‹ค! ๊ฒฝ๊ณ  ์—†์Œ (5 ํšŒ ์‹คํ–‰ ํ›„) ๋จผ์ € ์•„๋ž˜์—์„œ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค (์‹ค์ œ ์ถœ๋ ฅ์—์„œ๋Š” ์„ ํ–‰ ๊ณต๋ฐฑ์ด ๋‹ค๋ฅด๊ฒŒ ๋ณด์ž…๋‹ˆ๋‹ค).

stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into โ€˜C:/Users/hughp/Documents/R/win-library/3.4โ€™
# (as โ€˜libโ€™ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557423 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package โ€˜data.tableโ€™ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapped ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
#   [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=','  with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
#   Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000)    : 1551  Quote rule 0
# Type codes (jump 100)    : 1A51  Quote rule 0
# =====
#   Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
#   [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# |--------------------------------------------------|
#   |==================================================|
#   Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.280 wall clock time
# [12] Finalizing the datatable
# Type counts:
#   1 : bool8     '1'
# 1 : int32     '5'
# 2 : string    'A'
# =============================
#   0.005s (  0%) Memory map 0.341GB file
# 0.037s (  2%) sep=',' ncol=4 and header detection
# 0.000s (  0%) Column type detection using 10027 sample rows
# 0.321s ( 14%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 1.917s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# =    0.011s (  0%) Finding first non-embedded \n after each jump
# +    0.560s ( 25%) Parse to row-major thread buffers (grown 0 times)
# +    0.488s ( 21%) Transpose
# +    0.858s ( 38%) Waiting
# 0.999s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
# 2.280s        Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1        V2      V3 V4
# 1: Goulburn 110018063    3499 NA
# 2:       NA 110018064     812 NA
# 3:       NA 110018065    2158 NA
# 4:       NA 110019999     402 NA
# 5:       NA 110028068      10 NA
# ---                              
#   22885376:       NA 997999799       0 NA
# 22885377:       NA 998999899      64 NA
# 22885378:       NA 994999499      34 NA
# 22885379:       NA 0&&&&&&&&  250796 NA
# 22885380:       NA 0@@@@@@@@ 7305367 NA
# Warning messages:
#   1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#                 Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
#               2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#               Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

@HughParsonage ๋ฆด๋ฆฌํ”„! ๋‚˜๋Š” ๊ทธ๊ฒƒ์ด ์Šน๋ฆฌ๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์ •๋ฆฌํ•˜๊ณ  ํ•ฉ์ณ์„œ ๋„˜์–ด๊ฐˆ ๊ฒŒ์š”. ํ…Œ์ŠคํŠธ ํ•ด ์ฃผ์…”์„œ ๋Œ€๋‹จํžˆ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

@aadler Yes๋Š” ๋ฌธ์ œ # 2503 ์—์„œ ๊ท€ํ•˜์˜ ์˜๊ฒฌ์ด ๋˜‘๊ฐ™์ด ๋ณด์ธ๋‹ค๋Š” ๋ฐ ๋™์˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐœ๋ฐœ์ž์˜ ์ตœ์‹  ๋ฒ„์ „๋„ ํ…Œ์ŠคํŠธํ•˜๊ณ  ์ด์ œ ์ˆ˜์ •๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•ด ์ฃผ์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ? ์—ฌ๊ธฐ์—์„œ ๋ฐœ๊ฒฌ ํ•œ as.IDate ์˜ ๋ฌธ์ œ๊ฐ€ ์‹ค์ œ๋กœ ์ด์ „ ์Šคํƒ ๋ถˆ๊ท ํ˜•์œผ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๊ธฐ๋ฅผ ๋ฐ”๋ผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ข‹์ง€ ์•Š๋‹ค :(

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> DT <- fread('2017-11-22_1999_Performance.csv', header = TRUE, colClasses = CLS, select = SEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file 2017-11-22_1999_Performance.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|=======Warning: stack imbalance in '$', 27 then 26
===Warning: stack imbalance in '$', 26 then 27
================Error in fread("2017-11-22_1999_Performance.csv", header = TRUE, colClasses = CLS,  : 
  unprotect_ptr: pointer not found

@aadler ์‹ ๊ณ  freadR ๋ฅผ ํ†ตํ•ด ๋ณดํ˜ธ๋ฅผ ํ˜„์ง€ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ท€ํ•˜์˜ ๊ฒฝ์šฐ์—๋Š” ์œ ํ˜•์„ ์žฌ์ •์˜ํ•˜๊ณ  ์ฝ”๋“œ์˜ ํ•ด๋‹น ๋ถ€๋ถ„์— ์ƒ๋‹น์ˆ˜์˜ ๋ณดํ˜ธ ๊ธฐ๋Šฅ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ž‘๋™ ํ•  ๊ฐ€๋Šฅ์„ฑ์ด 30 %์ž…๋‹ˆ๋‹ค. ์ด ๋นŒ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์‹œ ์‹œ๋„ํ•˜์‹ญ์‹œ์˜ค.

์•„์ง ๋งˆ์ง€๋ง‰ ๋นŒ๋“œ๋ฅผ ์‹œ๋„ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ @aadler๋Š” ์‹œ๋„๋กœ ๋ฐ”๋กœ ์ด๋™ํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค ์ด ํ•˜๋‚˜ . ๋˜ํ•œ ํŒŒ์ผ ์‚ฌ๋ณธ์„๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค๋ฉด Windows RStudio์—์„œ ์ง์ ‘ ์‹œ๋„ํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

:(

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-01 01:54:04 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> ColCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+               rep('integer', 3L), rep('character', 2L),
+               'integer', 'Date', rep('numeric', 2L), 'Date',
+               rep('numeric', 12L), rep('integer', 5),
+               rep('numeric', 3L), 'integer', 'character')
> SELCOL <- c(WHATEVER)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = ColCLASS, select = SELCOL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|Error in fread("LargeFile.csv", header = TRUE, colClasses = ColCLASS,  : 
  unprotect_ptr: pointer not found

์ด๋ฉ”์ผ์˜ @aadler ๋•๋ถ„์— ์ด์ œ ์žฌํ˜„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. R 3.4.2, ์ตœ์‹  RStudio 1.1.383 ๋ฐ Windows 10 Pro 10.0.16299 ๋นŒ๋“œ 16299.

์—ฌ๊ธฐ์— ๊ธฐ๋ก ๋œ RStudio์—์„œ ์ด์ƒํ•œ ๋™์ž‘์ด ๋ณด์ž…๋‹ˆ๋‹ค.
https://www.youtube.com/watch?v=tl2x2vmZxMU
RStudio๊ฐ€ ์ž…๋ ฅ๋งŒ์œผ๋กœ GC๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋Š” ๋ฌด์—‡์ด๋ฉฐ ๋„๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๊นŒ? fread() ์ด (๊ฐ€) ์ง„ํ–‰๋ฅ  ํ‘œ์‹œ ์ค„์„ ์ธ์‡„ ํ•  ๋•Œ RStudio์˜ ๋ณ„๋„ ์ด๋ฒคํŠธ ๋ฃจํ”„๋Š” ์ฝ˜์†”์— ๋Œ€ํ•œ ์ถœ๋ ฅ์ด ์‚ฌ์šฉ์ž๊ฐ€ ์ž…๋ ฅํ•˜๊ณ  R์„ ํ˜ธ์ถœํ•˜์—ฌ GC๋ฅผ ๋ฐœ์ƒ์‹œํ‚ค๊ณ  ๋ชจ๋“  ๊ฒƒ์„ ์ž‘๋™ ์‹œํ‚จ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋งˆ๋„ ์—ฌ๊ธฐ์—์žˆ๋Š” RStudio ์‚ฌ์šฉ์ž๋Š” ์ €๋ฅผ ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉํ–ฅ์œผ๋กœ ์•ˆ๋‚ด ํ•  ์ˆ˜ ์žˆ๊ฑฐ๋‚˜ @kevinushey ๊ฐ€ ๋Œ์•„ ๋งํ–ˆ๊ณ  ์˜ค๋Š˜์€ ์ฒซ ๋ฒˆ์งธ์ž…๋‹ˆ๋‹ค

RStudio ์ฝ˜์†”์—์„œ ์Šคํƒ ๋ถˆ๊ท ํ˜•์„ ์•ˆ์ •์ ์œผ๋กœ ์žฌํ˜„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. RStudio ํ„ฐ๋ฏธ๋„ ํƒญ์„ ์‚ฌ์šฉํ•˜๋ฉด gcinfo(TRUE) ์‚ฌ์šฉํ•ด๋„ ์ „ํ˜€ ์žฌํ˜„ ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ํฅ๋ฏธ๋กญ๊ฒŒ๋„ GC๋Š” ์ง„ํ–‰๋ฅ  ํ‘œ์‹œ ์ค„์ด ์ธ์‡„ ๋  ๋•Œ ๋ฐœ์ƒํ•˜๋ฉฐ Linux์—์„œ๋„ ๊ดœ์ฐฎ ๊ธฐ ๋•Œ๋ฌธ์— ๊ดœ์ฐฎ์•„ ๋ณด์ž…๋‹ˆ๋‹ค. RStudio ์ฝ˜์†” ๋น„๋””์˜ค์˜ ๋™์ž‘์„ ๊ฐ์•ˆํ•  ๋•Œ ์ด๊ฒƒ์ด RStudio ์ฝ˜์†” ๋ฒ„๊ทธ๋ผ๋Š” ๊ฒฐ๋ก ์— ๋„๋‹ฌํ–ˆ์Šต๋‹ˆ๋‹ค. RStudio ํ„ฐ๋ฏธ๋„ ์ฐฝ์—์„œ ํ…์ŠคํŠธ๋ฅผ ๋ณต์‚ฌ ํ•  ์ˆ˜ ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์— (ํŽธ์ง‘-> ๋ณต์‚ฌ๋„ ์ž‘๋™ํ•˜์ง€ ์•Š๊ณ  Ctrl-C๋„ ์ž‘๋™ํ•˜์ง€ ์•Š์Œ) ์ง„ํ–‰๋ฅ  ํ‘œ์‹œ ์ค„์—์„œ GC๊ฐ€ ์ •์ƒ์ž„์„ ํ‘œ์‹œํ•˜๊ธฐ ์œ„ํ•ด ํ„ฐ๋ฏธ๋„ ํƒญ์˜ ์Šคํฌ๋ฆฐ ์ƒท์„ ์ฐ์—ˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์Šคํ„ฐ ์Šค๋ ˆ๋“œ ๋งŒ REprintf ํ˜ธ์ถœํ•˜๊ณ  ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๋Š” R API๋ฅผ ์ „ํ˜€ ํ˜ธ์ถœํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๊ดœ์ฐฎ์„ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•ฉ๋‹ˆ๋‹ค.

RStudio ํ„ฐ๋ฏธ๋„์—์„œ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
selection_014
์ง„ํ–‰๋ฅ  ํ‘œ์‹œ ์ค„์ด ์ฒ˜์Œ์œผ๋กœ ์ธ์‡„๋˜๋Š” ๋™์•ˆ GC๊ฐ€ ์žˆ์œผ๋ฉฐ RStudio ํ„ฐ๋ฏธ๋„์—์„œ ์ œ๋Œ€๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์ด ํ…Œ์ŠคํŠธ ํŒŒ์ผ์—๋Š” ํ•ด๋‹น ์—ด์— ๋Œ€ํ•ด์„œ๋งŒ ์ž๋™ ๋‹ค์‹œ ์ฝ๊ธฐ๋ฅผ ํŠธ๋ฆฌ๊ฑฐํ•˜๋Š” ์ƒ˜ํ”Œ ์™ธ ์œ ํ˜• ์˜ˆ์™ธ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ง„ํ–‰๋ฅ  ํ‘œ์‹œ ์ค„์ด ๋‘ ๋ฒˆ์งธ๋กœ ์ธ์‡„๋ฉ๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ RStudio Console์—๋Š” stack imbalance ๋˜๋Š” unprotect_ptr: pointer not found .

R version 3.4.2 (2017-09-28) -- "Short Summer"
> gcinfo(TRUE)
[1] FALSE
Garbage collection 22 = 16+3+3 (level 0) ... 
25.5 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 23 = 16+4+3 (level 1) ... 
24.9 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 24 = 17+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 25 = 18+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 26 = 19+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 27 = 20+4+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 28 = 20+5+3 (level 1) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 29 = 21+5+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 30 = 22+5+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 31 = 23+5+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 32 = 24+5+3 (level 0) ... 
25.3 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 33 = 25+5+3 (level 0) ... 
25.4 Mbytes of cons cells used (80%)
6.7 Mbytes of vectors used (66%)
Garbage collection 34 = 25+5+4 (level 2) ... 
24.6 Mbytes of cons cells used (61%)
6.4 Mbytes of vectors used (50%)
Garbage collection 35 = 26+5+4 (level 0) ... 
25.0 Mbytes of cons cells used (62%)
6.5 Mbytes of vectors used (52%)
> require(data.table)
Loading required package: data.table
Garbage collection 36 = 27+5+4 (level 0) ... 
27.2 Mbytes of cons cells used (68%)
7.1 Mbytes of vectors used (56%)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 01:04:34 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
Garbage collection 37 = 28+5+4 (level 0) ... 
27.7 Mbytes of cons cells used (69%)
7.3 Mbytes of vectors used (58%)
Garbage collection 38 = 29+5+4 (level 0) ... 
28.0 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (58%)
Garbage collection 39 = 30+5+4 (level 0) ... 
28.1 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (59%)
Garbage collection 40 = 31+5+4 (level 0) ... 
28.2 Mbytes of cons cells used (70%)
7.5 Mbytes of vectors used (59%)
Garbage collection 41 = 32+5+4 (level 0) ... 
28.4 Mbytes of cons cells used (71%)
7.5 Mbytes of vectors used (59%)
> DT = fread("/Users/pasha/Downloads/LargeFile.csv")
Garbage collection 42 = 32+5+5 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
7.1 Mbytes of vectors used (2%)
Garbage collection 43 = 32+5+6 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
244.7 Mbytes of vectors used (42%)
Garbage collection 44 = 32+5+7 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
482.3 Mbytes of vectors used (42%)
Garbage collection 45 = 32+5+8 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
957.4 Mbytes of vectors used (56%)
Garbage collection 46 = 32+5+9 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
1432.6 Mbytes of vectors used (63%)
Garbage collection 47 = 32+5+10 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
2145.3 Mbytes of vectors used (75%)
Garbage collection 48 = 32+5+11 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
2620.4 Mbytes of vectors used (71%)
Garbage collection 49 = 32+5+12 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
3570.8 Mbytes of vectors used (78%)
Garbage collection 50 = 32+5+13 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
4283.5 Mbytes of vectors used (75%)
Garbage collection 51 = 32+5+14 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
5709.0 Mbytes of vectors used (77%)
Garbage collection 52 = 32+5+15 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
7372.0 Mbytes of vectors used (81%)
Garbage collection 53 = 32+5+16 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
8797.5 Mbytes of vectors used (79%)
Garbage collection 54 = 32+5+17 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
10935.7 Mbytes of vectors used (80%)
|--------------------------------------------------|
|=====Error in fread("LargeFile.csv") : 
  unprotect_ptr: pointer not found
> 

showProgress=FALSE ๋Š” RStudio ์ฝ˜์†”์—์„œ์ด๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์žฌํ˜„ํ•˜๋ ค๋ฉด showProgress=TRUE (์ฆ‰ ๊ธฐ๋ณธ๊ฐ’)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด RStudio ์ฝ˜์†”์—์„œ ์ฒ˜์Œ ์‹คํ–‰ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ์ง„ํ–‰๋ฅ  ์ธก์ •๊ธฐ ๋™์•ˆ GC๊ฐ€ ์žˆ๋Š”์ง€ ์—ฌ๋ถ€์™€ ๊ด€๋ จ๋œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด ์„ธ์…˜์—์„œ ์ฒซ ๋ฒˆ์งธ ์‹คํ–‰์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ง„ํ–‰๋ฅ  ํ‘œ์‹œ๊ธฐ๊ฐ€ ํ‘œ์‹œ๋˜๋„๋ก ํฐ ํŒŒ์ผ ๋งŒ ์žˆ์œผ๋ฉด๋ฉ๋‹ˆ๋‹ค. ๋‹ค์‹œ ์ฝ๊ฑฐ๋‚˜ fread ์ „๋‹ฌ ๋œ ์ธ์ˆ˜์™€ ๊ด€๋ จ์ด ์—†์Šต๋‹ˆ๋‹ค. ์ƒˆ RStudio ์ฝ˜์†”์—์„œ ์ฒซ ๋ฒˆ์งธ ์‹คํ–‰์ด showProgress=FALSE ๋กœ ์ž‘๋™ํ•˜๋Š” ๊ฒฝ์šฐ ํ•ด๋‹น ์‹คํ–‰์€ R์˜ ํž™์„ ํ™•์žฅ ํ•œ ๋‹ค์Œ showProgress=TRUE ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋™์ผํ•œ ์„ธ์…˜์—์„œ ํ›„์† ์‹คํ–‰๋„ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ฒซ ๋ฒˆ์งธ ์‹คํ–‰์ด ์ด๋ฏธ ํž™์„ ํ™•์žฅํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ง„ํ–‰๋ฅ  ์ธก์ • ์ค‘์— GC๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
์ง„ํ–‰๋ฅ  ์ธก์ •๊ธฐ ๋™์•ˆ ๋งˆ์Šคํ„ฐ ์Šค๋ ˆ๋“œ์˜ GC๊ฐ€ Linux ๋ฐ Windows RStudio ํ„ฐ๋ฏธ๋„์—์„œ๋Š” ์ •์ƒ์ด์ง€๋งŒ RStudio ์ฝ˜์†”์—์„œ๋Š” ๊ทธ๋ ‡์ง€ ์•Š์€ ์ด์œ ๊ฐ€ ๋ˆˆ์— ๋„๋Š” ์งˆ๋ฌธ์ž…๋‹ˆ๋‹ค.

์ข‹์•„, ์ด๊ฒƒ์€ ๊ทธ๊ฒƒ์„ ๊ณ ์นœ๋‹ค. ๋ฌธ์ œ๋Š” RStudio๊ฐ€ ์•„๋‹Œ data.table ์ธก์—์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด์ œ Windows์˜ RStudio ์ฝ˜์†”์—์„œ ์•ˆ์ •์ ์œผ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ์€ Linux์™€ Max์—์„œ๋„ ๋ฐœ์ƒํ•  ์ˆ˜์žˆ๋Š” ๋ฌธ์ œ์˜€์Šต๋‹ˆ๋‹ค. ๋‹จ์ง€ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์ด ๊ทธ๊ฒƒ์„ ์œ ๋ฐœํ•˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ์—๋Š” REprintf ์‚ฌ์šฉํ•˜์—ฌ ๋งˆ์Šคํ„ฐ ์Šค๋ ˆ๋“œ ์ธ์‡„ ์ง„ํ–‰๊ณผ ๋™์‹œ์— ๋ฐœ์ƒํ•  ์ˆ˜์žˆ๋Š” R์— ๋Œ€ํ•œ ์ง„์ž… ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค (๋ฌธ์ž์—ด ์—ด๋กœ ๋ฒ„ํผ๋ฅผ ํ‘ธ์‹œ ํ•  ๋•Œ). ์ด๊ฒƒ์ด ์ƒˆ๋กœ์šด ์„ธ์…˜์˜ ์ฒซ ๋ฒˆ์งธ ์‹คํ–‰์—์„œ๋งŒ ๋ฐœ์ƒํ•œ ์ด์œ ์ž…๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์‹คํ–‰ ์ดํ›„์—๋Š” ํŒŒ์ผ์˜ ๋ชจ๋“  ๋ฌธ์ž์—ด์ด ์ด์ „์— ํ™•์ธ๋˜์—ˆ์œผ๋ฏ€๋กœ ์บ์‹œ ์กฐํšŒ๊ฐ€ ์ ์ค‘ (์Šค๋ ˆ๋“œ ์•ˆ์ „)๋˜๊ณ  ํ• ๋‹น๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค (์Šค๋ ˆ๋“œ ์•ˆ์ „ ์•„๋‹˜).

๊ทธ๋ž˜์„œ @aadler ์™€ @HughParsonage , ์ด๊ฒƒ์„ ์‹œ๋„

๊ฒฝ๊ณ  ์—†์Œ, ๋‹ค๋ฅธ ๊ฒƒ์„ ์ฐพ๊ณ  ์žˆ๋Š”์ง€ ํ™•์‹คํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

> gcinfo(TRUE)
[1] FALSE
> fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
Garbage collection 53 = 36+5+12 (level 2) ... 
30.3 Mbytes of cons cells used (60%)
7.9 Mbytes of vectors used (1%)
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Garbage collection 54 = 37+5+12 (level 0) ... 
30.8 Mbytes of cons cells used (61%)
566.6 Mbytes of vectors used (74%)
Garbage collection 55 = 37+6+12 (level 1) ... 
30.8 Mbytes of cons cells used (61%)
549.2 Mbytes of vectors used (72%)
  jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.626 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.005s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.469s ( 18%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.150s ( 82%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.103s (  4%) Finding first non-embedded \n after each jump
   +    0.230s (  9%) Parse to row-major thread buffers (grown 0 times)
   +    0.718s ( 27%) Transpose
   +    1.099s ( 42%) Waiting
   0.745s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
   2.626s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
Garbage collection 56 = 37+6+13 (level 2) ... 
31.1 Mbytes of cons cells used (62%)
531.9 Mbytes of vectors used (70%)
Garbage collection 57 = 38+6+13 (level 0) ... 
31.1 Mbytes of cons cells used (62%)
532.0 Mbytes of vectors used (70%)
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

๊ณ ๋งˆ์›Œ ํœด. ๋„ค, ์ƒˆ๋กœ์šด RStudio ์ฝ˜์†” ์„ธ์…˜์— ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด ๊น”๋”ํ•œ ์‹คํ–‰์ž…๋‹ˆ๋‹ค. ์Šคํƒ ๋ถˆ๊ท ํ˜• ๋˜๋Š” "unprotect_ptr : ํฌ์ธํ„ฐ๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์Œ"๋ฉ”์‹œ์ง€๊ฐ€ ํ‘œ์‹œ๋˜์ง€ ์•Š๊ณ  ์ง„ํ–‰๋ฅ  ํ‘œ์‹œ๊ธฐ๊ฐ€ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์‹คํ–‰๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค (์ด ๊ฒฝ์šฐ ๋‹ค์‹œ ์ฝ๊ธฐ ๋•Œ๋ฌธ์— ๋‘ ๋ฒˆ). ์ด์ œ @aadler ๋กœ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค.

์„ฑ๊ณต.

๋จผ์ € RStudio์˜ ์ƒˆ๋กœ์šด ์ธ์Šคํ„ด์Šค๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> DT <- fread('LargeFile.csv', colClasses = colCLASS, select = colSEL, header = TRUE, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|==================================================|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:25.938 wall clock time
[12] Finalizing the datatable
  Type counts:
        23 : drop      '0'
         5 : int32     '5'
         7 : float64   '7'
         2 : string    'A'
=============================
   0.005s (  0%) Memory map 6.355GB file
   0.025s (  0%) sep=',' ncol=37 and header detection
   0.001s (  0%) Column type detection using 10049 sample rows
   4.681s ( 18%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
  21.226s ( 82%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
   =    0.485s (  2%) Finding first non-embedded \n after each jump
   +    1.465s (  6%) Parse to row-major thread buffers (grown 0 times)
   +    9.095s ( 35%) Transpose
   +   10.181s ( 39%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  25.938s        Total

RStudio๋ฅผ ๋‹ซ๊ณ  ๋‹ค์‹œ ์—ด์–ด ๋ฌธ์ž์—ด ์บ์‹ฑ์ด ํ™œ์„ฑํ™”๋˜์ง€ ์•Š๋„๋กํ•˜๊ณ  gcinfo(TRUE) ํ•˜์—ฌ ๋‹ค์‹œ ์‹คํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ณด๋„ˆ์Šค ์ถ”๊ฐ€, IDate ๋กœ์˜ ์ „ํ™˜์ด ์™„๋ฃŒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค (ํ•˜์ง€๋งŒ 40 ์ดˆ ์ด์ƒ ๊ฑธ๋ ธ์Šต๋‹ˆ๋‹ค :)).

> colCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+               rep('integer', 3L), rep('character', 2L),
+               'integer', 'Date', rep('numeric', 2L), 'Date',
+               rep('numeric', 12L), rep('integer', 5),
+               rep('numeric', 3L), 'integer', 'character')
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> gcinfo(TRUE)
[1] FALSE
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
Garbage collection 46 = 36+5+5 (level 0) ... 
38.6 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 47 = 37+5+5 (level 0) ... 
38.7 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 48 = 38+5+5 (level 0) ... 
38.8 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 49 = 39+5+5 (level 0) ... 
39.0 Mbytes of cons cells used (78%)
11.2 Mbytes of vectors used (71%)
Garbage collection 50 = 40+5+5 (level 0) ... 
39.1 Mbytes of cons cells used (78%)
11.3 Mbytes of vectors used (71%)
Garbage collection 51 = 40+6+5 (level 1) ... 
38.8 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 52 = 41+6+5 (level 0) ... 
38.9 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 53 = 42+6+5 (level 0) ... 
41.5 Mbytes of cons cells used (83%)
12.2 Mbytes of vectors used (77%)
Garbage collection 54 = 42+7+5 (level 1) ... 
43.4 Mbytes of cons cells used (86%)
12.8 Mbytes of vectors used (81%)
Garbage collection 55 = 42+7+6 (level 2) ... 
44.7 Mbytes of cons cells used (72%)
13.0 Mbytes of vectors used (67%)
Garbage collection 56 = 43+7+6 (level 0) ... 
46.5 Mbytes of cons cells used (74%)
13.6 Mbytes of vectors used (70%)
Garbage collection 57 = 44+7+6 (level 0) ... 
47.0 Mbytes of cons cells used (75%)
13.8 Mbytes of vectors used (71%)
Garbage collection 58 = 45+7+6 (level 0) ... 
47.4 Mbytes of cons cells used (76%)
13.9 Mbytes of vectors used (71%)
Garbage collection 59 = 46+7+6 (level 0) ... 
47.7 Mbytes of cons cells used (76%)
14.2 Mbytes of vectors used (73%)
Garbage collection 60 = 47+7+6 (level 0) ... 
48.0 Mbytes of cons cells used (77%)
14.2 Mbytes of vectors used (73%)
Garbage collection 61 = 48+7+6 (level 0) ... 
48.1 Mbytes of cons cells used (77%)
14.3 Mbytes of vectors used (73%)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = colCLASS, select = colSEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
Garbage collection 62 = 48+7+7 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
13.6 Mbytes of vectors used (2%)
Garbage collection 63 = 48+7+8 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
488.7 Mbytes of vectors used (42%)
Garbage collection 64 = 48+7+9 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
963.9 Mbytes of vectors used (56%)
Garbage collection 65 = 48+7+10 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
1439.1 Mbytes of vectors used (63%)
Garbage collection 66 = 48+7+11 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
1914.2 Mbytes of vectors used (67%)
Garbage collection 67 = 48+7+12 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
2864.5 Mbytes of vectors used (77%)
Garbage collection 68 = 48+7+13 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
3577.3 Mbytes of vectors used (78%)
Garbage collection 69 = 48+7+14 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
4290.0 Mbytes of vectors used (75%)
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|============================Garbage collection 70 = 49+7+14 (level 0) ... 
76.5 Mbytes of cons cells used (99%)
5487.5 Mbytes of vectors used (96%)
=Garbage collection 71 = 49+8+14 (level 1) ... 
77.0 Mbytes of cons cells used (100%)
5487.6 Mbytes of vectors used (96%)
Garbage collection 72 = 49+8+15 (level 2) ... 
77.0 Mbytes of cons cells used (81%)
5487.1 Mbytes of vectors used (80%)
==============Garbage collection 73 = 50+8+15 (level 0) ... 
94.3 Mbytes of cons cells used (100%)
5494.0 Mbytes of vectors used (80%)
Garbage collection 74 = 50+9+15 (level 1) ... 
94.5 Mbytes of cons cells used (100%)
5494.1 Mbytes of vectors used (80%)
Garbage collection 75 = 50+9+16 (level 2) ... 
94.5 Mbytes of cons cells used (82%)
5493.1 Mbytes of vectors used (67%)
=======|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:24.772 wall clock time
[12] Finalizing the datatable
  Type counts:
        23 : drop      '0'
         5 : int32     '5'
         7 : float64   '7'
         2 : string    'A'
=============================
   0.005s (  0%) Memory map 6.355GB file
   0.018s (  0%) sep=',' ncol=37 and header detection
   0.000s (  0%) Column type detection using 10049 sample rows
   5.496s ( 22%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
  19.253s ( 78%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
   =    0.433s (  2%) Finding first non-embedded \n after each jump
   +    1.482s (  6%) Parse to row-major thread buffers (grown 0 times)
   +    9.515s ( 38%) Transpose
   +    7.822s ( 32%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  24.772s        Total
Garbage collection 76 = 51+9+16 (level 0) ... 
105.3 Mbytes of cons cells used (91%)
5500.3 Mbytes of vectors used (67%)
Garbage collection 77 = 51+10+16 (level 1) ... 
105.4 Mbytes of cons cells used (91%)
5500.2 Mbytes of vectors used (67%)
> DT[, Month := as.IDate(Month, format = "%Y-%m-%d")]
Garbage collection 78 = 51+10+17 (level 2) ... 
107.5 Mbytes of cons cells used (76%)
8174.1 Mbytes of vectors used (81%)
Garbage collection 79 = 51+11+17 (level 1) ... 
107.5 Mbytes of cons cells used (76%)
5910.4 Mbytes of vectors used (59%)
> gcinfo(FALSE)
[1] TRUE

๋Œ€๋ฐ•! : tada : ๊ด€๋ จ๋œ ๋ชจ๋“  ์‚ฌ๋žŒ, ํŠนํžˆ @mattdowle ์ด ์ง€๊ธˆ๊นŒ์ง€ ๋จธ๋ฆฌ์นด๋ฝ์ด ์งง์•„์•ผํ•ฉ๋‹ˆ๋‹ค. :)

'๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ ๋  ๋•Œ๊นŒ์ง€ ํœด๊ฐ€์— ๋จธ๋ฌผ๋Ÿฌ ๋ผ'๋ผ๋Š” ๋‚ด ์ „๋žต์ด ์—ฌ๊ธฐ์—์„œ ํ•ด๊ฒฐ ๋œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. :-)

ํ™•์ธํ•ด์•ผ ํ•  ๋‹ค๋ฅธ ์‚ฌํ•ญ์ด ์žˆ๊ฑฐ๋‚˜์ด ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ ๋œ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผ๋ฉ๋‹ˆ๊นŒ?

@aadler ์™€ @HughParsonage์—๊ฒŒ ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค! ๊ตฌ์กฐ.
@kevinushey ํ•˜ํ•˜. ์˜ˆ, data.table ์ชฝ์ด์—ˆ๊ณ  ์ด์ œ ํ•ด๊ฒฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค (PR # 2488). ๊ฐ์‚ฌ.

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰