verbose=FALSE
ã䜿çšããŠä»¥äžãå®è¡ãããšãRã¯ã©ãã·ã¥ïŒãã¹ã¿ãã¯ã®äžåè¡¡ãïŒãçºçããŸãã data.table
ãæåã«
ãã®åé¡ã¯ãããªãå°ãããã¡ã€ã«ã§ã¯åçŸãããŸããã zipãã¡ã€ã«ãžã®ãªã³ã¯ïŒcsvã¯350 MBïŒïŒ https ïŒ
ç§ã¯æã ç°ãªããšã©ãŒãçµéšããŸãã äŸãã°ã
getïŒnameãenvir = nsãinherits = FALSEïŒã®ãšã©ãŒïŒæåã®åŒæ°ãç¡å¹ã§ã
ãŸãã¯
èŠåïŒã$ãã®ã¹ã¿ãã¯ã®äžåè¡¡ã16ã15
ãšã©ãŒïŒR_ReprotectïŒä¿è·ãããã¢ã€ãã ã¯1ã€ã ãã§ãã€ã³ããã¯ã¹-2ãåä¿è·ã§ããŸãã
#
Minimal reproducible example
library(data.table)
#> data.table 1.10.5 IN DEVELOPMENT built 2017-11-13 02:46:28 UTC; appveyor
#> The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#> Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#> Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.550 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 94%. ETA 00:00 Warning: stack imbalance in '$', 27 then 28
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.991
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.006s ( 0%) Memory map 0.341GB file
0.011s ( 0%) sep=',' ncol=4 and header detection
0.002s ( 0%) Column type detection using 10027 sample rows
0.328s ( 9%) Allocation of 22885380 rows x 4 cols (0.469GB)
3.194s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.362s ( 10%) Parse to row-major thread buffers
+ 1.963s ( 55%) Transpose
+ 0.868s ( 25%) Waiting
0.991s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
3.541s Total
Warning: stack imbalance in 'withVisible', 3 then 5
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
#
Output of sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5 RevoUtils_10.0.6 RevoUtilsMath_10.0.1
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2 yaml_2.1.14
@HughParsonage ãããã¯ïŒ2457ã«äŒŒãŠããŸãã ããããã showProgress=FALSE
ãæž¡ããŠã¿ãŠãå®äºãããã©ããã確èªããŠãã ããã
@mattdowle 2017-11-09以éããªã°ã¬ãã·ã§ã³ã
showProgress=FALSE
å®è¡ãããšãå®éã«çµæãè¿ãããŸããïŒäºæãããèŠåã®ã¿ã衚瀺ãããŸãïŒã
ãã¹ãŠã®è©³çŽ°æ
å ±ãããããšãã 2017幎11æ9æ¥ä»¥éããªã°ã¬ãã·ã§ã³ãçºçããŠãããšã¯æããŸããããé·ãverbose=TRUE
åºåãETAåºåãšåæ§ã®åœ±é¿ãåãŒããŠããå¯èœæ§ããããŸãã ãã¡ã€ã«ãåèªã¿åãããå¿
èŠããããŸããããã¯ãããå€ãã®åºåãçæãããããšãæå³ããŸãã ç§ã¯showProgressã¯åœŒã®ããã«TRUEäœåã=ãšãã@HughParsonageã®ã¬ããŒããåœã§ããããšãæããåé¡ã¯ããããåé·ã§5-10åå®è¡ãããå Žå= TRUEãèµ·ããã®ã ãããšããããšã
䞊åã»ã¯ã·ã§ã³å ããåºåããã詳现ã¡ãã»ãŒãžã¯ãããŸããïŒæ¢ã«ä¿®æ£ãããŠããé²è¡ç¶æ³ETAãé€ãïŒããã ããæåã®èªã¿åãã®åŸã2åç®ã®åèªã¿åããéå§ãããåã«è©³çŽ°ã¡ãã»ãŒãžããããŸãïŒããã¯ãã®ãã¡ã€ã«ã§çºçããŸãïŒã ãããã®å°å·ã«ãã£ãŠ100çªç®ã®CheckUserInterruptãããªã¬ãŒãããå ŽåïŒïŒ2457ãåç §ïŒã2çªç®ã®äžŠåé åã倱æããå¯èœæ§ããããšæããŸãïŒå¥åŠãªããšã§ããïŒã ãšã«ãããããé€å€ããããã«ãRprintfã§ã¯ãªãREprintfã䜿çšããããã«ãã¹ãŠã®è©³çŽ°ã¡ãã»ãŒãžãå€æŽããŸããïŒETAã®ïŒ2457ãšåãä¿®æ£ïŒã ãã¹ããstderrã®åºåãèŠã€ããããªããããããã¯å€±æããŸãã-ä¿®æ£ãããŸãã åæ ŒãããšãWindows .zipãèªåçã«äœæãããŸãã®ã§ãããäžåºŠãè©Šããã ããã æºåãã§ãããããã§æŽæ°ããŸãã
ããããŸããã2åç®ã®è©Šè¡ã¯ãã§ãã¯ã«åæ Œãã Windows.zipãå©çšå¯èœã§ãã @HughParsonageããäžåºŠããçŽããŠããã ããŸãããã ã¡ãã»ãŒãžã詳现ã¢ãŒãã§åèªã¿åãããçŽåã«ãR_FlushConsoleïŒïŒãžã®åŒã³åºããè¿œå ããŸããã ãã®ãã©ãã·ã¥ã¯ãWindowsã§ã®ã¿å¿
èŠã§ãã ãã©ãã·ã¥ããªããšã䞊ååèªã¿åããè¡ãããŠãããšãã«ã³ã³ãœãŒã«ãå°ãé
ããŠæŽæ°ãããããšãããããããåé¡ãåŒãèµ·ããããšãããã£ãŠãããšæšæž¬ããŠããŸãã åžžã«verbose=TRUE
ãšshowProgress=TRUE
äž¡æ¹ã§ã10åç¹°ãè¿ããŠãã ããã 10åã®ã¯ãªã¢ã©ã³ã衚瀺ãããå Žåã¯ãããã§ããã ããã§ãªããã°ãç§ã¯ããäžåºŠèããªããã°ãªããªãã§ãããã
æ®å¿µãªãããä¿®æ£ãããŠããŸããïŒ
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = FALSE)
Read 26%. ETA 00:00 Warning: stack imbalance in '$', 20 then 22
Read 52%. ETA 00:00 Warning: stack imbalance in '$', 36 then 35
Warning: stack imbalance in '$', 21 then 22
Read 59%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
unprotect_ptr: pointer not found
In addition: Warning: stack imbalance in '$', 26 then 28
Warning messages:
1: Warning: stack imbalance in '$', 26 then 27
In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15
10åå®è¡ããåŸã§ãverbose=TRUE, showProgress=TRUE
ã䜿çšãããšããšã©ãŒã¯çºçããŸããã 10çªç®ã®åºåã®çµæã¯æ¬¡ã®ãšããã§ãã
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.094 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.752
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.004s ( 0%) Memory map 0.341GB file
0.008s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.173s ( 4%) Allocation of 22885380 rows x 4 cols (0.469GB)
3.660s ( 95%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
= 0.009s ( 0%) Finding first non-embedded \n after each jump
+ 1.946s ( 51%) Parse to row-major thread buffers
+ 1.098s ( 29%) Transpose
+ 0.608s ( 16%) Waiting
1.752s ( 46%) Rereading 1 columns due to out-of-sample type exceptions
3.846s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.589 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.418
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.001s ( 0%) Memory map 0.341GB file
0.003s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.574s ( 14%) Allocation of 22885380 rows x 4 cols (0.469GB)
3.428s ( 86%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
= 0.010s ( 0%) Finding first non-embedded \n after each jump
+ 1.988s ( 50%) Parse to row-major thread buffers
+ 1.137s ( 28%) Transpose
+ 0.292s ( 7%) Waiting
1.418s ( 35%) Rereading 1 columns due to out-of-sample type exceptions
4.007s Total
There were 20 warnings (use warnings() to see them)
@HughParsonageããããšãïŒ ç§ã¯æ··ä¹±ããŠããŸãã ããªãã¯ãããç§ãã¡ãæãã§ãããã®ã§ããverbose=TRUE, showProgress=TRUE
ã§ããŸããããšèšã£ãŠããŸã-ã€ã§ãŒã€ïŒ ããã¯åã«å€±æããŸãããïŒ ãšã«ããshowProgress
ã®ããã©ã«ãã¯TRUEã§ããã verbose
ã®ããã©ã«ãã®FALSEã§å®è¡ãããšãããã¯æ©èœãããã¹ã¿ãã¯ã®äžåè¡¡ãèŠãããŸããïŒ _less_åºåã倱æããã®ã¯å¥åŠãªããšã§ãã 確èªããŠãã ããã ãããããªããå€åç§ã¯ééã£ãæšãå ããŠããŸãã ããã§ã¯Linuxã§åé¡ãªãåäœããã®ã§ãWindowsã§ã®ãã¹ãã«äŸåããŠããŸãã ããããšãã
ïŒãŸãã10åç®ã®å®è¡åºåã®äžéšã«ã20åã®èŠåããã£ããšè¡šç€ºãããŸãããããã¯äžã«è¡šç€ºããã2åã®èŠåã§ããã10åç¹°ãè¿ããããšæããŸããããã§ããã°ãæå³ããããŸããïŒ
ããã«ã¡ã¯ãæ··ä¹±ããŠãã¿ãŸããããããã
å ã®åé¡ã§ã¯ã©ãã·ã¥ãçºçããªããªã£ãããšã¯ééããããŸãããã€ãŸãã次ã®ããã«æ©èœããŸãã
fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "")
æ確ã«ããããã«ããªãªãžãã«ã§ã¯ã verbose =FALSE
ïŒããã©ã«ãïŒã®ãšãã«ã¯ã©ãã·ã¥ããŸããã åé¡ãæåºããåã«verbose = TRUE
ã§å®è¡ãããã¹ã¿ãã¯ã®äžåè¡¡ãã®èŠåã«æ°ã¥ããŸããããã¯ã©ãã·ã¥ã¯çºçããŸããã§ããã ææ°ããŒãžã§ã³ã§ã¯ã verbose = FALSE
ã¯ã©ãã·ã¥ïŒãŸãã¯å®éã«åé¡ïŒã¯çºçããŸããã
ãæªä¿®æ£ããšèšã£ãã®ã¯ãèŠåã¡ãã»ãŒãžã«æ°ã¥ããããã§ãã
Warning messages:
Warning: stack imbalance in '$', 26 then 27
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15
ããã¯å¥åŠã«èŠããåãåé¡ã§ã¯ãããŸããããå¯æ¥ã«é¢é£ããŠããããšã瀺ããŠããã®ã§ã¯ãªãããšæããŸããã ããã¯èšã£ãŠããä»æãªãŒã¹ãã©ãªã¢ã§ã¯èŠåã¡ãã»ãŒãžãåçŸã§ããªããªããŸããã
ãªãã»ã©ãåãããŸããã ã¹ã¿ãã¯ã®äžåè¡¡ã«é¢ãããããã®èŠåã¡ãã»ãŒãžã¯ãæ¬è³ªçã«ãšã©ãŒã§ãã ããããã¹ãããããããšã¯ã§ããŸããã å®éã«ã¯ãŸã ã¯ã©ãã·ã¥ããŠããŸããããã¹ã¿ãã¯ã®äžåè¡¡ã«é¢ããèŠåãã¯ã©ãã·ã¥ãšåŒãã§ããŸãã ïŒãã®èŠåãèŠãåŸã«ã¯ã©ãã·ã¥ããã®ã¯æéã®åé¡ã§ããïŒ
verbose=TRUE, showProgress=TRUE
ã䜿çšããŠæ°ããRã»ãã·ã§ã³ã§10åå®è¡ãããšãã¹ã¿ãã¯ã®äžåè¡¡ã«é¢ãã20åã®èŠåã®ããããããŸãã¯æ¬¡ã®éåžžã®èŠåã®20åãã¹ãŠã«ãªããŸãã
1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
ã¹ã¿ãã¯ã®äžåè¡¡ã®èŠåãçºçããããæ°ããæ°ããRã»ãã·ã§ã³ãéå§ããŠãã ããã ãããäžåºŠã§ãèµ·ãã£ãåŸãç§ãã¡ã¯Rããäœãä¿¡çšã§ããŸããã
verbose=TRUE, showProgress=TRUE
å®è¡ãããšããªããšãã¯ã©ãã·ã¥ããŸããã SEXP
const char
ã«ã€ããŠã®äœãã ãããã³ãã³ãã©ã€ã³ããåçŸããããšããŠããŸãïŒæ®å¿µãªãããRStudioã§çºçããã¡ãã»ãŒãžå
šäœãèªã¿åãåã«RStudioãéããããŸããïŒã
ã¯ã©ãã·ã¥ãåçŸã§ããŸããã åèµ·ååŸã®çµæã¯æ¬¡ã®ãšããã§ãã ã¹ã¿ãã¯ã®äžåè¡¡ã®èŠåããããŸããïŒ
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> for (i in 1:10) fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE, showProgress = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 31%. ETA 00:00 Warning: stack imbalance in '$', 24 then 23
Read 91%. ETA 00:00 Warning: stack imbalance in '$', 27 then 26
Read 95%. ETA 00:00 Warning: stack imbalance in '$', 28 then 29
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.895
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.029s ( 1%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.314s ( 15%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.761s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.015s ( 1%) Finding first non-embedded \n after each jump
+ 0.599s ( 28%) Parse to row-major thread buffers
+ 0.400s ( 19%) Transpose
+ 0.746s ( 35%) Waiting
0.895s ( 42%) Rereading 1 columns due to out-of-sample type exceptions
2.107s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.335 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.049
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.402s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.974s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.209s ( 9%) Parse to row-major thread buffers
+ 0.864s ( 36%) Transpose
+ 0.900s ( 38%) Waiting
1.049s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
2.385s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.414
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.007s ( 0%) sep=',' ncol=4 and header detection
0.002s ( 0%) Column type detection using 10027 sample rows
0.293s ( 18%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.322s ( 81%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.199s ( 12%) Parse to row-major thread buffers
+ 0.822s ( 51%) Transpose
+ 0.301s ( 19%) Waiting
0.414s ( 25%) Rereading 1 columns due to out-of-sample type exceptions
1.626s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.451 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.409
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.403s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.448s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.194s ( 10%) Parse to row-major thread buffers
+ 0.974s ( 52%) Transpose
+ 0.279s ( 15%) Waiting
0.409s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
1.860s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.480 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.412
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.007s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.459s ( 24%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.424s ( 75%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.197s ( 10%) Parse to row-major thread buffers
+ 0.938s ( 50%) Transpose
+ 0.288s ( 15%) Waiting
0.412s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
1.892s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.381 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.401
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.005s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.384s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.389s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.196s ( 11%) Parse to row-major thread buffers
+ 0.911s ( 51%) Transpose
+ 0.281s ( 16%) Waiting
0.401s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
1.781s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.384 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.480
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.002s ( 0%) Column type detection using 10027 sample rows
0.476s ( 26%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.378s ( 74%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.192s ( 10%) Parse to row-major thread buffers
+ 0.833s ( 45%) Transpose
+ 0.352s ( 19%) Waiting
0.480s ( 26%) Rereading 1 columns due to out-of-sample type exceptions
1.864s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.374 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.507
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.311s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.562s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.193s ( 10%) Parse to row-major thread buffers
+ 0.988s ( 52%) Transpose
+ 0.381s ( 20%) Waiting
0.507s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
1.881s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.318 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.493
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.306s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.496s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.193s ( 11%) Parse to row-major thread buffers
+ 0.935s ( 52%) Transpose
+ 0.367s ( 20%) Waiting
0.493s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
1.811s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.141 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.506
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.007s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.132s ( 8%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.506s ( 91%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.195s ( 12%) Parse to row-major thread buffers
+ 0.938s ( 57%) Transpose
+ 0.371s ( 23%) Waiting
0.506s ( 31%) Rereading 1 columns due to out-of-sample type exceptions
1.647s Total
Warning: stack imbalance in 'for', 2 then 8
There were 20 warnings (use warnings() to see them)
å¥åŠãªããšã«ããã®ç¢ºå®æ§ã¯çŽ æŽãããã§ãã ããããšãã ã€ãŸãããã©ãã·ã¥ãæ©èœããªãã£ãã®ã§ãçµå±Rprintf
ãåé¿ããæ¹æ³ãèŠã€ããå¿
èŠããããŸãã ããverbose=FALSE, showProgress=FALSE
ç§ãæšæž¬ãã
ãããªãç§ã«ä»»ããŠãã ããã å床ãæè¬ããŸãã
@HughParsonageããããŸãããæè¿ã®2åç®ã®è©Šè¡ã§ããäžåºŠè©ŠããŠãã ããã ãŸã ãã¹ã¿ãŒã«ããŒãžãããŠããªãã®ã§ãããã®ãã©ã³ãããWindows.zipããã§ããããããã«æ³šæããŠãã ããã 以åãšåæ§ã«ã確èªã§ããããã«ãããããã®æ¹æ³ã§å®å šãªåºåãæäŸããŠãã ããã ããããšãïŒ
次ã®æåã®è©Šã¿ã¯ã¯ã©ãã·ã¥ïŒãã€ã³ã¿ã«é¢ããäœãïŒããããããŸããã
2åç®ã®è©Šè¡ïŒåèµ·ååŸïŒã§ã¯ã stack imbalance in '$', 16 then 15
èŠåã衚瀺ãããŸãã
# Assert that `data.table` is not installed:
stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into âC:/Users/hughp/Documents/R/win-library/3.4â
# (as âlibâ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557502 bytes (1.5 MB)
# downloaded 1.5 MB
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 01:38:17 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapping ... ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \r-only line endings are not allowed because \n is found in the data
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
# [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=',' with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
# Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000) : 1551 Quote rule 0
# Type codes (jump 100) : 1A51 Quote rule 0
# =====
# Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
# [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# Read 78%. ETA 00:00 Warning: stack imbalance in '$', 16 then 15
# Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.677 wall clock time
# [12] Finalizing the datatable
# Type counts:
# 1 : bool8 '1'
# 1 : int32 '5'
# 2 : string 'A'
# =============================
# 0.002s ( 0%) Memory map 0.341GB file
# 0.007s ( 0%) sep=',' ncol=4 and header detection
# 0.001s ( 0%) Column type detection using 10027 sample rows
# 0.297s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 2.369s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# = 0.003s ( 0%) Finding first non-embedded \n after each jump
# + 0.273s ( 10%) Parse to row-major thread buffers (grown 0 times)
# + 1.313s ( 49%) Transpose
# + 0.780s ( 29%) Waiting
# 0.893s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
# 2.677s Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1 V2 V3 V4
# 1: Goulburn 110018063 3499 NA
# 2: NA 110018064 812 NA
# 3: NA 110018065 2158 NA
# 4: NA 110019999 402 NA
# 5: NA 110028068 10 NA
# ---
# 22885376: NA 997999799 0 NA
# 22885377: NA 998999899 64 NA
# 22885378: NA 994999499 34 NA
# 22885379: NA 0&&&&&&&& 250796 NA
# 22885380: NA 0@@@@@@@@ 7305367 NA
# Warning messages:
# 1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
# 2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
ããã«ã¡ã¯ã@ mattdowleã OpenMPã4.0ã§ã¯ãªããããã3.1ã§ããGCCã®ããŒãžã§ã³ããŸã 䜿çšãããŠããŸãã CRANïŒ Delaporte ïŒã®ããã±ãŒãžã®1ã€ã§ãã®åé¡ãçºçããRtools for WindowsïŒ4.9.3ã«åºã¥ãïŒã§ã³ã³ãã€ã«ãããSIMDãã£ã¬ã¯ãã£ãïŒOpenMP 4.0ïŒã䜿çšããããšããŸãããããŸã gccã䜿çšããŠãã誰ãã®Linuxãã·ã³ã§ãšã©ãŒãçºçããŸãã4.8.0ã ç§ã®èšæ¶ãæ£ãããã°ãWindowsã§ãã4.0åŒã³åºããã䜿çšã§ããã4.5åŒã³åºãã¯äœ¿çšã§ããŸããã å€åããã¯åé¡ã«è²¢ç®ããŠããŸããïŒ
@HughParsonageãšãŠãæ©ããã¹ãããŠãããŠããããšãïŒ ããããŸãããããã§ã¯èãç¶ããŸãïŒ
@aadlerããã¯è¯ãèãã§ã-äœã§ãå¯èœã§ãã
@HughParsonage 1åã®å€æŽïŒ verbose=FALSE
ïŒã§åãã³ãã³ããæ£åžžã«æ©èœããããšã確èªããŠãã ããã ã€ãŸãã fread("SA2-by-DJZ-2011.csv", verbose = FALSE, na.strings = "", header = FALSE)
ã§ãã ããã°ã¬ã¹ã¡ãŒã¿ãŒã¯åŒãç¶ã衚瀺ãããŸãã
ã¯ãããã®ã³ãã³ããïŒ10åïŒå®è¡ãããšãæåŸ ãããçµæãè¿ãããŸããïŒã€ãŸããããŒã¿ã®åœ¢åŒãæ£ãããªããããèŠåã2ã€ãããªãdata.tableïŒã ã¹ã¿ãã¯ã®äžåè¡¡ã®èŠåã¯ãããŸããã
ããããšãã ãã®ãããã³ã³ãœãŒã«åºåã«é¢é£ããŠããããã§ãã è©ŠããŠã¿ãããšããã£ãšããã€ããããŸã...
詳现ã¢ãŒãã§ã¯ã䞊åé åå
ã«wallclock()
ãåŒã³åºããã©ã³ããããã€ããããŸãã ãããé€å€ããããã«ãåžžã«0.0ãè¿ããã·ã¹ãã ã³ãŒã«ãåé¿ããããã«ç絡ããŸããã ã¹ã¬ããã»ãŒãã ãšæããŸããããããã§ã¯ãªããããããŸããã ããã§åæ§ç¯ããããã©ã³ãããæ°ããWindows.zipãè©ŠããŠãã ããã
æåã®è©Šã¿ïŒ
install.packages("https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into âC:/Users/hughp/Documents/R/win-library/3.4â
# (as âlibâ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1556972 bytes (1.5 MB)
# downloaded 1.5 MB
#
# package âdata.tableâ successfully unpacked and MD5 sums checked
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 03:49:20 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
2åç®ã®è©Šè¡ã§ã次ã®èŠåã衚瀺ãããŸãã
Read 22%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
unprotect_ptr: pointer not found
In addition: Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Warning: stack imbalance in '$', 29 then 28
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 125 then 126
Warning: stack imbalance in 'lapply', 55 then 53
Warning: stack imbalance in 'lapply', 30 then 34
Warning: stack imbalance in '<-', 28 then 31
Warning: stack imbalance in '{', 24 then 27
Warning: stack imbalance in '{', 18 then 21
èããŠã¿ãŠãã ãããããã¯RStudioã®åé¡ã§ããããïŒ ã¿ãŒããã«ããã¹ã¯ãªãããå®è¡ãããšãç°¡åã«åçŸã§ããªãããã§ãã ã³ã³ãœãŒã«åºåã®ã³ããŒãç°¡åã«ãªããããRStudioããå®è¡ããŠããŸãã
RStudioã®å€ã§ã¯_ç°¡åã«_åçŸã§ããªããšèšããšããããã¯_ãŸã£ããåçŸããŸããïŒ RStudioå ã§ã®ã¿çºçããå Žåã§ããdata.tableåŽã§ä¿®æ£ããããšãç®æããŠããŸãã ç§ã¯å¥ã®ã«ãŒããšããŠããããééããªããåãªããã³ã³ãœãŒã«åºåã§ããããã¬ããããžãã¯ã®ä»ã®çã®ã¹ã¿ãã¯ã®äžåè¡¡ã§ã¯ãªãããšã確èªããããã«æ±ããŠããŸãã
ç§ã¯ãã¹ãŠã§RStudioã®å€åŽãåçŸããããã«ããŸã ã ãããã®äžã«ä¿¡é Œæ§ã®é«ãåçãïŒã§ãããç§ã¯ããã€ãã®èŠåãŸãã¯ã¯ã©ãã·ã¥ãåçŸããããšãã§ããŸãïŒããããšãã§ããŸãã Windowsã®ã³ãã³ãããã³ãããšgitã·ã§ã«ïŒWindowsã®å ŽåïŒãè©ŠããŸããã
Windowsã§RStudioããŒãžã§ã³1.1.383ã䜿çšããŠããŸãã ç§ã圌ãã«ããã®åé¡ãæèµ·ããå Žåãããªãã«åœ¹ç«ã¡ãŸããããããšãç§ã«åŸ ã£ãŠãããããã§ããïŒ
ããããšãã ããã¯ãRStudioã®å éšã«ããããšãç¥ã£ãŠãããšéåžžã«äŸ¿å©ã§ãã 圌ããšäžç·ã«ãããäžããå¿ èŠã¯ãããŸããã ããã¯ãåºåã³ã³ãœãŒã«ã®ãããã¡ãªã³ã°ïŒãŸãã¯åæ§ã®ãã®ïŒãšé¢ä¿ãããããšãæå³ããŸãã ç§ã¯åé¿çãé²ããŠãããããã·ã¥ããããšããŠããŸãã
Windowsããã®å€æŽãã³ã³ãã€ã«ããªãçç±ãããããŸããïŒ
fread.c:1054:3: warning: too many arguments for format [-Wformat-extra-args]
ããã§ã¯LinuxãšTravisã§åé¡ãªãåäœããŸãã ããã«ããããã®åé¿çããã¹ãããããã®Windows.zipãäœæãããªããªããŸãã ç§ã¯ããã§ç ããªããã°ãªããªãã§ãããã
ïŒ1054è¡ç®ã«ã€ããŠã¯äžå¹³ãèšã£ãŠããŸãããåãafaicsã§ãã次ã®1055è¡ç®ã§ã¯ãããŸãããå€å°ã®éããããã¯ãã§ããïŒ
lluã¯Windowsã®__VA_ARGS__
ã«åé¡ããããŸã-確ãã«ããã§ã¯ãããŸãããïŒ
ããŠãã€ãã«windows.zipãåè©Šè¡ããæºåãæŽããŸããã
çŸåšããã®ãã©ã³ãã«ã¯ããã€ãã®åé¿çããããŸãã ãããæ©èœããå Žåã¯ãåé¿çãåé€ããŠããããã©ãã§ãã£ããã確èªããŸãã lluã³ã³ãã€ã©ã®èŠåã¯ãããã«ãã@ st-pashaã®èª¬æãšäžèŽããŠã詳现ãªåºåã§ã¹ã¿ãã¯ã®äžåè¡¡ãåŒãèµ·ãããããæãææã«èŠãRprintf
ã¬ã€ã€ãŒã¯ã fprintf
çŽæ¥äœ¿çšããŠããããšããããã³ã³ãã€ã©ãŒãããããé ããŠããŸããã
2åç®ã®è©Šè¡æïŒåèµ·ååŸïŒ
stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into âC:/Users/hughp/Documents/R/win-library/3.4â
# (as âlibâ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1559167 bytes (1.5 MB)
# downloaded 1.5 MB
#
# package âdata.tableâ successfully unpacked and MD5 sums checked
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-18 04:58:23 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file: C:\Users\hughp\AppData\Local\Temp\RtmpIT9H0D/fread.out
Input contains no \n. Taking this to be a filename to open
Read 11%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 28%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 48%. ETA 00:00 Warning: stack imbalance in '$', 20 then 19
Read 98%. ETA 00:00 [01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 1A51 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.822 wall clock time
[12] Finalizing the datatable
Type counts:
1 : bool8 '1'
1 : int32 '5'
2 : string 'A'
=============================
0.000s ( 0%) Memory map 0.341GB file
0.001s ( 0%) sep=',' ncol=4 and header detection
0.000s ( 0%) Column type detection using 10027 sample rows
0.291s ( 10%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
2.531s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.002s ( 0%) Finding first non-embedded \n after each jump
+ 0.282s ( 10%) Parse to row-major thread buffers (grown 0 times)
+ 1.537s ( 54%) Transpose
+ 0.710s ( 25%) Waiting
0.842s ( 30%) Rereading 1 columns due to out-of-sample type exceptions
2.822s Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
V1 V2 V3 V4
1: Goulburn 110018063 3499 NA
2: NA 110018064 812 NA
3: NA 110018065 2158 NA
4: NA 110019999 402 NA
5: NA 110028068 10 NA
---
22885376: NA 997999799 0 NA
22885377: NA 998999899 64 NA
22885378: NA 994999499 34 NA
22885379: NA 0&&&&&&&& 250796 NA
22885380: NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
ç¹°ãè¿ããŸãããRStudioã®å€éšã§ã¯åçŸã§ããŸããã
ãšãŠãæ©ããã¹ãããŠãããŠããããšãã ãŸããããã¯ç¢ºãã«ãã®æå€ããé€å€ããŸãïŒ 2ã€ã®ã¢ã€ãã¢ãæ®ã£ãŠããŸãã æåã®ãã®ã¯æŒããŠééããŸããã ããã§æ°ããWindows.zipãè©ŠããŠãã ããã ãã®alloca
ã¯ã¹ã¿ãã¯äžã«ãããçºçæã«èšå®ããŠããna.strings
ãšé¢ä¿ããããŸãã ééããªãé©åãªé åïŒã¹ã¿ãã¯ã®äžåè¡¡ïŒã«ãããè©Šã䟡å€ããããŸãã
åé¡ãããŸããâ次ã®12æéã»ã©äžåšã«ãªãã®ã§ããããŸã§ãã¹ãã§ããŸããã
17:20æåã2017幎11æ18æ¥ã«ã¯ããããDowleã®[email protected]ã¯æžããŸããïŒ
ãšãŠãæ©ããã¹ãããŠãããŠããããšãã ãŸããããã¯ç¢ºãã«ãã®æå€ããé€å€ããŸãïŒ
2ã€ã®ã¢ã€ãã¢ãæ®ã£ãŠããŸãã æåã®ãã®ã¯æŒããŠééããŸããã æ°ããWindows.zipããè©Šããã ãã
ããã«
https://ci.appveyor.com/project/Rdatatable/data-table/build/1.0.1363/job/fo02vnbu5ebhwy3w/artifacts ã
ãã®å²ãåœãŠã¯ã¹ã¿ãã¯ã«å²ãåœãŠãããna.stringsãšé¢ä¿ããããŸãã
ããªãã¯ãããèµ·ããããã«èšå®ããŠããŸãã ééããªãé©åãªé åïŒã¹ã¿ãã¯
äžåè¡¡ïŒãããŠè©Šã䟡å€ããããŸããâ
ããªããèšåãããã®ã§ããªãã¯ãããåãåã£ãŠããŸãããã®ã¡ãŒã«ã«çŽæ¥è¿ä¿¡ããGitHubã§è¡šç€ºããŠãã ãã
https://github.com/Rdatatable/data.table/issues/2481#issuecomment-345421856 ã
ãŸãã¯ã¹ã¬ããããã¥ãŒãããŸã
https://github.com/notifications/unsubscribe-auth/AHvGDGa5Qnls5eSFBMaQO5s8DElfrpKSks5s3ncqgaJpZM4QcuPc
ã
倧äžå€«ã§ããã ããããšãïŒ ç§ã2çªç®ã®ã¢ã€ãã¢ãããã·ã¥ããŸããã éå»ã«\r
ãWindowsã§åé¡ãåŒãèµ·ãããããšãèŠããŠããããã§ãããã¹ã¿ãã¯ã®äžåè¡¡ãæãåºããŸããã ãšã«ããããããé€å€ããããã«ãé²è¡ç¶æ³ã¡ãŒã¿ãŒãã\r
ãåé€ããŸããã ã¹ã¿ãã¯äžåè¡¡ã¡ãã»ãŒãžã¯ãETAè¡ãçºçããå Žæã«åºåãããããã§ãã ã³ã³ãœãŒã«ã\r
ããã£ãããããããç°ãªãæ¹æ³ã§åŠçããŠãæåŸã®è¡ã眮ãæããããããã«ããããšã¯å¯èœã§ãã ããã§ãETAãæŽæ°ããããã³ã«æ°ããè¡ã衚瀺ãããŸãã ãããé€å€ããããã«äžæçã«ã æ°ããWindows.zipããã«ããããããã«æž¡ã
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file: C:\Users\hughp\AppData\Local\Temp\RtmpcVjZ1f/fread.out
Input contains no \n. Taking this to be a filename to open
Read 5%. ETA 00:00
Read 8%. ETA 00:00
Read 11%. ETA 00:00
Read 15%. ETA 00:00
Read 18%. ETA 00:00
Read 21%. ETA 00:00
Read 25%. ETA 00:00
Read 28%. ETA 00:00
Read 31%. ETA 00:00
Read 35%. ETA 00:00
Read 38%. ETA 00:00
Read 41%. ETA 00:00
Read 45%. ETA 00:00
Read 48%. ETA 00:00
Read 51%. ETA 00:00
Read 55%. ETA 00:00
Warning: stack imbalance in '$', 30 then 31
Warning: stack imbalance in '$', 17 then 16
Read 58%. ETA 00:00
Read 61%. ETA 00:00
Read 65%. ETA 00:00
Read 68%. ETA 00:00
Read 71%. ETA 00:00
Read 75%. ETA 00:00
Read 78%. ETA 00:00
Read 81%. ETA 00:00
Read 85%. ETA 00:00
Read 88%. ETA 00:00
Read 91%. ETA 00:00
Read 95%. ETA 00:00
Read 98%. ETA 00:00
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 1A51 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.894 wall clock time
[12] Finalizing the datatable
Type counts:
1 : bool8 '1'
1 : int32 '5'
2 : string 'A'
=============================
0.001s ( 0%) Memory map 0.341GB file
0.003s ( 0%) sep=',' ncol=4 and header detection
0.000s ( 0%) Column type detection using 10027 sample rows
0.316s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
2.574s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.004s ( 0%) Finding first non-embedded \n after each jump
+ 0.284s ( 10%) Parse to row-major thread buffers (grown 0 times)
+ 1.450s ( 50%) Transpose
+ 0.837s ( 29%) Waiting
0.953s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
2.894s Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
V1 V2 V3 V4
1: Goulburn 110018063 3499 NA
2: NA 110018064 812 NA
3: NA 110018065 2158 NA
4: NA 110019999 402 NA
5: NA 110028068 10 NA
---
22885376: NA 997999799 0 NA
22885377: NA 998999899 64 NA
22885378: NA 994999499 34 NA
22885379: NA 0&&&&&&&& 250796 NA
22885380: NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
åèïŒå°ãå€ãããŒãžã§ã³ã®RStudioã䜿çšããŠããå¥ã®Windowsãã·ã³ã§ã¯ããã®ã¹ã¿ãã¯äžåè¡¡ãšã©ãŒãåçŸã§ããŸããã§ããã
ãã®å Žåã¯ãææ¡ããããšããã«RStudioãµããŒãã«åãåããããšããæ¥ãããã§ãã ç§ã¯åã³ãã¬ããã³ãŒãã調ã¹ãŸããããç§ã¯èªåã®åŽã®ã¢ã€ãã¢ããããŸããã RStudioã®2ã€ã®ããŒãžã§ã³çªå·ãæããŠãã ããã å¿ ãããRStudioã§ãããšã¯éããŸãããããŸããŸãRStudioã®1ã€ã®ããŒãžã§ã³ã«è¡šç€ºãããdata.tableåŽã®é害ã§ããå¯èœæ§ããããŸãã ãããããããã³ã³ãœãŒã«åºåã«é¢é£ããŠããããã«èŠããã®ã¯å¥åŠã§ãããããã¯RStudioåºæã®ç°ãªããã®ã§ãã ãRStudioã¹ã¿ãã¯ã®äžåè¡¡ããæ€çŽ¢ããŸããããRStudioèªäœã§ã¯ãªããããã±ãŒãžã®é害ã«ã€ããŠå€ãã®åé¡ãçºçããŸãã æ€çŽ¢ãé£ããåé¡ã ããã§åé¡ãéãããŸãŸã«ããŠã圌ãã®èšãããšãèŠãŠã¿ãŸãããã
æåŸã®è©Šã¿ã圹ç«ã€ãšã¯æããŸããããå®å šãæãããã«ãããã§è©ŠããŠã¿ãŠ
ãã ãããã®ç¹å®ã®ã¹ã¿ãã¯äžåè¡¡ã¡ãã»ãŒãžã¯ãRèªäœã®eval.cïŒ491ããfread
ãŸãã¯data.table
ã§ã¯ãªããšæããŸãã ãã®check_stack_balance()
ã¯ãRå
éšã®5ã€ã®å Žæããã®ã¿åŒã³åºãããŸãã
names.c
ã®çµããã«do_internal()
objects.c
ã§ã applyMethod()
ã§2å
eval.c
ã§ã eval()
ã§2å
fread.c
ã䞊åã»ã¯ã·ã§ã³ã«ããéããããã®ããããã«å°éããæ¹æ³ãããããŸããã åŒã³åºãããŠããå¯äžã®ãšã³ããªãã€ã³ãã¯REprintf
ããããããcheck_stack_balance()
å°éããæ¹æ³ãããããŸããã ç§ãçŸåšèããããšãã§ããã®ã¯ãRStudioã«ã¯ãããããWindowsã§ã¯ç°ãªãæ¹æ³ã§ãã³ã³ãœãŒã«åºåãšçžäºäœçšããããã¯ã°ã©ãŠã³ãã§äœããå®è¡ããŠããã¹ã¬ããããããšããããšã ãã§ãã
æåŸã«ãå®å
šãæãããã«ãããŒã¹Rãlibcurl.cïŒ354ããã³internet.cïŒ409ã®é²è¡ç¶æ³ã¡ãŒã¿ãŒã§ïŒ Rprintfã§ã¯ãªãïŒããã䜿çšããŠããããã REprintf
ã䜿çšããã®ãæ£ããæ¹æ³ã®ããã§ãã Cã¬ãã«ã®Rã®ããã°ã¬ã¹ããŒãRã®APIã§äœ¿çšã§ããªãã®ã¯æ®å¿µã§ãïŒCã¬ãã«ã®Rã§ã2åå®è£
ãããŠããããã§ãïŒã
@mattdowle ãããã¯åœ¹ã«ç«ã¡ãŸããïŒ https://github.com/r-lib/progress
@aaderã¯ã-ããããšãïŒ ãã®ãœãŒã¹ã«ã¯ãã®ã³ã¡ã³ããå«ãŸããŠããŸãïŒ
// In R Studio we should print to stdout, because printing a \r
// to stderr is buggy (reported)
ããããç§ã¯ãã§ã«\r
ãåé€ããŸããããã¹ã¿ãã¯ã®äžåè¡¡ã¯äŸç¶ãšããŠçºçããŸãã ã©ãã§å ±åãããã®ãããã
æåŸã®ãã«ããæ©èœããŸããã§ããïŒ
https://community.rstudio.com/t/stack-imbalance-possibly-in-stderr/3009ã§å ±åãããŠã
R-develã«é¢ããã¿ã€ã ãªãŒãªè³ªåïŒ [Rd] RprintfãšREprintfã¯ã¹ã¬ããã»ãŒãã§ããïŒ
çµè«ãRprintfãšREprintfã¯ã¹ã¬ããã»ãŒãã§ã¯ãããŸãããã
ãšã€ã¯ïŒ
RStudioã§åé¡ãæèµ·ããŠããããªã³ã¯ãšHughã«æè¬ããŸãã
data.table::fwrite()
ãšdata.table::fread()
ã¯ã Rprintf
ãšREprintf
ãã¹ã¬ããã»ãŒãã§ã¯ãªãããšãèªèããŠãããããé²è¡ç¶æ³ã¡ãŒã¿ãŒã§ã¯ããã¹ã¿ãŒã¹ã¬ããããã®ã¿åŒã³åºããŸãã 2ã€ã®data.tableã¹ã¬ããããã®Rãšã³ããªãã€ã³ããåæã«åŒã³åºãã ãã§ãªãããã¹ã¿ãŒã¹ã¬ããã ãããããåŒã³åºãããšããããŸããããã¯ãã¹ã¬ããã®ããããã«ãã£ãŠåŒã³åºãããå¯äžã®Rãšã³ããªãã€ã³ãã§ããå¹³è¡æé¢ã ãã ãã Rprintf
ã¯ã100åã®å°å·ããšã«R_CheckUserInterrupt
åŒã³åºããŸãã ãã¹ã¿ãŒã¹ã¬ããã ãã§ãå®å
šã§ã¯ãªãéšåã ãšæããŸãã ãããã R_CheckUserInterrupt
åŒã³åºããªãããã REprintf
ã䜿çšããçç±ã§ãã Rå
éšã¯é²è¡ç¶æ³ã¡ãŒã¿ãŒã«REprintf
ã䜿çšãããããã³ã¢Rãšã®æŽåæ§ãä¿ã€ããã«REprintf
ã«åãæ¿ããã®ã¯çã«ããªã£ãŠããŸãã ã€ãŸãããã®éžæã¯ãstderrãšstdoutèªäœãšã¯äœã®é¢ä¿ããããŸããã
@kevinusheyãã®ã¹ã¬ãããèŠãŠãä»ã«è©Šãããšãã§ããããšãæããŠããã ããŸãããã RStudioã«é¢é£ããŠããã®ã§ãããããã©ãããããããããããããã¯ã°ã©ãŠã³ãã¹ã¬ããã«é¢é£ããŠããã®ã§ããããïŒ RStudioã«ããã¯ã°ã©ãŠã³ãã¹ã¬ãããããå Žåã Rprintf
/ REprintf
ã2ã€ã®ã¹ã¬ããããåæã«åŒã³åºãããå¯èœæ§ããããŸãã ãããããããããªããä»ãŸã§ã«ãã£ãšå€ãã®åé¡ããã£ãã§ãããã ãããã£ãŠãããã¯éåžžã«ããããããªãããã§ãã ãããããRStudioã¯ã R-extsã®ã»ã¯ã·ã§ã³ptr_*
ã³ãŒã«ããã¯ã眮ãæããŸãããããã¯ãã³ã³ãœãŒã«ã®åºåãšå¯Ÿè©±ã«é¢é£ããŠããŸãã ãã ãããã®ã»ã¯ã·ã§ã³ã¯ãUNIXé¢é£ã®å Žåãã§å§ãŸããããWindowsãã©ã®ããã«ç»å Žããã®ãããããŸãããããããã»ã¯ã·ã§ã³8.1.5ã®ã¹ã¬ããã®åé¡ãé¢é£ããŠããŸãã ã©ã¡ããã»ã¯ã·ã§ã³8ãGUIããã³ãã®ä»ã®ããã³ããšã³ããRã«ãªã³ã¯ãããã®ãµãã»ã¯ã·ã§ã³ã§ãã
12æäžæ¬ãŸã§å€åºããã®ã§ãæ®å¿µãªãããããŸã§èŠãæ©äŒããããŸããã ãã ããRStudioã¯ãRã€ãã³ãã«ãŒãã䜿çšããŠã¡ã€ã³ã¹ã¬ããã§ã»ãšãã©ãã¹ãŠãå®è¡ããŸãã å¯äžã®äŸå€ã¯ãããšãã°ãããžã§ã¯ãã¬ãã«ã®ãã¡ã€ã«ã€ã³ããã¯ã¹äœæã§ããããããã®ããã¯ã°ã©ãŠã³ãã¹ã¬ããã¯éåžžRAPIã«åœ±é¿ãäžããŸããã
RStudioã¯ãã³ã³ãœãŒã«ã®å
¥åãšåºåãåŠçããããã®ããŸããŸãªptr_*
ã³ãŒã«ããã¯ãåŒãç¶ããŸãã ããã§ã¯ãããããã©ã®ããã«åå ã§ããããããã«èããããšã¯ã§ããŸããããæ»ã£ãŠããããããã«è©³ãã調ã¹ãŠã¿ãŸãã
[OK]ãããããè©ŠããŠã¿ãŠãã ããããã«ã 以åã¯ã2ïŒ
ããšã«é²æç¶æ³ãæŽæ°ããŠããŸããã ããªãã®å Žåããã¡ã€ã«ã¯3ç§åŒ±ããããããªãã®ã§ã0.06ç§ããšã«RStudioã³ã³ãœãŒã«ãžã®æ°ããé²è¡ç¶æ³ã®æŽæ°ã§ããã å€åããã¯RStudioã«ã¯å€ãããã ãããã£ãŠããã®è©Šã¿ã¯ããŒãå°å·ããŸãã \r
ã¯ãŸã£ãã䜿çšããŸããã ããã¯ã \r
ãåºåããã£ã±ãã«ããå¯èœæ§ãããã¬ããŒãããã°ãã¡ã€ã«ã«é©ããŠããŸãã
3ç§ã®ã¿ã€ãã³ã°ã¯éåžžã«éãã®ã§ããããã1ç§ã®ETAãããå Žåãããã°ã¬ã¹ããŒã1ç§ããéå§ããããã«æžãããŸããã ããããªããšã衚瀺ãããŠããªããšããçç±ã ãã§ããã¡ã€ã«ããŸã£ãã衚瀺ãããããã¡ã€ã«ã«å¯ŸããŠæ©èœããŸããã ãã¹ããçµãã£ããã fwrite
å€ã«å¢ãããŸãã ã€ãŸããETAããããã2ç§ã®å Žåã2ç§ããéå§ããŸãã
ããã«ã¡ã¯ã@ mattdowleã ïŒ2503ã§ã®ç§ã®æåŸã®ã³ã¡ã³ããããã®åé¡ã«é¢é£ããŠããå¯èœæ§ããããŸãã
ãããïŒ èŠåãªãïŒ5åã®å®è¡åŸïŒã 以äžã§æåã«å®è¡ããŸãïŒå é ã®ã¹ããŒã¹ã¯å®éã®åºåã§ã¯ç°ãªã£ãŠèŠããããšã«æ³šæããŠãã ããïŒïŒ
stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into âC:/Users/hughp/Documents/R/win-library/3.4â
# (as âlibâ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557423 bytes (1.5 MB)
# downloaded 1.5 MB
#
# package âdata.tableâ successfully unpacked and MD5 sums checked
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapped ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
# [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=',' with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
# Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000) : 1551 Quote rule 0
# Type codes (jump 100) : 1A51 Quote rule 0
# =====
# Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
# [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# |--------------------------------------------------|
# |==================================================|
# Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.280 wall clock time
# [12] Finalizing the datatable
# Type counts:
# 1 : bool8 '1'
# 1 : int32 '5'
# 2 : string 'A'
# =============================
# 0.005s ( 0%) Memory map 0.341GB file
# 0.037s ( 2%) sep=',' ncol=4 and header detection
# 0.000s ( 0%) Column type detection using 10027 sample rows
# 0.321s ( 14%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 1.917s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# = 0.011s ( 0%) Finding first non-embedded \n after each jump
# + 0.560s ( 25%) Parse to row-major thread buffers (grown 0 times)
# + 0.488s ( 21%) Transpose
# + 0.858s ( 38%) Waiting
# 0.999s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
# 2.280s Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1 V2 V3 V4
# 1: Goulburn 110018063 3499 NA
# 2: NA 110018064 812 NA
# 3: NA 110018065 2158 NA
# 4: NA 110019999 402 NA
# 5: NA 110028068 10 NA
# ---
# 22885376: NA 997999799 0 NA
# 22885377: NA 998999899 64 NA
# 22885378: NA 994999499 34 NA
# 22885379: NA 0&&&&&&&& 250796 NA
# 22885380: NA 0@@@@@@@@ 7305367 NA
# Warning messages:
# 1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
# 2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
@HughParsonage ReliefïŒ ãããªãåå©ã ãšæããŸãã çä»ããããŒãžããŠæ¬¡ã«é²ã¿ãŸãã ãã¹ãããŠãããŠããããšãã
@aadlerã¯ããåé¡ïŒ2503ã®ã³ã¡ã³ãã¯ãŸã£ããåãããã«èŠããããšã«åæããŸããã éçºè
ããã®ææ°ã®ãã¹ããè¡ã£ãŠãä¿®æ£ãããããšã確èªããŠãã ããã èŠã€ããas.IDate
ã®åé¡ãã以åã®ã¹ã¿ãã¯ã®äžåè¡¡ã«ãã£ãŠå®éã«åŒãèµ·ããããããšãé¡ã£ãŠããŸãã
è¯ããªã:(
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
> DT <- fread('2017-11-22_1999_Performance.csv', header = TRUE, colClasses = CLS, select = SEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file 2017-11-22_1999_Performance.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|=======Warning: stack imbalance in '$', 27 then 26
===Warning: stack imbalance in '$', 26 then 27
================Error in fread("2017-11-22_1999_Performance.csv", header = TRUE, colClasses = CLS, :
unprotect_ptr: pointer not found
@aadlerãã®ã¬ããŒããããããšãã freadR
ãééããä¿è·ãããŒã«ã©ã€ãºããŸããã ããªãã®å Žåã¯åããªãŒããŒã©ã€ãããŠããŠãã³ãŒãã®ãã®éšåã«ããªãã®æ°ã®ä¿è·ããã£ãã®ã§ã30ïŒ
ã®ç¢ºçã§æ©èœããå¯èœæ§ããããŸãã ãã®ãã«ãã䜿çšã
@aadlerããªãã¯ãŸã æåŸã®ãã«ããè©ŠããŠããªãå Žåã¯ãããããšããã«çŽé²ããŠãã ãããããã ãŸãããã¡ã€ã«ã®ã³ããŒãå ¥æã§ããã°ãWindowsRStudioã§è©ŠããŠã¿ãããšãã§ãããããããŸããã
:(
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-01 01:54:04 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
> ColCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+ rep('integer', 3L), rep('character', 2L),
+ 'integer', 'Date', rep('numeric', 2L), 'Date',
+ rep('numeric', 12L), rep('integer', 5),
+ rep('numeric', 3L), 'integer', 'character')
> SELCOL <- c(WHATEVER)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = ColCLASS, select = SELCOL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file LargeFile.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|Error in fread("LargeFile.csv", header = TRUE, colClasses = ColCLASS, :
unprotect_ptr: pointer not found
ã¡ãŒã«ã®@aadlerã®ãããã§ãåçŸã§ããããã«ãªããŸããã R 3.4.2ãææ°ã®RStudio1.1.383ããã³Windows10 Pro10.0.16299ãã«ã16299ã
ããã«èšé²ãããŠããRStudioã§å¥åŠãªåäœãèŠãããŸãïŒ
https://www.youtube.com/watch?v=tl2x2vmZxMU
RStudioã¯å
¥åããã ãã§GCãçæããŠããããã§ãã ããã¯ãªãã§ããããšã«ããããããªãã«ããæ¹æ³ã¯ãããŸããïŒ fread()
ãããã°ã¬ã¹ããŒãå°å·ããŠãããšããRStudioã®åå¥ã®ã€ãã³ãã«ãŒãã¯ãã³ã³ãœãŒã«ãžã®åºåããŠãŒã¶ãŒå
¥åã§ãããšèããŠãããGCãçºçãããŠãã¹ãŠãããªããããRãåŒã³åºããŠããå¯èœæ§ããããŸããïŒ ãããããããã®RStudioãŠãŒã¶ãŒã¯ãç§ãæ£ããæ¹åã«åããããšãã§ãããã @ kevinusheyãæ»ã£ãŠããã®ãããããŸããïŒ12æåæ¬ã«Kevinãšèšã£ãã®ã§ãããä»æ¥ã¯1æ¥ç®ã§ã
RStudioã³ã³ãœãŒã«ã§ã¹ã¿ãã¯ã®äžåè¡¡ã確å®ã«åçŸã§ããŸãã RStudioã¿ãŒããã«ã¿ãã䜿çšãããšã gcinfo(TRUE)
ã䜿çšããŠãããŸã£ããåçŸã§ããŸããã èå³æ·±ãããšã«ãããã°ã¬ã¹ããŒãå°å·ãããŠãããšãã«GCãçºçããŸãããLinuxã§ãåé¡ãªãã®ã§ãåé¡ãªãããã§ãã RStudio Consoleã®ãã®ãããªã®åäœãèãããšãããã¯RStudioConsoleã®ãã°ã§ãããšããçµè«ã«éããŠããŸãã RStudioã¿ãŒããã«ãŠã£ã³ããŠããããã¹ããã³ããŒã§ããªãã£ãããïŒ[ç·šé]-> [ã³ããŒ]ã¯æ©èœãããCtrl-Cãæ©èœããŸããïŒã[ã¿ãŒããã«]ã¿ãã®ã¹ã¯ãªãŒã³ã·ã§ãããæ®ããé²è¡ç¶æ³ããŒã®GCã«åé¡ããªãããšã瀺ããŸããã ãã¹ã¿ãŒã¹ã¬ããã®ã¿ãREprintf
ãåŒã³åºããŠãããä»ã®ã¹ã¬ããã¯R APIããŸã£ããåŒã³åºããŠããªããããåé¡ã¯ãªããšæããŸãã
RStudioã¿ãŒããã«ã§æ£åžžã«åäœããŸãïŒ
ããã°ã¬ã¹ããŒãæåã«å°å·ãããŠãããšãã«GCããããRStudioã¿ãŒããã«ã§æ£åžžã«æ©èœããããšã«æ³šæããŠãã ããã ãã®ãã¹ããã¡ã€ã«ã«ã¯ãµã³ãã«å€ã®ã¿ã€ãã®äŸå€ãããããããã®åã®ã¿ã®èªååèªã¿åããããªã¬ãŒããããããããã°ã¬ã¹ããŒã2åå°å·ãããŸãã
ããããRStudioã³ã³ãœãŒã«ã«ã¯stack imbalance
ãŸãã¯unprotect_ptr: pointer not found
ããããããããŸãïŒ
R version 3.4.2 (2017-09-28) -- "Short Summer"
> gcinfo(TRUE)
[1] FALSE
Garbage collection 22 = 16+3+3 (level 0) ...
25.5 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 23 = 16+4+3 (level 1) ...
24.9 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 24 = 17+4+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 25 = 18+4+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 26 = 19+4+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 27 = 20+4+3 (level 0) ...
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 28 = 20+5+3 (level 1) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 29 = 21+5+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 30 = 22+5+3 (level 0) ...
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 31 = 23+5+3 (level 0) ...
25.2 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 32 = 24+5+3 (level 0) ...
25.3 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 33 = 25+5+3 (level 0) ...
25.4 Mbytes of cons cells used (80%)
6.7 Mbytes of vectors used (66%)
Garbage collection 34 = 25+5+4 (level 2) ...
24.6 Mbytes of cons cells used (61%)
6.4 Mbytes of vectors used (50%)
Garbage collection 35 = 26+5+4 (level 0) ...
25.0 Mbytes of cons cells used (62%)
6.5 Mbytes of vectors used (52%)
> require(data.table)
Loading required package: data.table
Garbage collection 36 = 27+5+4 (level 0) ...
27.2 Mbytes of cons cells used (68%)
7.1 Mbytes of vectors used (56%)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 01:04:34 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
Garbage collection 37 = 28+5+4 (level 0) ...
27.7 Mbytes of cons cells used (69%)
7.3 Mbytes of vectors used (58%)
Garbage collection 38 = 29+5+4 (level 0) ...
28.0 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (58%)
Garbage collection 39 = 30+5+4 (level 0) ...
28.1 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (59%)
Garbage collection 40 = 31+5+4 (level 0) ...
28.2 Mbytes of cons cells used (70%)
7.5 Mbytes of vectors used (59%)
Garbage collection 41 = 32+5+4 (level 0) ...
28.4 Mbytes of cons cells used (71%)
7.5 Mbytes of vectors used (59%)
> DT = fread("/Users/pasha/Downloads/LargeFile.csv")
Garbage collection 42 = 32+5+5 (level 2) ...
27.4 Mbytes of cons cells used (54%)
7.1 Mbytes of vectors used (2%)
Garbage collection 43 = 32+5+6 (level 2) ...
27.4 Mbytes of cons cells used (54%)
244.7 Mbytes of vectors used (42%)
Garbage collection 44 = 32+5+7 (level 2) ...
27.4 Mbytes of cons cells used (54%)
482.3 Mbytes of vectors used (42%)
Garbage collection 45 = 32+5+8 (level 2) ...
27.4 Mbytes of cons cells used (54%)
957.4 Mbytes of vectors used (56%)
Garbage collection 46 = 32+5+9 (level 2) ...
27.4 Mbytes of cons cells used (54%)
1432.6 Mbytes of vectors used (63%)
Garbage collection 47 = 32+5+10 (level 2) ...
27.4 Mbytes of cons cells used (54%)
2145.3 Mbytes of vectors used (75%)
Garbage collection 48 = 32+5+11 (level 2) ...
27.4 Mbytes of cons cells used (54%)
2620.4 Mbytes of vectors used (71%)
Garbage collection 49 = 32+5+12 (level 2) ...
27.4 Mbytes of cons cells used (54%)
3570.8 Mbytes of vectors used (78%)
Garbage collection 50 = 32+5+13 (level 2) ...
27.4 Mbytes of cons cells used (54%)
4283.5 Mbytes of vectors used (75%)
Garbage collection 51 = 32+5+14 (level 2) ...
27.4 Mbytes of cons cells used (54%)
5709.0 Mbytes of vectors used (77%)
Garbage collection 52 = 32+5+15 (level 2) ...
27.4 Mbytes of cons cells used (54%)
7372.0 Mbytes of vectors used (81%)
Garbage collection 53 = 32+5+16 (level 2) ...
27.4 Mbytes of cons cells used (54%)
8797.5 Mbytes of vectors used (79%)
Garbage collection 54 = 32+5+17 (level 2) ...
27.4 Mbytes of cons cells used (54%)
10935.7 Mbytes of vectors used (80%)
|--------------------------------------------------|
|=====Error in fread("LargeFile.csv") :
unprotect_ptr: pointer not found
>
showProgress=FALSE
ã¯ãRStudioã³ã³ãœãŒã«ã§ç¢ºå®ã«è§£æ±ºããŸãã åçŸããã«ã¯ã showProgress=TRUE
ïŒã€ãŸãããã©ã«ãïŒã䜿çšããŠãæ°ããRStudioã³ã³ãœãŒã«ã§æåã«å®è¡ããå¿
èŠããããŸãã ããã°ã¬ã¹ã¡ãŒã¿ãŒäžã«GCããããã©ããã«é¢é£ããŠããããã§ãã ãã¬ãã·ã¥ã»ãã·ã§ã³ã®æåã®å®è¡ã«ãããŸãã ããã°ã¬ã¹ã¡ãŒã¿ãŒã衚瀺ãããããã«ã倧ããªãã¡ã€ã«ã§ããå¿
èŠããããŸãã fread
æž¡ãããåèªã¿èŸŒã¿ãåŒæ°ãšã¯äœã®é¢ä¿ããããŸããã æ°ããRStudioã³ã³ãœãŒã«ã§ã®æåã®å®è¡ãshowProgress=FALSE
ã§æ©èœããå Žåããã®å®è¡ã¯Rã®ããŒããæ¡åŒµããåãã»ãã·ã§ã³ã§ã®showProgress=TRUE
åŸç¶ã®å®è¡ãæ©èœããŸãã ããããæåã®å®è¡ã§ãã§ã«ããŒããæ¡åŒµãããŠãããããããã°ã¬ã¹ã¡ãŒã¿ãŒäžã«GCããªããšããçç±ã ãã§ã
é²è¡ç¶æ³ã¡ãŒã¿ãŒäžã®ãã¹ã¿ãŒã¹ã¬ããã®GCãLinuxããã³WindowsRStudioã¿ãŒããã«ã§åé¡ãªãã®ã«ãRStudioã³ã³ãœãŒã«ã§ã¯åé¡ããªãã®ã¯ãªãã§ããã
ããããŸãããããã§ä¿®æ£ãããŸãã åé¡ã¯ãRStudioã§ã¯ãªãdata.tableåŽã«ãããŸããã Windowsã®RStudioã³ã³ãœãŒã«ã§ç¢ºå®ã«åäœããããã«ãªããŸããã ããã¯LinuxãšMaxã§ãçºçããå¯èœæ§ã®ããåé¡ã§ãããã¡ã¢ãªãã¿ãŒã³ããããããªã¬ãŒããŠããªãã£ãã ãã§ãã ä»ã®ã¹ã¬ããã«ã¯Rãžã®ãšã³ããªãã€ã³ããããïŒæåååã§ãããã¡ãããã·ã¥ããå ŽåïŒãããã¯ãã¹ã¿ãŒã¹ã¬ããã®å°å·ãREprintf
ã䜿çšããŠé²è¡ããã®ãšåæã«çºçããå¯èœæ§ããããŸãã ãã®ãããæ°ããã»ãã·ã§ã³ã®æåã®å®è¡ã§ã®ã¿çºçããŸããã 2åç®ä»¥éã®å®è¡ã§ã¯ããã¡ã€ã«å
ã®ãã¹ãŠã®æååã以åã«ç¢ºèªãããŠããããããã£ãã·ã¥ã«ãã¯ã¢ããã¯ãããïŒã¹ã¬ããã»ãŒãïŒã§ãããå²ãåœãŠã§ã¯ãããŸããïŒã¹ã¬ããã»ãŒãã§ã¯ãããŸããïŒã
ã ããã @ aadlerãš@HughParsonage ããããè©ŠããŠã¿
èŠåã¯ãããŸãããä»ã«äœãæ¢ããŠãããã©ããã¯ããããŸããã
> gcinfo(TRUE)
[1] FALSE
> fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 1A51 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
Garbage collection 53 = 36+5+12 (level 2) ...
30.3 Mbytes of cons cells used (60%)
7.9 Mbytes of vectors used (1%)
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Garbage collection 54 = 37+5+12 (level 0) ...
30.8 Mbytes of cons cells used (61%)
566.6 Mbytes of vectors used (74%)
Garbage collection 55 = 37+6+12 (level 1) ...
30.8 Mbytes of cons cells used (61%)
549.2 Mbytes of vectors used (72%)
jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.626 wall clock time
[12] Finalizing the datatable
Type counts:
1 : bool8 '1'
1 : int32 '5'
2 : string 'A'
=============================
0.002s ( 0%) Memory map 0.341GB file
0.005s ( 0%) sep=',' ncol=4 and header detection
0.000s ( 0%) Column type detection using 10027 sample rows
0.469s ( 18%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
2.150s ( 82%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.103s ( 4%) Finding first non-embedded \n after each jump
+ 0.230s ( 9%) Parse to row-major thread buffers (grown 0 times)
+ 0.718s ( 27%) Transpose
+ 1.099s ( 42%) Waiting
0.745s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
2.626s Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
Garbage collection 56 = 37+6+13 (level 2) ...
31.1 Mbytes of cons cells used (62%)
531.9 Mbytes of vectors used (70%)
Garbage collection 57 = 38+6+13 (level 0) ...
31.1 Mbytes of cons cells used (62%)
532.0 Mbytes of vectors used (70%)
V1 V2 V3 V4
1: Goulburn 110018063 3499 NA
2: NA 110018064 812 NA
3: NA 110018065 2158 NA
4: NA 110019999 402 NA
5: NA 110028068 10 NA
---
22885376: NA 997999799 0 NA
22885377: NA 998999899 64 NA
22885378: NA 994999499 34 NA
22885379: NA 0&&&&&&&& 250796 NA
22885380: NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
ããããšããã¥ãŒã ãããããã¯æ°ããRStudioã³ã³ãœãŒã«ã»ãã·ã§ã³ã«ãã£ããšä»®å®ãããšãã¯ãªãŒã³ã©ã³ã§ãã ã¹ã¿ãã¯ã®äžåè¡¡ãŸãã¯ãunprotect_ptrïŒpointer not foundãã¡ãã»ãŒãžã®å åã¯ãªããé²è¡ç¶æ³ã¡ãŒã¿ãŒã¯æ£ããå®è¡ãããŠããŸãïŒãã®å Žåãåèªã¿åãããããã2åïŒã 確èªããããã«@aadlerã ã
æåã
æåã®å®è¡ãRStudioã®æ°ããã€ã³ã¹ã¿ã³ã¹ã
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
> DT <- fread('LargeFile.csv', colClasses = colCLASS, select = colSEL, header = TRUE, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file LargeFile.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|==================================================|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:25.938 wall clock time
[12] Finalizing the datatable
Type counts:
23 : drop '0'
5 : int32 '5'
7 : float64 '7'
2 : string 'A'
=============================
0.005s ( 0%) Memory map 6.355GB file
0.025s ( 0%) sep=',' ncol=37 and header detection
0.001s ( 0%) Column type detection using 10049 sample rows
4.681s ( 18%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
21.226s ( 82%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
= 0.485s ( 2%) Finding first non-embedded \n after each jump
+ 1.465s ( 6%) Parse to row-major thread buffers (grown 0 times)
+ 9.095s ( 35%) Transpose
+ 10.181s ( 39%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
25.938s Total
RStudioãéããŠå床éããæååãã£ãã·ã¥ãã¢ã¯ãã£ãåãããªãããã«ããŠã gcinfo(TRUE)
ã䜿çšããŠå床å®è¡ããŸããã ããŒãã¹ãè¿œå ãããIDateãžã®å€æãå®äºããŸããïŒãã ãã40ç§ä»¥äžããããŸãã:)ïŒã
> colCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+ rep('integer', 3L), rep('character', 2L),
+ 'integer', 'Date', rep('numeric', 2L), 'Date',
+ rep('numeric', 12L), rep('integer', 5),
+ rep('numeric', 3L), 'integer', 'character')
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> gcinfo(TRUE)
[1] FALSE
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
Garbage collection 46 = 36+5+5 (level 0) ...
38.6 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 47 = 37+5+5 (level 0) ...
38.7 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 48 = 38+5+5 (level 0) ...
38.8 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 49 = 39+5+5 (level 0) ...
39.0 Mbytes of cons cells used (78%)
11.2 Mbytes of vectors used (71%)
Garbage collection 50 = 40+5+5 (level 0) ...
39.1 Mbytes of cons cells used (78%)
11.3 Mbytes of vectors used (71%)
Garbage collection 51 = 40+6+5 (level 1) ...
38.8 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 52 = 41+6+5 (level 0) ...
38.9 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 53 = 42+6+5 (level 0) ...
41.5 Mbytes of cons cells used (83%)
12.2 Mbytes of vectors used (77%)
Garbage collection 54 = 42+7+5 (level 1) ...
43.4 Mbytes of cons cells used (86%)
12.8 Mbytes of vectors used (81%)
Garbage collection 55 = 42+7+6 (level 2) ...
44.7 Mbytes of cons cells used (72%)
13.0 Mbytes of vectors used (67%)
Garbage collection 56 = 43+7+6 (level 0) ...
46.5 Mbytes of cons cells used (74%)
13.6 Mbytes of vectors used (70%)
Garbage collection 57 = 44+7+6 (level 0) ...
47.0 Mbytes of cons cells used (75%)
13.8 Mbytes of vectors used (71%)
Garbage collection 58 = 45+7+6 (level 0) ...
47.4 Mbytes of cons cells used (76%)
13.9 Mbytes of vectors used (71%)
Garbage collection 59 = 46+7+6 (level 0) ...
47.7 Mbytes of cons cells used (76%)
14.2 Mbytes of vectors used (73%)
Garbage collection 60 = 47+7+6 (level 0) ...
48.0 Mbytes of cons cells used (77%)
14.2 Mbytes of vectors used (73%)
Garbage collection 61 = 48+7+6 (level 0) ...
48.1 Mbytes of cons cells used (77%)
14.3 Mbytes of vectors used (73%)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = colCLASS, select = colSEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file LargeFile.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
Garbage collection 62 = 48+7+7 (level 2) ...
46.5 Mbytes of cons cells used (60%)
13.6 Mbytes of vectors used (2%)
Garbage collection 63 = 48+7+8 (level 2) ...
46.5 Mbytes of cons cells used (60%)
488.7 Mbytes of vectors used (42%)
Garbage collection 64 = 48+7+9 (level 2) ...
46.5 Mbytes of cons cells used (60%)
963.9 Mbytes of vectors used (56%)
Garbage collection 65 = 48+7+10 (level 2) ...
46.5 Mbytes of cons cells used (60%)
1439.1 Mbytes of vectors used (63%)
Garbage collection 66 = 48+7+11 (level 2) ...
46.5 Mbytes of cons cells used (60%)
1914.2 Mbytes of vectors used (67%)
Garbage collection 67 = 48+7+12 (level 2) ...
46.5 Mbytes of cons cells used (60%)
2864.5 Mbytes of vectors used (77%)
Garbage collection 68 = 48+7+13 (level 2) ...
46.5 Mbytes of cons cells used (60%)
3577.3 Mbytes of vectors used (78%)
Garbage collection 69 = 48+7+14 (level 2) ...
46.5 Mbytes of cons cells used (60%)
4290.0 Mbytes of vectors used (75%)
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|============================Garbage collection 70 = 49+7+14 (level 0) ...
76.5 Mbytes of cons cells used (99%)
5487.5 Mbytes of vectors used (96%)
=Garbage collection 71 = 49+8+14 (level 1) ...
77.0 Mbytes of cons cells used (100%)
5487.6 Mbytes of vectors used (96%)
Garbage collection 72 = 49+8+15 (level 2) ...
77.0 Mbytes of cons cells used (81%)
5487.1 Mbytes of vectors used (80%)
==============Garbage collection 73 = 50+8+15 (level 0) ...
94.3 Mbytes of cons cells used (100%)
5494.0 Mbytes of vectors used (80%)
Garbage collection 74 = 50+9+15 (level 1) ...
94.5 Mbytes of cons cells used (100%)
5494.1 Mbytes of vectors used (80%)
Garbage collection 75 = 50+9+16 (level 2) ...
94.5 Mbytes of cons cells used (82%)
5493.1 Mbytes of vectors used (67%)
=======|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:24.772 wall clock time
[12] Finalizing the datatable
Type counts:
23 : drop '0'
5 : int32 '5'
7 : float64 '7'
2 : string 'A'
=============================
0.005s ( 0%) Memory map 6.355GB file
0.018s ( 0%) sep=',' ncol=37 and header detection
0.000s ( 0%) Column type detection using 10049 sample rows
5.496s ( 22%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
19.253s ( 78%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
= 0.433s ( 2%) Finding first non-embedded \n after each jump
+ 1.482s ( 6%) Parse to row-major thread buffers (grown 0 times)
+ 9.515s ( 38%) Transpose
+ 7.822s ( 32%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
24.772s Total
Garbage collection 76 = 51+9+16 (level 0) ...
105.3 Mbytes of cons cells used (91%)
5500.3 Mbytes of vectors used (67%)
Garbage collection 77 = 51+10+16 (level 1) ...
105.4 Mbytes of cons cells used (91%)
5500.2 Mbytes of vectors used (67%)
> DT[, Month := as.IDate(Month, format = "%Y-%m-%d")]
Garbage collection 78 = 51+10+17 (level 2) ...
107.5 Mbytes of cons cells used (76%)
8174.1 Mbytes of vectors used (81%)
Garbage collection 79 = 51+11+17 (level 1) ...
107.5 Mbytes of cons cells used (76%)
5910.4 Mbytes of vectors used (59%)
> gcinfo(FALSE)
[1] TRUE
é©ãã°ããïŒ ïŒtadaïŒé¢ä¿è å šå¡ãç¹ã«@mattdowleã«ãšã£ãŠçŽ æŽãããä»äº
ãåé¡ã解決ãããŸã§äŒæããšãããšããç§ã®æŠç¥ã¯ããã§ããŸããã£ãããã§ã:-)
ä»ã«ç¢ºèªããå¿ èŠããããã®ã¯ãããŸããããŸãã¯ãã®åé¡ã¯è§£æ±ºããããšèŠãªãããŸããïŒ
@aadlerãš@HughParsonageã«æè¬ããŸãïŒ æµ®ã圫ãã
@kevinusheyããã ã¯ããããã¯data.tableåŽã§ãããçŸåšã¯è§£æ±ºãããŠããŸãïŒPRïŒ2488ïŒã ããããšãã
æãåèã«ãªãã³ã¡ã³ã
ãåé¡ã解決ãããŸã§äŒæããšãããšããç§ã®æŠç¥ã¯ããã§ããŸããã£ãããã§ã:-)
ä»ã«ç¢ºèªããå¿ èŠããããã®ã¯ãããŸããããŸãã¯ãã®åé¡ã¯è§£æ±ºããããšèŠãªãããŸããïŒ