verbose=FALSE
๋ค์์ ์คํํ ๋ R ์ถฉ๋ ( '์คํ ๋ถ๊ท ํ')์ด ๋ฐ์ํฉ๋๋ค. ์ฐธ๊ณ : ํ๋ ๋ฌ ์ ์ data.table
์ ์ด์ ๊ฐ๋ฐ ๋ฒ์ ์์ ์๋ ์ฝ๋๋ฅผ ์ฑ๊ณต์ ์ผ๋ก ์คํํ ์ ์์์ผ๋ฏ๋ก ์ด๊ฒ์ ์๋นํ ์ต๊ทผ์ ๋ฒ๊ทธ๋ผ๊ณ ์๊ฐํฉ๋๋ค. (๋ฏธ์ํ์ง๋ง ์๋ํ๋ ์ ํํ ๊ฐ๋ฐ ๋ฒ์ ์ ๊ธฐ์ตํ์ง ๋ชปํฉ๋๋ค.)
์ด ๋ฌธ์ ๋ ํจ์ฌ ๋ ์์ ํ์ผ์์ ์ฌํ๋์ง ์์ต๋๋ค. zip ํ์ผ ๋งํฌ (csv๋ 350MB) : https://github.com/HughParsonage/ABS-data/blob/master/inbox/SA2-by-DJZ-2011.zip
๊ฐ๋ ๋ค๋ฅธ ์ค๋ฅ๊ฐ ๋ฐ์ํฉ๋๋ค. ์๋ฅผ ๋ค๋ฉด
get (name, envir = ns, inherits = FALSE) ์ค๋ฅ : ์๋ชป๋ ์ฒซ ๋ฒ์งธ ์ธ์
๋๋
๊ฒฝ๊ณ : '$'์ ์คํ ๋ถ๊ท ํ, 16, 15
์ค๋ฅ : R_Reprotect : ๋ณดํธ ๋ ํญ๋ชฉ 1 ๊ฐ๋ง ์ธ๋ฑ์ค๋ฅผ ๋ค์ ๋ณดํธ ํ ์ ์์ -2
#
Minimal reproducible example
library(data.table)
#> data.table 1.10.5 IN DEVELOPMENT built 2017-11-13 02:46:28 UTC; appveyor
#> The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#> Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#> Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.550 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 94%. ETA 00:00 Warning: stack imbalance in '$', 27 then 28
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.991
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.006s ( 0%) Memory map 0.341GB file
0.011s ( 0%) sep=',' ncol=4 and header detection
0.002s ( 0%) Column type detection using 10027 sample rows
0.328s ( 9%) Allocation of 22885380 rows x 4 cols (0.469GB)
3.194s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.362s ( 10%) Parse to row-major thread buffers
+ 1.963s ( 55%) Transpose
+ 0.868s ( 25%) Waiting
0.991s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
3.541s Total
Warning: stack imbalance in 'withVisible', 3 then 5
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
#
Output of sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5 RevoUtils_10.0.6 RevoUtilsMath_10.0.1
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2 yaml_2.1.14
@HughParsonage , ์ด๊ฒ์ # 2457๊ณผ ๋น์ทํฉ๋๋ค. showProgress=FALSE
์ ๋ฌํ๊ณ ์๋ฃ๋๋์ง ํ์ธํ์ญ์์ค.
@mattdowle 2017-11-09 ์ดํ ํ๊ท๊ฐ ์์์๊น์?
showProgress=FALSE
์คํํ๋ฉด ์ค์ ๋ก ๊ฒฐ๊ณผ๊ฐ ๋ฐํ๋์์ต๋๋ค (์์ ๋ ๊ฒฝ๊ณ ๋ง ํ์๋จ).
๋ชจ๋ ์์ธํ ์ ๋ณด์ ๊ฐ์ฌ๋๋ฆฝ๋๋ค. 2017-11-09 ์ดํ ํ๊ท๊ฐ ์์๋์ง ์์ฌ ์ค๋ฝ์ง๋ง ๊ธด verbose=TRUE
์ถ๋ ฅ์ด ETA ์ถ๋ ฅ๊ณผ ๋น์ทํ ์ํฅ์ ๋ฏธ์น ์ ์์ต๋๋ค. ํ์ผ์ ๋ค์ ์ฝ์ด์ผํ๋ฏ๋ก ๋ ๋ง์ ์ถ๋ ฅ์ด ์์ฑ๋ฉ๋๋ค. ๋๋ showProgress ๊ทธ๋ฅผ ์ํด TRUE ์ผ์ = ๊ฒ์ @HughParsonage์ ๋ณด๊ณ ์๊ฐ ๊ฐ์ง์์ ๋๋ ค์ํ๊ณ , ๋ฌธ์ ๋ ์์ธํ ์ ๋ณด์ ํจ๊ป 5 ~ 10 ๋ฒ ์คํ๋๋ ๊ฒฝ์ฐ = TRUE ์ผ์ด ์ผ์ด๋ ๊ฒ์
๋๋ค.
๋ณ๋ ฌ ์น์ ๋ด์์ ์ธ์ ๋ ์์ธํ ๋ฉ์์ง๋ ์์ต๋๋ค (์ด๋ฏธ ์์ ๋ ์งํ ETA ์ ์ธ). ๊ทธ๋ฌ๋ ์ฒซ ๋ฒ์งธ ์ฝ๊ธฐ ํ์ ๋ ๋ฒ์งธ ๋ค์ ์ฝ๊ธฐ๊ฐ ์์๋๊ธฐ ์ ์ (์ด ํ์ผ์์ ๋ฐ์ํ๋) ์์ธํ ๋ฉ์์ง๊ฐ ์์ต๋๋ค. ์ด ์ธ์๊ฐ 100 ๋ฒ์งธ CheckUserInterrupt (# 2457 ์ฐธ์กฐ)๋ฅผ ํธ๋ฆฌ๊ฑฐํ๋ฉด ๋ ๋ฒ์งธ ๋ณ๋ ฌ ์์ญ์ด ์คํจ ํ ์ ์์ต๋๋ค (์ด์ํ๊ฒ๋). ์ด์จ๋ ๊ทธ๊ฒ์ ๋ฐฐ์ ํ๊ธฐ ์ํด Rprintf ๋์ REprintf๋ฅผ ์ฌ์ฉํ๋๋ก ๋ชจ๋ ์์ธํ ๋ฉ์์ง๋ฅผ ๋ณ๊ฒฝํ์ต๋๋ค (ETA์ ๋ํ # 2457๊ณผ ๋์ผ). ํ ์คํธ๊ฐ stderr์์ ์ถ๋ ฅ์ ์ฐพ์ง ๋ชปํ๊ธฐ ๋๋ฌธ์ ์คํจํ์ต๋๋ค. ํต๊ณผํ๋ฉด Windows .zip์ด ์๋์ผ๋ก ์์ฑ๋๋ฉฐ ๋ค์ ์๋ํด์ฃผ์ธ์. ์ค๋น๋๋ฉด ์ฌ๊ธฐ์์ ์ ๋ฐ์ดํธํ๊ฒ ์ต๋๋ค.
์ข์, ๋ ๋ฒ์งธ ์๋๋ ๊ฒ์ฌ๋ฅผ ํต๊ณผํ๊ณ Windows.zip ์ ์ฌ์ฉํ ์ ์์ต๋๋ค. @HughParsonage ๋ค์ ์๋ํด ์ฃผ์๊ฒ ์ต๋๊น? ๋ค์ ์ฝ๊ธฐ ์ง์ ์ verbose ๋ชจ๋์ ๋ฉ์์ง ๋ค์ R_FlushConsole () ํธ์ถ์ ์ถ๊ฐํ์ต๋๋ค. ๊ทธ ํ๋ฌ์๋ Windows์์๋ง ํ์ํฉ๋๋ค. ๋๋ ํ๋ฌ์๊ฐ ์์ผ๋ฉด ์ฝ์์ด ๋๋๋ก ๋ณ๋ ฌ ๋ค์ ์ฝ๊ธฐ๊ฐ ์ผ์ด๋๊ณ ๋ฌธ์ ๋ฅผ ์ผ์ผํค๋ ๊ฒ์ผ๋ก ์๋ ค์ ธ์์ ๋ ์กฐ๊ธ ๋์ค์ ์
๋ฐ์ดํธ๋๋ค๋ ์ถ์ธก์ํ๊ณ ์์ต๋๋ค. ํญ์ verbose=TRUE
๋ฐ showProgress=TRUE
๋ชจ๋ ์ฌ์ฉํ์ฌ 10 ๋ฒ ๋ฐ๋ณตํ์ญ์์ค. 10 ๋ฒ์ ํด๋ฆฌ์ด ๋ฐ์ ๋ณธ๋ค๋ฉด ๊ทธ๊ฒ ๋ค๋ผ๊ณ ๋งํ ๊ฒ์
๋๋ค. ๊ทธ๋ ์ง ์์ผ๋ฉด ๋ค์ ์๊ฐํด์ผํฉ๋๋ค.
๋ถํํ๋ ์์ ๋์ง ์์์ต๋๋ค.
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = FALSE)
Read 26%. ETA 00:00 Warning: stack imbalance in '$', 20 then 22
Read 52%. ETA 00:00 Warning: stack imbalance in '$', 36 then 35
Warning: stack imbalance in '$', 21 then 22
Read 59%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
unprotect_ptr: pointer not found
In addition: Warning: stack imbalance in '$', 26 then 28
Warning messages:
1: Warning: stack imbalance in '$', 26 then 27
In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15
10 ๋ฒ ์คํ ํ ํ์๋ verbose=TRUE, showProgress=TRUE
์ฌ์ฉํ๋ฉด ์ค๋ฅ๊ฐ ๋ฐ์ํ์ง ์์ต๋๋ค. ๋ค์์ 10 ๋ฒ์งธ ์ถ๋ ฅ์ ๊ฒฐ๊ณผ์
๋๋ค.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.094 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.752
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.004s ( 0%) Memory map 0.341GB file
0.008s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.173s ( 4%) Allocation of 22885380 rows x 4 cols (0.469GB)
3.660s ( 95%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
= 0.009s ( 0%) Finding first non-embedded \n after each jump
+ 1.946s ( 51%) Parse to row-major thread buffers
+ 1.098s ( 29%) Transpose
+ 0.608s ( 16%) Waiting
1.752s ( 46%) Rereading 1 columns due to out-of-sample type exceptions
3.846s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.589 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.418
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.001s ( 0%) Memory map 0.341GB file
0.003s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.574s ( 14%) Allocation of 22885380 rows x 4 cols (0.469GB)
3.428s ( 86%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
= 0.010s ( 0%) Finding first non-embedded \n after each jump
+ 1.988s ( 50%) Parse to row-major thread buffers
+ 1.137s ( 28%) Transpose
+ 0.292s ( 7%) Waiting
1.418s ( 35%) Rereading 1 columns due to out-of-sample type exceptions
4.007s Total
There were 20 warnings (use warnings() to see them)
@HughParsonage ๊ฐ์ฌํฉ๋๋ค! ๊ทธ๋๋ ํผ๋ ์ค๋ฝ์ต๋๋ค. ๋น์ ์ ๊ทธ๊ฒ์ด ์ฐ๋ฆฌ๊ฐ ๋ฐ๋ฌ๋ verbose=TRUE, showProgress=TRUE
์ ์ ์๋ํ๋ค๊ณ ๋งํ๋ ๊ฒ์
๋๋ค-์! ์ ์๋ ์คํจํ์ง ์์๋์? showProgress
๋ํ ๊ธฐ๋ณธ๊ฐ์ ์ด์จ๋ TRUE์ด์ง๋ง verbose
๋ํ ๊ธฐ๋ณธ๊ฐ FALSE๋ก ์คํํ๋ฉด _then_ ์๋ํ์ง ์๊ณ ์คํ ๋ถ๊ท ํ์ด ํ์๋ฉ๋๊น? _less_ ์ถ๋ ฅ์ผ๋ก ์ธํด ์คํจํ๋ ๊ฒ์ด ์ด์ํฉ๋๋ค. ํ์ธ ํด์ฃผ์ธ์. ๊ทธ๋ ๋ค๋ฉด ์๋ง๋ ์๋ชป๋ ๋๋ฌด๋ฅผ ์ง๋ ๊ฒ์
๋๋ค. Linux์์ ๋๋ฅผ ์ํด ์ ์๋ํ๋ฏ๋ก Windows์์ ํ
์คํธํ๋ ๋ฐ ์์กดํฉ๋๋ค. ๊ฐ์ฌ.
(๋ํ 10 ๋ฒ์งธ ์คํ ์ถ๋ ฅ์ ๋งจ ์๋์๋ 20 ๊ฐ์ ๊ฒฝ๊ณ ๊ฐ ์๋ค๊ณ ํ์๋์ด ์์ต๋๋ค. ์์์ ํ์๋ 2 ๊ฐ์ ๊ฒฝ๊ณ ๊ฐ 10 ๋ฒ ๋ฐ๋ณต ๋ ๊ฒ์ผ๋ก ๊ฐ์ ํฉ๋๋ค. ๊ทธ๋ ๋ค๋ฉด ์๋ฏธ๊ฐ ์์ต๋๋ค.)
ํผ๋์ ๋๋ ค ์ฃ์กํฉ๋๋ค, Matt.
์๋ ๋ฌธ์ ๋ก ์ธํด ๋ ์ด์ ์ถฉ๋์ด ๋ฐ์ํ์ง ์๋๋ค๋ ๊ฒ์ด ๋ง์ต๋๋ค. ์ฆ, ๋ค์์ด ์์๋๋ก ์๋ํฉ๋๋ค.
fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "")
๋ช
ํํํ๊ธฐ ์ํด ์๋ณธ์์ verbose =FALSE
(๊ธฐ๋ณธ๊ฐ) ๋ ์ถฉ๋์ด ๋ฐ์ํ์ต๋๋ค. ๋ฌธ์ ๋ฅผ ์ ๊ธฐํ๊ธฐ ์ ์ verbose = TRUE
ํ๋๋ฐ '์คํ ๋ถ๊ท ํ'๊ฒฝ๊ณ ๊ฐ ๋ํ ๋ฌ์ง๋ง ์ถฉ๋์ด ๋ฐ์ํ์ง ์์์ต๋๋ค. ์ต์ ๋ฒ์ ์์๋ verbose = FALSE
์์ ์ถฉ๋ (๋๋ ์ค์ ๋ก ๋ฌธ์ )์ด ๋ฐ์ํ์ง ์์ต๋๋ค.
๋ด๊ฐ '์์ ๋์ง ์์'์ด๋ผ๊ณ ๋งํ ์ด์ ๋ ๊ฒฝ๊ณ ๋ฉ์์ง๋ฅผ ๋ฐ๊ฒฌํ๊ธฐ ๋๋ฌธ์ ๋๋ค.
Warning messages:
Warning: stack imbalance in '$', 26 then 27
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15
์ด์ํ๊ฒ ๋ณด์๊ณ ๋์ผํ์ง๋ ์์ง๋ง ๋ฐ์ ํ ๊ด๋ จ์ด ์๋ค๊ณ ์๊ฐํ์ต๋๋ค. ํ์ง๋ง ์ค๋ ์์นจ ํธ์ฃผ์์๋ ๋ ์ด์ ๊ฒฝ๊ณ ๋ฉ์์ง๋ฅผ ์ฌํ ํ ์ ์์ต๋๋ค.
๊ทธ๋ ์์์ด. ์คํ ๋ถ๊ท ํ์ ๋ํ ๊ฒฝ๊ณ ๋ฉ์์ง๋ ๋ณธ์ง์ ์ผ๋ก ์ค๋ฅ์ ๋๋ค. ์ฐ๋ฆฌ๋ ๊ทธ๋ค์ ๊ฑด๋ ๋ธ ์ ์์ต๋๋ค. ์คํ ๋ถ๊ท ํ์ ๋ํ ๊ฒฝ๊ณ ๋ ์ค์ ๋ก ์์ง ์ถฉ๋ํ์ง ์์์ง๋ง ์ถฉ๋์ด๋ผ๊ณ ๋ถ๋ฆ ๋๋ค. (๊ฒฝ๊ณ ๋ฅผ ๋ณธ ํ ์ถฉ๋ํ๋ ๊ฒ์ ์๊ฐ ๋ฌธ์ ์ผ๋ฟ์ ๋๋ค.)
verbose=TRUE, showProgress=TRUE
๋ฅผ ์ฌ์ฉํ์ฌ ์๋ก์ด R ์ธ์
์์ 10 ๋ฒ ์คํํ๋ฉด ์คํ ๋ถ๊ท ํ์ ๋ํ 20 ๊ฐ์ ๊ฒฝ๊ณ ์ค ํ๋์ด๊ฑฐ๋ ๋ชจ๋ ๋ค์๊ณผ ๊ฐ์ ์ผ๋ฐ ๊ฒฝ๊ณ ์ผ๋ฟ์
๋๋ค.
1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
์คํ ๋ถ๊ท ํ ๊ฒฝ๊ณ ๊ฐ ๋ฐ์ํ๋ฉด ์๋ก์ด R ์ธ์ ์ ์์ํ์ญ์์ค. ํ ๋ฒ์ด๋ผ๋ ๋ฐ์ํ ์ดํ์๋ R์ ์ด๋ค ๊ฒ๋ ์ ๋ขฐํ ์ ์์ต๋๋ค.
verbose=TRUE, showProgress=TRUE
๋ฌ๋ ธ์ ๋ ์ถฉ๋์ด ๋ฐ์ํ์ต๋๋ค. ์ฝ ๋ญ๊ฐ const char
A๋ฅผ SEXP
. ์ด ๋ฌธ์ ๋ฅผ ๋ช
๋ น ์ค์์ ์ฌํํ๋ ค๊ณ ํฉ๋๋ค (๋ถํํ๋ RStudio์์ ๋ฐ์ํ์ผ๋ฉฐ RStudio๊ฐ ์ ์ฒด ๋ฉ์์ง๋ฅผ ์ฝ๊ธฐ ์ ์ ๋ซํ์ต๋๋ค).
์ถฉ๋์ ์ฌํ ํ ์ ์์ต๋๋ค. ์ฌ๋ถํ ํ ๊ฒฐ๊ณผ์ ๋๋ค. ์คํ ๋ถ๊ท ํ ๊ฒฝ๊ณ ๊ฐ ์์ต๋๋ค.
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> for (i in 1:10) fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE, showProgress = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 31%. ETA 00:00 Warning: stack imbalance in '$', 24 then 23
Read 91%. ETA 00:00 Warning: stack imbalance in '$', 27 then 26
Read 95%. ETA 00:00 Warning: stack imbalance in '$', 28 then 29
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.895
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.029s ( 1%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.314s ( 15%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.761s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.015s ( 1%) Finding first non-embedded \n after each jump
+ 0.599s ( 28%) Parse to row-major thread buffers
+ 0.400s ( 19%) Transpose
+ 0.746s ( 35%) Waiting
0.895s ( 42%) Rereading 1 columns due to out-of-sample type exceptions
2.107s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.335 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.049
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.402s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.974s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.209s ( 9%) Parse to row-major thread buffers
+ 0.864s ( 36%) Transpose
+ 0.900s ( 38%) Waiting
1.049s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
2.385s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.414
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.007s ( 0%) sep=',' ncol=4 and header detection
0.002s ( 0%) Column type detection using 10027 sample rows
0.293s ( 18%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.322s ( 81%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.199s ( 12%) Parse to row-major thread buffers
+ 0.822s ( 51%) Transpose
+ 0.301s ( 19%) Waiting
0.414s ( 25%) Rereading 1 columns due to out-of-sample type exceptions
1.626s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.451 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.409
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.403s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.448s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.194s ( 10%) Parse to row-major thread buffers
+ 0.974s ( 52%) Transpose
+ 0.279s ( 15%) Waiting
0.409s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
1.860s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.480 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.412
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.007s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.459s ( 24%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.424s ( 75%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.197s ( 10%) Parse to row-major thread buffers
+ 0.938s ( 50%) Transpose
+ 0.288s ( 15%) Waiting
0.412s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
1.892s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.381 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.401
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.005s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.384s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.389s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.196s ( 11%) Parse to row-major thread buffers
+ 0.911s ( 51%) Transpose
+ 0.281s ( 16%) Waiting
0.401s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
1.781s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.384 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.480
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.002s ( 0%) Column type detection using 10027 sample rows
0.476s ( 26%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.378s ( 74%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.192s ( 10%) Parse to row-major thread buffers
+ 0.833s ( 45%) Transpose
+ 0.352s ( 19%) Waiting
0.480s ( 26%) Rereading 1 columns due to out-of-sample type exceptions
1.864s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.374 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.507
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.311s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.562s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.193s ( 10%) Parse to row-major thread buffers
+ 0.988s ( 52%) Transpose
+ 0.381s ( 20%) Waiting
0.507s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
1.881s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.318 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.493
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.306s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.496s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.193s ( 11%) Parse to row-major thread buffers
+ 0.935s ( 52%) Transpose
+ 0.367s ( 20%) Waiting
0.493s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
1.811s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.141 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.506
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.007s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.132s ( 8%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.506s ( 91%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.195s ( 12%) Parse to row-major thread buffers
+ 0.938s ( 57%) Transpose
+ 0.371s ( 23%) Waiting
0.506s ( 31%) Rereading 1 columns due to out-of-sample type exceptions
1.647s Total
Warning: stack imbalance in 'for', 2 then 8
There were 20 warnings (use warnings() to see them)
์ด์ํ๊ฒ๋ ํ์ค์ฑ์ ํ๋ฅญํฉ๋๋ค. ๊ฐ์ฌ. ์ฆ, ํ๋ฌ์๊ฐ ์๋ํ์ง ์์์ผ๋ฏ๋ก ๊ฒฐ๊ตญ Rprintf
์ ํผํ ๋ฐฉ๋ฒ์ ์ฐพ์์ผํฉ๋๋ค. verbose=FALSE, showProgress=FALSE
์์ ์ ์ผ๋ก ์๋ํ๋ค๊ณ ๊ฐ์ ํฉ๋๋ค (์ด ๋ฌธ์ ์ ์๋จ ๊ทผ์ฒ์ ์์ฑ ํ์ผ๋ฏ๋ก ์ด์ ์์กดํ๊ณ ์์ต๋๋ค.) "Reliably"๋ 2 ๊ฐ์ ์์ ๊ฒฝ๊ณ ๋ง ์๊ณ ์คํ์ด ๋ณด์ด์ง ์๋ 10 ์ฐ์์ ์๋ฏธํฉ๋๋ค. ๋ถ๊ท ํ ๊ฒฝ๊ณ .
๊ทธ๋ผ ๋์๊ฒ ๋งก๊ธฐ์ญ์์ค. ๋ค์ ํ๋ฒ ๊ฐ์ฌ๋๋ฆฝ๋๋ค.
@HughParsonage ์ข์, ์ต๊ทผ ๋ ๋ฒ์งธ ์๋๋ก ๋ค์ ์๋ํ์ญ์์ค. ์์ง ๋ง์คํฐ์ ๋ณํฉ๋์ง ์์์ผ๋ฏ๋ก ์ฌ๊ธฐ ์ง์ ํ์ญ์์ค . ์ด์ ๊ณผ ๋ง์ฐฌ๊ฐ์ง๋ก ํ์ธํ ์ ์๋๋ก ์ด๋ ์ชฝ์ด๋ ์ ์ฒด ์ถ๋ ฅ์ ์ ๊ณตํ์ญ์์ค. ๊ฐ์ฌ!
๋ค์์ ์ฒซ ๋ฒ์งธ ์๋๋ก ์ธํด ์ถฉ๋์ด ๋ฐ์ํ์ต๋๋ค (ํฌ์ธํฐ์ ๋ํ ๊ฒ).
๋ ๋ฒ์งธ ์๋ (์ฌ๋ถํ
ํ)๋ stack imbalance in '$', 16 then 15
๊ฒฝ๊ณ ๋ฅผ ๋ฐ์์ํต๋๋ค.
# Assert that `data.table` is not installed:
stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into โC:/Users/hughp/Documents/R/win-library/3.4โ
# (as โlibโ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557502 bytes (1.5 MB)
# downloaded 1.5 MB
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 01:38:17 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapping ... ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \r-only line endings are not allowed because \n is found in the data
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
# [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=',' with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
# Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000) : 1551 Quote rule 0
# Type codes (jump 100) : 1A51 Quote rule 0
# =====
# Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
# [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# Read 78%. ETA 00:00 Warning: stack imbalance in '$', 16 then 15
# Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.677 wall clock time
# [12] Finalizing the datatable
# Type counts:
# 1 : bool8 '1'
# 1 : int32 '5'
# 2 : string 'A'
# =============================
# 0.002s ( 0%) Memory map 0.341GB file
# 0.007s ( 0%) sep=',' ncol=4 and header detection
# 0.001s ( 0%) Column type detection using 10027 sample rows
# 0.297s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 2.369s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# = 0.003s ( 0%) Finding first non-embedded \n after each jump
# + 0.273s ( 10%) Parse to row-major thread buffers (grown 0 times)
# + 1.313s ( 49%) Transpose
# + 0.780s ( 29%) Waiting
# 0.893s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
# 2.677s Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1 V2 V3 V4
# 1: Goulburn 110018063 3499 NA
# 2: NA 110018064 812 NA
# 3: NA 110018065 2158 NA
# 4: NA 110019999 402 NA
# 5: NA 110028068 10 NA
# ---
# 22885376: NA 997999799 0 NA
# 22885377: NA 998999899 64 NA
# 22885378: NA 994999499 34 NA
# 22885379: NA 0&&&&&&&& 250796 NA
# 22885380: NA 0@@@@@@@@ 7305367 NA
# Warning messages:
# 1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
# 2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
์๋ ํ์ธ์, @mattdowle. OpenMP๊ฐ 4.0์ด ์๋๋ผ ๊ธฐ๊ปํด์ผ 3.1 ์ธ GCC ๋ฒ์ ์ด ์์ง ์ฌ์ฉ ์ค์ ๋๋ค. ๋๋ CRAN (๋ด ํจํค์ง ์ค ํ๋์ ๊ทธ ๋ฌธ์ ๋ก ์คํ Delaporte ์ฌ์ ํ GCC๋ฅผ ์ฌ์ฉํ์ฌ ์ฌ๋์ ๋ฆฌ๋ ์ค ์์คํ ์์ (4.9.3 ๊ธฐ์ค) Windows ์ฉ Rtools ์ปดํ์ผ ๋๋ SIMD ์ง์์ด๋ฅผ ์ฌ์ฉํ์ฌ ์๋) (OpenMP๋ฅผ 4.0)ํ์ง๋ง ๋์ก๋ค ๋ฐ ์ค๋ฅ 4.8.0. ๋ด๊ฐ ์ฌ๋ฐ๋ฅด๊ฒ ๊ธฐ์ตํ๋ค๋ฉด Windows์กฐ์ฐจ๋ 4.5 ํธ์ถ์ด ์๋ 4.0 ๋ง ์ฌ์ฉํ ์ ์์ต๋๋ค. ๊ทธ๊ฒ ๋ฌธ์ ์ ์์ธ์ผ๊น์?
@HughParsonage ๋๋ฌด ๋นจ๋ฆฌ ํ
์คํธ ํด ์ฃผ์
์ ๊ฐ์ฌํฉ๋๋ค! ์ข์์, ๊ทธ๋ผ ๊ณ์ ์๊ฐ ํ ๊ฒ์!
@aadler ์ข์ ์๊ฐ์
๋๋ค-๋ชจ๋ ๊ฒ์ด ๊ฐ๋ฅํฉ๋๋ค.
@HughParsonage ํ ๋ฒ๋ง ๋ณ๊ฒฝ ํ ๋์ผํ ๋ช
๋ น ( verbose=FALSE
)์ด ์ ๋๋ก ์๋ ํ๋์ง ํ์ธํ๊ธฐ ์ํด ์ : fread("SA2-by-DJZ-2011.csv", verbose = FALSE, na.strings = "", header = FALSE)
. ์งํ๋ฅ ํ์๊ธฐ๋ ๊ณ์ ํ์๋ฉ๋๋ค.
์, ํด๋น ๋ช ๋ น์ (10 ๋ฒ) ์คํํ๋ฉด ์์ ๊ฒฐ๊ณผ๊ฐ ๋ฐํ๋์์ต๋๋ค (์ฆ, ํ์์ด ์๋ชป ๋์๊ธฐ ๋๋ฌธ์ ๊ฒฝ๊ณ ๊ฐ ๋ ๊ฐ ๋ฟ์ธ data.table). ์คํ ๋ถ๊ท ํ ๊ฒฝ๊ณ ๊ฐ ์์ต๋๋ค.
๊ฐ์ฌ. ๋ฐ๋ผ์ ์ฝ์ ์ถ๋ ฅ๊ณผ ๊ด๋ จ๋ ๊ฒ ๊ฐ์ต๋๋ค. ์๋ ํ ๋ช ๊ฐ์ง ๋ ...
์์ธ ๋ชจ๋์์๋ wallclock()
๋ฅผ ํธ์ถํ๋ ๋ณ๋ ฌ ์์ญ ๋ด๋ถ์ ๋ช ๊ฐ์ง ๋ถ๊ธฐ๊ฐ ์์ต๋๋ค. ๋๋ ๊ทธ๊ฒ์ ๋ฐฐ์ ํ๊ธฐ ์ํด ํญ์ 0.0์ ๋ฐํํ๊ณ ์์คํ
ํธ์ถ์ ํผํ๋๋ก ๋จ๋ฝํ์ต๋๋ค. ์ค๋ ๋ ์์ ํ๋ค๊ณ ์๊ฐํ์ง๋ง ์๋ ์๋ ์์ต๋๋ค. ์ฌ๊ธฐ ์์ ๋ค์ ๋น๋ ๋ ๋ถ๊ธฐ์์ ์ Windows.zip์ ์ฌ์ฉํด๋ณด์ญ์์ค.
์ฒซ๋ฒ์งธ ์๋:
install.packages("https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into โC:/Users/hughp/Documents/R/win-library/3.4โ
# (as โlibโ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1556972 bytes (1.5 MB)
# downloaded 1.5 MB
#
# package โdata.tableโ successfully unpacked and MD5 sums checked
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 03:49:20 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
๋ ๋ฒ์งธ ์๋์์๋ ๋ค์๊ณผ ๊ฐ์ ๊ฒฝ๊ณ ๊ฐ ํ์๋ฉ๋๋ค.
Read 22%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
unprotect_ptr: pointer not found
In addition: Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Warning: stack imbalance in '$', 29 then 28
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 125 then 126
Warning: stack imbalance in 'lapply', 55 then 53
Warning: stack imbalance in 'lapply', 30 then 34
Warning: stack imbalance in '<-', 28 then 31
Warning: stack imbalance in '{', 24 then 27
Warning: stack imbalance in '{', 18 then 21
์๊ฐ : ์ด๊ฒ์ด RStudio์ ๋ฌธ์ ๊ฐ ๋ ์ ์์ต๋๊น? ํฐ๋ฏธ๋์์ ์คํฌ๋ฆฝํธ๋ฅผ ์คํํ๋ ๊ฒ์ ์ฝ๊ฒ ์ฌํ๋์ง ์๋ ๊ฒ ๊ฐ์ต๋๋ค. ์ฝ์ ์ถ๋ ฅ์ ๋ ์ฝ๊ฒ ๋ณต์ฌ ํ ์ ์๊ธฐ ๋๋ฌธ์ RStudio์์ ์คํํ๊ณ ์์ต๋๋ค.
RStudio ์ธ๋ถ์์ _ ์ฝ๊ฒ ์ฌํ๋์ง ์๋๋ค๊ณ ๋งํ๋ฉด _ ์ ๋ถ _ ์ฌํ๋ฉ๋๊น? RStudio ๋ด์์๋ง ๋ฐ์ํ๋๋ผ๋ ์ฌ์ ํ data.table ์ธก๋ฉด์์ ์์ ํ๋ ค๋ ๊ฒ์ ๋๋ค. ๋๋ ๊ทธ๊ฒ์ด ํ์คํ "๊ทธ๋ฅ"์ฝ์ ์ถ๋ ฅ์ด๊ณ fread ๋ก์ง์ ๋ค๋ฅธ ์ง์ ํ ์คํ ๋ถ๊ท ํ์ด ์๋๋ผ๋ ๊ฒ์ ํ์ธํ๊ธฐ์ํ ๋ ๋ค๋ฅธ ๊ฒฝ๋ก๋ก ์์ฒญํ๊ณ ์๋ค.
๋๋ ์ ํ RStudio์ ์ธ๋ถ๋ฅผ ์ฌํ ์์ง์ด์ผ, ๊ทธ๋ฆฌ๊ณ ๊ทธ ์์ ์์ ์ ์ธ ์ฌํ์ด (์ฆ, ๋ด๊ฐ ์ด๋ค ๊ฒฝ๊ณ ๋ ์ถฉ๋์ ์ฌํ ํ ์ ์์ต๋๋ค) ํ ์ ์์ต๋๋ค. Windows ๋ช ๋ น ํ๋กฌํํธ์ git shell (Windows)์ ์ฌ์ฉํด ๋ณด์์ต๋๋ค.
Windows์์ RStudio ๋ฒ์ 1.1.383์ ์ฌ์ฉํ๊ณ ์์ต๋๋ค. ์ ๊ฐ ๊ทธ๋ค๊ณผ ํจ๊ป์ด ๋ฌธ์ ๋ฅผ ์ ๊ธฐํ๋ค๋ฉด ๋์์ด ๋ ๊น์, ์๋๋ฉด ์ ๊ฐ ๊ธฐ๋ค๋ฆฌ๊ธฐ๋ฅผ ์ํ์ญ๋๊น?
๊ฐ์ฌ. RStudio ๋ด๋ถ์ ์๋ค๋ ๊ฒ์ ์๋ ๊ฒ์ด ์ ๋ง ์ ์ฉํฉ๋๋ค. ๊ทธ๋ค๊ณผ ํจ๊ป ์ฌ๋ฆด ํ์๊ฐ ์์ต๋๋ค. ์ด๋ ์ถ๋ ฅ ์ฝ์ ๋ฒํผ๋ง (๋๋ ์ ์ฌ)๊ณผ ๊ด๋ จ์ด ์์์ ์๋ฏธํฉ๋๋ค. ๋๋ ์์ ์ ์งํํ๊ณ ๊ณง ์ถ์งํ ๊ฒ์ ๋๋ค.
Windows๊ฐ ๋ณ๊ฒฝ ์ฌํญ์ ์ปดํ์ผํ์ง ์๋ ์ด์ ๋ฅผ ์ ์ ์์ต๋๋ค.
fread.c:1054:3: warning: too many arguments for format [-Wformat-extra-args]
Linux์ Travis์์ ์ ์๋ํฉ๋๋ค. ์ด๋ก ์ธํด์ด ํด๊ฒฐ ๋ฐฉ๋ฒ์ ํ
์คํธ ํ ์ ์๋๋ก Windows.zip์ด ์์ฑ๋์ง ์์ต๋๋ค. ๋ ์์ผ ๊ฒ ์ด.
(๊ทธ๊ฒ์ 1054 ํ์ ๋ํด ๋ถํํ์ง๋ง ๋ฐ๋ก ๋ค์ ํ 1055๊ฐ ์๋๋๋ค. ์ด๊ฒ์ ์ฝ๊ฐ์ ์ฐจ์ด๊ฐ์์ ๊ฒ์
๋๋ค. % llu Windows์์ __VA_ARGS__
๋ฌธ์ ์
๋๋ค. ๋ฌผ๋ก ์๋๋๋ค.)
์ข์, ๋ง์ง๋ง์ผ๋ก windows.zip ๋ค์ ์๋ํ์๊ธฐ ๋ฐ๋๋๋ค์ ๋ํ ์ค๋น๊ฐ๋์ด ์ฌ๊ธฐ .
ํ์ฌ์ด ๋ถ๊ธฐ์๋ ๋ช ๊ฐ์ง ํด๊ฒฐ ๋ฐฉ๋ฒ์ด ์์ต๋๋ค. ์๋ํ๋ฉด ํด๊ฒฐ ๋ฐฉ๋ฒ์ ์ ๊ฑฐํ์ฌ ์ด๋ ๊ฒ์ด ์๋์ง ํ์ธํฉ๋๋ค. llu ์ปดํ์ผ๋ฌ ๊ฒฝ๊ณ ๋ ์ฌ๊ธฐ ์์ ์ฐพ์ @ st-pasha ์ค๋ช
๊ณผ ์ผ์นํ๋ ์์ธํ ์ถ๋ ฅ์์ โโ์คํ ๋ถ๊ท ํ์ ์ ๋ฐํ๋ฏ๋ก ๊ฐ์ฅ ์ ๋ง ํด ๋ณด์
๋๋ค. ์๋ง๋ Rprintf
๋ ์ด์ด๋ ์ปดํ์ผ๋ฌ์์ ๊ทธ๊ฒ์ ์จ๊ธฐ๊ณ ์์์ ๊ฒ์
๋๋ค. ์ด์ fprintf
์ง์ ์ฌ์ฉํ๊ณ ์๋ค๋ ๊ฒ์ ์ ์ ์์ต๋๋ค.
๋ ๋ฒ์งธ ์๋์ (์ฌ๋ถํ ํ)
stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into โC:/Users/hughp/Documents/R/win-library/3.4โ
# (as โlibโ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1559167 bytes (1.5 MB)
# downloaded 1.5 MB
#
# package โdata.tableโ successfully unpacked and MD5 sums checked
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-18 04:58:23 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file: C:\Users\hughp\AppData\Local\Temp\RtmpIT9H0D/fread.out
Input contains no \n. Taking this to be a filename to open
Read 11%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 28%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 48%. ETA 00:00 Warning: stack imbalance in '$', 20 then 19
Read 98%. ETA 00:00 [01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 1A51 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.822 wall clock time
[12] Finalizing the datatable
Type counts:
1 : bool8 '1'
1 : int32 '5'
2 : string 'A'
=============================
0.000s ( 0%) Memory map 0.341GB file
0.001s ( 0%) sep=',' ncol=4 and header detection
0.000s ( 0%) Column type detection using 10027 sample rows
0.291s ( 10%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
2.531s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.002s ( 0%) Finding first non-embedded \n after each jump
+ 0.282s ( 10%) Parse to row-major thread buffers (grown 0 times)
+ 1.537s ( 54%) Transpose
+ 0.710s ( 25%) Waiting
0.842s ( 30%) Rereading 1 columns due to out-of-sample type exceptions
2.822s Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
V1 V2 V3 V4
1: Goulburn 110018063 3499 NA
2: NA 110018064 812 NA
3: NA 110018065 2158 NA
4: NA 110019999 402 NA
5: NA 110028068 10 NA
---
22885376: NA 997999799 0 NA
22885377: NA 998999899 64 NA
22885378: NA 994999499 34 NA
22885379: NA 0&&&&&&&& 250796 NA
22885380: NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
๋ค์ ๋งํ์ง๋ง, RStudio ์ธ๋ถ์์๋ ์ฌํ ํ ์ ์์ต๋๋ค.
๋นจ๋ฆฌ ํ
์คํธ ํด ์ฃผ์
์ ๊ฐ์ฌํฉ๋๋ค. ๊ธ์, ๊ทธ๊ฒ์ ํ์คํ ๊ทธ๋ ๋ง์ ๊ฒ์ ์ง๋ฐฐํฉ๋๋ค! ๋ ๊ฐ์ง ์์ด๋์ด๊ฐ ๋จ์์ต๋๋ค. ์ฒซ ๋ฒ์งธ๋ ๋ฐ๊ณ ํต๊ณผํ์ต๋๋ค. ์ฌ๊ธฐ์์ ์ Windows.zip์ ์ฌ์ฉํด๋ณด์ญ์์ค. ๊ทธ alloca
๋ ์คํ์ ์์ผ๋ฉฐ ์ค์ ์ค์ธ na.strings
์ ๊ด๋ จ์ด ์์ต๋๋ค. ํ์คํ ์ฌ๋ฐ๋ฅธ ์์ญ (์คํ ๋ถ๊ท ํ)์ ์๊ณ ์๋ ํ ๊ฐ์น๊ฐ ์์ต๋๋ค.
๋ฌธ์ ์์ต๋๋ค. ๋ค์ 12 ์๊ฐ ๋์ ์๋ฆฌ๋ฅผ ๋น์ธ ๊ฒ์ด๋ฏ๋ก ๊ทธ๋๊น์ง๋ ํ
์คํธ ํ ์ ์์ต๋๋ค.
2017 ๋
11 ์ 18 ์ผ ํ ์์ผ ์คํ 5์ 20 ๋ถ Matt Dowle [email protected] ์ ๋ค์๊ณผ ๊ฐ์ด ์ผ์ต๋๋ค.
๋นจ๋ฆฌ ํ ์คํธ ํด ์ฃผ์ ์ ๊ฐ์ฌํฉ๋๋ค. ๊ธ์, ๊ทธ๊ฒ์ ํ์คํ ๊ทธ๋ ๋ง์ ๊ฒ์ ์ง๋ฐฐํฉ๋๋ค!
๋ ๊ฐ์ง ์์ด๋์ด๊ฐ ๋จ์์ต๋๋ค. ์ฒซ ๋ฒ์งธ๋ ๋ฐ๊ณ ํต๊ณผํ์ต๋๋ค. ์ Windows.zip์ ์ฌ์ฉํด๋ณด์ญ์์ค.
์ฌ๊ธฐ
https://ci.appveyor.com/project/Rdatatable/data-table/build/1.0.1363/job/fo02vnbu5ebhwy3w/artifacts .
ํด๋น ํ ๋น์ ์คํ์ ํ ๋น๋๋ฉฐ na.strings์ ๊ด๋ จ์ด ์์ต๋๋ค.
๋น์ ์ ๊ทธ๊ฒ์ด ์ผ์ด๋๋๋๋ก ์ค์ ํ๊ณ ์์ต๋๋ค. ํ์คํ ์ฌ๋ฐ๋ฅธ ์์ญ (์คํ
๋ถ๊ท ํ) ๋ ธ๋ ฅํ ๊ฐ์น๊ฐ ์์ต๋๋ค.โ
๋น์ ์ด ์ธ๊ธ ๋์๊ธฐ ๋๋ฌธ์ ์ด๊ฒ์ ๋ฐ๊ณ ์์ต๋๋ค.์ด ์ด๋ฉ์ผ์ ์ง์ ๋ต์ฅํ๊ณ GitHub์์ ํ์ธํ์ธ์.
https://github.com/Rdatatable/data.table/issues/2481#issuecomment-345421856 ,
๋๋ ์ค๋ ๋ ์์๊ฑฐ
https://github.com/notifications/unsubscribe-auth/AHvGDGa5Qnls5eSFBMaQO5s8DElfrpKSks5s3ncqgaJpZM4QcuPc
.
๊ทธ๋ ๊ฑฑ์ ๋ง. ๊ฐ์ฌ! ๋๋ ์ง๊ธ๋ ๋ ๋ฒ์งธ ์์ด๋์ด๋ฅผ ์ถ์งํ์ต๋๋ค. ๊ณผ๊ฑฐ์ Windows์์ ๋ฌธ์ ๋ฅผ ์ผ์ผ์ผฐ๋ \r
๋ฅผ ๊ธฐ์ตํ๋ ๊ฒ ๊ฐ์ง๋ง ์คํ ๋ถ๊ท ํ์ ๊ธฐ์ต ๋์ง ์์ต๋๋ค. ์ด์จ๋ , ๊ทธ๊ฒ์ ๋ฐฐ์ ํ๊ธฐ ์ํด ์งํ๋ฅ ์ธก์ ๊ธฐ์์ \r
๋ฅผ ์ ๊ฑฐํ์ต๋๋ค. ์คํ ๋ถ๊ท ํ ๋ฉ์์ง๋ ETA ๋ผ์ธ์ด ๋ฐ์ํ๋ ๊ณณ์ ์ธ์๋๋ ๊ฒ ๊ฐ์ต๋๋ค. ์ฝ์์ด \r
์ก์์ ๋ค๋ฅด๊ฒ ์ทจ๊ธํ์ฌ ๋ง์ง๋ง ์ค์ด ๊ต์ฒด๋๋๋กํ๋ ๊ฒ์ด ๊ฐ๋ฅํฉ๋๋ค. ์ด์ ETA๊ฐ ์
๋ฐ์ดํธ ๋ ๋๋ง๋ค ์ ์ค์ด ํ์๋ฉ๋๋ค. ๊ทธ๊ฒ์ ๋ฐฐ์ ํ๊ธฐ ์ํด ์ผ์์ ์ผ๋ก. ์ Windows.zip์ด ๋น๋๋๊ณ ์ฌ๊ธฐ์ ์ ๋ฌ
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file: C:\Users\hughp\AppData\Local\Temp\RtmpcVjZ1f/fread.out
Input contains no \n. Taking this to be a filename to open
Read 5%. ETA 00:00
Read 8%. ETA 00:00
Read 11%. ETA 00:00
Read 15%. ETA 00:00
Read 18%. ETA 00:00
Read 21%. ETA 00:00
Read 25%. ETA 00:00
Read 28%. ETA 00:00
Read 31%. ETA 00:00
Read 35%. ETA 00:00
Read 38%. ETA 00:00
Read 41%. ETA 00:00
Read 45%. ETA 00:00
Read 48%. ETA 00:00
Read 51%. ETA 00:00
Read 55%. ETA 00:00
Warning: stack imbalance in '$', 30 then 31
Warning: stack imbalance in '$', 17 then 16
Read 58%. ETA 00:00
Read 61%. ETA 00:00
Read 65%. ETA 00:00
Read 68%. ETA 00:00
Read 71%. ETA 00:00
Read 75%. ETA 00:00
Read 78%. ETA 00:00
Read 81%. ETA 00:00
Read 85%. ETA 00:00
Read 88%. ETA 00:00
Read 91%. ETA 00:00
Read 95%. ETA 00:00
Read 98%. ETA 00:00
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 1A51 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.894 wall clock time
[12] Finalizing the datatable
Type counts:
1 : bool8 '1'
1 : int32 '5'
2 : string 'A'
=============================
0.001s ( 0%) Memory map 0.341GB file
0.003s ( 0%) sep=',' ncol=4 and header detection
0.000s ( 0%) Column type detection using 10027 sample rows
0.316s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
2.574s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.004s ( 0%) Finding first non-embedded \n after each jump
+ 0.284s ( 10%) Parse to row-major thread buffers (grown 0 times)
+ 1.450s ( 50%) Transpose
+ 0.837s ( 29%) Waiting
0.953s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
2.894s Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
V1 V2 V3 V4
1: Goulburn 110018063 3499 NA
2: NA 110018064 812 NA
3: NA 110018065 2158 NA
4: NA 110019999 402 NA
5: NA 110028068 10 NA
---
22885376: NA 997999799 0 NA
22885377: NA 998999899 64 NA
22885378: NA 994999499 34 NA
22885379: NA 0&&&&&&&& 250796 NA
22885380: NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
์ฐธ๊ณ : ์ฝ๊ฐ ์ด์ ๋ฒ์ ์ RStudio๋ฅผ ์ฌ์ฉํ๋ ๋ค๋ฅธ Windows ์์คํ ์์์ด ์คํ ๋ถ๊ท ํ ์ค๋ฅ๋ฅผ ์ฌํ ํ ์ ์์ต๋๋ค.
์ด ๊ฒฝ์ฐ ์ ์ํ๋๋ก RStudio ์ง์์ ์์ฒญํ ๋๊ฐ ๋ ๊ฒ ๊ฐ์ต๋๋ค. ๋๋ fread ์ฝ๋๋ฅผ ๋ค์ ์ดํด ๋ณด์๊ณ ๋ด ์๊ฐ์ด ๋ถ์กฑํฉ๋๋ค. RStudio์ ๋ ๊ฐ์ง ๋ฒ์ ๋ฒํธ๋ฅผ ์๋ ค์ฃผ์ธ์. ๋ฐ๋์ RStudio๋ผ๋ ๊ฒ์ ์๋ฏธํ๋ ๊ฒ์ ์๋๋ฉฐ, RStudio์ ํ ๋ฒ์ ์ ํ์๋๋ data.table ์ธก์ ๊ฒฐํจ ์ผ ์ ์์ต๋๋ค. ๊ทธ๋ฌ๋ ๊ทธ๊ฒ์ด ์ฝ์ ์ถ๋ ฅ๊ณผ ๊ด๋ จ์ด์๋ ๊ฒ ๊ฐ๊ณ RStudio์ ํนํ๋ ๋ค๋ฅธ ์ ์ด ์ด์ํฉ๋๋ค. "RStudio stack imabalance"๋ฅผ ๊ฒ์ํ์ง๋ง RStudio ์์ฒด๊ฐ ์๋ ํจํค์ง ๊ฒฐํจ์ ๋ํ ๋ง์ ๋ฌธ์ ๊ฐ ๋ฐ์ํฉ๋๋ค. ๊ฒ์ํ๊ธฐ ์ด๋ ค์ด ๋ฌธ์ ์ ๋๋ค. ์ฌ๊ธฐ์ ๋ฌธ์ ๋ฅผ ์ด์ด๋๊ณ ๊ทธ๋ค์ด ๋งํ๋ ๊ฒ์ ๋ณด์.
๋ง์ง๋ง ์๋๊ฐ ๋์์ด ๋ ์ง ์์ฌ ์ค๋ฝ์ง๋ง ์์ ์ฑ์ ์ํด ์ฌ๊ธฐ ์์ ์๋ํด๋ณด์ธ์. ์๋ง๋ Windows์์ ์ฌ์ฉ๋๋ MinGW ์ปดํ์ผ๋ฌ๋์ด ๋ ๊ฐ์ง int์์ ์ด์ํ ์ผ์ํฉ๋๋ค. ๊ทธ์ค ํ๋๋ ์์ 0์ผ๋ก ์ต์ ํ๋์ด ์คํ ๋ถ๊ท ํ์ ์ ๋ฐํฉ๋๋ค.
๊ทธ๋ฌ๋ ํน์ ์คํ ๋ถ๊ท ํ ๋ฉ์์ง๋ R ์์ฒด์ eval.c : 491 ์์ fread
๋๋ data.table
๋ผ๊ณ ์๊ฐํ์ง ์์ต๋๋ค. check_stack_balance()
๋ R ๋ด๋ถ์ 5 ๊ณณ์์๋ง ํธ์ถ๋ฉ๋๋ค.
names.c
์์ do_internal()
๋
objects.c
, applyMethod()
์์ ๋ ๋ฒ
eval.c
, eval()
์์ ๋ ๋ฒ
fread.c
์ด (๊ฐ) ๋ณ๋ ฌ ์น์
์์๋ ๋์ ์ด๋ค ์ค ์ด๋ค ๊ฒ์ ๋๋ฌ ํ ์ ์๋์ง๋ ์ ์ ์์ต๋๋ค. ํธ์ถ๋๋ ์ ์ผํ ์ง์
์ ์ REprintf
์ด๋ฉฐ check_stack_balance()
๋๋ฌ ํ ์์๋ ๋ฐฉ๋ฒ์ ์ ์ ์์ต๋๋ค. ํ์ฌ ๋ด๊ฐ ์๊ฐํ ์์๋ ๊ฒ์ RStudio์ ์๋ง๋ Windows์์ ๋ค๋ฅด๊ฒ ์ฝ์ ์ถ๋ ฅ๊ณผ ์ํธ ์์ฉํ๋ ๋ฐฑ๊ทธ๋ผ์ด๋์์ ๋ฌด์ธ๊ฐ๋ฅผ ์ํํ๋ ์ค๋ ๋๊ฐ ์๋ค๋ ๊ฒ์
๋๋ค.
๋ง์ง๋ง์ผ๋ก, ์์ ์ฑ์ ์ํด, base R์ด libcurl.c : 354 ๋ฐ internet.c : 409 ์ ์งํ๋ฅ ์ธก์ ๊ธฐ์์ Rprintf ๋์ ์ฌ์ฉํ๋ฏ๋ก REprintf
๊ฒ์ด ์ฌ๋ฐ๋ฅธ ๋ฐฉ๋ฒ ์ธ ๊ฒ ๊ฐ์ต๋๋ค. C ๋ ๋ฒจ์์ R์ ์งํ๋ฅ ํ์ ์ค์ R์ API์์ ์ฌ์ฉํ ์ ์์ต๋๋ค (C ๋ ๋ฒจ์์๋ R์์ ๋ ๋ฒ ๊ตฌํ ๋ ๊ฒ ๊ฐ์ต๋๋ค).
@mattdowle ,์ด๊ฒ ๋์์ด ๋ ๊น์? https://github.com/r-lib/progress
@aader ์-๊ฐ์ฌํฉ๋๋ค! ์์ค์๋ ๋ค์ ์ฃผ์์ด ํฌํจ๋์ด ์์ต๋๋ค.
// In R Studio we should print to stdout, because printing a \r
// to stderr is buggy (reported)
ํ์ง๋ง ์ด๋ฏธ \r
์ ๊ฑฐํ๋๋ฐ ์คํ ๋ถ๊ท ํ์ด ์ฌ์ ํ ๋ฐ์ํฉ๋๋ค. ์ด๋์๋ณด๊ณ ๋์๋์ง ๊ถ๊ธํฉ๋๋ค.
๋ง์ง๋ง ๋น๋๋ ์๋ํ์ง ์์์ต๋๋ค.
https://community.rstudio.com/t/stack-imbalance-possibly-in-stderr/3009 ์์๋ณด๊ณ ๋จ
R-devel์ ๋ํ์๊ธฐ ์ ์ ํ ์ง๋ฌธ : [Rd] Rprintf ๋ฐ REprintf๋ ์ค๋ ๋๋ก๋ถํฐ ์์ ํฉ๋๊น?
Upshot "Rprintf ๋ฐ REprintf๋ ์ค๋ ๋๋ก๋ถํฐ ์์ ํ์ง ์์ต๋๋ค."
Yoiks!
RStudio์ ๋ฌธ์ ๋ฅผ ์ ๊ธฐ ํ ๋งํฌ์ Hugh์ ๊ฐ์ฌ๋๋ฆฝ๋๋ค.
data.table::fwrite()
๋ฐ data.table::fread()
์ (๋) Rprintf
๋ฐ REprintf
๋ ์ค๋ ๋๋ก๋ถํฐ ์์ ํ์ง ์์ผ๋ฏ๋ก ์งํ๋ฅ ๋ฏธํฐ์ ๋ํด ๋ง์คํฐ ์ค๋ ๋์์๋ง ํธ์ถํฉ๋๋ค. ๋ ๊ฐ์ data.table ์ค๋ ๋๊ฐ ๋์์ ํด๋น R ์ง์
์ ์ ํธ์ถํ์ง ์์๋ฟ๋ง ์๋๋ผ ๋ง์คํฐ ์ค๋ ๋ ๋ง์ด์ด๋ฅผ ํธ์ถ ํ ์ ์์ผ๋ฉฐ, ์ด๋ ๋ชจ๋ ์ค๋ ๋์์ ์ธ์ ๋ ํธ์ถ๋๋ ์ ์ผํ R ์ง์
์ ์
๋๋ค. ๋ณ๋ ฌ ์น์
. ๊ทธ๋ฌ๋ Rprintf
๋ 100 ๋งค ์ธ์ ํ ๋๋ง๋ค R_CheckUserInterrupt
ํธ์ถํฉ๋๋ค. ๋ง์คํฐ ์ค๋ ๋๋ง์ผ๋ก๋ ์์ ํ์ง ์์ ๋ถ๋ถ์ด๋ผ๊ณ ์๊ฐํฉ๋๋ค. ์ด๊ฒ์ด R_CheckUserInterrupt
ํธ์ถํ์ง ์๊ธฐ ๋๋ฌธ์ REprintf
๋ฅผ ์ฌ์ฉํ๋ ์ด์ ์
๋๋ค. R ๋ด๋ถ๋ ์งํ๋ฅ ์ธก์ ๊ธฐ์ REprintf
๋ฅผ ์ฌ์ฉํ๋ฏ๋ก ์ฝ์ด R๊ณผ์ ์ผ๊ด์ฑ์ ์ํด REprintf
๋ก ์ ํํ๋ ๊ฒ์ด ์ข์ต๋๋ค. ์ฆ, ๊ทธ ์ ํ์ ๊ทธ ์์ฒด๋ก stderr ๋ stdout๊ณผ ๊ด๋ จ์ด ์์ต๋๋ค.
@kevinushey ์ด ์ค๋ ๋๋ฅผ ์ดํด๋ณด๊ณ ๋ด๊ฐ ์๋ ํ ์์๋ ๋ค๋ฅธ ๊ฒ์ ์๋ ค์ฃผ์๊ฒ ์ต๋๊น? RStudio์ ๊ด๋ จ๋ ๊ฒ์ผ ์ ์์ต๋๊น? ์ด์จ๋ ๋ฐฐ๊ฒฝ ์ค๋ ๋์ ๊ด๋ จ์ด ์์ต๋๊น? RStudio์ ๋ฐฑ๊ทธ๋ผ์ด๋ ์ค๋ ๋๊ฐ์๋ ๊ฒฝ์ฐ Rprintf
/ REprintf
๊ฐ ๋์์ ๋ ์ค๋ ๋์์ ํธ์ถ ๋ ์ ์์ต๋๋ค. ๊ทธ๋ฌ๋ ๋ง์ฝ ๊ทธ๋ ๋ค๋ฉด ์ฐ๋ฆฌ๋ ์ง๊ธ๊น์ง ๋ ๋ง์ ๋ฌธ์ ๋ฅผ ๋ณด์์ ๊ฒ์
๋๋ค. ๊ทธ๋์ ๊ทธ๋ด ๊ฒ ๊ฐ์ง ์์ต๋๋ค. RStudio๋ R-exts ์น์
ptr_*
์ฝ๋ฐฑ์ ๋์ฒด ํ ์ ์์ต๋๋ค.์ด ์ฝ๋ฐฑ์ ์ฝ์ ์ถ๋ ฅ ๋ฐ ์ํธ ์์ฉ๊ณผ ๊ด๋ จ์ด ์์ต๋๋ค. ๊ทธ๋ฌ๋์ด ์น์
์ "์ ๋์ค ์ ์ฌ ์ฌ์ฉ์ ์ฉ"์ผ๋ก ์์ํ๋ฏ๋ก Windows๊ฐ ์ด๋ป๊ฒ ๋ค์ด์ค๋ ์ง ๋ชจ๋ฅด๊ฒ ์ต๋๋ค. ์น์
8.1.5 ์ค๋ ๋ฉ ๋ฌธ์ ๋ ๊ด๋ จ์ด์์ ์ ์์ต๋๋ค. ๋ ๋ค ์น์
8 : "GUI ๋ฐ ๊ธฐํ ํ๋ฐํธ ์๋๋ฅผ R์ ์ฐ๊ฒฐ"์ ํ์ ์น์
์
๋๋ค.
12 ์ ์ด๊น์ง ์ธ์ถ ํ ์์ ์ด๋ ์ํ๊น๊ฒ๋ ๊ทธ๋๊น์ง๋ ๋ณผ ๊ธฐํ๊ฐ ์์ต๋๋ค. ๊ทธ๋ฌ๋ RStudio๋ R ์ด๋ฒคํธ ๋ฃจํ๋ฅผ ์ฌ์ฉํ์ฌ ๋ฉ์ธ ์ค๋ ๋์์ ๊ฑฐ์ ๋ชจ๋ ๊ฒ์ ์คํํฉ๋๋ค. ์ ์ผํ ์์ธ๋ ์๋ฅผ ๋ค์ด ํ๋ก์ ํธ ์์ค ํ์ผ ์ธ๋ฑ์ฑ์ด๋ฉฐ ์ด๋ฌํ ๋ฐฑ๊ทธ๋ผ์ด๋ ์ค๋ ๋๋ ์ผ๋ฐ์ ์ผ๋ก R API๋ฅผ ๊ฑด๋๋ฆฌ์ง ์์ต๋๋ค.
RStudio๋ ์ฝ์ ์
๋ ฅ ๋ฐ ์ถ๋ ฅ ์ฒ๋ฆฌ๋ฅผ ์ํด ๋ค์ํ ptr_*
์ฝ๋ฐฑ์ ์ธ์ํฉ๋๋ค. ๋๋ ๊ทธ๋ค์ด ์ฌ๊ธฐ์์ ์ด๋ป๊ฒ ์์ธ์ด ๋ ์ง ์ฆ์ ์๊ฐํ ์ ์์ง๋ง ๋ด๊ฐ ๋ค์ ๋ค์ด ์ค๋ฉด ๋ ๊น๊ฒ ์ดํด ๋ณด๋ ค๊ณ ๋
ธ๋ ฅํ ๊ฒ์
๋๋ค.
์ข์ต๋๋ค. ์ฌ๊ธฐ์์ ์๋ํด๋ณด์ธ์. ์ด์ ์๋ ์งํ ์ํ๋ฅผ 2 %๋ง๋ค ์
๋ฐ์ดํธํ์ต๋๋ค. ๊ทํ์ ๊ฒฝ์ฐ ํ์ผ์ 3 ์ด ๋ฏธ๋ง ๋ง ์์๋๋ฏ๋ก 0.06 ์ด๋ง๋ค RStudio ์ฝ์์ ๋ํ ์๋ก์ด ์งํ๋ฅ ์
๋ฐ์ดํธ๊ฐ๋์์ต๋๋ค. RStudio์๊ฒ ๋๋ฌด ๋ง์์ ์๋ ์์ต๋๋ค. ๊ทธ๋์์ด ์๋๋ ๋ง๋๋ฅผ ์ธ์ํฉ๋๋ค. \r
๋ ์ ํ ์ฌ์ฉํ์ง ์์ต๋๋ค. ์ด๋ \r
๊ฐ ์ถ๋ ฅ์ ์ฑ์ธ ์์๋ ๋ณด๊ณ ์ ๋ฐ ๋ก๊ทธ ํ์ผ์ ๋ ์ข์ต๋๋ค.
3 ์ด์ ํ์ด๋ฐ์ด ๋งค์ฐ ๋น ๋ฅด๊ธฐ ๋๋ฌธ์ 1 ์ด ETA๊ฐ์๋ ๊ฒฝ์ฐ ์งํ๋ฅ ํ์ ์ค์ 1 ์ด์์ ์์ํ๋๋ก ์ค์์ต๋๋ค. ๊ทธ๋ ์ง ์์ผ๋ฉด ํ์๋์ง ์์๊ธฐ ๋๋ฌธ์ ํ์ผ์ด ์ ํ ํ์๋์ง ์๊ณ ์๋ํฉ๋๋ค. ํ
์คํธ๋ฅผ ๋ง์น ํ์๋ fwrite
์ ๋๋ฆด ๊ฒ์
๋๋ค. ์ฆ, ETA๊ฐ ๊ฑฐ๊ธฐ์์ 2 ์ด์ด๋ฉด 2 ์ด์์ ์์ํฉ๋๋ค.
์๋ ํ์ธ์, @mattdowle. # 2503์ ๋ง์ง๋ง ๋๊ธ๋์ด ๋ฌธ์ ์ ๊ด๋ จ์ด์์ ์ ์์ต๋๋ค.
์ข์ ๋ณด์ธ๋ค! ๊ฒฝ๊ณ ์์ (5 ํ ์คํ ํ) ๋จผ์ ์๋์์ ์คํํฉ๋๋ค (์ค์ ์ถ๋ ฅ์์๋ ์ ํ ๊ณต๋ฐฑ์ด ๋ค๋ฅด๊ฒ ๋ณด์ ๋๋ค).
stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into โC:/Users/hughp/Documents/R/win-library/3.4โ
# (as โlibโ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557423 bytes (1.5 MB)
# downloaded 1.5 MB
#
# package โdata.tableโ successfully unpacked and MD5 sums checked
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapped ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
# [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=',' with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
# Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000) : 1551 Quote rule 0
# Type codes (jump 100) : 1A51 Quote rule 0
# =====
# Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
# [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# |--------------------------------------------------|
# |==================================================|
# Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.280 wall clock time
# [12] Finalizing the datatable
# Type counts:
# 1 : bool8 '1'
# 1 : int32 '5'
# 2 : string 'A'
# =============================
# 0.005s ( 0%) Memory map 0.341GB file
# 0.037s ( 2%) sep=',' ncol=4 and header detection
# 0.000s ( 0%) Column type detection using 10027 sample rows
# 0.321s ( 14%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 1.917s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# = 0.011s ( 0%) Finding first non-embedded \n after each jump
# + 0.560s ( 25%) Parse to row-major thread buffers (grown 0 times)
# + 0.488s ( 21%) Transpose
# + 0.858s ( 38%) Waiting
# 0.999s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
# 2.280s Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1 V2 V3 V4
# 1: Goulburn 110018063 3499 NA
# 2: NA 110018064 812 NA
# 3: NA 110018065 2158 NA
# 4: NA 110019999 402 NA
# 5: NA 110028068 10 NA
# ---
# 22885376: NA 997999799 0 NA
# 22885377: NA 998999899 64 NA
# 22885378: NA 994999499 34 NA
# 22885379: NA 0&&&&&&&& 250796 NA
# 22885380: NA 0@@@@@@@@ 7305367 NA
# Warning messages:
# 1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
# 2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
@HughParsonage ๋ฆด๋ฆฌํ! ๋๋ ๊ทธ๊ฒ์ด ์น๋ฆฌ๋ผ๊ณ ์๊ฐํฉ๋๋ค. ์ ๋ฆฌํ๊ณ ํฉ์ณ์ ๋์ด๊ฐ ๊ฒ์. ํ ์คํธ ํด ์ฃผ์ ์ ๋๋จํ ๊ฐ์ฌํฉ๋๋ค.
@aadler Yes๋ ๋ฌธ์ # 2503 ์์ ๊ทํ์ ์๊ฒฌ์ด ๋๊ฐ์ด ๋ณด์ธ๋ค๋ ๋ฐ ๋์ํ์ต๋๋ค. ๊ฐ๋ฐ์์ ์ต์ ๋ฒ์ ๋ ํ
์คํธํ๊ณ ์ด์ ์์ ๋์๋์ง ํ์ธํด ์ฃผ์๊ฒ ์ต๋๊น? ์ฌ๊ธฐ์์ ๋ฐ๊ฒฌ ํ as.IDate
์ ๋ฌธ์ ๊ฐ ์ค์ ๋ก ์ด์ ์คํ ๋ถ๊ท ํ์ผ๋ก ์ธํด ๋ฐ์ํ๊ธฐ๋ฅผ ๋ฐ๋ผ๊ณ ์์ต๋๋ค.
์ข์ง ์๋ค :(
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
> DT <- fread('2017-11-22_1999_Performance.csv', header = TRUE, colClasses = CLS, select = SEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file 2017-11-22_1999_Performance.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|=======Warning: stack imbalance in '$', 27 then 26
===Warning: stack imbalance in '$', 26 then 27
================Error in fread("2017-11-22_1999_Performance.csv", header = TRUE, colClasses = CLS, :
unprotect_ptr: pointer not found
@aadler ์ ๊ณ freadR
๋ฅผ ํตํด ๋ณดํธ๋ฅผ ํ์งํํ์ต๋๋ค. ๊ทํ์ ๊ฒฝ์ฐ์๋ ์ ํ์ ์ฌ์ ์ํ๊ณ ์ฝ๋์ ํด๋น ๋ถ๋ถ์ ์๋น์์ ๋ณดํธ ๊ธฐ๋ฅ์ด ์๊ธฐ ๋๋ฌธ์ ์๋ ํ ๊ฐ๋ฅ์ฑ์ด 30 %์
๋๋ค. ์ด ๋น๋๋ฅผ ์ฌ์ฉํ์ฌ ๋ค์ ์๋ํ์ญ์์ค.
์์ง ๋ง์ง๋ง ๋น๋๋ฅผ ์๋ํ์ง ์์ ๊ฒฝ์ฐ @aadler๋ ์๋๋ก ๋ฐ๋ก ์ด๋ํ์๊ธฐ ๋ฐ๋๋๋ค ์ด ํ๋ . ๋ํ ํ์ผ ์ฌ๋ณธ์๋ฐ์ ์ ์๋ค๋ฉด Windows RStudio์์ ์ง์ ์๋ํด ๋ณผ ์ ์์ต๋๋ค.
:(
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-01 01:54:04 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
> ColCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+ rep('integer', 3L), rep('character', 2L),
+ 'integer', 'Date', rep('numeric', 2L), 'Date',
+ rep('numeric', 12L), rep('integer', 5),
+ rep('numeric', 3L), 'integer', 'character')
> SELCOL <- c(WHATEVER)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = ColCLASS, select = SELCOL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file LargeFile.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|Error in fread("LargeFile.csv", header = TRUE, colClasses = ColCLASS, :
unprotect_ptr: pointer not found
์ด๋ฉ์ผ์ @aadler ๋๋ถ์ ์ด์ ์ฌํ ํ ์ ์์ต๋๋ค. R 3.4.2, ์ต์ RStudio 1.1.383 ๋ฐ Windows 10 Pro 10.0.16299 ๋น๋ 16299.
์ฌ๊ธฐ์ ๊ธฐ๋ก ๋ RStudio์์ ์ด์ํ ๋์์ด ๋ณด์
๋๋ค.
https://www.youtube.com/watch?v=tl2x2vmZxMU
RStudio๊ฐ ์
๋ ฅ๋ง์ผ๋ก GC๋ฅผ ์์ฑํ๋ ๊ฒ ๊ฐ์ต๋๋ค. ๊ทธ ์ด์ ๋ ๋ฌด์์ด๋ฉฐ ๋๋ ๋ฐฉ๋ฒ์ด ์์ต๋๊น? fread()
์ด (๊ฐ) ์งํ๋ฅ ํ์ ์ค์ ์ธ์ ํ ๋ RStudio์ ๋ณ๋ ์ด๋ฒคํธ ๋ฃจํ๋ ์ฝ์์ ๋ํ ์ถ๋ ฅ์ด ์ฌ์ฉ์๊ฐ ์
๋ ฅํ๊ณ R์ ํธ์ถํ์ฌ GC๋ฅผ ๋ฐ์์ํค๊ณ ๋ชจ๋ ๊ฒ์ ์๋ ์ํจ๋ค๊ณ ์๊ฐํ๋ ๊ฒ์ผ ์ ์์ต๋๋ค. ์๋ง๋ ์ฌ๊ธฐ์์๋ RStudio ์ฌ์ฉ์๋ ์ ๋ฅผ ์ฌ๋ฐ๋ฅธ ๋ฐฉํฅ์ผ๋ก ์๋ด ํ ์ ์๊ฑฐ๋ @kevinushey ๊ฐ ๋์ ๋งํ๊ณ ์ค๋์ ์ฒซ ๋ฒ์งธ์
๋๋ค
RStudio ์ฝ์์์ ์คํ ๋ถ๊ท ํ์ ์์ ์ ์ผ๋ก ์ฌํ ํ ์ ์์ต๋๋ค. RStudio ํฐ๋ฏธ๋ ํญ์ ์ฌ์ฉํ๋ฉด gcinfo(TRUE)
์ฌ์ฉํด๋ ์ ํ ์ฌํ ํ ์ ์์ต๋๋ค. ํฅ๋ฏธ๋กญ๊ฒ๋ GC๋ ์งํ๋ฅ ํ์ ์ค์ด ์ธ์ ๋ ๋ ๋ฐ์ํ๋ฉฐ Linux์์๋ ๊ด์ฐฎ ๊ธฐ ๋๋ฌธ์ ๊ด์ฐฎ์ ๋ณด์
๋๋ค. RStudio ์ฝ์ ๋น๋์ค์ ๋์์ ๊ฐ์ํ ๋ ์ด๊ฒ์ด RStudio ์ฝ์ ๋ฒ๊ทธ๋ผ๋ ๊ฒฐ๋ก ์ ๋๋ฌํ์ต๋๋ค. RStudio ํฐ๋ฏธ๋ ์ฐฝ์์ ํ
์คํธ๋ฅผ ๋ณต์ฌ ํ ์ ์์๊ธฐ ๋๋ฌธ์ (ํธ์ง-> ๋ณต์ฌ๋ ์๋ํ์ง ์๊ณ Ctrl-C๋ ์๋ํ์ง ์์) ์งํ๋ฅ ํ์ ์ค์์ GC๊ฐ ์ ์์์ ํ์ํ๊ธฐ ์ํด ํฐ๋ฏธ๋ ํญ์ ์คํฌ๋ฆฐ ์ท์ ์ฐ์์ต๋๋ค. ๋ง์คํฐ ์ค๋ ๋ ๋ง REprintf
ํธ์ถํ๊ณ ๋ค๋ฅธ ์ค๋ ๋๋ R API๋ฅผ ์ ํ ํธ์ถํ์ง ์๊ธฐ ๋๋ฌธ์ ๊ด์ฐฎ์ ๊ฒ์ผ๋ก ์์ํฉ๋๋ค.
RStudio ํฐ๋ฏธ๋์์ ์ ์๋ํฉ๋๋ค.
์งํ๋ฅ ํ์ ์ค์ด ์ฒ์์ผ๋ก ์ธ์๋๋ ๋์ GC๊ฐ ์์ผ๋ฉฐ RStudio ํฐ๋ฏธ๋์์ ์ ๋๋ก ์๋ํฉ๋๋ค. ์ด ํ
์คํธ ํ์ผ์๋ ํด๋น ์ด์ ๋ํด์๋ง ์๋ ๋ค์ ์ฝ๊ธฐ๋ฅผ ํธ๋ฆฌ๊ฑฐํ๋ ์ํ ์ธ ์ ํ ์์ธ๊ฐ ์๊ธฐ ๋๋ฌธ์ ์งํ๋ฅ ํ์ ์ค์ด ๋ ๋ฒ์งธ๋ก ์ธ์๋ฉ๋๋ค.
ํ์ง๋ง RStudio Console์๋ stack imbalance
๋๋ unprotect_ptr: pointer not found
.
R version 3.4.2 (2017-09-28) -- "Short Summer"
> gcinfo(TRUE)
[1] FALSE
Garbage collection 22 = 16+3+3 (level 0) ...
25.5 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 23 = 16+4+3 (level 1) ...
24.9 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 24 = 17+4+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 25 = 18+4+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 26 = 19+4+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 27 = 20+4+3 (level 0) ...
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 28 = 20+5+3 (level 1) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 29 = 21+5+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 30 = 22+5+3 (level 0) ...
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 31 = 23+5+3 (level 0) ...
25.2 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 32 = 24+5+3 (level 0) ...
25.3 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 33 = 25+5+3 (level 0) ...
25.4 Mbytes of cons cells used (80%)
6.7 Mbytes of vectors used (66%)
Garbage collection 34 = 25+5+4 (level 2) ...
24.6 Mbytes of cons cells used (61%)
6.4 Mbytes of vectors used (50%)
Garbage collection 35 = 26+5+4 (level 0) ...
25.0 Mbytes of cons cells used (62%)
6.5 Mbytes of vectors used (52%)
> require(data.table)
Loading required package: data.table
Garbage collection 36 = 27+5+4 (level 0) ...
27.2 Mbytes of cons cells used (68%)
7.1 Mbytes of vectors used (56%)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 01:04:34 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
Garbage collection 37 = 28+5+4 (level 0) ...
27.7 Mbytes of cons cells used (69%)
7.3 Mbytes of vectors used (58%)
Garbage collection 38 = 29+5+4 (level 0) ...
28.0 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (58%)
Garbage collection 39 = 30+5+4 (level 0) ...
28.1 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (59%)
Garbage collection 40 = 31+5+4 (level 0) ...
28.2 Mbytes of cons cells used (70%)
7.5 Mbytes of vectors used (59%)
Garbage collection 41 = 32+5+4 (level 0) ...
28.4 Mbytes of cons cells used (71%)
7.5 Mbytes of vectors used (59%)
> DT = fread("/Users/pasha/Downloads/LargeFile.csv")
Garbage collection 42 = 32+5+5 (level 2) ...
27.4 Mbytes of cons cells used (54%)
7.1 Mbytes of vectors used (2%)
Garbage collection 43 = 32+5+6 (level 2) ...
27.4 Mbytes of cons cells used (54%)
244.7 Mbytes of vectors used (42%)
Garbage collection 44 = 32+5+7 (level 2) ...
27.4 Mbytes of cons cells used (54%)
482.3 Mbytes of vectors used (42%)
Garbage collection 45 = 32+5+8 (level 2) ...
27.4 Mbytes of cons cells used (54%)
957.4 Mbytes of vectors used (56%)
Garbage collection 46 = 32+5+9 (level 2) ...
27.4 Mbytes of cons cells used (54%)
1432.6 Mbytes of vectors used (63%)
Garbage collection 47 = 32+5+10 (level 2) ...
27.4 Mbytes of cons cells used (54%)
2145.3 Mbytes of vectors used (75%)
Garbage collection 48 = 32+5+11 (level 2) ...
27.4 Mbytes of cons cells used (54%)
2620.4 Mbytes of vectors used (71%)
Garbage collection 49 = 32+5+12 (level 2) ...
27.4 Mbytes of cons cells used (54%)
3570.8 Mbytes of vectors used (78%)
Garbage collection 50 = 32+5+13 (level 2) ...
27.4 Mbytes of cons cells used (54%)
4283.5 Mbytes of vectors used (75%)
Garbage collection 51 = 32+5+14 (level 2) ...
27.4 Mbytes of cons cells used (54%)
5709.0 Mbytes of vectors used (77%)
Garbage collection 52 = 32+5+15 (level 2) ...
27.4 Mbytes of cons cells used (54%)
7372.0 Mbytes of vectors used (81%)
Garbage collection 53 = 32+5+16 (level 2) ...
27.4 Mbytes of cons cells used (54%)
8797.5 Mbytes of vectors used (79%)
Garbage collection 54 = 32+5+17 (level 2) ...
27.4 Mbytes of cons cells used (54%)
10935.7 Mbytes of vectors used (80%)
|--------------------------------------------------|
|=====Error in fread("LargeFile.csv") :
unprotect_ptr: pointer not found
>
showProgress=FALSE
๋ RStudio ์ฝ์์์์ด๋ฅผ ์์ ์ ์ผ๋ก ํด๊ฒฐํฉ๋๋ค. ์ฌํํ๋ ค๋ฉด showProgress=TRUE
(์ฆ ๊ธฐ๋ณธ๊ฐ)๋ฅผ ์ฌ์ฉํ์ฌ ์๋ก์ด RStudio ์ฝ์์์ ์ฒ์ ์คํํด์ผํฉ๋๋ค. ์งํ๋ฅ ์ธก์ ๊ธฐ ๋์ GC๊ฐ ์๋์ง ์ฌ๋ถ์ ๊ด๋ จ๋ ๊ฒ ๊ฐ์ต๋๋ค. ์๋ก์ด ์ธ์
์์ ์ฒซ ๋ฒ์งธ ์คํ์ด ์์ต๋๋ค. ์งํ๋ฅ ํ์๊ธฐ๊ฐ ํ์๋๋๋ก ํฐ ํ์ผ ๋ง ์์ผ๋ฉด๋ฉ๋๋ค. ๋ค์ ์ฝ๊ฑฐ๋ fread
์ ๋ฌ ๋ ์ธ์์ ๊ด๋ จ์ด ์์ต๋๋ค. ์ RStudio ์ฝ์์์ ์ฒซ ๋ฒ์งธ ์คํ์ด showProgress=FALSE
๋ก ์๋ํ๋ ๊ฒฝ์ฐ ํด๋น ์คํ์ R์ ํ์ ํ์ฅ ํ ๋ค์ showProgress=TRUE
๋ฅผ ์ฌ์ฉํ์ฌ ๋์ผํ ์ธ์
์์ ํ์ ์คํ๋ ์๋ํฉ๋๋ค. ๊ทธ๋ฌ๋ ์ฒซ ๋ฒ์งธ ์คํ์ด ์ด๋ฏธ ํ์ ํ์ฅํ๊ธฐ ๋๋ฌธ์ ์งํ๋ฅ ์ธก์ ์ค์ GC๊ฐ ์๊ธฐ ๋๋ฌธ์
๋๋ค.
์งํ๋ฅ ์ธก์ ๊ธฐ ๋์ ๋ง์คํฐ ์ค๋ ๋์ GC๊ฐ Linux ๋ฐ Windows RStudio ํฐ๋ฏธ๋์์๋ ์ ์์ด์ง๋ง RStudio ์ฝ์์์๋ ๊ทธ๋ ์ง ์์ ์ด์ ๊ฐ ๋์ ๋๋ ์ง๋ฌธ์
๋๋ค.
์ข์, ์ด๊ฒ์ ๊ทธ๊ฒ์ ๊ณ ์น๋ค. ๋ฌธ์ ๋ RStudio๊ฐ ์๋ data.table ์ธก์์์์ต๋๋ค. ์ด์ Windows์ RStudio ์ฝ์์์ ์์ ์ ์ผ๋ก ์๋ํฉ๋๋ค. ๊ทธ๊ฒ์ Linux์ Max์์๋ ๋ฐ์ํ ์์๋ ๋ฌธ์ ์์ต๋๋ค. ๋จ์ง ๋ฉ๋ชจ๋ฆฌ ํจํด์ด ๊ทธ๊ฒ์ ์ ๋ฐํ์ง ์์๊ธฐ ๋๋ฌธ์
๋๋ค. ๋ค๋ฅธ ์ค๋ ๋์๋ REprintf
์ฌ์ฉํ์ฌ ๋ง์คํฐ ์ค๋ ๋ ์ธ์ ์งํ๊ณผ ๋์์ ๋ฐ์ํ ์์๋ R์ ๋ํ ์ง์
์ ์ด ์์ต๋๋ค (๋ฌธ์์ด ์ด๋ก ๋ฒํผ๋ฅผ ํธ์ ํ ๋). ์ด๊ฒ์ด ์๋ก์ด ์ธ์
์ ์ฒซ ๋ฒ์งธ ์คํ์์๋ง ๋ฐ์ํ ์ด์ ์
๋๋ค. ๋ ๋ฒ์งธ ์คํ ์ดํ์๋ ํ์ผ์ ๋ชจ๋ ๋ฌธ์์ด์ด ์ด์ ์ ํ์ธ๋์์ผ๋ฏ๋ก ์บ์ ์กฐํ๊ฐ ์ ์ค (์ค๋ ๋ ์์ )๋๊ณ ํ ๋น๋์ง ์์์ต๋๋ค (์ค๋ ๋ ์์ ์๋).
๊ทธ๋์ @aadler ์ @HughParsonage , ์ด๊ฒ์ ์๋
๊ฒฝ๊ณ ์์, ๋ค๋ฅธ ๊ฒ์ ์ฐพ๊ณ ์๋์ง ํ์คํ์ง ์์ต๋๋ค.
> gcinfo(TRUE)
[1] FALSE
> fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 1A51 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
Garbage collection 53 = 36+5+12 (level 2) ...
30.3 Mbytes of cons cells used (60%)
7.9 Mbytes of vectors used (1%)
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Garbage collection 54 = 37+5+12 (level 0) ...
30.8 Mbytes of cons cells used (61%)
566.6 Mbytes of vectors used (74%)
Garbage collection 55 = 37+6+12 (level 1) ...
30.8 Mbytes of cons cells used (61%)
549.2 Mbytes of vectors used (72%)
jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.626 wall clock time
[12] Finalizing the datatable
Type counts:
1 : bool8 '1'
1 : int32 '5'
2 : string 'A'
=============================
0.002s ( 0%) Memory map 0.341GB file
0.005s ( 0%) sep=',' ncol=4 and header detection
0.000s ( 0%) Column type detection using 10027 sample rows
0.469s ( 18%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
2.150s ( 82%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.103s ( 4%) Finding first non-embedded \n after each jump
+ 0.230s ( 9%) Parse to row-major thread buffers (grown 0 times)
+ 0.718s ( 27%) Transpose
+ 1.099s ( 42%) Waiting
0.745s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
2.626s Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
Garbage collection 56 = 37+6+13 (level 2) ...
31.1 Mbytes of cons cells used (62%)
531.9 Mbytes of vectors used (70%)
Garbage collection 57 = 38+6+13 (level 0) ...
31.1 Mbytes of cons cells used (62%)
532.0 Mbytes of vectors used (70%)
V1 V2 V3 V4
1: Goulburn 110018063 3499 NA
2: NA 110018064 812 NA
3: NA 110018065 2158 NA
4: NA 110019999 402 NA
5: NA 110028068 10 NA
---
22885376: NA 997999799 0 NA
22885377: NA 998999899 64 NA
22885378: NA 994999499 34 NA
22885379: NA 0&&&&&&&& 250796 NA
22885380: NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
๊ณ ๋ง์ ํด. ๋ค, ์๋ก์ด RStudio ์ฝ์ ์ธ์ ์ ์๋ค๊ณ ๊ฐ์ ํ๋ฉด ๊น๋ํ ์คํ์ ๋๋ค. ์คํ ๋ถ๊ท ํ ๋๋ "unprotect_ptr : ํฌ์ธํฐ๋ฅผ ์ฐพ์ ์ ์์"๋ฉ์์ง๊ฐ ํ์๋์ง ์๊ณ ์งํ๋ฅ ํ์๊ธฐ๊ฐ ์ฌ๋ฐ๋ฅด๊ฒ ์คํ๋๊ณ ์์ต๋๋ค (์ด ๊ฒฝ์ฐ ๋ค์ ์ฝ๊ธฐ ๋๋ฌธ์ ๋ ๋ฒ). ์ด์ @aadler ๋ก ํ์ธํ์ญ์์ค.
์ฑ๊ณต.
๋จผ์ RStudio์ ์๋ก์ด ์ธ์คํด์ค๋ฅผ ์คํํฉ๋๋ค.
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
> DT <- fread('LargeFile.csv', colClasses = colCLASS, select = colSEL, header = TRUE, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file LargeFile.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|==================================================|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:25.938 wall clock time
[12] Finalizing the datatable
Type counts:
23 : drop '0'
5 : int32 '5'
7 : float64 '7'
2 : string 'A'
=============================
0.005s ( 0%) Memory map 6.355GB file
0.025s ( 0%) sep=',' ncol=37 and header detection
0.001s ( 0%) Column type detection using 10049 sample rows
4.681s ( 18%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
21.226s ( 82%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
= 0.485s ( 2%) Finding first non-embedded \n after each jump
+ 1.465s ( 6%) Parse to row-major thread buffers (grown 0 times)
+ 9.095s ( 35%) Transpose
+ 10.181s ( 39%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
25.938s Total
RStudio๋ฅผ ๋ซ๊ณ ๋ค์ ์ด์ด ๋ฌธ์์ด ์บ์ฑ์ด ํ์ฑํ๋์ง ์๋๋กํ๊ณ gcinfo(TRUE)
ํ์ฌ ๋ค์ ์คํํ์ต๋๋ค. ๋ณด๋์ค ์ถ๊ฐ, IDate ๋ก์ ์ ํ์ด ์๋ฃ๋์์ต๋๋ค (ํ์ง๋ง 40 ์ด ์ด์ ๊ฑธ๋ ธ์ต๋๋ค :)).
> colCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+ rep('integer', 3L), rep('character', 2L),
+ 'integer', 'Date', rep('numeric', 2L), 'Date',
+ rep('numeric', 12L), rep('integer', 5),
+ rep('numeric', 3L), 'integer', 'character')
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> gcinfo(TRUE)
[1] FALSE
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
Garbage collection 46 = 36+5+5 (level 0) ...
38.6 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 47 = 37+5+5 (level 0) ...
38.7 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 48 = 38+5+5 (level 0) ...
38.8 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 49 = 39+5+5 (level 0) ...
39.0 Mbytes of cons cells used (78%)
11.2 Mbytes of vectors used (71%)
Garbage collection 50 = 40+5+5 (level 0) ...
39.1 Mbytes of cons cells used (78%)
11.3 Mbytes of vectors used (71%)
Garbage collection 51 = 40+6+5 (level 1) ...
38.8 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 52 = 41+6+5 (level 0) ...
38.9 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 53 = 42+6+5 (level 0) ...
41.5 Mbytes of cons cells used (83%)
12.2 Mbytes of vectors used (77%)
Garbage collection 54 = 42+7+5 (level 1) ...
43.4 Mbytes of cons cells used (86%)
12.8 Mbytes of vectors used (81%)
Garbage collection 55 = 42+7+6 (level 2) ...
44.7 Mbytes of cons cells used (72%)
13.0 Mbytes of vectors used (67%)
Garbage collection 56 = 43+7+6 (level 0) ...
46.5 Mbytes of cons cells used (74%)
13.6 Mbytes of vectors used (70%)
Garbage collection 57 = 44+7+6 (level 0) ...
47.0 Mbytes of cons cells used (75%)
13.8 Mbytes of vectors used (71%)
Garbage collection 58 = 45+7+6 (level 0) ...
47.4 Mbytes of cons cells used (76%)
13.9 Mbytes of vectors used (71%)
Garbage collection 59 = 46+7+6 (level 0) ...
47.7 Mbytes of cons cells used (76%)
14.2 Mbytes of vectors used (73%)
Garbage collection 60 = 47+7+6 (level 0) ...
48.0 Mbytes of cons cells used (77%)
14.2 Mbytes of vectors used (73%)
Garbage collection 61 = 48+7+6 (level 0) ...
48.1 Mbytes of cons cells used (77%)
14.3 Mbytes of vectors used (73%)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = colCLASS, select = colSEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file LargeFile.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
Garbage collection 62 = 48+7+7 (level 2) ...
46.5 Mbytes of cons cells used (60%)
13.6 Mbytes of vectors used (2%)
Garbage collection 63 = 48+7+8 (level 2) ...
46.5 Mbytes of cons cells used (60%)
488.7 Mbytes of vectors used (42%)
Garbage collection 64 = 48+7+9 (level 2) ...
46.5 Mbytes of cons cells used (60%)
963.9 Mbytes of vectors used (56%)
Garbage collection 65 = 48+7+10 (level 2) ...
46.5 Mbytes of cons cells used (60%)
1439.1 Mbytes of vectors used (63%)
Garbage collection 66 = 48+7+11 (level 2) ...
46.5 Mbytes of cons cells used (60%)
1914.2 Mbytes of vectors used (67%)
Garbage collection 67 = 48+7+12 (level 2) ...
46.5 Mbytes of cons cells used (60%)
2864.5 Mbytes of vectors used (77%)
Garbage collection 68 = 48+7+13 (level 2) ...
46.5 Mbytes of cons cells used (60%)
3577.3 Mbytes of vectors used (78%)
Garbage collection 69 = 48+7+14 (level 2) ...
46.5 Mbytes of cons cells used (60%)
4290.0 Mbytes of vectors used (75%)
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|============================Garbage collection 70 = 49+7+14 (level 0) ...
76.5 Mbytes of cons cells used (99%)
5487.5 Mbytes of vectors used (96%)
=Garbage collection 71 = 49+8+14 (level 1) ...
77.0 Mbytes of cons cells used (100%)
5487.6 Mbytes of vectors used (96%)
Garbage collection 72 = 49+8+15 (level 2) ...
77.0 Mbytes of cons cells used (81%)
5487.1 Mbytes of vectors used (80%)
==============Garbage collection 73 = 50+8+15 (level 0) ...
94.3 Mbytes of cons cells used (100%)
5494.0 Mbytes of vectors used (80%)
Garbage collection 74 = 50+9+15 (level 1) ...
94.5 Mbytes of cons cells used (100%)
5494.1 Mbytes of vectors used (80%)
Garbage collection 75 = 50+9+16 (level 2) ...
94.5 Mbytes of cons cells used (82%)
5493.1 Mbytes of vectors used (67%)
=======|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:24.772 wall clock time
[12] Finalizing the datatable
Type counts:
23 : drop '0'
5 : int32 '5'
7 : float64 '7'
2 : string 'A'
=============================
0.005s ( 0%) Memory map 6.355GB file
0.018s ( 0%) sep=',' ncol=37 and header detection
0.000s ( 0%) Column type detection using 10049 sample rows
5.496s ( 22%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
19.253s ( 78%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
= 0.433s ( 2%) Finding first non-embedded \n after each jump
+ 1.482s ( 6%) Parse to row-major thread buffers (grown 0 times)
+ 9.515s ( 38%) Transpose
+ 7.822s ( 32%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
24.772s Total
Garbage collection 76 = 51+9+16 (level 0) ...
105.3 Mbytes of cons cells used (91%)
5500.3 Mbytes of vectors used (67%)
Garbage collection 77 = 51+10+16 (level 1) ...
105.4 Mbytes of cons cells used (91%)
5500.2 Mbytes of vectors used (67%)
> DT[, Month := as.IDate(Month, format = "%Y-%m-%d")]
Garbage collection 78 = 51+10+17 (level 2) ...
107.5 Mbytes of cons cells used (76%)
8174.1 Mbytes of vectors used (81%)
Garbage collection 79 = 51+11+17 (level 1) ...
107.5 Mbytes of cons cells used (76%)
5910.4 Mbytes of vectors used (59%)
> gcinfo(FALSE)
[1] TRUE
๋๋ฐ! : tada : ๊ด๋ จ๋ ๋ชจ๋ ์ฌ๋, ํนํ @mattdowle ์ด ์ง๊ธ๊น์ง ๋จธ๋ฆฌ์นด๋ฝ์ด ์งง์์ผํฉ๋๋ค. :)
'๋ฌธ์ ๊ฐ ํด๊ฒฐ ๋ ๋๊น์ง ํด๊ฐ์ ๋จธ๋ฌผ๋ฌ ๋ผ'๋ผ๋ ๋ด ์ ๋ต์ด ์ฌ๊ธฐ์์ ํด๊ฒฐ ๋ ๊ฒ ๊ฐ์ต๋๋ค. :-)
ํ์ธํด์ผ ํ ๋ค๋ฅธ ์ฌํญ์ด ์๊ฑฐ๋์ด ๋ฌธ์ ๊ฐ ํด๊ฒฐ ๋ ๊ฒ์ผ๋ก ๊ฐ์ฃผ๋ฉ๋๊น?
@aadler ์ @HughParsonage์๊ฒ ๊ฐ์ฌ๋๋ฆฝ๋๋ค! ๊ตฌ์กฐ.
@kevinushey ํํ. ์, data.table ์ชฝ์ด์๊ณ ์ด์ ํด๊ฒฐ๋์์ต๋๋ค (PR # 2488). ๊ฐ์ฌ.
๊ฐ์ฅ ์ ์ฉํ ๋๊ธ
'๋ฌธ์ ๊ฐ ํด๊ฒฐ ๋ ๋๊น์ง ํด๊ฐ์ ๋จธ๋ฌผ๋ฌ ๋ผ'๋ผ๋ ๋ด ์ ๋ต์ด ์ฌ๊ธฐ์์ ํด๊ฒฐ ๋ ๊ฒ ๊ฐ์ต๋๋ค. :-)
ํ์ธํด์ผ ํ ๋ค๋ฅธ ์ฌํญ์ด ์๊ฑฐ๋์ด ๋ฌธ์ ๊ฐ ํด๊ฒฐ ๋ ๊ฒ์ผ๋ก ๊ฐ์ฃผ๋ฉ๋๊น?