当我使用verbose=FALSE
运行以下命令时,我会遇到R崩溃(“堆栈不平衡”)。 请注意,我能够在一两个月前在data.table
的较旧开发版本上成功运行以下代码,因此我认为这是一个相当新的错误。 (对不起,我不记得确切的开发版本在哪里工作。)
该问题不会在较小的文件上重现。 链接到zip文件(csv为350 MB): https :
我偶尔会遇到不同的错误。 例如,
get(名称,envir = ns,继承= FALSE)错误:第一个参数无效
要么
警告:'$'中的堆栈不平衡,先是16,然后是15
错误:R_Reprotect:只有1个受保护的项目,无法重新保护索引-2
#
Minimal reproducible example
library(data.table)
#> data.table 1.10.5 IN DEVELOPMENT built 2017-11-13 02:46:28 UTC; appveyor
#> The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#> Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#> Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.550 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 94%. ETA 00:00 Warning: stack imbalance in '$', 27 then 28
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.991
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.006s ( 0%) Memory map 0.341GB file
0.011s ( 0%) sep=',' ncol=4 and header detection
0.002s ( 0%) Column type detection using 10027 sample rows
0.328s ( 9%) Allocation of 22885380 rows x 4 cols (0.469GB)
3.194s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.362s ( 10%) Parse to row-major thread buffers
+ 1.963s ( 55%) Transpose
+ 0.868s ( 25%) Waiting
0.991s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
3.541s Total
Warning: stack imbalance in 'withVisible', 3 then 5
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
#
Output of sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5 RevoUtils_10.0.6 RevoUtilsMath_10.0.1
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2 yaml_2.1.14
@HughParsonage ,这看起来与#2457类似。 也许尝试传递showProgress=FALSE
,看看它是否完成。
@mattdowle自2017-11-09以来可能会有回归吗?
使用showProgress=FALSE
运行确实返回了结果(仅带有预期的警告)。
感谢您提供所有详细信息。 我怀疑自2017年11月9日以来已经出现了回归,但可能较长的verbose=TRUE
输出是否与ETA输出具有类似的影响。 该文件需要重新读取,这意味着将生成更多输出。 我担心@HughParsonage的showProgress = TRUE适用于他的报告是虚假的,如果以verbose = TRUE运行5-10次,该问题将会发生。
在并行区域中没有打印任何详细消息(进度ETA已修复)。但是,在第一次读取之后和第二次重新读取开始之前有冗长的消息(此文件正在发生)。 我想如果这些打印触发第100个CheckUserInterrupt(请参阅#2457),则有可能导致第2个并行区域失败(虽然奇数)。 为了排除这种情况,我只是将所有详细消息更改为使用REprintf而不是Rprintf(与ETA#2457相同的修复)。 失败是因为测试未在stderr上找到输出-可以解决。 通过之后,将自动创建Windows .zip,然后您可以重试。 准备好后,我将在这里更新。
好的,第二次尝试通过了检查,并且Windows.zip可用。 您可以再尝试@HughParsonage吗? 在重新阅读之前,我在详细模式下的消息之后添加了对R_FlushConsole()的调用。 仅在Windows上才需要刷新。 我正在猜测,如果没有刷新,则有时在并行重读发生时控制台有时会稍作更新,这会引起问题。 请重复10次,并始终同时使用verbose=TRUE
和showProgress=TRUE
。 如果您看到10次清晰的运行,那么我们就说是这样。 否则,我将不得不重新考虑。
不幸的是,不固定:
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = FALSE)
Read 26%. ETA 00:00 Warning: stack imbalance in '$', 20 then 22
Read 52%. ETA 00:00 Warning: stack imbalance in '$', 36 then 35
Warning: stack imbalance in '$', 21 then 22
Read 59%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
unprotect_ptr: pointer not found
In addition: Warning: stack imbalance in '$', 26 then 28
Warning messages:
1: Warning: stack imbalance in '$', 26 then 27
In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15
使用verbose=TRUE, showProgress=TRUE
即使运行10次也不会出错。 这是第十个输出的结果:
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.094 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.752
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.004s ( 0%) Memory map 0.341GB file
0.008s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.173s ( 4%) Allocation of 22885380 rows x 4 cols (0.469GB)
3.660s ( 95%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
= 0.009s ( 0%) Finding first non-embedded \n after each jump
+ 1.946s ( 51%) Parse to row-major thread buffers
+ 1.098s ( 29%) Transpose
+ 0.608s ( 16%) Waiting
1.752s ( 46%) Rereading 1 columns due to out-of-sample type exceptions
3.846s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.589 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.418
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.001s ( 0%) Memory map 0.341GB file
0.003s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.574s ( 14%) Allocation of 22885380 rows x 4 cols (0.469GB)
3.428s ( 86%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
= 0.010s ( 0%) Finding first non-embedded \n after each jump
+ 1.988s ( 50%) Parse to row-major thread buffers
+ 1.137s ( 28%) Transpose
+ 0.292s ( 7%) Waiting
1.418s ( 35%) Rereading 1 columns due to out-of-sample type exceptions
4.007s Total
There were 20 warnings (use warnings() to see them)
@HughParsonage谢谢! 我很困惑。 您是说它可以与verbose=TRUE, showProgress=TRUE
,这就是我们希望的-是的! 那是失败的,不是吗? 无论如何, showProgress
的默认值为TRUE,但是当您使用verbose
的默认FALSE运行时,_then_不起作用,您会看到堆栈失衡吗? _less_输出使它失败很奇怪。 请确认。 如果真是这样,那么也许我正在吠错树。 它在Linux上对我来说很好用,因此我依赖您在Windows上进行测试。 谢谢。
(此外,在第10个运行输出的底部,它说有20条警告。我认为这些是向上显示的2条警告,重复了10次。如果是这样,则是有道理的。)
嗨,对不起,马特。
没错,原来的问题不再导致崩溃,也就是说,以下各项可以按预期进行:
fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "")
为了弄清楚,在原始情况下,当verbose =FALSE
(默认值)时,我崩溃了。 在提出问题之前,我用verbose = TRUE
运行它,并注意到“堆栈不平衡”警告,但没有遇到崩溃。 在最新版本中, verbose = FALSE
不会导致崩溃(甚至没有任何问题)。
我说“不固定”的原因是我注意到了警告消息:
Warning messages:
Warning: stack imbalance in '$', 26 then 27
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15
看起来很奇怪,我认为这可能表示一个密切相关但并非完全相同的问题。 话虽如此,今天上午在澳大利亚,我再也无法重现警告信息。
好的我明白了。 那些关于堆栈不平衡的警告消息本质上是错误,是的。 我们不能跳过它们。 我将有关堆栈不平衡的警告称为崩溃,即使它实际上尚未崩溃。 (直到看到该警告后崩溃,这只是时间问题。)
当您使用verbose=TRUE, showProgress=TRUE
在一个新的R会话中运行10条时,是有关堆栈不平衡的20条警告中的任何一条,还是所有这20条仅是以下常规警告。
1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
一旦发生堆栈不平衡警告,请重新开始一个新的R会话。 即使发生了一次,我们也无法相信R的任何东西。
当我与verbose=TRUE, showProgress=TRUE
一起运行时,我设法崩溃const char
和SEXP
。 我正在尝试从命令行重现此代码(不幸的是,它发生在RStudio中,在我可以阅读整个消息之前,RStudio已关闭)。
无法重现崩溃。 这是重启后的结果。 出现堆栈不平衡警告:
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> for (i in 1:10) fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE, showProgress = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 31%. ETA 00:00 Warning: stack imbalance in '$', 24 then 23
Read 91%. ETA 00:00 Warning: stack imbalance in '$', 27 then 26
Read 95%. ETA 00:00 Warning: stack imbalance in '$', 28 then 29
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.895
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.029s ( 1%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.314s ( 15%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.761s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.015s ( 1%) Finding first non-embedded \n after each jump
+ 0.599s ( 28%) Parse to row-major thread buffers
+ 0.400s ( 19%) Transpose
+ 0.746s ( 35%) Waiting
0.895s ( 42%) Rereading 1 columns due to out-of-sample type exceptions
2.107s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.335 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.049
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.402s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.974s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.209s ( 9%) Parse to row-major thread buffers
+ 0.864s ( 36%) Transpose
+ 0.900s ( 38%) Waiting
1.049s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
2.385s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.414
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.007s ( 0%) sep=',' ncol=4 and header detection
0.002s ( 0%) Column type detection using 10027 sample rows
0.293s ( 18%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.322s ( 81%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.199s ( 12%) Parse to row-major thread buffers
+ 0.822s ( 51%) Transpose
+ 0.301s ( 19%) Waiting
0.414s ( 25%) Rereading 1 columns due to out-of-sample type exceptions
1.626s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.451 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.409
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.403s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.448s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.194s ( 10%) Parse to row-major thread buffers
+ 0.974s ( 52%) Transpose
+ 0.279s ( 15%) Waiting
0.409s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
1.860s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.480 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.412
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.007s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.459s ( 24%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.424s ( 75%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.197s ( 10%) Parse to row-major thread buffers
+ 0.938s ( 50%) Transpose
+ 0.288s ( 15%) Waiting
0.412s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
1.892s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.381 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.401
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.005s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.384s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.389s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.196s ( 11%) Parse to row-major thread buffers
+ 0.911s ( 51%) Transpose
+ 0.281s ( 16%) Waiting
0.401s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
1.781s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.384 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.480
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.002s ( 0%) Column type detection using 10027 sample rows
0.476s ( 26%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.378s ( 74%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.192s ( 10%) Parse to row-major thread buffers
+ 0.833s ( 45%) Transpose
+ 0.352s ( 19%) Waiting
0.480s ( 26%) Rereading 1 columns due to out-of-sample type exceptions
1.864s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.374 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.507
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.311s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.562s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.193s ( 10%) Parse to row-major thread buffers
+ 0.988s ( 52%) Transpose
+ 0.381s ( 20%) Waiting
0.507s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
1.881s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.318 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.493
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.006s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.306s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.496s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.193s ( 11%) Parse to row-major thread buffers
+ 0.935s ( 52%) Transpose
+ 0.367s ( 20%) Waiting
0.493s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
1.811s Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.141 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.506
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.002s ( 0%) Memory map 0.341GB file
0.007s ( 0%) sep=',' ncol=4 and header detection
0.001s ( 0%) Column type detection using 10027 sample rows
0.132s ( 8%) Allocation of 22885380 rows x 4 cols (0.469GB)
1.506s ( 91%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.195s ( 12%) Parse to row-major thread buffers
+ 0.938s ( 57%) Transpose
+ 0.371s ( 23%) Waiting
0.506s ( 31%) Rereading 1 columns due to out-of-sample type exceptions
1.647s Total
Warning: stack imbalance in 'for', 2 then 8
There were 20 warnings (use warnings() to see them)
奇怪的是,确定性很大。 谢谢。 这意味着冲洗不起作用,我将必须找到一种避免Rprintf
。 它可以与verbose=FALSE, showProgress=FALSE
可靠地配合使用(您在该问题的顶部附近写道,所以我依赖它。)“可靠”是指连续10次运行,只有两个预期的警告,并且没有看到堆栈不平衡警告。
那把它留给我。 再次感谢。
@HughParsonage好吧,请尝试使用最近的第二次尝试。 它尚未合并到master中,因此请小心从此处的分支中获取Windows.zip。 和以前一样,请以任何一种方式提供完整的输出,以便我可以检查它。 谢谢!
第一次尝试以下操作会导致崩溃(有关指针的信息)。
第二次尝试(重新引导后)将导致stack imbalance in '$', 16 then 15
警告。
# Assert that `data.table` is not installed:
stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557502 bytes (1.5 MB)
# downloaded 1.5 MB
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 01:38:17 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapping ... ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \r-only line endings are not allowed because \n is found in the data
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
# [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=',' with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
# Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000) : 1551 Quote rule 0
# Type codes (jump 100) : 1A51 Quote rule 0
# =====
# Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
# [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# Read 78%. ETA 00:00 Warning: stack imbalance in '$', 16 then 15
# Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.677 wall clock time
# [12] Finalizing the datatable
# Type counts:
# 1 : bool8 '1'
# 1 : int32 '5'
# 2 : string 'A'
# =============================
# 0.002s ( 0%) Memory map 0.341GB file
# 0.007s ( 0%) sep=',' ncol=4 and header detection
# 0.001s ( 0%) Column type detection using 10027 sample rows
# 0.297s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 2.369s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# = 0.003s ( 0%) Finding first non-embedded \n after each jump
# + 0.273s ( 10%) Parse to row-major thread buffers (grown 0 times)
# + 1.313s ( 49%) Transpose
# + 0.780s ( 29%) Waiting
# 0.893s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
# 2.677s Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1 V2 V3 V4
# 1: Goulburn 110018063 3499 NA
# 2: NA 110018064 812 NA
# 3: NA 110018065 2158 NA
# 4: NA 110019999 402 NA
# 5: NA 110028068 10 NA
# ---
# 22885376: NA 997999799 0 NA
# 22885377: NA 998999899 64 NA
# 22885378: NA 994999499 34 NA
# 22885379: NA 0&&&&&&&& 250796 NA
# 22885380: NA 0@@@@@@@@ 7305367 NA
# Warning messages:
# 1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
# 2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
嗨,@ mattdowle。 有一些仍在使用的GCC版本,其OpenMP最好是3.1,而不是4.0。 我在CRAN( Delaporte )上的一个软件包中遇到了这个问题,在其中尝试使用SIMD指令(OpenMP 4.0),该指令已使用Rtools for Windows(基于4.9.3)进行了编译,但在仍使用gcc的Linux机器上抛出了错误4.8.0。 如果我没记错的话,即使Windows也只能使用4.0而不是4.5调用。 也许这是导致问题的原因?
@HughParsonage感谢您如此快速的测试! 好的,那我继续思考!
@aadler这是一个好主意-一切
@HughParsonage只是为了确认一下,只需一次更改( verbose=FALSE
)的同一命令是否可以正常工作? 即fread("SA2-by-DJZ-2011.csv", verbose = FALSE, na.strings = "", header = FALSE)
。 进度表仍将显示。
是的,运行该命令(十次)返回了预期的结果(即data.table仅带有两个警告,因为其格式错误)。 没有堆栈不平衡警告。
谢谢。 因此,这似乎与控制台输出有关。 夫妇还有更多尝试的方法...
在详细模式下,并行区域内有一些分支称为wallclock()
。 我将其短路,始终返回0.0并避免系统调用,以排除这种情况。 我认为这是线程安全的,但也许不是。 请从此处重建的分支中尝试新的Windows.zip。
第一次尝试:
install.packages("https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1556972 bytes (1.5 MB)
# downloaded 1.5 MB
#
# package ‘data.table’ successfully unpacked and MD5 sums checked
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 03:49:20 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
第二次尝试,我收到以下警告:
Read 22%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
unprotect_ptr: pointer not found
In addition: Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Warning: stack imbalance in '$', 29 then 28
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 125 then 126
Warning: stack imbalance in 'lapply', 55 then 53
Warning: stack imbalance in 'lapply', 30 then 34
Warning: stack imbalance in '<-', 28 then 31
Warning: stack imbalance in '{', 24 then 27
Warning: stack imbalance in '{', 18 then 21
只是一个想法:这可能是RStudio的问题吗? 从终端运行脚本似乎不太容易重现。 我正在RStudio上运行,因为它使复制控制台输出变得更加容易。
当您说它不能在RStudio外部轻松再现时,是否可以全部再现? 即使它仅发生在RStudio中,它仍然是我打算在data.table方面解决的问题。 我只是想作为另一种途径来确认它肯定是“只是”控制台输出,而不是fread逻辑中的其他一些真正的堆栈不平衡。
我尚未在RStudio之外完全复制,并且可以在RStudio中可靠地复制(也就是说,我可以复制一些警告或崩溃)。 我试过Windows命令提示符和git shell(在Windows中)。
我在Windows上使用RStudio 1.1.383版。 如果我也向他们提出这个问题,对您有帮助吗?还是您希望我等待?
谢谢。 知道它只是在RStudio中,这真的很有用。 无需与他们一起提出。 这仅表示它与输出控制台缓冲(或类似缓冲)有关。 我已经着手进行并即将推动的工作。
我不明白为什么Windows不编译该更改:
fread.c:1054:3: warning: too many arguments for format [-Wformat-extra-args]
在Linux和Travis上可以正常工作。 这将阻止为您创建Windows.zip来测试此变通办法。 我得睡了
(它抱怨的是第1054行,而不是下一行的1055,这只是相同的原因。必须有所区别。%llu是Windows上__VA_ARGS__
问题-肯定不是。)
第二次尝试(重启后)
stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1559167 bytes (1.5 MB)
# downloaded 1.5 MB
#
# package ‘data.table’ successfully unpacked and MD5 sums checked
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-18 04:58:23 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file: C:\Users\hughp\AppData\Local\Temp\RtmpIT9H0D/fread.out
Input contains no \n. Taking this to be a filename to open
Read 11%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 28%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 48%. ETA 00:00 Warning: stack imbalance in '$', 20 then 19
Read 98%. ETA 00:00 [01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 1A51 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.822 wall clock time
[12] Finalizing the datatable
Type counts:
1 : bool8 '1'
1 : int32 '5'
2 : string 'A'
=============================
0.000s ( 0%) Memory map 0.341GB file
0.001s ( 0%) sep=',' ncol=4 and header detection
0.000s ( 0%) Column type detection using 10027 sample rows
0.291s ( 10%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
2.531s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.002s ( 0%) Finding first non-embedded \n after each jump
+ 0.282s ( 10%) Parse to row-major thread buffers (grown 0 times)
+ 1.537s ( 54%) Transpose
+ 0.710s ( 25%) Waiting
0.842s ( 30%) Rereading 1 columns due to out-of-sample type exceptions
2.822s Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
V1 V2 V3 V4
1: Goulburn 110018063 3499 NA
2: NA 110018064 812 NA
3: NA 110018065 2158 NA
4: NA 110019999 402 NA
5: NA 110028068 10 NA
---
22885376: NA 997999799 0 NA
22885377: NA 998999899 64 NA
22885378: NA 994999499 34 NA
22885379: NA 0&&&&&&&& 250796 NA
22885380: NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
同样,在RStudio之外不可复制。
感谢您如此快速的测试。 好吧,那当然可以排除很多! 还有两个想法。 第一个被推动并通过。 请在此处尝试新的Windows.zip。 alloca
在堆栈中,并且与您正在设置的na.strings
。 绝对在正确的区域(堆栈不平衡),值得尝试。
没问题-接下来的12个小时我将离开,所以直到那时都无法测试。
2017年11月18日星期六,下午5:20,Matt Dowle [email protected]写道:
感谢您如此快速的测试。 好吧,那当然可以排除很多!
还有两个想法。 第一个被推动并通过。 请尝试新的Windows.zip
这里
https://ci.appveyor.com/project/Rdatatable/data-table/build/1.0.1363/job/fo02vnbu5ebhwy3w/artifacts 。
该分配是在堆栈上分配的,并且与na.strings有关
您正在进行设置。 绝对在正确的区域(堆栈
不平衡),值得一试。-
您收到此邮件是因为有人提到您。直接回复此电子邮件,在GitHub上查看
https://github.com/Rdatatable/data.table/issues/2481#issuecomment-345421856 ,
或使线程静音
https://github.com/notifications/unsubscribe-auth/AHvGDGa5Qnls5eSFBMaQO5s8DElfrpKSks5s3ncqgaJpZM4QcuPc
。
好的别担心。 谢谢! 我现在也已经提出了第二个想法。 我似乎还记得\r
过去在Windows上引起过问题,但我不记得堆栈不平衡了。 无论如何,为了排除这种情况,我从进度表中删除了\r
。 堆栈不平衡消息似乎确实在发生ETA行的地方打印。 控制台捕获\r
并以不同的方式对待它,以便替换最后一行是可行的。 现在,每次ETA更新时,您都应该看到一个新行。 只是暂时将其排除在外。 新的Windows.zip已构建并在此处传递。
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file: C:\Users\hughp\AppData\Local\Temp\RtmpcVjZ1f/fread.out
Input contains no \n. Taking this to be a filename to open
Read 5%. ETA 00:00
Read 8%. ETA 00:00
Read 11%. ETA 00:00
Read 15%. ETA 00:00
Read 18%. ETA 00:00
Read 21%. ETA 00:00
Read 25%. ETA 00:00
Read 28%. ETA 00:00
Read 31%. ETA 00:00
Read 35%. ETA 00:00
Read 38%. ETA 00:00
Read 41%. ETA 00:00
Read 45%. ETA 00:00
Read 48%. ETA 00:00
Read 51%. ETA 00:00
Read 55%. ETA 00:00
Warning: stack imbalance in '$', 30 then 31
Warning: stack imbalance in '$', 17 then 16
Read 58%. ETA 00:00
Read 61%. ETA 00:00
Read 65%. ETA 00:00
Read 68%. ETA 00:00
Read 71%. ETA 00:00
Read 75%. ETA 00:00
Read 78%. ETA 00:00
Read 81%. ETA 00:00
Read 85%. ETA 00:00
Read 88%. ETA 00:00
Read 91%. ETA 00:00
Read 95%. ETA 00:00
Read 98%. ETA 00:00
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 1A51 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.894 wall clock time
[12] Finalizing the datatable
Type counts:
1 : bool8 '1'
1 : int32 '5'
2 : string 'A'
=============================
0.001s ( 0%) Memory map 0.341GB file
0.003s ( 0%) sep=',' ncol=4 and header detection
0.000s ( 0%) Column type detection using 10027 sample rows
0.316s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
2.574s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.004s ( 0%) Finding first non-embedded \n after each jump
+ 0.284s ( 10%) Parse to row-major thread buffers (grown 0 times)
+ 1.450s ( 50%) Transpose
+ 0.837s ( 29%) Waiting
0.953s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
2.894s Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
V1 V2 V3 V4
1: Goulburn 110018063 3499 NA
2: NA 110018064 812 NA
3: NA 110018065 2158 NA
4: NA 110019999 402 NA
5: NA 110028068 10 NA
---
22885376: NA 997999799 0 NA
22885377: NA 998999899 64 NA
22885378: NA 994999499 34 NA
22885379: NA 0&&&&&&&& 250796 NA
22885380: NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
仅供参考:我无法在具有较旧版本RStudio的另一台Windows计算机上重现此堆栈不平衡错误。
在这种情况下,您似乎可以按照您的建议去寻求RStudio支持了。 我再次浏览了fread代码,但我的想法不对。 请告诉他们RStudio的两个版本号。 这并不一定意味着它是RStudio,这可能是data.table方面的错,它恰巧出现在RStudio的一个版本中。 但是奇怪的是,它似乎与控制台输出有关,并且与RStudio有所不同。 我已经搜索了“ RStudio堆栈不平衡”,但是很多问题都与程序包错误有关,而不是RStudio本身。 寻找困难的问题。 让我们在这里保持打开状态,看看他们怎么说。
我怀疑最后的尝试是否会有所帮助,但是为了完整起见,请在此处尝试
但是,该特定堆栈不平衡消息来自R本身中的eval.c:491 。 某些线程必须在该行上运行,但我认为它不是fread
或data.table
。 check_stack_balance()
只能从R内部的5个地方调用:
在names.c
结尾的do_internal()
在objects.c
,两次在applyMethod()
在eval.c
,两次在eval()
我看不到fread.c
在其并行部分中如何达到这些目标。 唯一被调用的入口点是REprintf
,我不知道如何达到check_stack_balance()
。 我目前能想到的是RStudio的一个线程正在后台执行某项操作,该操作可能与控制台输出交互,在Windows上可能有所不同。
最后,为了完整起见,使用REprintf
似乎是正确的方法,因为基础R在libcurl.c:354和internet.c:409的进度表中使用了它(而不是Rprintf)。 令人遗憾的是,R的API中没有R的C级别的进度条(它似乎在C的R中也实现了两次)。
@mattdowle ,这会有所帮助吗? https://github.com/r-lib/progress
@aader是的-谢谢! 它的来源包含此评论:
// In R Studio we should print to stdout, because printing a \r
// to stderr is buggy (reported)
但是我已经删除了\r
,堆栈仍然不平衡。 我不知道报告在哪里。
有关R-devel的及时问题: [Rd] Rprintf和REprintf线程安全吗?
结果“ Rprintf和REprintf不是线程安全的。”
ik!
感谢所有提供的链接,并感谢Hugh向RStudio提出了问题。
data.table::fwrite()
和data.table::fread()
知道Rprintf
和REprintf
不是线程安全的,因此对于它们的进度表,它们仅从主线程调用它们。 不仅两个data.table线程都不会同时调用该R入口点,而且只有主线程也曾经调用过它,并且这是在运行期间任何时间任何线程调用的唯一R入口点。平行部分。 但是, Rprintf
R_CheckUserInterrupt
每打印100次会调用REprintf
,因为它不会调用R_CheckUserInterrupt
。 R内部使用REprintf
作为进度表,因此切换到REprintf
为了与核心R保持一致; 即,该选择本身与stderr vs stdout无关。
@kevinushey ,您介意看看此线程并让我知道我可以尝试的其他事情吗? 可能与RStudio有关,或者与后台线程有关吗? 如果RStudio具有后台线程,则可能是可以同时从两个线程中调用Rprintf
/ REprintf
。 但是,如果真是这样,那么在此之前我们会看到更多的问题。 因此,这似乎不太可能。 也许RStudio取代了R-exts第ptr_*
回调-这些与控制台输出和交互有关。 但是,该部分以“ For unix-alikes”开头,因此我不知道Windows是如何出现的。也许8.1.5节中的
我要等到十二月初,所以很遗憾,直到那时我才有机会看看。 但是,RStudio确实使用R事件循环在主线程上几乎运行了所有东西。 唯一的例外是例如项目级文件索引,并且那些后台线程通常不涉及任何R API。
RStudio确实接管了各种ptr_*
回调,用于处理控制台输入和输出。 我无法立即想到它们可能是这里的原因,但是当我回来时,我将尝试更深入地研究。
好的,请在这里尝试这个。 以前,它每2%更新一次进度状态。 对于您而言,您的文件仅需不到3秒的时间,因此这是每0.06秒对RStudio控制台进行的新进度更新。 也许对RStudio来说太过分了。 因此,此尝试将打印一个条。 它根本不使用\r
。 无论如何,对于\r
可以填充输出的报表和日志文件,这应该更好。
由于您的3秒计时非常快,因此如果距离那里有1秒的预计到达时间,我将进度条缩短为从1秒开始。 否则,它根本不会显示,并且仅因为未显示而对文件起作用。 经过测试后,我将增加fwrite
内容; 也就是说,如果ETA从那里开始2秒,则从2秒开始。
你好,@ mattdowle。 我在#2503中的最后评论也可能与此问题相关。
看起来不错! 无警告(运行5次后)。 首先在下面运行(请注意,前导空格在实际输出中看起来有所不同):
stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip",
repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557423 bytes (1.5 MB)
# downloaded 1.5 MB
#
# package ‘data.table’ successfully unpacked and MD5 sums checked
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapped ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
# [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=',' with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
# Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000) : 1551 Quote rule 0
# Type codes (jump 100) : 1A51 Quote rule 0
# =====
# Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
# [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# |--------------------------------------------------|
# |==================================================|
# Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.280 wall clock time
# [12] Finalizing the datatable
# Type counts:
# 1 : bool8 '1'
# 1 : int32 '5'
# 2 : string 'A'
# =============================
# 0.005s ( 0%) Memory map 0.341GB file
# 0.037s ( 2%) sep=',' ncol=4 and header detection
# 0.000s ( 0%) Column type detection using 10027 sample rows
# 0.321s ( 14%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 1.917s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# = 0.011s ( 0%) Finding first non-embedded \n after each jump
# + 0.560s ( 25%) Parse to row-major thread buffers (grown 0 times)
# + 0.488s ( 21%) Transpose
# + 0.858s ( 38%) Waiting
# 0.999s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
# 2.280s Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1 V2 V3 V4
# 1: Goulburn 110018063 3499 NA
# 2: NA 110018064 812 NA
# 3: NA 110018065 2158 NA
# 4: NA 110019999 402 NA
# 5: NA 110028068 10 NA
# ---
# 22885376: NA 997999799 0 NA
# 22885377: NA 998999899 64 NA
# 22885378: NA 994999499 34 NA
# 22885379: NA 0&&&&&&&& 250796 NA
# 22885380: NA 0@@@@@@@@ 7305367 NA
# Warning messages:
# 1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
# 2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
# Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
@HughParsonage救济! 我认为那是一个胜利。 我会整理,合并并继续前进。 非常感谢您的测试。
@aadler是同意您在问题#2503中的评论看起来相同。 您是否也可以从开发人员那里测试最新消息,并确认它已修复? 希望您发现的as.IDate
问题确实是由较早的堆栈不平衡引起的。
不好 :(
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
> DT <- fread('2017-11-22_1999_Performance.csv', header = TRUE, colClasses = CLS, select = SEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file 2017-11-22_1999_Performance.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|=======Warning: stack imbalance in '$', 27 then 26
===Warning: stack imbalance in '$', 26 then 27
================Error in fread("2017-11-22_1999_Performance.csv", header = TRUE, colClasses = CLS, :
unprotect_ptr: pointer not found
@aadler感谢您的报告。 我经历了freadR
并本地化了保护。 有30%的可能性可能会起作用,因为在您的情况下,您要覆盖类型,并且在该部分代码中有很多保护。 请使用此版本重试。
@aadler如果您还没有尝试过最后一个版本,请直接尝试这个版本。 另外,如果可以将文件的副本提供给我,我也许可以在Windows RStudio上尝试一下。
:(
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-01 01:54:04 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
> ColCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+ rep('integer', 3L), rep('character', 2L),
+ 'integer', 'Date', rep('numeric', 2L), 'Date',
+ rep('numeric', 12L), rep('integer', 5),
+ rep('numeric', 3L), 'integer', 'character')
> SELCOL <- c(WHATEVER)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = ColCLASS, select = SELCOL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file LargeFile.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|Error in fread("LargeFile.csv", header = TRUE, colClasses = ColCLASS, :
unprotect_ptr: pointer not found
感谢电子邮件中的@aadler ,我现在可以复制了。 R 3.4.2,最新的RStudio 1.1.383和Windows 10 Pro 10.0.16299 Build 16299。
我在RStudio中看到了奇怪的行为,记录在这里:
https://www.youtube.com/watch?v=tl2x2vmZxMU
似乎RStudio仅通过键入即可生成GC。 为什么会这样,并且总有办法将其关闭? 可能是当fread()
打印其进度条时,RStudio的单独事件循环正在考虑到控制台的输出是用户键入并调用R,从而引起GC并使所有操作跳闸? 也许这里的RStudio用户知道,可以为我指出正确的方向,或者@kevinushey又回来了(您确实说过凯文(Kevin)12月初,今天是第一
我可以在RStudio控制台中可靠地重现堆栈不平衡的情况。 使用RStudio的“终端”选项卡,即使使用gcinfo(TRUE)
也无法重现它。 有趣的是,GC确实在打印进度条时发生,并且看起来不错,因为在Linux上也很好。 鉴于RStudio控制台视频中的行为,我得出的结论是这是RStudio控制台错误。 我无法从RStudio终端窗口中复制文本(“编辑”->“复制”不起作用,Ctrl-C也不起作用),所以我对“终端”选项卡进行了截图以显示进度条中的GC正常。 我希望它是可以的,因为只有主线程正在调用REprintf
,而其他线程根本没有调用任何R API。
在RStudio Terminal中可以正常工作:
请注意,进度条是第一次打印时有GC,在RStudio Terminal中可以正常工作。 进度条第二次打印,因为此测试文件中存在样本外类型异常,该异常触发了仅针对这些列的自动重读。
但是在RStudio控制台中有stack imbalance
或unprotect_ptr: pointer not found
:
R version 3.4.2 (2017-09-28) -- "Short Summer"
> gcinfo(TRUE)
[1] FALSE
Garbage collection 22 = 16+3+3 (level 0) ...
25.5 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 23 = 16+4+3 (level 1) ...
24.9 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 24 = 17+4+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 25 = 18+4+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 26 = 19+4+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 27 = 20+4+3 (level 0) ...
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 28 = 20+5+3 (level 1) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 29 = 21+5+3 (level 0) ...
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 30 = 22+5+3 (level 0) ...
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 31 = 23+5+3 (level 0) ...
25.2 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 32 = 24+5+3 (level 0) ...
25.3 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 33 = 25+5+3 (level 0) ...
25.4 Mbytes of cons cells used (80%)
6.7 Mbytes of vectors used (66%)
Garbage collection 34 = 25+5+4 (level 2) ...
24.6 Mbytes of cons cells used (61%)
6.4 Mbytes of vectors used (50%)
Garbage collection 35 = 26+5+4 (level 0) ...
25.0 Mbytes of cons cells used (62%)
6.5 Mbytes of vectors used (52%)
> require(data.table)
Loading required package: data.table
Garbage collection 36 = 27+5+4 (level 0) ...
27.2 Mbytes of cons cells used (68%)
7.1 Mbytes of vectors used (56%)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 01:04:34 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
Garbage collection 37 = 28+5+4 (level 0) ...
27.7 Mbytes of cons cells used (69%)
7.3 Mbytes of vectors used (58%)
Garbage collection 38 = 29+5+4 (level 0) ...
28.0 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (58%)
Garbage collection 39 = 30+5+4 (level 0) ...
28.1 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (59%)
Garbage collection 40 = 31+5+4 (level 0) ...
28.2 Mbytes of cons cells used (70%)
7.5 Mbytes of vectors used (59%)
Garbage collection 41 = 32+5+4 (level 0) ...
28.4 Mbytes of cons cells used (71%)
7.5 Mbytes of vectors used (59%)
> DT = fread("/Users/pasha/Downloads/LargeFile.csv")
Garbage collection 42 = 32+5+5 (level 2) ...
27.4 Mbytes of cons cells used (54%)
7.1 Mbytes of vectors used (2%)
Garbage collection 43 = 32+5+6 (level 2) ...
27.4 Mbytes of cons cells used (54%)
244.7 Mbytes of vectors used (42%)
Garbage collection 44 = 32+5+7 (level 2) ...
27.4 Mbytes of cons cells used (54%)
482.3 Mbytes of vectors used (42%)
Garbage collection 45 = 32+5+8 (level 2) ...
27.4 Mbytes of cons cells used (54%)
957.4 Mbytes of vectors used (56%)
Garbage collection 46 = 32+5+9 (level 2) ...
27.4 Mbytes of cons cells used (54%)
1432.6 Mbytes of vectors used (63%)
Garbage collection 47 = 32+5+10 (level 2) ...
27.4 Mbytes of cons cells used (54%)
2145.3 Mbytes of vectors used (75%)
Garbage collection 48 = 32+5+11 (level 2) ...
27.4 Mbytes of cons cells used (54%)
2620.4 Mbytes of vectors used (71%)
Garbage collection 49 = 32+5+12 (level 2) ...
27.4 Mbytes of cons cells used (54%)
3570.8 Mbytes of vectors used (78%)
Garbage collection 50 = 32+5+13 (level 2) ...
27.4 Mbytes of cons cells used (54%)
4283.5 Mbytes of vectors used (75%)
Garbage collection 51 = 32+5+14 (level 2) ...
27.4 Mbytes of cons cells used (54%)
5709.0 Mbytes of vectors used (77%)
Garbage collection 52 = 32+5+15 (level 2) ...
27.4 Mbytes of cons cells used (54%)
7372.0 Mbytes of vectors used (81%)
Garbage collection 53 = 32+5+16 (level 2) ...
27.4 Mbytes of cons cells used (54%)
8797.5 Mbytes of vectors used (79%)
Garbage collection 54 = 32+5+17 (level 2) ...
27.4 Mbytes of cons cells used (54%)
10935.7 Mbytes of vectors used (80%)
|--------------------------------------------------|
|=====Error in fread("LargeFile.csv") :
unprotect_ptr: pointer not found
>
showProgress=FALSE
在RStudio控制台中可靠地解决了该问题。 要进行复制,它必须是在带有showProgress=TRUE
(即默认值)的全新RStudio控制台中首次运行。 似乎与进度表中是否存在GC有关; 第一次运行是在新的会话中。 它只需要一个大文件即可显示进度表。 与重新读取或传递给fread
参数无关。 如果在新的RStudio控制台中进行的首次运行是使用showProgress=FALSE
来使其正常工作,则该运行会扩展R的堆,然后在同一会话中使用showProgress=TRUE
进行后续运行也可以。 但这只是因为进度表期间没有GC,这是因为第一次运行已经扩展了堆。
为什么在进度表上的主线程上使用GC在Linux和Windows RStudio Terminal上是可以的,而在RStudio Console中却没有。
好的,这可以解决。 问题出在data.table而不是RStudio。 现在可以在Windows的RStudio控制台中对我可靠地工作。 这也是在Linux和Max上也会发生的问题,只是内存模式没有触发它。 其他线程确实有一个R入口点(当使用字符串列推送其缓冲区时),这可能与使用REprintf
的主线程打印进度同时发生。 这就是为什么它仅在新会话的第一次运行中发生的原因。 在第二次运行之后,文件中的所有字符串都已被查看过,因此缓存查找处于命中状态(线程安全)而不分配(线程安全)。
因此, @ aadler和@HughParsonage ,请尝试使用此方法。 现在有95%的机会可行!
没有警告,不确定您是否在寻找其他东西:
> gcinfo(TRUE)
[1] FALSE
> fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 1A51 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
Garbage collection 53 = 36+5+12 (level 2) ...
30.3 Mbytes of cons cells used (60%)
7.9 Mbytes of vectors used (1%)
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Garbage collection 54 = 37+5+12 (level 0) ...
30.8 Mbytes of cons cells used (61%)
566.6 Mbytes of vectors used (74%)
Garbage collection 55 = 37+6+12 (level 1) ...
30.8 Mbytes of cons cells used (61%)
549.2 Mbytes of vectors used (72%)
jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.626 wall clock time
[12] Finalizing the datatable
Type counts:
1 : bool8 '1'
1 : int32 '5'
2 : string 'A'
=============================
0.002s ( 0%) Memory map 0.341GB file
0.005s ( 0%) sep=',' ncol=4 and header detection
0.000s ( 0%) Column type detection using 10027 sample rows
0.469s ( 18%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
2.150s ( 82%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.103s ( 4%) Finding first non-embedded \n after each jump
+ 0.230s ( 9%) Parse to row-major thread buffers (grown 0 times)
+ 0.718s ( 27%) Transpose
+ 1.099s ( 42%) Waiting
0.745s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
2.626s Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
Garbage collection 56 = 37+6+13 (level 2) ...
31.1 Mbytes of cons cells used (62%)
531.9 Mbytes of vectors used (70%)
Garbage collection 57 = 38+6+13 (level 0) ...
31.1 Mbytes of cons cells used (62%)
532.0 Mbytes of vectors used (70%)
V1 V2 V3 V4
1: Goulburn 110018063 3499 NA
2: NA 110018064 812 NA
3: NA 110018065 2158 NA
4: NA 110019999 402 NA
5: NA 110028068 10 NA
---
22885376: NA 997999799 0 NA
22885377: NA 998999899 64 NA
22885378: NA 994999499 34 NA
22885379: NA 0&&&&&&&& 250796 NA
22885380: NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
谢谢休。 是的,如果这是在全新的RStudio控制台会话中,这是干净的运行。 没有堆栈不平衡或“ unprotect_ptr:找不到指针”消息的迹象,进度条运行正常(在这种情况下,两次重读)。 现在只需@aadler确认。
成功。
第一次运行,RStudio的新实例。
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
> DT <- fread('LargeFile.csv', colClasses = colCLASS, select = colSEL, header = TRUE, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file LargeFile.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|==================================================|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:25.938 wall clock time
[12] Finalizing the datatable
Type counts:
23 : drop '0'
5 : int32 '5'
7 : float64 '7'
2 : string 'A'
=============================
0.005s ( 0%) Memory map 6.355GB file
0.025s ( 0%) sep=',' ncol=37 and header detection
0.001s ( 0%) Column type detection using 10049 sample rows
4.681s ( 18%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
21.226s ( 82%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
= 0.485s ( 2%) Finding first non-embedded \n after each jump
+ 1.465s ( 6%) Parse to row-major thread buffers (grown 0 times)
+ 9.095s ( 35%) Transpose
+ 10.181s ( 39%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
25.938s Total
关闭RStudio并重新打开它以防止激活字符串缓存,然后再次使用gcinfo(TRUE)
运行它。 增加了奖励,完成了向IDate的转换(尽管花费了40秒钟以上):)。
> colCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+ rep('integer', 3L), rep('character', 2L),
+ 'integer', 'Date', rep('numeric', 2L), 'Date',
+ rep('numeric', 12L), rep('integer', 5),
+ rep('numeric', 3L), 'integer', 'character')
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> gcinfo(TRUE)
[1] FALSE
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
Garbage collection 46 = 36+5+5 (level 0) ...
38.6 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 47 = 37+5+5 (level 0) ...
38.7 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 48 = 38+5+5 (level 0) ...
38.8 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 49 = 39+5+5 (level 0) ...
39.0 Mbytes of cons cells used (78%)
11.2 Mbytes of vectors used (71%)
Garbage collection 50 = 40+5+5 (level 0) ...
39.1 Mbytes of cons cells used (78%)
11.3 Mbytes of vectors used (71%)
Garbage collection 51 = 40+6+5 (level 1) ...
38.8 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 52 = 41+6+5 (level 0) ...
38.9 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 53 = 42+6+5 (level 0) ...
41.5 Mbytes of cons cells used (83%)
12.2 Mbytes of vectors used (77%)
Garbage collection 54 = 42+7+5 (level 1) ...
43.4 Mbytes of cons cells used (86%)
12.8 Mbytes of vectors used (81%)
Garbage collection 55 = 42+7+6 (level 2) ...
44.7 Mbytes of cons cells used (72%)
13.0 Mbytes of vectors used (67%)
Garbage collection 56 = 43+7+6 (level 0) ...
46.5 Mbytes of cons cells used (74%)
13.6 Mbytes of vectors used (70%)
Garbage collection 57 = 44+7+6 (level 0) ...
47.0 Mbytes of cons cells used (75%)
13.8 Mbytes of vectors used (71%)
Garbage collection 58 = 45+7+6 (level 0) ...
47.4 Mbytes of cons cells used (76%)
13.9 Mbytes of vectors used (71%)
Garbage collection 59 = 46+7+6 (level 0) ...
47.7 Mbytes of cons cells used (76%)
14.2 Mbytes of vectors used (73%)
Garbage collection 60 = 47+7+6 (level 0) ...
48.0 Mbytes of cons cells used (77%)
14.2 Mbytes of vectors used (73%)
Garbage collection 61 = 48+7+6 (level 0) ...
48.1 Mbytes of cons cells used (77%)
14.3 Mbytes of vectors used (73%)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = colCLASS, select = colSEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file LargeFile.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
Garbage collection 62 = 48+7+7 (level 2) ...
46.5 Mbytes of cons cells used (60%)
13.6 Mbytes of vectors used (2%)
Garbage collection 63 = 48+7+8 (level 2) ...
46.5 Mbytes of cons cells used (60%)
488.7 Mbytes of vectors used (42%)
Garbage collection 64 = 48+7+9 (level 2) ...
46.5 Mbytes of cons cells used (60%)
963.9 Mbytes of vectors used (56%)
Garbage collection 65 = 48+7+10 (level 2) ...
46.5 Mbytes of cons cells used (60%)
1439.1 Mbytes of vectors used (63%)
Garbage collection 66 = 48+7+11 (level 2) ...
46.5 Mbytes of cons cells used (60%)
1914.2 Mbytes of vectors used (67%)
Garbage collection 67 = 48+7+12 (level 2) ...
46.5 Mbytes of cons cells used (60%)
2864.5 Mbytes of vectors used (77%)
Garbage collection 68 = 48+7+13 (level 2) ...
46.5 Mbytes of cons cells used (60%)
3577.3 Mbytes of vectors used (78%)
Garbage collection 69 = 48+7+14 (level 2) ...
46.5 Mbytes of cons cells used (60%)
4290.0 Mbytes of vectors used (75%)
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|============================Garbage collection 70 = 49+7+14 (level 0) ...
76.5 Mbytes of cons cells used (99%)
5487.5 Mbytes of vectors used (96%)
=Garbage collection 71 = 49+8+14 (level 1) ...
77.0 Mbytes of cons cells used (100%)
5487.6 Mbytes of vectors used (96%)
Garbage collection 72 = 49+8+15 (level 2) ...
77.0 Mbytes of cons cells used (81%)
5487.1 Mbytes of vectors used (80%)
==============Garbage collection 73 = 50+8+15 (level 0) ...
94.3 Mbytes of cons cells used (100%)
5494.0 Mbytes of vectors used (80%)
Garbage collection 74 = 50+9+15 (level 1) ...
94.5 Mbytes of cons cells used (100%)
5494.1 Mbytes of vectors used (80%)
Garbage collection 75 = 50+9+16 (level 2) ...
94.5 Mbytes of cons cells used (82%)
5493.1 Mbytes of vectors used (67%)
=======|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:24.772 wall clock time
[12] Finalizing the datatable
Type counts:
23 : drop '0'
5 : int32 '5'
7 : float64 '7'
2 : string 'A'
=============================
0.005s ( 0%) Memory map 6.355GB file
0.018s ( 0%) sep=',' ncol=37 and header detection
0.000s ( 0%) Column type detection using 10049 sample rows
5.496s ( 22%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
19.253s ( 78%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
= 0.433s ( 2%) Finding first non-embedded \n after each jump
+ 1.482s ( 6%) Parse to row-major thread buffers (grown 0 times)
+ 9.515s ( 38%) Transpose
+ 7.822s ( 32%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
24.772s Total
Garbage collection 76 = 51+9+16 (level 0) ...
105.3 Mbytes of cons cells used (91%)
5500.3 Mbytes of vectors used (67%)
Garbage collection 77 = 51+10+16 (level 1) ...
105.4 Mbytes of cons cells used (91%)
5500.2 Mbytes of vectors used (67%)
> DT[, Month := as.IDate(Month, format = "%Y-%m-%d")]
Garbage collection 78 = 51+10+17 (level 2) ...
107.5 Mbytes of cons cells used (76%)
8174.1 Mbytes of vectors used (81%)
Garbage collection 79 = 51+11+17 (level 1) ...
107.5 Mbytes of cons cells used (76%)
5910.4 Mbytes of vectors used (59%)
> gcinfo(FALSE)
[1] TRUE
太棒了! :tada:对所涉及的每个人都很棒,尤其是@mattdowle ,现在必须用这个来短发:)
我的“休假直到问题解决”的策略似乎在这里已经解决了:-)
还有什么我应该尝试看看的,还是这个问题已解决?
谢谢@aadler和@HughParsonage! 救济。
@kevinushey哈哈。 是的,它是data.table端的,现在已解决(PR#2488)。 谢谢。
最有用的评论
我的“休假直到问题解决”的策略似乎在这里已经解决了:-)
还有什么我应该尝试看看的,还是这个问题已解决?