Data.table: Stapeln Sie das Ungleichgewicht im Kopf

Erstellt am 14. Nov. 2017  ·  61Kommentare  ·  Quelle: Rdatatable/data.table

Ich erlebe einen R-Absturz (ein 'Stapelungleichgewicht'), wenn ich Folgendes mit verbose=FALSE ausführe. Hinweis: Ich konnte den folgenden Code vor ein oder zwei Monaten erfolgreich auf einer älteren Entwicklungsversion von data.table ausführen, daher glaube ich, dass dies ein relativ neuer Fehler ist. (Entschuldigung, ich erinnere mich nicht an die genaue Entwicklungsversion, in der es funktioniert hat.)

Das Problem wird nicht in einer wesentlich kleineren Datei reproduziert. Link zur Zip-Datei (die CSV-Datei ist 350 MB groß): https://github.com/HughParsonage/ABS-data/blob/master/inbox/SA2-by-DJZ-2011.zip

Ich habe gelegentlich verschiedene Fehler. Beispielsweise,

Fehler in get (Name, Umgebung = ns, erbt = FALSE): ungültiges erstes Argument

oder

Warnung: Stapelungleichgewicht in '$', 16 dann 15
Fehler: R_Reprotect: Nur 1 geschütztes Element, Index -2 kann nicht erneut geschützt werden

# Minimal reproducible example

library(data.table)

#> data.table 1.10.5 IN DEVELOPMENT built 2017-11-13 02:46:28 UTC; appveyor
#>   The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#>   Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#>   Release notes, videos and slides: http://r-datatable.com


fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE)

Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.550 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 94%. ETA 00:00 Warning: stack imbalance in '$', 27 then 28
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.991
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.006s (  0%) Memory map 0.341GB file
   0.011s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.328s (  9%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.194s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.362s ( 10%) Parse to row-major thread buffers
   +    1.963s ( 55%) Transpose
   +    0.868s ( 25%) Waiting
   0.991s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
   3.541s        Total
Warning: stack imbalance in 'withVisible', 3 then 5
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

# Output of sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5    RevoUtils_10.0.6     RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2    yaml_2.1.14 
bug fread idatitime platform-specific

Hilfreichster Kommentar

Es scheint, als hätte sich meine Strategie, im Urlaub zu bleiben, bis das Problem behoben ist, hier bewährt :-)

Gibt es noch etwas, das ich mir ansehen sollte, oder wird dieses Problem als gelöst angesehen?

Alle 61 Kommentare

@ HughParsonage , das sieht ähnlich aus wie # 2457. Versuchen Sie vielleicht, showProgress=FALSE und prüfen Sie, ob der Vorgang abgeschlossen ist.
@mattdowle könnte es seit dem 09.11.2017 eine Regression gegeben haben?

Das Ausführen mit showProgress=FALSE hat tatsächlich das Ergebnis zurückgegeben (nur mit den erwarteten Warnungen).

Vielen Dank für alle detaillierten Informationen. Ich bezweifle, dass es seit dem 09.11.2017 eine Regression gegeben hat, aber vielleicht hat die lange Ausgabe von verbose=TRUE einen ähnlichen Einfluss wie die ETA-Ausgabe. Die Datei muss erneut gelesen werden, was bedeutet, dass mehr Ausgabe generiert wird. Ich befürchte, dass @HughParsonages Bericht, dass showProgress = TRUE für ihn funktioniert, falsch ist und dass das Problem auftreten wird, wenn es 5-10 Mal mit verbose = TRUE ausgeführt wird.

Es werden keine ausführlichen Nachrichten aus dem parallelen Abschnitt gedruckt (außer der bereits festgelegten Fortschritts-ETA). Es gibt jedoch ausführliche Nachrichten nach dem ersten Lesen und vor dem zweiten erneuten Lesen (was für diese Datei geschieht). Ich nehme an, es ist möglich, dass wenn diese Ausdrucke den 100. CheckUserInterrupt auslösen (siehe # 2457), der 2. parallele Bereich fehlschlägt (allerdings ungerade). Um dies trotzdem auszuschließen, habe ich gerade alle ausführlichen Nachrichten so geändert, dass REprintf anstelle von Rprintf verwendet wird (gleicher Fix wie # 2457 für die ETA). Das ist fehlgeschlagen, weil die Tests die Ausgabe auf stderr nicht finden - wird behoben. Sobald es vorbei ist, wird die Windows-ZIP-Datei automatisch erstellt, und Sie können es bitte erneut versuchen. Ich werde hier aktualisieren, wenn ich fertig bin.

Ok, der zweite Versuch besteht darin, Prüfungen zu bestehen und Windows.zip ist verfügbar. @ HughParsonage würde es Ihnen etwas verbose=TRUE und showProgress=TRUE . Wenn Sie 10 klare Läufe sehen, werden wir sagen, dass es das war. Sonst muss ich nochmal nachdenken.

Leider nicht behoben:

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = FALSE)
Read 26%. ETA 00:00 Warning: stack imbalance in '$', 20 then 22
Read 52%. ETA 00:00 Warning: stack imbalance in '$', 36 then 35
Warning: stack imbalance in '$', 21 then 22
Read 59%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  : 
  unprotect_ptr: pointer not found
In addition: Warning: stack imbalance in '$', 26 then 28
Warning messages:
1: Warning: stack imbalance in '$', 26 then 27
In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15

Wenn ich verbose=TRUE, showProgress=TRUE auch nach 10 Läufen verwende, wird kein Fehler angezeigt. Hier ist das Ergebnis der 10. Ausgabe:

Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.094 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.752
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.004s (  0%) Memory map 0.341GB file
   0.008s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.173s (  4%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.660s ( 95%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
   =    0.009s (  0%) Finding first non-embedded \n after each jump
   +    1.946s ( 51%) Parse to row-major thread buffers
   +    1.098s ( 29%) Transpose
   +    0.608s ( 16%) Waiting
   1.752s ( 46%) Rereading 1 columns due to out-of-sample type exceptions
   3.846s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.589 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.418
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.001s (  0%) Memory map 0.341GB file
   0.003s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.574s ( 14%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.428s ( 86%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
   =    0.010s (  0%) Finding first non-embedded \n after each jump
   +    1.988s ( 50%) Parse to row-major thread buffers
   +    1.137s ( 28%) Transpose
   +    0.292s (  7%) Waiting
   1.418s ( 35%) Rereading 1 columns due to out-of-sample type exceptions
   4.007s        Total
There were 20 warnings (use warnings() to see them)

@ HughParsonage Danke! Ich bin allerdings verwirrt. Sie sagen, es funktioniert gut mit verbose=TRUE, showProgress=TRUE , was wir uns erhofft haben - yay! Das ist doch vorher gescheitert, oder? Der Standardwert für showProgress ist ohnehin WAHR, aber wenn Sie mit dem Standardwert FALSE für verbose , funktioniert es dann nicht und Sie sehen das Stapelungleichgewicht? Das ist seltsam, dass die _less_-Ausgabe fehlschlägt. Bitte bestätigen. Wenn das der Fall ist, belle ich vielleicht den falschen Baum an. Unter Linux funktioniert es hier einwandfrei, daher bin ich darauf angewiesen, dass Sie unter Windows testen. Vielen Dank.
(Außerdem steht am Ende der Ausgabe des 10. Laufs, dass es 20 Warnungen gab. Ich gehe davon aus, dass dies die 2 Warnungen sind, die oben angezeigt werden und 10 Mal wiederholt werden. Wenn ja, ist dies sinnvoll.)

Hallo, entschuldige die Verwirrung, Matt.

Sie haben Recht, dass das ursprüngliche Problem nicht mehr zu einem Absturz führt, nämlich Folgendes funktioniert wie erwartet:

fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "")

Um im Original zu verdeutlichen, dass ich beim Absturz von verbose =FALSE (Standardeinstellung) einen Absturz hatte. Ich habe es mit verbose = TRUE bevor ich das Problem eingereicht habe, und eine Warnung "Stapelungleichgewicht" festgestellt, aber keinen Absturz festgestellt. Mit der neuesten Version bekomme ich keinen Absturz (oder irgendwelche Probleme) mit verbose = FALSE .

Der Grund, warum ich "nicht behoben" sagte, war, dass ich die Warnmeldungen bemerkte:

Warning messages:
Warning: stack imbalance in '$', 26 then 27
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15

Das sah seltsam aus und ich dachte, es könnte auf ein eng verwandtes, wenn auch nicht identisches Problem hinweisen. Trotzdem kann ich heute Morgen in Australien die Warnmeldungen nicht mehr reproduzieren.

OK, ich verstehe. Diese Warnmeldungen über das Ungleichgewicht des Stapels sind im Wesentlichen Fehler, ja. Wir können sie nicht überspringen. Ich nenne diese Warnung vor einem Stapelungleichgewicht einen Absturz, obwohl sie noch nicht abgestürzt ist. (Es ist nur eine Frage der Zeit, bis es abstürzt, nachdem diese Warnung angezeigt wurde.)

Wenn Sie die 10 Läufe in einer neuen R-Sitzung mit verbose=TRUE, showProgress=TRUE , handelt es sich um eine der 20 Warnungen bezüglich eines Stapelungleichgewichts oder um alle 20 nur um die folgenden regulären Warnungen.

1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

Sobald eine Warnung vor einem Stapelungleichgewicht aufgetreten ist, starten Sie bitte eine neue R-Sitzung. Wir können nichts von R vertrauen, nachdem dies einmal vorgekommen ist.

Ich habe einen Absturz bekommen, als ich mit verbose=TRUE, showProgress=TRUE . Etwas über ein const char mit einem SEXP . Ich versuche dies über die Befehlszeile zu reproduzieren (leider ist es in RStudio aufgetreten und RStudio wurde geschlossen, bevor ich die gesamte Nachricht lesen konnte).

Der Absturz kann nicht reproduziert werden. Hier ist das Ergebnis nach dem Neustart. Es gab eine Warnung vor einem Stapelungleichgewicht:

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> for (i in 1:10) fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE, showProgress = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 31%. ETA 00:00 Warning: stack imbalance in '$', 24 then 23
Read 91%. ETA 00:00 Warning: stack imbalance in '$', 27 then 26
Read 95%. ETA 00:00 Warning: stack imbalance in '$', 28 then 29
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.895
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.029s (  1%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.314s ( 15%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.761s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.015s (  1%) Finding first non-embedded \n after each jump
   +    0.599s ( 28%) Parse to row-major thread buffers
   +    0.400s ( 19%) Transpose
   +    0.746s ( 35%) Waiting
   0.895s ( 42%) Rereading 1 columns due to out-of-sample type exceptions
   2.107s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.335 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.049
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.402s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.974s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.209s (  9%) Parse to row-major thread buffers
   +    0.864s ( 36%) Transpose
   +    0.900s ( 38%) Waiting
   1.049s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
   2.385s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.414
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.293s ( 18%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.322s ( 81%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.199s ( 12%) Parse to row-major thread buffers
   +    0.822s ( 51%) Transpose
   +    0.301s ( 19%) Waiting
   0.414s ( 25%) Rereading 1 columns due to out-of-sample type exceptions
   1.626s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.451 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.409
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.403s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.448s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.194s ( 10%) Parse to row-major thread buffers
   +    0.974s ( 52%) Transpose
   +    0.279s ( 15%) Waiting
   0.409s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.860s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.480 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.412
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.459s ( 24%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.424s ( 75%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.197s ( 10%) Parse to row-major thread buffers
   +    0.938s ( 50%) Transpose
   +    0.288s ( 15%) Waiting
   0.412s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.892s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.381 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.401
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.005s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.384s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.389s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.196s ( 11%) Parse to row-major thread buffers
   +    0.911s ( 51%) Transpose
   +    0.281s ( 16%) Waiting
   0.401s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.781s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.384 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.480
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.476s ( 26%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.378s ( 74%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.192s ( 10%) Parse to row-major thread buffers
   +    0.833s ( 45%) Transpose
   +    0.352s ( 19%) Waiting
   0.480s ( 26%) Rereading 1 columns due to out-of-sample type exceptions
   1.864s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.374 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.507
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.311s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.562s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.193s ( 10%) Parse to row-major thread buffers
   +    0.988s ( 52%) Transpose
   +    0.381s ( 20%) Waiting
   0.507s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
   1.881s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.318 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.493
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.306s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.496s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.193s ( 11%) Parse to row-major thread buffers
   +    0.935s ( 52%) Transpose
   +    0.367s ( 20%) Waiting
   0.493s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
   1.811s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.141 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.506
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.132s (  8%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.506s ( 91%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.195s ( 12%) Parse to row-major thread buffers
   +    0.938s ( 57%) Transpose
   +    0.371s ( 23%) Waiting
   0.506s ( 31%) Rereading 1 columns due to out-of-sample type exceptions
   1.647s        Total
Warning: stack imbalance in 'for', 2 then 8
There were 20 warnings (use warnings() to see them)

Seltsamerweise ist diese Gewissheit groß. Vielen Dank. Das heißt, der Flush hat nicht funktioniert und ich muss einen Weg finden, um Rprintf zu vermeiden. Es funktioniert zuverlässig mit verbose=FALSE, showProgress=FALSE Ich nehme an (Sie haben das oben in dieser Ausgabe geschrieben, also verlasse ich mich darauf.) "Zuverlässig" bedeutet 10 gerade Läufe mit nur den zwei erwarteten Warnungen und ohne Sicht auf den Stapel Ungleichgewichtswarnung.
Lass es dann bei mir. Danke noch einmal.

@ HughParsonage Ok, bitte versuchen Sie es noch einmal mit dem letzten zweiten Versuch. Es ist noch nicht mit dem Master zusammengeführt. Achten Sie daher darauf, die Windows.zip-Datei aus dem Zweig hier abzurufen. Bitte geben Sie nach wie vor die vollständige Ausgabe an, damit ich sie überprüfen kann. Vielen Dank!

Der erste der folgenden Versuche führte zu einem Absturz (etwas über einen Zeiger).

Der zweite Versuch (nach dem Neustart) führt zu einer Warnung stack imbalance in '$', 16 then 15 .

# Assert that `data.table` is not installed:
stopifnot(!requireNamespace("data.table", quietly = TRUE))

install.packages("https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557502 bytes (1.5 MB)
# downloaded 1.5 MB

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 01:38:17 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com

setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapping ... ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \r-only line endings are not allowed because \n is found in the data
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
#   [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=','  with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
#   Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000)    : 1551  Quote rule 0
# Type codes (jump 100)    : 1A51  Quote rule 0
# =====
#   Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
#   [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# Read 78%. ETA 00:00 Warning: stack imbalance in '$', 16 then 15
# Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.677 wall clock time
# [12] Finalizing the datatable
# Type counts:
#   1 : bool8     '1'
# 1 : int32     '5'
# 2 : string    'A'
# =============================
#   0.002s (  0%) Memory map 0.341GB file
# 0.007s (  0%) sep=',' ncol=4 and header detection
# 0.001s (  0%) Column type detection using 10027 sample rows
# 0.297s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 2.369s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# =    0.003s (  0%) Finding first non-embedded \n after each jump
# +    0.273s ( 10%) Parse to row-major thread buffers (grown 0 times)
# +    1.313s ( 49%) Transpose
# +    0.780s ( 29%) Waiting
# 0.893s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
# 2.677s        Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1        V2      V3 V4
# 1: Goulburn 110018063    3499 NA
# 2:       NA 110018064     812 NA
# 3:       NA 110018065    2158 NA
# 4:       NA 110019999     402 NA
# 5:       NA 110028068      10 NA
# ---                              
#   22885376:       NA 997999799       0 NA
# 22885377:       NA 998999899      64 NA
# 22885378:       NA 994999499      34 NA
# 22885379:       NA 0&&&&&&&&  250796 NA
# 22885380:       NA 0@@@@@@@@ 7305367 NA
# Warning messages:
#   1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#                 Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
#               2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#               Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

Hallo, @mattdowle. Es werden noch Versionen von GCC verwendet, deren OpenMP bestenfalls 3.1 und nicht 4.0 ist. Ich bin auf dieses Problem in einem meiner Pakete auf CRAN ( Delaporte ) gestoßen, wo ich versucht habe, eine SIMD-Direktive (OpenMP 4.0) zu verwenden, die mit Rtools für Windows (basierend auf 4.9.3) kompiliert wurde, aber auf einem Linux-Computer von jemandem, der noch gcc verwendet, einen Fehler ausgelöst hat 4.8.0. Selbst Windows kann nur 4.0- und nicht 4.5-Aufrufe verwenden, wenn ich mich richtig erinnere. Vielleicht trägt das zum Problem bei?

@ HughParsonage Danke, dass hast ! Ok, dann werde ich weiter nachdenken!
@aadler Das ist ein guter Gedanke - alles ist möglich.

@HughParsonage Nur um zu bestätigen, dass der gleiche Befehl mit nur einer Änderung ( verbose=FALSE ) gut funktioniert? dh fread("SA2-by-DJZ-2011.csv", verbose = FALSE, na.strings = "", header = FALSE) . Die Fortschrittsanzeige wird weiterhin angezeigt.

Ja, wenn Sie diesen Befehl (zehnmal) ausführen, wird das erwartete Ergebnis zurückgegeben (dh die Datentabelle mit nur zwei Warnungen, da sie schlecht formatiert ist). Keine Warnungen vor Stapelungleichgewichten.

Vielen Dank. Es scheint also mit der Konsolenausgabe zu zusammenhängen. Noch ein paar Dinge zum Ausprobieren ...

Im ausführlichen Modus gibt es einige Zweige innerhalb der parallelen Region, die wallclock() aufrufen. Ich habe das kurzgeschlossen, um immer 0.0 zurückzugeben und den Systemaufruf zu vermeiden, um das auszuschließen. Ich dachte, es sei threadsicher, aber vielleicht auch nicht. Bitte probieren Sie hier die neue Windows.zip-Datei aus dem neu erstellten Zweig aus .

Erster Versuch:

install.packages("https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1556972 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package ‘data.table’ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 03:49:20 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)

image

Beim zweiten Versuch erhalte ich folgende Warnungen:

Read 22%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  : 
  unprotect_ptr: pointer not found
In addition: Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
 Warning: stack imbalance in '$', 29 then 28
 Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 125 then 126
Warning: stack imbalance in 'lapply', 55 then 53
Warning: stack imbalance in 'lapply', 30 then 34
Warning: stack imbalance in '<-', 28 then 31
Warning: stack imbalance in '{', 24 then 27
Warning: stack imbalance in '{', 18 then 21

Nur ein Gedanke: Könnte dies ein Problem mit RStudio sein? Das Ausführen des Skripts vom Terminal aus scheint sich nicht so einfach zu reproduzieren. Ich verwende RStudio, weil es das Kopieren der Konsolenausgabe erleichtert.

Wenn Sie sagen, dass es außerhalb von RStudio nicht so einfach reproduziert werden kann, wird es überhaupt reproduziert? Selbst wenn es nur in RStudio passiert, möchte ich es dennoch auf der Seite data.table beheben. Ich frage nur als eine andere Route, um zu bestätigen, dass es sich definitiv um "nur" Konsolenausgabe handelt und nicht um ein anderes echtes Stack-Ungleichgewicht in der Fread-Logik.

Ich muss noch außerhalb von RStudio reproduzieren und kann es zuverlässig innerhalb von RStudio reproduzieren (dh ich kann eine Warnung oder einen Absturz reproduzieren). Ich habe Windows-Eingabeaufforderung und Git-Shell (in Windows) versucht.

Ich verwende RStudio Version 1.1.383 unter Windows. Wäre es für Sie hilfreich, wenn ich dieses Problem auch bei ihnen ansprechen würde, oder möchten Sie, dass ich warte?

Vielen Dank. Das ist wirklich nützlich zu wissen, dass es sich nur um RStudio handelt. Keine Notwendigkeit, es mit ihnen zu erheben. Es bedeutet nur, dass es etwas mit der Pufferung der Ausgabekonsole (oder ähnlichem) zu tun hat. Ich habe eine Arbeit gemacht und bin dabei zu pushen.

Ich verstehe nicht, warum Windows diese Änderung nicht kompiliert:
fread.c:1054:3: warning: too many arguments for format [-Wformat-extra-args]
Funktioniert hier unter Linux und Travis einwandfrei. Dadurch wird verhindert, dass Windows.zip erstellt wird, damit Sie diese Problemumgehung testen können. Ich muss darauf schlafen.
(Es beschwert sich über Zeile 1054, aber nicht über die nächste Zeile 1055, die genau das gleiche ist. Muss ein Unterschied sein.% Llu ein Problem mit __VA_ARGS__ unter Windows - sicherlich nicht.)

Ok, schließlich ist windows.zip für Sie bereit , noch einmal versuchen bitte hier .

In diesem Zweig gibt es jetzt mehrere Problemumgehungen. Wenn es funktioniert, entferne ich die Problemumgehungen, um festzustellen, um welche es sich handelt. Die llu Compiler - Warnungen suchen die vielversprechendste als dass zu einer Stapel Ungleichgewicht in ausführlicher Ausgabe, im Einklang mit der Erklärung @ st-Pascha gefunden geben würde hier . Vielleicht hat die Rprintf -Ebene das vor dem Compiler versteckt, den sie jetzt sehen kann, da sie fprintf direkt verwendet.

image

Beim zweiten Versuch (nach dem Neustart)

stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1559167 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package ‘data.table’ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-18 04:58:23 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com

fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file:  C:\Users\hughp\AppData\Local\Temp\RtmpIT9H0D/fread.out 
Input contains no \n. Taking this to be a filename to open
Read 11%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 28%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 48%. ETA 00:00 Warning: stack imbalance in '$', 20 then 19
Read 98%. ETA 00:00 [01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.822 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.000s (  0%) Memory map 0.341GB file
   0.001s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.291s ( 10%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.531s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.002s (  0%) Finding first non-embedded \n after each jump
   +    0.282s ( 10%) Parse to row-major thread buffers (grown 0 times)
   +    1.537s ( 54%) Transpose
   +    0.710s ( 25%) Waiting
   0.842s ( 30%) Rereading 1 columns due to out-of-sample type exceptions
   2.822s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

Auch außerhalb von RStudio nicht reproduzierbar.

Danke, dass du so schnell getestet hast. Nun, das schließt dann sicher viel aus! Zwei Ideen übrig. Zuerst drückte und ging. Bitte probieren Sie hier die neue Windows.zip alloca befindet sich auf dem Stapel und hat mit na.strings zu tun, die Sie gerade einstellen. Auf jeden Fall im richtigen Bereich (Stack-Ungleichgewicht) und einen Versuch wert.

Kein Problem - ich werde die nächsten 12 Stunden weg sein und kann bis dahin nicht testen.
Am Samstag, den 18. November 2017 um 17:20 Uhr schrieb Matt Dowle [email protected] :

Danke, dass du so schnell getestet hast. Nun, das schließt dann sicher viel aus!
Zwei Ideen übrig. Zuerst drückte und ging. Bitte versuchen Sie es mit neuer Windows.zip
Hier
https://ci.appveyor.com/project/Rdatatable/data-table/build/1.0.1363/job/fo02vnbu5ebhwy3w/artifacts .
Diese Zuordnung ist auf dem Stapel zugeordnet und hat mit na.strings zu tun, die
Sie stellen ein, wie es passiert. Auf jeden Fall im richtigen Bereich (Stapel
Ungleichgewicht) und einen Versuch wert.

- -
Sie erhalten dies, weil Sie erwähnt wurden.

Antworte direkt auf diese E-Mail und sieh sie dir auf GitHub an
https://github.com/Rdatatable/data.table/issues/2481#issuecomment-345421856 ,
oder schalten Sie den Thread stumm
https://github.com/notifications/unsubscribe-auth/AHvGDGa5Qnls5eSFBMaQO5s8DElfrpKSks5s3ncqgaJpZM4QcuPc
.

Ok, keine Sorge. Vielen Dank! Ich habe jetzt auch die 2. Idee vorangetrieben. Ich erinnere mich an \r , das in der Vergangenheit unter Windows ein Problem verursacht hat, aber ich erinnere mich nicht an ein Stapelungleichgewicht. Um dies auszuschließen, habe ich die \r von der Fortschrittsanzeige entfernt. Die Meldung "Stapelungleichgewicht" scheint dort gedruckt zu werden, wo die ETA-Zeilen auftreten. Es ist möglich, dass die Konsole \r abfängt und anders behandelt, sodass die letzte Zeile ersetzt wird. Sie sollten jetzt jedes Mal eine neue Zeile sehen, wenn die ETA aktualisiert wird. Nur vorübergehend, um das auszuschließen. Neue Windows.zip gebaut und hier übergeben .

fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file:  C:\Users\hughp\AppData\Local\Temp\RtmpcVjZ1f/fread.out 
Input contains no \n. Taking this to be a filename to open
Read 5%. ETA 00:00
Read 8%. ETA 00:00
Read 11%. ETA 00:00
Read 15%. ETA 00:00
Read 18%. ETA 00:00
Read 21%. ETA 00:00
Read 25%. ETA 00:00
Read 28%. ETA 00:00
Read 31%. ETA 00:00
Read 35%. ETA 00:00
Read 38%. ETA 00:00
Read 41%. ETA 00:00
Read 45%. ETA 00:00
Read 48%. ETA 00:00
Read 51%. ETA 00:00
Read 55%. ETA 00:00
Warning: stack imbalance in '$', 30 then 31
Warning: stack imbalance in '$', 17 then 16
Read 58%. ETA 00:00
Read 61%. ETA 00:00
Read 65%. ETA 00:00
Read 68%. ETA 00:00
Read 71%. ETA 00:00
Read 75%. ETA 00:00
Read 78%. ETA 00:00
Read 81%. ETA 00:00
Read 85%. ETA 00:00
Read 88%. ETA 00:00
Read 91%. ETA 00:00
Read 95%. ETA 00:00
Read 98%. ETA 00:00
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.894 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.001s (  0%) Memory map 0.341GB file
   0.003s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.316s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.574s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.004s (  0%) Finding first non-embedded \n after each jump
   +    0.284s ( 10%) Parse to row-major thread buffers (grown 0 times)
   +    1.450s ( 50%) Transpose
   +    0.837s ( 29%) Waiting
   0.953s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
   2.894s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

Zu Ihrer Information: Ich konnte diesen Stapelungleichgewichtsfehler auf einem anderen Windows-Computer mit einer etwas älteren Version von RStudio nicht reproduzieren.

In diesem Fall scheint es an der Zeit zu sein, den RStudio-Support zu fragen, wie Sie es dann vorgeschlagen haben. Ich habe den Fread-Code noch einmal durchgesehen und habe keine Ideen mehr auf meiner Seite. Bitte teilen Sie ihnen die beiden Versionsnummern von RStudio mit. Dies bedeutet nicht unbedingt, dass es sich um RStudio handelt. Es kann sich um einen Fehler auf der Seite von data.table handeln, der zufällig in einer Version von RStudio angezeigt wird. Aber es ist seltsam, dass es mit der Konsolenausgabe zusammenhängt und dass dies etwas anderes und RStudio-spezifisches ist. Ich habe nach "RStudio-Stack-Imabalance" gesucht, aber es treten viele Probleme bei Paketfehlern auf, nicht bei RStudio an sich. Schwieriges Problem zu suchen. Lassen Sie uns das Thema hier offen halten und sehen, was sie sagen.

Ich bezweifle, dass der letzte Versuch helfen wird, aber der Vollständigkeit halber versuchen Sie es bitte hier . Vielleicht macht der MinGW-Compiler, der unter Windows verwendet wird, etwas Seltsames mit diesen beiden Ints. Eine davon ist die Konstante 0, die möglicherweise wegoptimiert wird und dann irgendwie ein Stapelungleichgewicht verursacht.

Diese bestimmte Stapelungleichgewichtsnachricht kommt jedoch von eval.c: 491 in R selbst. Ein Thread muss diese Zeile ausführen, aber ich glaube nicht, dass es fread oder data.table . Das check_stack_balance() wird nur von 5 Stellen in R-Interna aufgerufen:
in names.c am Ende von do_internal()
in objects.c , zweimal in applyMethod()
in eval.c , zweimal in eval()
Ich sehe nicht, wie eines davon erreicht werden kann, während sich fread.c in seinem parallelen Abschnitt befindet. Der einzige Einstiegspunkt, der aufgerufen wird, ist REprintf und ich sehe nicht, wie das check_stack_balance() . Alles, was ich derzeit denken kann, ist, dass RStudio einen Thread hat, der etwas im Hintergrund tut, das möglicherweise mit der Konsolenausgabe interagiert, möglicherweise anders unter Windows.
Der Vollständigkeit halber scheint die Verwendung von REprintf der richtige Weg zu sein, da Basis R dies (anstelle von Rprintf) in seiner Fortschrittsanzeige in libcurl.c: 354 und internet.c: 409 verwendet . Es ist eine Schande, dass der Fortschrittsbalken von R auf C-Ebene in der API von R nicht verfügbar ist (er scheint auch zweimal in R auf C-Ebene implementiert zu sein).

@ Mattdowle , wäre das hilfreich? https://github.com/r-lib/progress

@aader Ja - danke! Seine Quelle enthält diesen Kommentar :
// In R Studio we should print to stdout, because printing a \r
// to stderr is buggy (reported)
Aber ich habe bereits \r und das Stack-Ungleichgewicht tritt immer noch auf. Ich frage mich, wo es gemeldet wurde.

Letzter Build hat auch nicht funktioniert:

image

Berichtet unter https://community.rstudio.com/t/stack-imbalance-possibly-in-stderr/3009

Rechtzeitige Frage zu R-devel: [Rd] Sind Rprintf und REprintf threadsicher?

Fazit "Rprintf und REprintf sind nicht threadsicher."

Yoiks!

Vielen Dank für die Links und Hugh, dass er das Problem mit RStudio angesprochen hat.

data.table::fwrite() und data.table::fread() sind sich bewusst, dass Rprintf und REprintf nicht threadsicher sind. Für ihre Fortschrittsanzeigen rufen sie sie daher nur vom Master-Thread auf. Nicht nur, dass zwei data.table-Threads diesen R-Einstiegspunkt niemals gleichzeitig aufrufen, sondern auch nur der Master-Thread, und dies ist der einzige R-Einstiegspunkt, der von einem der Threads zu irgendeinem Zeitpunkt während des aufgerufen wird Parallelschnitt. Rprintf ruft jedoch alle 100 Drucke R_CheckUserInterrupt . Ich denke, das ist der Teil, der wahrscheinlich nicht einmal vor dem Master-Thread sicher ist. Das ist der Grund für die Verwendung von REprintf da dies nicht R_CheckUserInterrupt nennt. R-Interna verwenden REprintf für Fortschrittsanzeigen. Daher ist es sinnvoll, auf REprintf zu wechseln, um die Konsistenz mit Kern R zu gewährleisten. dh diese Wahl hat per se nichts mit stderr vs stdout zu tun.

@kevinushey Würde es Ihnen etwas Rprintf / REprintf von zwei Threads gleichzeitig aufgerufen werden. Aber wenn das der Fall wäre, hätten wir vorher viel mehr Probleme gesehen. Das scheint also sehr unwahrscheinlich. Vielleicht ersetzt RStudio die in Abschnitt 8.1.2 von R-exts erwähnten ptr_* Rückrufe - diese beziehen sich auf die Konsolenausgabe und -interaktion . Dieser Abschnitt beginnt jedoch mit "Für Unix-Alikes", daher weiß ich nicht, wie Windows hereinkommt. Vielleicht sind auch Abschnitt 8.1.5-Threading-Probleme relevant. Beides sind Unterabschnitte von Abschnitt 8: "Verknüpfen von GUIs und anderen Frontends mit R".

Ich werde bis Anfang Dezember unterwegs sein, daher habe ich bis dahin leider keine Gelegenheit, einen Blick darauf zu werfen. RStudio führt jedoch fast alles im Hauptthread mithilfe der R-Ereignisschleife aus. Die einzigen Ausnahmen sind beispielsweise die Indizierung von Dateien auf Projektebene, und diese Hintergrundthreads berühren normalerweise keine R-APIs.

RStudio übernimmt die verschiedenen ptr_* Rückrufe für die Verarbeitung der Konsoleneingabe und -ausgabe. Ich kann mir nicht sofort vorstellen, wie sie hier eine Ursache sein könnten, aber ich werde versuchen, einen tieferen Blick darauf zu werfen, wenn ich wieder da bin.

Ok, bitte versuchen Sie es hier . Zuvor wurde der Fortschrittsstatus alle 2% aktualisiert. In Ihrem Fall dauert Ihre Datei nur knapp 3 Sekunden, sodass alle 0,06 Sekunden ein neues Fortschrittsupdate für die RStudio-Konsole erstellt wurde. Vielleicht war das zu viel für RStudio. Dieser Versuch druckt also einen Balken. \r überhaupt nicht verwendet. Dies sollte ohnehin besser für Berichte und Protokolldateien sein, bei denen \r die Ausgabe füllen könnte.

Da Ihr Timing von 3 Sekunden ziemlich schnell ist, habe ich den Fortschrittsbalken auf 1 Sekunde reduziert, wenn von dort aus eine ETA von 1 Sekunde vorliegt. Andernfalls würde es überhaupt nicht angezeigt und funktioniert für Ihre Datei, nur weil es nicht angezeigt wurde. Nachdem Sie getestet haben, werde ich auf das erhöhen, was fwrite hat. dh beginnen Sie bei 2 Sekunden, wenn ETA 2 Sekunden von dort entfernt ist.

Hallo, @mattdowle. Mein letzter Kommentar in # 2503 könnte auch mit diesem Problem zusammenhängen.

Sieht gut aus! Keine Warnungen (nach 5 Läufen). Führen Sie zuerst unten aus (beachten Sie, dass führende Leerzeichen in der tatsächlichen Ausgabe anders aussehen):

stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557423 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package ‘data.table’ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapped ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
#   [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=','  with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
#   Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000)    : 1551  Quote rule 0
# Type codes (jump 100)    : 1A51  Quote rule 0
# =====
#   Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
#   [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# |--------------------------------------------------|
#   |==================================================|
#   Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.280 wall clock time
# [12] Finalizing the datatable
# Type counts:
#   1 : bool8     '1'
# 1 : int32     '5'
# 2 : string    'A'
# =============================
#   0.005s (  0%) Memory map 0.341GB file
# 0.037s (  2%) sep=',' ncol=4 and header detection
# 0.000s (  0%) Column type detection using 10027 sample rows
# 0.321s ( 14%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 1.917s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# =    0.011s (  0%) Finding first non-embedded \n after each jump
# +    0.560s ( 25%) Parse to row-major thread buffers (grown 0 times)
# +    0.488s ( 21%) Transpose
# +    0.858s ( 38%) Waiting
# 0.999s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
# 2.280s        Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1        V2      V3 V4
# 1: Goulburn 110018063    3499 NA
# 2:       NA 110018064     812 NA
# 3:       NA 110018065    2158 NA
# 4:       NA 110019999     402 NA
# 5:       NA 110028068      10 NA
# ---                              
#   22885376:       NA 997999799       0 NA
# 22885377:       NA 998999899      64 NA
# 22885378:       NA 994999499      34 NA
# 22885379:       NA 0&&&&&&&&  250796 NA
# 22885380:       NA 0@@@@@@@@ 7305367 NA
# Warning messages:
#   1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#                 Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
#               2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#               Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

@ HughParsonage Relief! Ich denke, das ist dann ein Gewinn. Ich werde aufräumen, zusammenführen und weitermachen. Vielen Dank an Sie für das Testen.

@aadler Ja, stimmte zu, dass Ihr Kommentar in Ausgabe Nr. 2503 genauso aussieht. Könnten Sie bitte auch das Neueste von dev testen und bestätigen, dass es jetzt behoben ist? Wir hoffen, dass das Problem mit as.IDate Sie gefunden haben, tatsächlich durch das frühere Stack-Ungleichgewicht verursacht wurde.

Nicht gut :(

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> DT <- fread('2017-11-22_1999_Performance.csv', header = TRUE, colClasses = CLS, select = SEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file 2017-11-22_1999_Performance.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|=======Warning: stack imbalance in '$', 27 then 26
===Warning: stack imbalance in '$', 26 then 27
================Error in fread("2017-11-22_1999_Performance.csv", header = TRUE, colClasses = CLS,  : 
  unprotect_ptr: pointer not found

@aadler Danke für diesen Bericht. Ich habe freadR durchlaufen und den Schutz lokalisiert. Es besteht eine Wahrscheinlichkeit von 30%, dass dies funktioniert, da Sie in Ihrem Fall Typen überschreiben und in diesem Teil des Codes einige Schutzmaßnahmen vorhanden waren. Bitte versuchen Sie es erneut mit diesem Build .

@aadler Wenn Sie den letzten Build noch nicht ausprobiert haben, versuchen Sie es gleich mit diesem . Wenn es mir möglich ist, eine Kopie Ihrer Datei an mich zu senden, kann ich mich möglicherweise unter Windows RStudio versuchen.

:(

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-01 01:54:04 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> ColCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+               rep('integer', 3L), rep('character', 2L),
+               'integer', 'Date', rep('numeric', 2L), 'Date',
+               rep('numeric', 12L), rep('integer', 5),
+               rep('numeric', 3L), 'integer', 'character')
> SELCOL <- c(WHATEVER)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = ColCLASS, select = SELCOL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|Error in fread("LargeFile.csv", header = TRUE, colClasses = ColCLASS,  : 
  unprotect_ptr: pointer not found

Dank @aadler per E-Mail kann ich jetzt reproduzieren. R 3.4.2, neueste Version von RStudio 1.1.383 und Windows 10 Pro 10.0.16299 Build 16299.

Ich sehe seltsames Verhalten in RStudio, das hier aufgezeichnet wurde:
https://www.youtube.com/watch?v=tl2x2vmZxMU
Es scheint, dass RStudio GCs nur durch Eingabe generiert. Warum ist das so und gibt es überhaupt eine Möglichkeit, es auszuschalten? Wenn fread() seinen Fortschrittsbalken druckt, denkt die separate Ereignisschleife von RStudio möglicherweise, dass die Ausgabe an die Konsole vom Benutzer eingegeben wird, und ruft R auf, wodurch der GC ausgelöst wird und alles ausgelöst wird. Vielleicht wissen RStudio-Benutzer hier, könnten mich in die richtige Richtung weisen, oder vielleicht ist @kevinushey zurück (Sie haben Anfang Dezember Kevin gesagt, und es ist heute der 1. heute: -))

Ich kann das Ungleichgewicht des Stapels in der RStudio-Konsole zuverlässig reproduzieren. Wenn ich die Registerkarte RStudio-Terminal verwende, kann ich sie selbst mit gcinfo(TRUE) überhaupt nicht reproduzieren. Interessanterweise treten GCs auf, wenn der Fortschrittsbalken gedruckt wird, und das scheint in Ordnung zu sein, da es auch unter Linux in Ordnung ist. Angesichts des Verhaltens in diesem Video der RStudio-Konsole komme ich zu dem Schluss, dass dies ein Fehler der RStudio-Konsole ist. Ich konnte den Text nicht aus dem RStudio Terminal-Fenster kopieren (Bearbeiten-> Kopieren funktioniert nicht und Strg-C auch nicht). Daher habe ich einen Screenshot der Registerkarte Terminal erstellt, um zu zeigen, dass GC während des Fortschrittsbalkens in Ordnung ist. Ich würde erwarten, dass es in Ordnung ist, da nur der Master-Thread REprintf aufruft und die anderen Threads überhaupt keine R-API aufrufen.

Funktioniert gut im RStudio Terminal:
selection_014
Beachten Sie, dass es GCs gibt, während der Fortschrittsbalken zum ersten Mal gedruckt wird, und dass dies im RStudio-Terminal einwandfrei funktioniert. Der Fortschrittsbalken wird ein zweites Mal gedruckt, da diese Testdatei eine Ausnahme vom Typ "Out-of-Sample" enthält, die ein automatisches erneutes Lesen nur für diese Spalten auslöst.

In der RStudio-Konsole gibt es jedoch entweder stack imbalance oder unprotect_ptr: pointer not found :

R version 3.4.2 (2017-09-28) -- "Short Summer"
> gcinfo(TRUE)
[1] FALSE
Garbage collection 22 = 16+3+3 (level 0) ... 
25.5 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 23 = 16+4+3 (level 1) ... 
24.9 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 24 = 17+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 25 = 18+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 26 = 19+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 27 = 20+4+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 28 = 20+5+3 (level 1) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 29 = 21+5+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 30 = 22+5+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 31 = 23+5+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 32 = 24+5+3 (level 0) ... 
25.3 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 33 = 25+5+3 (level 0) ... 
25.4 Mbytes of cons cells used (80%)
6.7 Mbytes of vectors used (66%)
Garbage collection 34 = 25+5+4 (level 2) ... 
24.6 Mbytes of cons cells used (61%)
6.4 Mbytes of vectors used (50%)
Garbage collection 35 = 26+5+4 (level 0) ... 
25.0 Mbytes of cons cells used (62%)
6.5 Mbytes of vectors used (52%)
> require(data.table)
Loading required package: data.table
Garbage collection 36 = 27+5+4 (level 0) ... 
27.2 Mbytes of cons cells used (68%)
7.1 Mbytes of vectors used (56%)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 01:04:34 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
Garbage collection 37 = 28+5+4 (level 0) ... 
27.7 Mbytes of cons cells used (69%)
7.3 Mbytes of vectors used (58%)
Garbage collection 38 = 29+5+4 (level 0) ... 
28.0 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (58%)
Garbage collection 39 = 30+5+4 (level 0) ... 
28.1 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (59%)
Garbage collection 40 = 31+5+4 (level 0) ... 
28.2 Mbytes of cons cells used (70%)
7.5 Mbytes of vectors used (59%)
Garbage collection 41 = 32+5+4 (level 0) ... 
28.4 Mbytes of cons cells used (71%)
7.5 Mbytes of vectors used (59%)
> DT = fread("/Users/pasha/Downloads/LargeFile.csv")
Garbage collection 42 = 32+5+5 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
7.1 Mbytes of vectors used (2%)
Garbage collection 43 = 32+5+6 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
244.7 Mbytes of vectors used (42%)
Garbage collection 44 = 32+5+7 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
482.3 Mbytes of vectors used (42%)
Garbage collection 45 = 32+5+8 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
957.4 Mbytes of vectors used (56%)
Garbage collection 46 = 32+5+9 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
1432.6 Mbytes of vectors used (63%)
Garbage collection 47 = 32+5+10 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
2145.3 Mbytes of vectors used (75%)
Garbage collection 48 = 32+5+11 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
2620.4 Mbytes of vectors used (71%)
Garbage collection 49 = 32+5+12 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
3570.8 Mbytes of vectors used (78%)
Garbage collection 50 = 32+5+13 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
4283.5 Mbytes of vectors used (75%)
Garbage collection 51 = 32+5+14 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
5709.0 Mbytes of vectors used (77%)
Garbage collection 52 = 32+5+15 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
7372.0 Mbytes of vectors used (81%)
Garbage collection 53 = 32+5+16 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
8797.5 Mbytes of vectors used (79%)
Garbage collection 54 = 32+5+17 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
10935.7 Mbytes of vectors used (80%)
|--------------------------------------------------|
|=====Error in fread("LargeFile.csv") : 
  unprotect_ptr: pointer not found
> 

showProgress=FALSE löst es zuverlässig in der RStudio-Konsole auf. Um zu reproduzieren, muss es das allererste Mal in einer neuen RStudio-Konsole mit showProgress=TRUE (dh Standard) ausgeführt werden. Scheint damit zu tun zu haben, ob während der Fortschrittsanzeige ein GC vorhanden ist. Es gibt im ersten Lauf eine neue Sitzung. Es muss nur eine große Datei sein, damit die Fortschrittsanzeige angezeigt wird. Nichts mit erneutem Lesen oder Argumenten zu tun, die an fread . Wenn der erste Lauf in einer neuen RStudio-Konsole mit showProgress=FALSE , damit er funktioniert, erweitert dieser Lauf Rs Heap, und ein nachfolgender Lauf in derselben Sitzung mit showProgress=TRUE funktioniert dann ebenfalls. Aber nur weil es dann während der Fortschrittsanzeige keinen GC gibt, weil der erste Lauf den Heap bereits erweitert.
Warum ein GC im Master-Thread während der Fortschrittsanzeige unter Linux und im Windows RStudio Terminal, aber nicht in der RStudio Console in Ordnung ist, ist die offene Frage.

Ok, das behebt es. Das Problem lag auf der Seite data.table und nicht auf RStudio. Funktioniert jetzt für mich zuverlässig in der RStudio Console unter Windows. Es war ein Problem, das auch unter Linux und Max auftreten konnte. Es ist nur so, dass die Speichermuster es nicht auslösten. Die anderen Threads hatten einen Einstiegspunkt in R (wenn ihre Puffer mit Zeichenfolgenspalten verschoben wurden), der gleichzeitig mit dem Druckvorgang des Master-Threads mit REprintf . Deshalb passierte es nur im ersten Lauf einer neuen Sitzung. Ab dem zweiten Durchlauf wurden alle Zeichenfolgen in der Datei zuvor gesehen, sodass die Cache-Suchvorgänge erfolgreich waren (threadsicher) und nicht zugeordnet wurden (nicht threadsicher).

Also, @aadler und @HughParsonage , bitte versuchen Sie es mit diesem . 95% Chance, dass dies jetzt funktioniert!

Keine Warnung, nicht sicher, ob Sie etwas anderes suchen:

> gcinfo(TRUE)
[1] FALSE
> fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
Garbage collection 53 = 36+5+12 (level 2) ... 
30.3 Mbytes of cons cells used (60%)
7.9 Mbytes of vectors used (1%)
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Garbage collection 54 = 37+5+12 (level 0) ... 
30.8 Mbytes of cons cells used (61%)
566.6 Mbytes of vectors used (74%)
Garbage collection 55 = 37+6+12 (level 1) ... 
30.8 Mbytes of cons cells used (61%)
549.2 Mbytes of vectors used (72%)
  jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.626 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.005s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.469s ( 18%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.150s ( 82%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.103s (  4%) Finding first non-embedded \n after each jump
   +    0.230s (  9%) Parse to row-major thread buffers (grown 0 times)
   +    0.718s ( 27%) Transpose
   +    1.099s ( 42%) Waiting
   0.745s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
   2.626s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
Garbage collection 56 = 37+6+13 (level 2) ... 
31.1 Mbytes of cons cells used (62%)
531.9 Mbytes of vectors used (70%)
Garbage collection 57 = 38+6+13 (level 0) ... 
31.1 Mbytes of cons cells used (62%)
532.0 Mbytes of vectors used (70%)
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

Danke Hugh. Ja, das ist ein sauberer Lauf, vorausgesetzt, das war in einer neuen RStudio-Konsolensitzung. Keine Anzeichen für das Stack-Ungleichgewicht oder die Meldung "unprotect_ptr: Zeiger nicht gefunden", und die Fortschrittsanzeige wird ordnungsgemäß ausgeführt (in diesem Fall zweimal, da erneut gelesen wird). Jetzt nur noch @aadler zur Bestätigung.

ERFOLG.

Erster Lauf, neue Instanz von RStudio.

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> DT <- fread('LargeFile.csv', colClasses = colCLASS, select = colSEL, header = TRUE, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|==================================================|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:25.938 wall clock time
[12] Finalizing the datatable
  Type counts:
        23 : drop      '0'
         5 : int32     '5'
         7 : float64   '7'
         2 : string    'A'
=============================
   0.005s (  0%) Memory map 6.355GB file
   0.025s (  0%) sep=',' ncol=37 and header detection
   0.001s (  0%) Column type detection using 10049 sample rows
   4.681s ( 18%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
  21.226s ( 82%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
   =    0.485s (  2%) Finding first non-embedded \n after each jump
   +    1.465s (  6%) Parse to row-major thread buffers (grown 0 times)
   +    9.095s ( 35%) Transpose
   +   10.181s ( 39%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  25.938s        Total

RStudio wurde geschlossen und erneut geöffnet, um zu verhindern, dass das String-Caching aktiviert wird, und erneut mit gcinfo(TRUE) . Zusätzlicher Bonus, die Konvertierung zu IDate ist abgeschlossen (hat allerdings über 40 Sekunden gedauert :)).

> colCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+               rep('integer', 3L), rep('character', 2L),
+               'integer', 'Date', rep('numeric', 2L), 'Date',
+               rep('numeric', 12L), rep('integer', 5),
+               rep('numeric', 3L), 'integer', 'character')
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> gcinfo(TRUE)
[1] FALSE
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
Garbage collection 46 = 36+5+5 (level 0) ... 
38.6 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 47 = 37+5+5 (level 0) ... 
38.7 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 48 = 38+5+5 (level 0) ... 
38.8 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 49 = 39+5+5 (level 0) ... 
39.0 Mbytes of cons cells used (78%)
11.2 Mbytes of vectors used (71%)
Garbage collection 50 = 40+5+5 (level 0) ... 
39.1 Mbytes of cons cells used (78%)
11.3 Mbytes of vectors used (71%)
Garbage collection 51 = 40+6+5 (level 1) ... 
38.8 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 52 = 41+6+5 (level 0) ... 
38.9 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 53 = 42+6+5 (level 0) ... 
41.5 Mbytes of cons cells used (83%)
12.2 Mbytes of vectors used (77%)
Garbage collection 54 = 42+7+5 (level 1) ... 
43.4 Mbytes of cons cells used (86%)
12.8 Mbytes of vectors used (81%)
Garbage collection 55 = 42+7+6 (level 2) ... 
44.7 Mbytes of cons cells used (72%)
13.0 Mbytes of vectors used (67%)
Garbage collection 56 = 43+7+6 (level 0) ... 
46.5 Mbytes of cons cells used (74%)
13.6 Mbytes of vectors used (70%)
Garbage collection 57 = 44+7+6 (level 0) ... 
47.0 Mbytes of cons cells used (75%)
13.8 Mbytes of vectors used (71%)
Garbage collection 58 = 45+7+6 (level 0) ... 
47.4 Mbytes of cons cells used (76%)
13.9 Mbytes of vectors used (71%)
Garbage collection 59 = 46+7+6 (level 0) ... 
47.7 Mbytes of cons cells used (76%)
14.2 Mbytes of vectors used (73%)
Garbage collection 60 = 47+7+6 (level 0) ... 
48.0 Mbytes of cons cells used (77%)
14.2 Mbytes of vectors used (73%)
Garbage collection 61 = 48+7+6 (level 0) ... 
48.1 Mbytes of cons cells used (77%)
14.3 Mbytes of vectors used (73%)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = colCLASS, select = colSEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
Garbage collection 62 = 48+7+7 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
13.6 Mbytes of vectors used (2%)
Garbage collection 63 = 48+7+8 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
488.7 Mbytes of vectors used (42%)
Garbage collection 64 = 48+7+9 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
963.9 Mbytes of vectors used (56%)
Garbage collection 65 = 48+7+10 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
1439.1 Mbytes of vectors used (63%)
Garbage collection 66 = 48+7+11 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
1914.2 Mbytes of vectors used (67%)
Garbage collection 67 = 48+7+12 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
2864.5 Mbytes of vectors used (77%)
Garbage collection 68 = 48+7+13 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
3577.3 Mbytes of vectors used (78%)
Garbage collection 69 = 48+7+14 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
4290.0 Mbytes of vectors used (75%)
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|============================Garbage collection 70 = 49+7+14 (level 0) ... 
76.5 Mbytes of cons cells used (99%)
5487.5 Mbytes of vectors used (96%)
=Garbage collection 71 = 49+8+14 (level 1) ... 
77.0 Mbytes of cons cells used (100%)
5487.6 Mbytes of vectors used (96%)
Garbage collection 72 = 49+8+15 (level 2) ... 
77.0 Mbytes of cons cells used (81%)
5487.1 Mbytes of vectors used (80%)
==============Garbage collection 73 = 50+8+15 (level 0) ... 
94.3 Mbytes of cons cells used (100%)
5494.0 Mbytes of vectors used (80%)
Garbage collection 74 = 50+9+15 (level 1) ... 
94.5 Mbytes of cons cells used (100%)
5494.1 Mbytes of vectors used (80%)
Garbage collection 75 = 50+9+16 (level 2) ... 
94.5 Mbytes of cons cells used (82%)
5493.1 Mbytes of vectors used (67%)
=======|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:24.772 wall clock time
[12] Finalizing the datatable
  Type counts:
        23 : drop      '0'
         5 : int32     '5'
         7 : float64   '7'
         2 : string    'A'
=============================
   0.005s (  0%) Memory map 6.355GB file
   0.018s (  0%) sep=',' ncol=37 and header detection
   0.000s (  0%) Column type detection using 10049 sample rows
   5.496s ( 22%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
  19.253s ( 78%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
   =    0.433s (  2%) Finding first non-embedded \n after each jump
   +    1.482s (  6%) Parse to row-major thread buffers (grown 0 times)
   +    9.515s ( 38%) Transpose
   +    7.822s ( 32%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  24.772s        Total
Garbage collection 76 = 51+9+16 (level 0) ... 
105.3 Mbytes of cons cells used (91%)
5500.3 Mbytes of vectors used (67%)
Garbage collection 77 = 51+10+16 (level 1) ... 
105.4 Mbytes of cons cells used (91%)
5500.2 Mbytes of vectors used (67%)
> DT[, Month := as.IDate(Month, format = "%Y-%m-%d")]
Garbage collection 78 = 51+10+17 (level 2) ... 
107.5 Mbytes of cons cells used (76%)
8174.1 Mbytes of vectors used (81%)
Garbage collection 79 = 51+11+17 (level 1) ... 
107.5 Mbytes of cons cells used (76%)
5910.4 Mbytes of vectors used (59%)
> gcinfo(FALSE)
[1] TRUE

genial! : tada: großartige Arbeit an alle Beteiligten, besonders an @mattdowle, der jetzt ein paar Haare kurz sein muss :)

Es scheint, als hätte sich meine Strategie, im Urlaub zu bleiben, bis das Problem behoben ist, hier bewährt :-)

Gibt es noch etwas, das ich mir ansehen sollte, oder wird dieses Problem als gelöst angesehen?

Danke @aadler und @HughParsonage! Linderung.
@ kevinushey Ha ha. Ja, es war data.table-Seite und jetzt behoben (PR # 2488). Vielen Dank.

War diese Seite hilfreich?
0 / 5 - 0 Bewertungen