Data.table: Tumpuk ketidakseimbangan dalam ketakutan

Dibuat pada 14 Nov 2017  ·  61Komentar  ·  Sumber: Rdatatable/data.table

Saya mengalami crash R ('stack imbalance') ketika saya menjalankan perintah berikut dengan verbose=FALSE . Catatan Saya berhasil menjalankan kode di bawah ini pada versi dev yang lebih lama dari data.table sebulan atau dua bulan yang lalu, jadi saya yakin ini adalah bug yang cukup baru. (Maaf, saya tidak ingat versi dev yang tepat di mana ia berfungsi.)

Masalah tidak mereproduksi pada file yang jauh lebih kecil. Tautan ke file zip (csv berukuran 350 MB): https://github.com/HughParsonage/ABS-data/blob/master/inbox/SA2-by-DJZ-2011.zip

Saya terkadang mengalami kesalahan yang berbeda. Sebagai contoh,

Kesalahan dalam get (name, envir = ns, inherits = FALSE): argumen pertama tidak valid

atau

Peringatan: tumpukan ketidakseimbangan di '$', 16 lalu 15
Kesalahan: R_Reprotect: hanya 1 item yang dilindungi, tidak dapat melindungi indeks -2

# Minimal reproducible example

library(data.table)

#> data.table 1.10.5 IN DEVELOPMENT built 2017-11-13 02:46:28 UTC; appveyor
#>   The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#>   Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#>   Release notes, videos and slides: http://r-datatable.com


fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE)

Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.550 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 94%. ETA 00:00 Warning: stack imbalance in '$', 27 then 28
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.991
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.006s (  0%) Memory map 0.341GB file
   0.011s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.328s (  9%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.194s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.362s ( 10%) Parse to row-major thread buffers
   +    1.963s ( 55%) Transpose
   +    0.868s ( 25%) Waiting
   0.991s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
   3.541s        Total
Warning: stack imbalance in 'withVisible', 3 then 5
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

# Output of sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5    RevoUtils_10.0.6     RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2    yaml_2.1.14 
bug fread idatitime platform-specific

Komentar yang paling membantu

Sepertinya strategi saya untuk 'tetap berlibur sampai masalah teratasi' tampaknya berhasil di sini :-)

Apakah ada hal lain yang harus saya coba lihat atau apakah masalah ini dianggap sudah terselesaikan?

Semua 61 komentar

@HughParsonage , ini terlihat mirip dengan # 2457. Mungkin mencoba melewati showProgress=FALSE dan lihat apakah itu selesai.
@ mattebayahkah telah ada regresi sejak 2017-11-09?

Berjalan dengan showProgress=FALSE memang mengembalikan hasil (hanya dengan peringatan yang diharapkan).

Terima kasih untuk semua info detailnya. Saya ragu telah ada regresi sejak 2017-11-09 tetapi mungkin output verbose=TRUE yang panjang memiliki dampak yang mirip dengan output ETA. File tersebut perlu dibaca ulang yang berarti lebih banyak output yang dihasilkan. Saya takut bahwa @HughParsonage 's laporan yang showProgress = karya BENAR baginya adalah palsu dan bahwa masalah akan terjadi jika dijalankan 5-10 kali dengan verbose = TRUE.

Tidak ada pesan panjang yang dicetak dari dalam bagian paralel (selain kemajuan ETA yang sudah diperbaiki.) Namun, ada pesan panjang setelah pembacaan pertama dan sebelum pembacaan ulang ke-2 dimulai (yang terjadi pada file ini). Saya kira itu mungkin bahwa jika cetakan itu memicu CheckUserInterrupt ke-100 (lihat # 2457), itu dapat menyebabkan wilayah paralel ke-2 gagal (meskipun aneh). Untuk mengesampingkan hal itu, saya baru saja mengubah semua pesan panjang untuk menggunakan REprintf daripada Rprintf (perbaikan yang sama seperti # 2457 untuk ETA). Itu gagal karena tes tidak menemukan output pada stderr - akan diperbaiki. Setelah lolos, Windows .zip akan dibuat secara otomatis, lalu Anda dapat mencoba lagi. Saya akan perbarui di sini jika sudah siap.

Ok, upaya kedua adalah melewati pemeriksaan dan Windows.zip tersedia. @HughParsonage, maukah Anda mencoba lagi? Saya telah menambahkan panggilan ke R_FlushConsole () setelah pesan dalam mode verbose sebelum membaca ulang. Pembilasan itu hanya dibutuhkan di Windows. Saya menebak bahwa tanpa flush, konsol terkadang memperbarui sedikit kemudian ketika pembacaan ulang paralel terjadi dan diketahui menyebabkan masalah. Ulangi 10 kali, selalu dengan verbose=TRUE dan showProgress=TRUE . Jika Anda melihat 10 langkah yang jelas maka kami akan mengatakan itu saja. Kalau tidak, saya harus berpikir ulang.

Sayangnya, tidak diperbaiki:

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = FALSE)
Read 26%. ETA 00:00 Warning: stack imbalance in '$', 20 then 22
Read 52%. ETA 00:00 Warning: stack imbalance in '$', 36 then 35
Warning: stack imbalance in '$', 21 then 22
Read 59%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  : 
  unprotect_ptr: pointer not found
In addition: Warning: stack imbalance in '$', 26 then 28
Warning messages:
1: Warning: stack imbalance in '$', 26 then 27
In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15

Menggunakan verbose=TRUE, showProgress=TRUE bahkan setelah 10 kali berjalan saya tidak mendapatkan kesalahan. Inilah hasil dari keluaran ke-10:

Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.094 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.752
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.004s (  0%) Memory map 0.341GB file
   0.008s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.173s (  4%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.660s ( 95%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
   =    0.009s (  0%) Finding first non-embedded \n after each jump
   +    1.946s ( 51%) Parse to row-major thread buffers
   +    1.098s ( 29%) Transpose
   +    0.608s ( 16%) Waiting
   1.752s ( 46%) Rereading 1 columns due to out-of-sample type exceptions
   3.846s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.589 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.418
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.001s (  0%) Memory map 0.341GB file
   0.003s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.574s ( 14%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.428s ( 86%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
   =    0.010s (  0%) Finding first non-embedded \n after each jump
   +    1.988s ( 50%) Parse to row-major thread buffers
   +    1.137s ( 28%) Transpose
   +    0.292s (  7%) Waiting
   1.418s ( 35%) Rereading 1 columns due to out-of-sample type exceptions
   4.007s        Total
There were 20 warnings (use warnings() to see them)

@HughParsonage Terima kasih! Saya bingung. Anda mengatakan itu berfungsi dengan baik dengan verbose=TRUE, showProgress=TRUE yang kami harapkan - hore! Itu gagal sebelumnya bukan? Default untuk showProgress adalah BENAR tetapi ketika Anda menjalankan dengan FALSE default untuk verbose , _then_ itu tidak berfungsi dan Anda melihat tumpukan tidak seimbang? Aneh bahwa keluaran _less_ membuatnya gagal. Mohon konfirmasi. Jika itu masalahnya maka mungkin saya menggonggong pohon yang salah. Ini berfungsi dengan baik di sini untuk saya di Linux jadi saya bergantung pada Anda pengujian di Windows. Terima kasih.
(Juga, di bagian bawah dari hasil putaran ke-10, dikatakan ada 20 peringatan. Saya berasumsi itu adalah 2 peringatan yang ditampilkan lebih tinggi, diulang 10 kali. Jika demikian, masuk akal.)

Hai, maaf atas kebingungannya, Matt.

Anda benar bahwa masalah asli tidak lagi mengakibatkan crash, yaitu yang berikut berfungsi seperti yang diharapkan:

fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "")

Untuk memperjelas, aslinya, ketika verbose =FALSE (default) saya mengalami crash. Saya menjalankannya dengan verbose = TRUE sebelum mengajukan masalah, dan melihat peringatan 'stack imbalance' tetapi tidak mengalami crash. Dengan versi terbaru, saya tidak mengalami crash (atau memang ada masalah) dengan verbose = FALSE .

Alasan saya mengatakan 'tidak diperbaiki' adalah karena saya memperhatikan pesan peringatan:

Warning messages:
Warning: stack imbalance in '$', 26 then 27
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15

yang tampak aneh dan saya pikir mungkin menunjukkan masalah yang terkait erat meskipun tidak identik. Karena itu, pagi ini di Australia saya tidak dapat lagi mereproduksi pesan peringatan.

Ok aku paham. Pesan peringatan tentang ketidakseimbangan tumpukan pada dasarnya adalah kesalahan, ya. Kita tidak bisa melewatkannya. Saya menyebut peringatan tentang ketidakseimbangan tumpukan itu sebagai crash, meskipun itu belum benar-benar macet. (Ini hanya masalah waktu sampai macet setelah melihat peringatan itu.)

Ketika Anda melakukan 10 proses dalam sesi R baru dengan verbose=TRUE, showProgress=TRUE , apakah salah satu dari 20 peringatan tentang ketidakseimbangan tumpukan atau semua 20 peringatan tersebut hanyalah peringatan reguler berikut.

1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

Setelah peringatan ketidakseimbangan tumpukan terjadi, mulailah sesi R baru yang baru. Kami tidak dapat mempercayai apa pun dari R setelah itu terjadi bahkan sekali.

Saya berhasil mendapatkan crash ketika saya menjalankan dengan verbose=TRUE, showProgress=TRUE . Sesuatu tentang const char dengan SEXP . Saya mencoba mereproduksi ini dari baris perintah (sayangnya itu terjadi di RStudio dan RStudio ditutup sebelum saya dapat membaca seluruh pesan).

Tidak dapat mereproduksi crash. Inilah hasil setelah reboot. Ada peringatan ketidakseimbangan tumpukan:

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> for (i in 1:10) fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE, showProgress = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 31%. ETA 00:00 Warning: stack imbalance in '$', 24 then 23
Read 91%. ETA 00:00 Warning: stack imbalance in '$', 27 then 26
Read 95%. ETA 00:00 Warning: stack imbalance in '$', 28 then 29
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.895
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.029s (  1%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.314s ( 15%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.761s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.015s (  1%) Finding first non-embedded \n after each jump
   +    0.599s ( 28%) Parse to row-major thread buffers
   +    0.400s ( 19%) Transpose
   +    0.746s ( 35%) Waiting
   0.895s ( 42%) Rereading 1 columns due to out-of-sample type exceptions
   2.107s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.335 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.049
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.402s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.974s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.209s (  9%) Parse to row-major thread buffers
   +    0.864s ( 36%) Transpose
   +    0.900s ( 38%) Waiting
   1.049s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
   2.385s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.414
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.293s ( 18%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.322s ( 81%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.199s ( 12%) Parse to row-major thread buffers
   +    0.822s ( 51%) Transpose
   +    0.301s ( 19%) Waiting
   0.414s ( 25%) Rereading 1 columns due to out-of-sample type exceptions
   1.626s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.451 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.409
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.403s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.448s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.194s ( 10%) Parse to row-major thread buffers
   +    0.974s ( 52%) Transpose
   +    0.279s ( 15%) Waiting
   0.409s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.860s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.480 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.412
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.459s ( 24%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.424s ( 75%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.197s ( 10%) Parse to row-major thread buffers
   +    0.938s ( 50%) Transpose
   +    0.288s ( 15%) Waiting
   0.412s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.892s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.381 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.401
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.005s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.384s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.389s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.196s ( 11%) Parse to row-major thread buffers
   +    0.911s ( 51%) Transpose
   +    0.281s ( 16%) Waiting
   0.401s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.781s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.384 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.480
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.476s ( 26%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.378s ( 74%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.192s ( 10%) Parse to row-major thread buffers
   +    0.833s ( 45%) Transpose
   +    0.352s ( 19%) Waiting
   0.480s ( 26%) Rereading 1 columns due to out-of-sample type exceptions
   1.864s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.374 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.507
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.311s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.562s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.193s ( 10%) Parse to row-major thread buffers
   +    0.988s ( 52%) Transpose
   +    0.381s ( 20%) Waiting
   0.507s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
   1.881s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.318 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.493
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.306s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.496s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.193s ( 11%) Parse to row-major thread buffers
   +    0.935s ( 52%) Transpose
   +    0.367s ( 20%) Waiting
   0.493s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
   1.811s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.141 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.506
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.132s (  8%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.506s ( 91%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.195s ( 12%) Parse to row-major thread buffers
   +    0.938s ( 57%) Transpose
   +    0.371s ( 23%) Waiting
   0.506s ( 31%) Rereading 1 columns due to out-of-sample type exceptions
   1.647s        Total
Warning: stack imbalance in 'for', 2 then 8
There were 20 warnings (use warnings() to see them)

Anehnya, kepastian itu bagus. Terima kasih. Itu berarti flush tidak berhasil dan saya harus menemukan cara untuk menghindari Rprintf . Ini bekerja dengan andal dengan verbose=FALSE, showProgress=FALSE Saya asumsikan (Anda menulis itu di dekat bagian atas masalah ini jadi saya mengandalkan itu.) "Reliably" berarti 10 jalan lurus dengan hanya dua peringatan yang diharapkan dan tidak ada tampilan tumpukan peringatan ketidakseimbangan.
Serahkan padaku. Terima kasih lagi.

@HughParsonage Oke, coba lagi dengan percobaan kedua baru-baru ini. Itu belum digabungkan menjadi master jadi harap berhati-hati untuk mengambil Windows.zip dari cabang di sini . Seperti sebelumnya, berikan output lengkap dengan cara apa pun sehingga saya dapat memeriksanya. Terima kasih!

Upaya pertama dari tindakan berikut ini mengakibatkan crash (sesuatu tentang pointer).

Upaya kedua (setelah reboot) menghasilkan peringatan stack imbalance in '$', 16 then 15 .

# Assert that `data.table` is not installed:
stopifnot(!requireNamespace("data.table", quietly = TRUE))

install.packages("https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557502 bytes (1.5 MB)
# downloaded 1.5 MB

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 01:38:17 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com

setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapping ... ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \r-only line endings are not allowed because \n is found in the data
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
#   [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=','  with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
#   Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000)    : 1551  Quote rule 0
# Type codes (jump 100)    : 1A51  Quote rule 0
# =====
#   Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
#   [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# Read 78%. ETA 00:00 Warning: stack imbalance in '$', 16 then 15
# Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.677 wall clock time
# [12] Finalizing the datatable
# Type counts:
#   1 : bool8     '1'
# 1 : int32     '5'
# 2 : string    'A'
# =============================
#   0.002s (  0%) Memory map 0.341GB file
# 0.007s (  0%) sep=',' ncol=4 and header detection
# 0.001s (  0%) Column type detection using 10027 sample rows
# 0.297s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 2.369s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# =    0.003s (  0%) Finding first non-embedded \n after each jump
# +    0.273s ( 10%) Parse to row-major thread buffers (grown 0 times)
# +    1.313s ( 49%) Transpose
# +    0.780s ( 29%) Waiting
# 0.893s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
# 2.677s        Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1        V2      V3 V4
# 1: Goulburn 110018063    3499 NA
# 2:       NA 110018064     812 NA
# 3:       NA 110018065    2158 NA
# 4:       NA 110019999     402 NA
# 5:       NA 110028068      10 NA
# ---                              
#   22885376:       NA 997999799       0 NA
# 22885377:       NA 998999899      64 NA
# 22885378:       NA 994999499      34 NA
# 22885379:       NA 0&&&&&&&&  250796 NA
# 22885380:       NA 0@@@@@@@@ 7305367 NA
# Warning messages:
#   1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#                 Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
#               2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#               Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

Hai, @mattdowle. Ada versi GCC yang masih digunakan dengan OpenMP terbaik 3.1, bukan 4.0. Saya mengalami masalah itu di salah satu paket saya di CRAN ( Delaporte ) di mana saya mencoba menggunakan direktif SIMD (OpenMP 4.0) yang dikompilasi dengan Rtools untuk Windows (berdasarkan 4.9.3) tetapi melemparkan dan kesalahan pada mesin Linux seseorang yang masih menggunakan gcc 4.8.0. Bahkan Windows hanya dapat menggunakan panggilan 4.0 dan bukan 4.5, jika saya ingat dengan benar. Mungkin itu yang menyebabkan masalah?

@HughParsonage Terima kasih telah menguji begitu cepat! Oke, saya akan terus berpikir kalau begitu!
@aadler Itu pemikiran yang bagus - segalanya mungkin.

@HughParsonage Hanya untuk mengonfirmasi, harap bahwa perintah yang sama hanya dengan satu perubahan ( verbose=FALSE ) berfungsi dengan baik? yaitu fread("SA2-by-DJZ-2011.csv", verbose = FALSE, na.strings = "", header = FALSE) . Pengukur kemajuan masih akan ditampilkan.

Ya, menjalankan perintah itu (sepuluh kali) mengembalikan hasil yang diharapkan (yaitu data.table dengan hanya dua peringatan karena diformat dengan buruk). Tidak ada peringatan ketidakseimbangan tumpukan.

Terima kasih. Jadi sepertinya terkait dengan keluaran konsolnya saja. Beberapa hal lagi untuk dicoba ...

Dalam mode verbose, ada beberapa cabang di dalam wilayah paralel yang memanggil wallclock() . Saya telah menyingkatnya untuk selalu mengembalikan 0,0 dan menghindari pemanggilan sistem, untuk mengesampingkan hal itu. Saya pikir itu aman untuk benang tetapi mungkin tidak. Silakan coba Windows.zip baru dari cabang yang dibangun kembali di sini .

Percobaan pertama:

install.packages("https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1556972 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package ‘data.table’ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 03:49:20 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)

image

Upaya kedua, saya mendapatkan peringatan berikut:

Read 22%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  : 
  unprotect_ptr: pointer not found
In addition: Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
 Warning: stack imbalance in '$', 29 then 28
 Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 125 then 126
Warning: stack imbalance in 'lapply', 55 then 53
Warning: stack imbalance in 'lapply', 30 then 34
Warning: stack imbalance in '<-', 28 then 31
Warning: stack imbalance in '{', 24 then 27
Warning: stack imbalance in '{', 18 then 21

Hanya berpikir: mungkinkah ini menjadi masalah dengan RStudio? Menjalankan skrip dari terminal tampaknya tidak mereproduksi dengan mudah. Saya menjalankan dari RStudio karena membuat penyalinan output konsol lebih mudah.

Ketika Anda mengatakan itu tidak mereproduksi _seperti dengan mudah_ di luar RStudio, apakah itu mereproduksi _at all_? Bahkan jika itu hanya terjadi di dalam RStudio, itu masih sesuatu yang ingin saya perbaiki di sisi data.table. Saya hanya bertanya sebagai cara lain untuk mengonfirmasi bahwa ini pasti adalah "hanya" keluaran konsol dan bukan ketidakseimbangan tumpukan sebenarnya dalam logika fread.

Saya belum mereproduksi di luar RStudio sama sekali, dan dapat mereproduksi dengan andal di dalamnya (yaitu, saya dapat mereproduksi beberapa peringatan atau crash). Saya sudah mencoba command prompt Windows dan git shell (di Windows).

Saya menggunakan RStudio Versi 1.1.383 di Windows. Apakah akan membantu Anda jika saya juga mengangkat masalah ini dengan mereka, atau Anda ingin saya menunggu?

Terima kasih. Itu sangat berguna untuk mengetahui bahwa itu hanya di dalam RStudio. Tidak perlu membicarakannya dengan mereka. Ini hanya berarti itu ada hubungannya dengan buffering konsol keluaran (atau serupa). Saya sudah maju dengan pekerjaan di sekitar dan akan mendorong.

Saya tidak mengerti mengapa Windows tidak mengkompilasi perubahan itu:
fread.c:1054:3: warning: too many arguments for format [-Wformat-extra-args]
Bekerja dengan baik di sini di Linux dan di Travis. Itu mencegah Windows.zip dibuat bagi Anda untuk menguji solusi ini. Aku harus tidur di atasnya.
(Ini mengeluh tentang baris 1054 tetapi bukan baris berikutnya 1055 yang hanya afaics yang sama. Pasti ada perbedaan.% Llu masalah dengan __VA_ARGS__ di Windows - tentu saja tidak.)

Ok akhirnya windows.zip sudah siap untuk anda coba lagi silahkan disini .

Ada beberapa solusi di cabang ini sekarang. Jika berhasil, saya akan mulai menghapus solusi untuk menetapkan yang mana. Peringatan kompilator llu terlihat paling menjanjikan karena akan menimbulkan ketidakseimbangan tumpukan dalam keluaran verbose, sesuai dengan penjelasan @ st-pasha yang ditemukan di sini . Mungkin lapisan Rprintf menyembunyikannya dari kompiler yang sekarang dapat dilihatnya karena menggunakan fprintf secara langsung.

image

Pada upaya kedua (setelah reboot)

stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1559167 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package ‘data.table’ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-18 04:58:23 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com

fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file:  C:\Users\hughp\AppData\Local\Temp\RtmpIT9H0D/fread.out 
Input contains no \n. Taking this to be a filename to open
Read 11%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 28%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 48%. ETA 00:00 Warning: stack imbalance in '$', 20 then 19
Read 98%. ETA 00:00 [01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.822 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.000s (  0%) Memory map 0.341GB file
   0.001s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.291s ( 10%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.531s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.002s (  0%) Finding first non-embedded \n after each jump
   +    0.282s ( 10%) Parse to row-major thread buffers (grown 0 times)
   +    1.537s ( 54%) Transpose
   +    0.710s ( 25%) Waiting
   0.842s ( 30%) Rereading 1 columns due to out-of-sample type exceptions
   2.822s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

Sekali lagi, tidak dapat direproduksi di luar RStudio.

Terima kasih telah menguji begitu cepat. Nah, itu pasti banyak aturannya! Dua ide tersisa. Yang pertama mendorong dan mengoper. Silakan coba Windows.zip baru di sini . alloca ada di tumpukan dan ada hubungannya dengan na.strings yang Anda atur saat terjadi. Pasti di area yang tepat (stack imbalance) dan pantas untuk dicoba.

Tidak masalah - Saya akan pergi selama 12 jam ke depan atau lebih tidak dapat menguji sampai saat itu.
Pada Sabtu, 18 Nov 2017 pukul 17:20, Matt Dowle [email protected] menulis:

Terima kasih telah menguji begitu cepat. Nah, itu pasti banyak aturannya!
Dua ide tersisa. Yang pertama mendorong dan mengoper. Silakan coba Windows.zip baru
sini
https://ci.appveyor.com/project/Rdatatable/data-table/build/1.0.1363/job/fo02vnbu5ebhwy3w/artifacts .
Alokasi itu dialokasikan pada stack dan ada hubungannya dengan na.strings yang
Anda menyetel saat itu terjadi. Pasti di area yang tepat (tumpukan
ketidakseimbangan) dan pantas untuk dicoba.

-
Anda menerima ini karena Anda disebutkan.

Balas email ini secara langsung, lihat di GitHub
https://github.com/Rdatatable/data.table/issues/2481#issuecomment-345421856 ,
atau nonaktifkan utasnya
https://github.com/notifications/unsubscribe-auth/AHvGDGa5Qnls5eSFBMaQO5s8DElfrpKSks5s3ncqgaJpZM4QcuPc
.

OK tidak masalah. Terima kasih! Saya telah mendorong ide kedua sekarang juga. Sepertinya saya ingat \r menyebabkan masalah pada Windows di masa lalu, tetapi saya tidak mengingat ketidakseimbangan tumpukan. Bagaimanapun, untuk mengesampingkan itu, saya telah menghapus \r dari pengukur kemajuan. Pesan stack imbalance tampaknya dicetak di tempat munculnya garis ETA. Kemungkinan konsol menangkap \r dan memperlakukannya secara berbeda sehingga baris terakhir diganti. Anda sekarang akan melihat baris baru setiap kali ETA diperbarui. Hanya sementara untuk mengesampingkan itu. Windows.zip baru dibangun dan lewat di sini .

fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file:  C:\Users\hughp\AppData\Local\Temp\RtmpcVjZ1f/fread.out 
Input contains no \n. Taking this to be a filename to open
Read 5%. ETA 00:00
Read 8%. ETA 00:00
Read 11%. ETA 00:00
Read 15%. ETA 00:00
Read 18%. ETA 00:00
Read 21%. ETA 00:00
Read 25%. ETA 00:00
Read 28%. ETA 00:00
Read 31%. ETA 00:00
Read 35%. ETA 00:00
Read 38%. ETA 00:00
Read 41%. ETA 00:00
Read 45%. ETA 00:00
Read 48%. ETA 00:00
Read 51%. ETA 00:00
Read 55%. ETA 00:00
Warning: stack imbalance in '$', 30 then 31
Warning: stack imbalance in '$', 17 then 16
Read 58%. ETA 00:00
Read 61%. ETA 00:00
Read 65%. ETA 00:00
Read 68%. ETA 00:00
Read 71%. ETA 00:00
Read 75%. ETA 00:00
Read 78%. ETA 00:00
Read 81%. ETA 00:00
Read 85%. ETA 00:00
Read 88%. ETA 00:00
Read 91%. ETA 00:00
Read 95%. ETA 00:00
Read 98%. ETA 00:00
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.894 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.001s (  0%) Memory map 0.341GB file
   0.003s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.316s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.574s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.004s (  0%) Finding first non-embedded \n after each jump
   +    0.284s ( 10%) Parse to row-major thread buffers (grown 0 times)
   +    1.450s ( 50%) Transpose
   +    0.837s ( 29%) Waiting
   0.953s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
   2.894s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

FYI: Saya tidak dapat mereproduksi kesalahan ketidakseimbangan tumpukan ini pada mesin Windows lain dengan versi RStudio yang sedikit lebih lama.

Dalam hal ini, sepertinya sudah waktunya untuk meminta dukungan RStudio seperti yang Anda sarankan. Saya telah melihat-lihat kode ketakutan lagi dan saya kehabisan ide. Tolong beri tahu mereka dua nomor versi RStudio. Itu tidak selalu berarti itu RStudio, itu bisa jadi kesalahan pada sisi data.table yang kebetulan muncul di satu versi RStudio. Tetapi aneh bahwa tampaknya terkait dengan keluaran konsol dan itu adalah sesuatu yang berbeda dan khusus RStudio. Saya telah mencari "RStudio stack imabalance" tetapi banyak masalah yang muncul tentang kesalahan paket, bukan RStudio itu sendiri. Masalah yang sulit dicari. Mari kita biarkan masalah tetap terbuka di sini dan lihat apa yang mereka katakan.

Saya ragu upaya terakhir akan membantu, tetapi untuk kelengkapan, silakan mencobanya di sini . Mungkin kompiler MinGW yang digunakan pada Windows melakukan sesuatu yang aneh dengan kedua int tersebut. Salah satunya adalah konstanta 0 yang mungkin dioptimalkan dan kemudian menyebabkan ketidakseimbangan tumpukan.

Namun, pesan ketidakseimbangan tumpukan tersebut berasal dari eval.c: 491 di R itu sendiri. Beberapa utas harus menjalankan baris itu tetapi menurut saya itu bukan fread atau data.table . check_stack_balance() itu hanya dipanggil dari 5 tempat di internal R:
di names.c di akhir do_internal()
di objects.c , dua kali di applyMethod()
di eval.c , dua kali di eval()
Saya tidak melihat bagaimana salah satu dari itu dapat dicapai sementara fread.c ada di bagian paralelnya. Satu-satunya titik masuk yang dipanggil adalah REprintf dan saya tidak melihat bagaimana itu bisa mencapai check_stack_balance() . Yang dapat saya pikirkan saat ini adalah bahwa RStudio memiliki utas yang melakukan sesuatu di latar belakang yang mungkin berinteraksi dengan output konsol, mungkin berbeda pada Windows.
Akhirnya, untuk kelengkapan, tampaknya menggunakan REprintf adalah cara yang tepat untuk digunakan karena basis R menggunakan itu (bukan Rprintf) dalam pengukur kemajuannya di libcurl.c: 354 dan internet.c: 409 . Sayangnya, bilah kemajuan R di tingkat C tidak tersedia di API R (tampaknya juga diterapkan dua kali di R di tingkat C).

@mattdowle , apakah ini berguna? https://github.com/r-lib/progress

@aader Ya - terima kasih! Sumbernya berisi komentar ini :
// In R Studio we should print to stdout, because printing a \r
// to stderr is buggy (reported)
Tapi saya sudah menghapus \r dan ketidakseimbangan tumpukan masih terjadi. Saya ingin tahu di mana itu dilaporkan.

Build terakhir juga tidak berfungsi:

image

Dilaporkan di https://community.rstudio.com/t/stack-imbalance-possibly-in-stderr/3009

Pertanyaan tepat waktu tentang R-devel: [Rd] Apakah Rprintf dan REprintf thread-safe?

Hasilnya "Rprintf dan REprintf tidak aman untuk thread."

Yoiks!

Terima kasih atas tautannya dan Hugh telah mengangkat masalah dengan RStudio.

data.table::fwrite() dan data.table::fread() sadar bahwa Rprintf dan REprintf tidak aman untuk thread jadi untuk pengukur kemajuan mereka, mereka hanya memanggilnya dari utas utama. Tidak hanya dua utas data.table yang pernah memanggil titik masuk R itu pada saat yang sama, tetapi hanya utas master yang pernah memanggilnya, juga, dan itulah satu-satunya titik masuk R yang dipanggil oleh utas mana pun kapan saja selama bagian paralel. Namun, Rprintf memanggil R_CheckUserInterrupt setiap 100 cetakan. Saya pikir itu bagian yang sepertinya tidak aman bahkan dari utas utama saja. Itulah alasannya sekarang menggunakan REprintf karena itu tidak memanggil R_CheckUserInterrupt . R internal menggunakan REprintf untuk pengukur kemajuan, jadi beralih ke REprintf untuk konsistensi dengan inti R masuk akal; yaitu pilihan itu tidak ada hubungannya dengan stderr vs stdout, per se.

@kevinushey Maukah Anda melihat utas ini dan memberi tahu saya hal lain yang bisa saya coba? Mungkinkah RStudio terkait, entah bagaimana, mungkin terkait dengan utas latar belakang? Jika RStudio memiliki utas latar belakang maka bisa jadi Rprintf / REprintf dapat dipanggil dari dua utas pada saat yang bersamaan. Tetapi, jika itu masalahnya, maka kita akan melihat lebih banyak masalah sebelum sekarang. Jadi sepertinya sangat tidak mungkin. Mungkin RStudio menggantikan callback ptr_* disebutkan di bagian 8.1.2 dari R-exts - yang terkait dengan output konsol dan interaksi. Namun bagian itu dimulai dengan "Untuk unix-alikes", jadi saya tidak tahu bagaimana Windows masuk. Mungkin masalah threading bagian

Saya akan keluar sampai awal Desember, jadi sayangnya saya tidak akan punya kesempatan untuk melihatnya sampai saat itu. Namun, RStudio menjalankan hampir semua yang ada di thread utama menggunakan R event loop; satu-satunya pengecualian adalah untuk pengindeksan file tingkat proyek dan thread latar belakang tersebut biasanya tidak menyentuh R API.

RStudio memang mengambil alih berbagai callback ptr_* untuk menangani input dan output konsol; Saya tidak bisa langsung memikirkan bagaimana mereka mungkin menjadi penyebab di sini, tetapi saya akan mencoba untuk melihat lebih dalam ketika saya kembali.

Oke, coba yang ini di sini . Sebelumnya, itu memperbarui status kemajuan setiap 2%. Dalam kasus Anda, file Anda hanya membutuhkan waktu kurang dari 3 detik jadi itu adalah pembaruan kemajuan baru ke konsol RStudio setiap 0,06 detik. Mungkin itu terlalu berlebihan untuk RStudio. Jadi upaya ini mencetak bar. Ini tidak menggunakan \r sama sekali. Ini akan lebih baik untuk laporan dan file log dimana \r dapat mengisi keluaran.

Karena waktu 3 detik Anda cukup cepat, saya telah mengurangi bilah kemajuan untuk mulai dari 1 detik jika ada ETA 1 detik dari sana. Jika tidak, itu tidak akan ditampilkan sama sekali dan berfungsi untuk file Anda hanya karena tidak ditampilkan. Setelah Anda mengujinya, saya akan meningkatkan apa yang dimiliki fwrite ; yaitu mulai dari 2 detik jika ETA berjarak 2 detik dari sana.

Halo, @mattdowle. Komentar terakhir saya di # 2503 mungkin terkait dengan masalah ini juga.

Kelihatan bagus! Tidak ada peringatan (setelah 5 kali berjalan). Jalankan pertama di bawah (perhatikan bahwa spasi utama terlihat berbeda dalam keluaran sebenarnya):

stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557423 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package ‘data.table’ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapped ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
#   [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=','  with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
#   Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000)    : 1551  Quote rule 0
# Type codes (jump 100)    : 1A51  Quote rule 0
# =====
#   Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
#   [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# |--------------------------------------------------|
#   |==================================================|
#   Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.280 wall clock time
# [12] Finalizing the datatable
# Type counts:
#   1 : bool8     '1'
# 1 : int32     '5'
# 2 : string    'A'
# =============================
#   0.005s (  0%) Memory map 0.341GB file
# 0.037s (  2%) sep=',' ncol=4 and header detection
# 0.000s (  0%) Column type detection using 10027 sample rows
# 0.321s ( 14%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 1.917s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# =    0.011s (  0%) Finding first non-embedded \n after each jump
# +    0.560s ( 25%) Parse to row-major thread buffers (grown 0 times)
# +    0.488s ( 21%) Transpose
# +    0.858s ( 38%) Waiting
# 0.999s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
# 2.280s        Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1        V2      V3 V4
# 1: Goulburn 110018063    3499 NA
# 2:       NA 110018064     812 NA
# 3:       NA 110018065    2158 NA
# 4:       NA 110019999     402 NA
# 5:       NA 110028068      10 NA
# ---                              
#   22885376:       NA 997999799       0 NA
# 22885377:       NA 998999899      64 NA
# 22885378:       NA 994999499      34 NA
# 22885379:       NA 0&&&&&&&&  250796 NA
# 22885380:       NA 0@@@@@@@@ 7305367 NA
# Warning messages:
#   1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#                 Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
#               2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#               Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

@HughParsonage Relief! Saya pikir itu kemenangan. Saya akan membereskan, menggabungkan, dan melanjutkan. Terima kasih banyak untuk pengujian Anda.

@aadler Ya setuju komentar Anda di edisi # 2503 terlihat sama. Bisakah Anda menguji terbaru dari dev juga dan mengonfirmasi bahwa sekarang sudah diperbaiki? Kami berharap masalah dengan as.IDate Anda temukan memang disebabkan oleh ketidakseimbangan tumpukan sebelumnya.

Tidak baik :(

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> DT <- fread('2017-11-22_1999_Performance.csv', header = TRUE, colClasses = CLS, select = SEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file 2017-11-22_1999_Performance.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|=======Warning: stack imbalance in '$', 27 then 26
===Warning: stack imbalance in '$', 26 then 27
================Error in fread("2017-11-22_1999_Performance.csv", header = TRUE, colClasses = CLS,  : 
  unprotect_ptr: pointer not found

@aadler Terima kasih atas laporannya. Saya memeriksa freadR dan melokalkan perlindungan. Ada kemungkinan 30% yang mungkin berfungsi karena dalam kasus Anda, Anda mengganti jenis dan ada beberapa perlindungan di bagian kode itu. Silakan coba lagi menggunakan build ini .

@aadler Jika Anda belum mencoba versi terakhir, silakan langsung mencoba yang ini . Selain itu, jika memungkinkan untuk mendapatkan salinan file Anda kepada saya, saya mungkin dapat mencoba sendiri di Windows RStudio.

:(

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-01 01:54:04 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> ColCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+               rep('integer', 3L), rep('character', 2L),
+               'integer', 'Date', rep('numeric', 2L), 'Date',
+               rep('numeric', 12L), rep('integer', 5),
+               rep('numeric', 3L), 'integer', 'character')
> SELCOL <- c(WHATEVER)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = ColCLASS, select = SELCOL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|Error in fread("LargeFile.csv", header = TRUE, colClasses = ColCLASS,  : 
  unprotect_ptr: pointer not found

Terima kasih kepada @aadler di email, sekarang saya dapat mereproduksi. R 3.4.2, RStudio 1.1.383 terbaru dan Windows 10 Pro 10.0.16299 Build 16299.

Saya melihat perilaku aneh di RStudio, direkam di sini:
https://www.youtube.com/watch?v=tl2x2vmZxMU
Tampaknya RStudio menghasilkan GC hanya dengan mengetik. Mengapa begitu dan apakah ada cara untuk mematikannya? Mungkin ketika fread() sedang mencetak bilah kemajuannya, loop peristiwa terpisah RStudio berpikir bahwa output ke konsol adalah pengguna mengetik dan memanggil R yang memunculkan GC dan menaikkan semuanya? Mungkin pengguna RStudio di sini tahu, bisa mengarahkan saya ke arah yang benar, atau mungkin @kevinushey sudah kembali (Anda memang mengatakan awal Desember Kevin, dan ini tanggal 1 hari ini: -))

Saya dapat mereproduksi ketidakseimbangan tumpukan dengan andal di Konsol RStudio. Menggunakan tab RStudio Terminal, saya tidak dapat mereproduksinya sama sekali, bahkan dengan gcinfo(TRUE) . Menariknya, GC benar-benar terjadi saat bilah kemajuan sedang dicetak dan itu tampak baik-baik saja, karena itu juga baik-baik saja di Linux. Mengingat perilaku dalam video Konsol RStudio, saya mencapai kesimpulan bahwa ini adalah bug Konsol RStudio. Saya tidak dapat menyalin teks dari jendela RStudio Terminal (Edit-> Salin tidak berfungsi dan begitu pula Ctrl-C) jadi saya mengambil tangkapan layar dari tab Terminal untuk menunjukkan bahwa GC selama bilah kemajuan baik-baik saja. Saya berharap ini akan baik-baik saja karena hanya utas utama yang memanggil REprintf dan utas lainnya tidak memanggil R API sama sekali.

Bekerja dengan baik di Terminal RStudio:
selection_014
Perhatikan bahwa ada GC saat bilah kemajuan mencetak pertama kali dan itu berfungsi dengan baik di Terminal RStudio. Bilah kemajuan mencetak untuk kedua kalinya karena ada pengecualian jenis sampel di luar dalam file pengujian ini yang memicu pembacaan ulang otomatis untuk kolom tersebut saja.

Tetapi di RStudio Console ada stack imbalance atau unprotect_ptr: pointer not found :

R version 3.4.2 (2017-09-28) -- "Short Summer"
> gcinfo(TRUE)
[1] FALSE
Garbage collection 22 = 16+3+3 (level 0) ... 
25.5 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 23 = 16+4+3 (level 1) ... 
24.9 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 24 = 17+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 25 = 18+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 26 = 19+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 27 = 20+4+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 28 = 20+5+3 (level 1) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 29 = 21+5+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 30 = 22+5+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 31 = 23+5+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 32 = 24+5+3 (level 0) ... 
25.3 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 33 = 25+5+3 (level 0) ... 
25.4 Mbytes of cons cells used (80%)
6.7 Mbytes of vectors used (66%)
Garbage collection 34 = 25+5+4 (level 2) ... 
24.6 Mbytes of cons cells used (61%)
6.4 Mbytes of vectors used (50%)
Garbage collection 35 = 26+5+4 (level 0) ... 
25.0 Mbytes of cons cells used (62%)
6.5 Mbytes of vectors used (52%)
> require(data.table)
Loading required package: data.table
Garbage collection 36 = 27+5+4 (level 0) ... 
27.2 Mbytes of cons cells used (68%)
7.1 Mbytes of vectors used (56%)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 01:04:34 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
Garbage collection 37 = 28+5+4 (level 0) ... 
27.7 Mbytes of cons cells used (69%)
7.3 Mbytes of vectors used (58%)
Garbage collection 38 = 29+5+4 (level 0) ... 
28.0 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (58%)
Garbage collection 39 = 30+5+4 (level 0) ... 
28.1 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (59%)
Garbage collection 40 = 31+5+4 (level 0) ... 
28.2 Mbytes of cons cells used (70%)
7.5 Mbytes of vectors used (59%)
Garbage collection 41 = 32+5+4 (level 0) ... 
28.4 Mbytes of cons cells used (71%)
7.5 Mbytes of vectors used (59%)
> DT = fread("/Users/pasha/Downloads/LargeFile.csv")
Garbage collection 42 = 32+5+5 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
7.1 Mbytes of vectors used (2%)
Garbage collection 43 = 32+5+6 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
244.7 Mbytes of vectors used (42%)
Garbage collection 44 = 32+5+7 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
482.3 Mbytes of vectors used (42%)
Garbage collection 45 = 32+5+8 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
957.4 Mbytes of vectors used (56%)
Garbage collection 46 = 32+5+9 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
1432.6 Mbytes of vectors used (63%)
Garbage collection 47 = 32+5+10 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
2145.3 Mbytes of vectors used (75%)
Garbage collection 48 = 32+5+11 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
2620.4 Mbytes of vectors used (71%)
Garbage collection 49 = 32+5+12 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
3570.8 Mbytes of vectors used (78%)
Garbage collection 50 = 32+5+13 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
4283.5 Mbytes of vectors used (75%)
Garbage collection 51 = 32+5+14 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
5709.0 Mbytes of vectors used (77%)
Garbage collection 52 = 32+5+15 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
7372.0 Mbytes of vectors used (81%)
Garbage collection 53 = 32+5+16 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
8797.5 Mbytes of vectors used (79%)
Garbage collection 54 = 32+5+17 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
10935.7 Mbytes of vectors used (80%)
|--------------------------------------------------|
|=====Error in fread("LargeFile.csv") : 
  unprotect_ptr: pointer not found
> 

showProgress=FALSE menyelesaikannya dengan andal di Konsol RStudio. Untuk mereproduksi, ini harus dijalankan pertama kali di Konsol RStudio baru dengan showProgress=TRUE (yaitu default). Sepertinya terkait dengan apakah ada GC selama pengukur kemajuan; ada sesi pertama di sesi baru. Ini hanya perlu file besar agar pengukur kemajuan ditampilkan. Tidak ada hubungannya dengan membaca ulang atau argumen diteruskan ke fread . Jika proses pertama di Konsol RStudio baru adalah dengan showProgress=FALSE untuk membuatnya berfungsi, proses tersebut memperluas heap R kemudian menjalankan berikutnya di sesi yang sama dengan showProgress=TRUE juga kemudian berfungsi. Tetapi hanya karena tidak ada GC selama pengukur kemajuan karena proses pertama sudah memperluas heap.
Mengapa GC pada utas master selama pengukur kemajuan baik-baik saja di Linux dan di Terminal Windows RStudio tetapi tidak di Konsol RStudio adalah pertanyaan yang belum terjawab.

Oke, ini memperbaikinya. Masalah ada di sisi data.table dan bukan RStudio. Bekerja untuk saya dengan andal di RStudio Console di Windows. Itu adalah masalah yang dapat terjadi di Linux dan Max juga, hanya saja pola memori tidak memicunya. Utas lain memang memiliki titik masuk ke R (saat mendorong buffer mereka dengan kolom string) yang dapat terjadi pada saat yang sama dengan kemajuan pencetakan utas utama menggunakan REprintf . Itulah mengapa itu hanya terjadi pada putaran pertama dalam sesi baru. Pada proses ke-2 dan seterusnya, semua string dalam file telah dilihat sebelumnya sehingga pencarian cache mengenai (aman untuk thread) dan tidak mengalokasikan (bukan thread-aman).

Jadi, @aadler dan @HughParsonage , silakan coba yang ini . 95% kemungkinan ini berhasil sekarang!

Tanpa peringatan, tidak yakin jika Anda mencari yang lain:

> gcinfo(TRUE)
[1] FALSE
> fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
Garbage collection 53 = 36+5+12 (level 2) ... 
30.3 Mbytes of cons cells used (60%)
7.9 Mbytes of vectors used (1%)
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Garbage collection 54 = 37+5+12 (level 0) ... 
30.8 Mbytes of cons cells used (61%)
566.6 Mbytes of vectors used (74%)
Garbage collection 55 = 37+6+12 (level 1) ... 
30.8 Mbytes of cons cells used (61%)
549.2 Mbytes of vectors used (72%)
  jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.626 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.005s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.469s ( 18%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.150s ( 82%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.103s (  4%) Finding first non-embedded \n after each jump
   +    0.230s (  9%) Parse to row-major thread buffers (grown 0 times)
   +    0.718s ( 27%) Transpose
   +    1.099s ( 42%) Waiting
   0.745s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
   2.626s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
Garbage collection 56 = 37+6+13 (level 2) ... 
31.1 Mbytes of cons cells used (62%)
531.9 Mbytes of vectors used (70%)
Garbage collection 57 = 38+6+13 (level 0) ... 
31.1 Mbytes of cons cells used (62%)
532.0 Mbytes of vectors used (70%)
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

Terima kasih Hugh. Ya, itu berjalan bersih, dengan asumsi itu dalam sesi konsol RStudio baru. Tidak ada tanda baik pesan ketidakseimbangan tumpukan atau "protect_ptr: pointer tidak ditemukan", dan pengukur kemajuan berjalan dengan benar (dua kali dalam kasus ini karena ada pembacaan ulang). Sekarang tinggal @aadler untuk mengonfirmasi.

KEBERHASILAN.

Pertama dijalankan, instance baru RStudio.

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> DT <- fread('LargeFile.csv', colClasses = colCLASS, select = colSEL, header = TRUE, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|==================================================|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:25.938 wall clock time
[12] Finalizing the datatable
  Type counts:
        23 : drop      '0'
         5 : int32     '5'
         7 : float64   '7'
         2 : string    'A'
=============================
   0.005s (  0%) Memory map 6.355GB file
   0.025s (  0%) sep=',' ncol=37 and header detection
   0.001s (  0%) Column type detection using 10049 sample rows
   4.681s ( 18%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
  21.226s ( 82%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
   =    0.485s (  2%) Finding first non-embedded \n after each jump
   +    1.465s (  6%) Parse to row-major thread buffers (grown 0 times)
   +    9.095s ( 35%) Transpose
   +   10.181s ( 39%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  25.938s        Total

Menutup RStudio dan membukanya kembali untuk mencegah string caching aktif dan menjalankannya lagi dengan gcinfo(TRUE) . Bonus tambahan, konversi ke IDate selesai (memakan waktu lebih dari 40 detik, meskipun :)).

> colCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+               rep('integer', 3L), rep('character', 2L),
+               'integer', 'Date', rep('numeric', 2L), 'Date',
+               rep('numeric', 12L), rep('integer', 5),
+               rep('numeric', 3L), 'integer', 'character')
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> gcinfo(TRUE)
[1] FALSE
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
Garbage collection 46 = 36+5+5 (level 0) ... 
38.6 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 47 = 37+5+5 (level 0) ... 
38.7 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 48 = 38+5+5 (level 0) ... 
38.8 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 49 = 39+5+5 (level 0) ... 
39.0 Mbytes of cons cells used (78%)
11.2 Mbytes of vectors used (71%)
Garbage collection 50 = 40+5+5 (level 0) ... 
39.1 Mbytes of cons cells used (78%)
11.3 Mbytes of vectors used (71%)
Garbage collection 51 = 40+6+5 (level 1) ... 
38.8 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 52 = 41+6+5 (level 0) ... 
38.9 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 53 = 42+6+5 (level 0) ... 
41.5 Mbytes of cons cells used (83%)
12.2 Mbytes of vectors used (77%)
Garbage collection 54 = 42+7+5 (level 1) ... 
43.4 Mbytes of cons cells used (86%)
12.8 Mbytes of vectors used (81%)
Garbage collection 55 = 42+7+6 (level 2) ... 
44.7 Mbytes of cons cells used (72%)
13.0 Mbytes of vectors used (67%)
Garbage collection 56 = 43+7+6 (level 0) ... 
46.5 Mbytes of cons cells used (74%)
13.6 Mbytes of vectors used (70%)
Garbage collection 57 = 44+7+6 (level 0) ... 
47.0 Mbytes of cons cells used (75%)
13.8 Mbytes of vectors used (71%)
Garbage collection 58 = 45+7+6 (level 0) ... 
47.4 Mbytes of cons cells used (76%)
13.9 Mbytes of vectors used (71%)
Garbage collection 59 = 46+7+6 (level 0) ... 
47.7 Mbytes of cons cells used (76%)
14.2 Mbytes of vectors used (73%)
Garbage collection 60 = 47+7+6 (level 0) ... 
48.0 Mbytes of cons cells used (77%)
14.2 Mbytes of vectors used (73%)
Garbage collection 61 = 48+7+6 (level 0) ... 
48.1 Mbytes of cons cells used (77%)
14.3 Mbytes of vectors used (73%)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = colCLASS, select = colSEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
Garbage collection 62 = 48+7+7 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
13.6 Mbytes of vectors used (2%)
Garbage collection 63 = 48+7+8 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
488.7 Mbytes of vectors used (42%)
Garbage collection 64 = 48+7+9 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
963.9 Mbytes of vectors used (56%)
Garbage collection 65 = 48+7+10 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
1439.1 Mbytes of vectors used (63%)
Garbage collection 66 = 48+7+11 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
1914.2 Mbytes of vectors used (67%)
Garbage collection 67 = 48+7+12 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
2864.5 Mbytes of vectors used (77%)
Garbage collection 68 = 48+7+13 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
3577.3 Mbytes of vectors used (78%)
Garbage collection 69 = 48+7+14 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
4290.0 Mbytes of vectors used (75%)
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|============================Garbage collection 70 = 49+7+14 (level 0) ... 
76.5 Mbytes of cons cells used (99%)
5487.5 Mbytes of vectors used (96%)
=Garbage collection 71 = 49+8+14 (level 1) ... 
77.0 Mbytes of cons cells used (100%)
5487.6 Mbytes of vectors used (96%)
Garbage collection 72 = 49+8+15 (level 2) ... 
77.0 Mbytes of cons cells used (81%)
5487.1 Mbytes of vectors used (80%)
==============Garbage collection 73 = 50+8+15 (level 0) ... 
94.3 Mbytes of cons cells used (100%)
5494.0 Mbytes of vectors used (80%)
Garbage collection 74 = 50+9+15 (level 1) ... 
94.5 Mbytes of cons cells used (100%)
5494.1 Mbytes of vectors used (80%)
Garbage collection 75 = 50+9+16 (level 2) ... 
94.5 Mbytes of cons cells used (82%)
5493.1 Mbytes of vectors used (67%)
=======|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:24.772 wall clock time
[12] Finalizing the datatable
  Type counts:
        23 : drop      '0'
         5 : int32     '5'
         7 : float64   '7'
         2 : string    'A'
=============================
   0.005s (  0%) Memory map 6.355GB file
   0.018s (  0%) sep=',' ncol=37 and header detection
   0.000s (  0%) Column type detection using 10049 sample rows
   5.496s ( 22%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
  19.253s ( 78%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
   =    0.433s (  2%) Finding first non-embedded \n after each jump
   +    1.482s (  6%) Parse to row-major thread buffers (grown 0 times)
   +    9.515s ( 38%) Transpose
   +    7.822s ( 32%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  24.772s        Total
Garbage collection 76 = 51+9+16 (level 0) ... 
105.3 Mbytes of cons cells used (91%)
5500.3 Mbytes of vectors used (67%)
Garbage collection 77 = 51+10+16 (level 1) ... 
105.4 Mbytes of cons cells used (91%)
5500.2 Mbytes of vectors used (67%)
> DT[, Month := as.IDate(Month, format = "%Y-%m-%d")]
Garbage collection 78 = 51+10+17 (level 2) ... 
107.5 Mbytes of cons cells used (76%)
8174.1 Mbytes of vectors used (81%)
Garbage collection 79 = 51+11+17 (level 1) ... 
107.5 Mbytes of cons cells used (76%)
5910.4 Mbytes of vectors used (59%)
> gcinfo(FALSE)
[1] TRUE

luar biasa! : tada: kerja bagus untuk semua orang yang terlibat, terutama @mattdowle yang sekarang harus pendek beberapa rambut dengan ini :)

Sepertinya strategi saya untuk 'tetap berlibur sampai masalah teratasi' tampaknya berhasil di sini :-)

Apakah ada hal lain yang harus saya coba lihat atau apakah masalah ini dianggap sudah terselesaikan?

Terima kasih @aadler dan @HughParsonage! Bantuan.
@evin.ha ha. Ya, itu adalah sisi data.table dan sekarang diselesaikan (PR # 2488). Terima kasih.

Apakah halaman ini membantu?
0 / 5 - 0 peringkat