Data.table: Desequilíbrio de pilha em fread

Criado em 14 nov. 2017 · 61Comentários · Fonte: Rdatatable/data.table

Eu experimento uma falha de R (um 'desequilíbrio de pilha') quando executo o seguinte com verbose=FALSE . Nota Consegui executar com sucesso o código abaixo em uma versão dev mais antiga de data.table um ou dois meses atrás, então acredito que este seja um bug bastante recente. (Desculpe, não me lembro da versão dev exata em que estava funcionando.)

O problema não se reproduz em um arquivo substancialmente menor. Link para o arquivo zip (o csv tem 350 MB): https://github.com/HughParsonage/ABS-data/blob/master/inbox/SA2-by-DJZ-2011.zip

Ocasionalmente, experimento erros diferentes. Por exemplo,

Erro em get (nome, envir = ns, herda = FALSO): primeiro argumento inválido

Aviso: desequilíbrio da pilha em '$', 16 e depois 15
Erro: R_Reproteger: apenas 1 item protegido, não é possível proteger novamente o índice -2

# Minimal reproducible example

library(data.table)

#> data.table 1.10.5 IN DEVELOPMENT built 2017-11-13 02:46:28 UTC; appveyor
#>   The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#>   Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#>   Release notes, videos and slides: http://r-datatable.com


fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE)

Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.550 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 94%. ETA 00:00 Warning: stack imbalance in '$', 27 then 28
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.991
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.006s (  0%) Memory map 0.341GB file
   0.011s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.328s (  9%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.194s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.362s ( 10%) Parse to row-major thread buffers
   +    1.963s ( 55%) Transpose
   +    0.868s ( 25%) Waiting
   0.991s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
   3.541s        Total
Warning: stack imbalance in 'withVisible', 3 then 5
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

# Output of sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5    RevoUtils_10.0.6     RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2    yaml_2.1.14

bug fread idatitime platform-specific

Fonte

HughParsonage

Comentários muito úteis

Parece que minha estratégia de 'ficar de férias até que o problema seja resolvido' parece ter funcionado aqui :-)

Há mais alguma coisa que devo tentar analisar ou este problema foi considerado resolvido?

kevinushey em 4 dez. 2017

😄3

Todos 61 comentários

@HughParsonage , isso é semelhante ao # 2457. Talvez tente passar showProgress=FALSE e veja se completa.
@mattdowle poderia ter ocorrido uma regressão desde 09-11-2017?

aadler em 14 nov. 2017

Executar com showProgress=FALSE realmente retornou o resultado (apenas com os avisos esperados).

HughParsonage em 14 nov. 2017

Obrigado por todas as informações detalhadas. Duvido que tenha ocorrido uma regressão desde 09-11-2017, mas talvez o resultado longo de verbose=TRUE esteja tendo um impacto semelhante ao resultado do ETA. O arquivo precisa ser relido, o que significa que mais saída é gerada. Temo que o relatório do @HughParsonage que ShowProgress = obras verdade para ele é espúria e que o problema vai acontecer se ele é executado 5-10 vezes com verbose = TRUE.

Não há mensagens verbosas impressas de dentro da seção paralela (além do ETA de progresso que já foi corrigido). No entanto, há mensagens verbosas após a primeira leitura e antes do início da segunda releitura (o que está acontecendo para este arquivo). Suponho que seja possível que se essas impressões acionarem o 100º CheckUserInterrupt (consulte # 2457), isso poderia causar a falha da 2ª região paralela (embora estranho). Para descartar isso de qualquer maneira, acabei de alterar todas as mensagens detalhadas para usar REprintf em vez de Rprintf (mesma correção que # 2457 para o ETA). Isso falhou porque os testes não estão encontrando a saída em stderr - serão corrigidos. Assim que passar, o arquivo .zip do Windows será criado automaticamente e você pode tentar novamente. Vou atualizar aqui quando estiver pronto.

mattdowle em 14 nov. 2017

Ok, a segunda tentativa está passando nas verificações e o Windows.zip está disponível. @HughParsonage , você se importaria de tentar novamente, por favor? Eu adicionei uma chamada para R_FlushConsole () após as mensagens no modo detalhado, antes da releitura. Esse flush só é necessário no Windows. Estou supondo que, sem o flush, o console às vezes é atualizado um pouco mais tarde, quando a releitura paralela está acontecendo e isso é conhecido por causar problemas. Repita 10 vezes, sempre com verbose=TRUE e showProgress=TRUE . Se você vir 10 corridas claras, então diremos que foi isso. Caso contrário, terei que pensar novamente.

mattdowle em 15 nov. 2017

Infelizmente, não corrigido:

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = FALSE)
Read 26%. ETA 00:00 Warning: stack imbalance in '$', 20 then 22
Read 52%. ETA 00:00 Warning: stack imbalance in '$', 36 then 35
Warning: stack imbalance in '$', 21 then 22
Read 59%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  : 
  unprotect_ptr: pointer not found
In addition: Warning: stack imbalance in '$', 26 then 28
Warning messages:
1: Warning: stack imbalance in '$', 26 then 27
In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15

Usando verbose=TRUE, showProgress=TRUE mesmo após 10 execuções, não recebo nenhum erro. Aqui está o resultado da 10ª saída:

Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.094 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.752
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.004s (  0%) Memory map 0.341GB file
   0.008s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.173s (  4%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.660s ( 95%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
   =    0.009s (  0%) Finding first non-embedded \n after each jump
   +    1.946s ( 51%) Parse to row-major thread buffers
   +    1.098s ( 29%) Transpose
   +    0.608s ( 16%) Waiting
   1.752s ( 46%) Rereading 1 columns due to out-of-sample type exceptions
   3.846s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.589 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.418
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.001s (  0%) Memory map 0.341GB file
   0.003s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.574s ( 14%) Allocation of 22885380 rows x 4 cols (0.469GB)
   3.428s ( 86%) Reading 360 chunks of 0.971MB (63547 rows) using 1 threads
   =    0.010s (  0%) Finding first non-embedded \n after each jump
   +    1.988s ( 50%) Parse to row-major thread buffers
   +    1.137s ( 28%) Transpose
   +    0.292s (  7%) Waiting
   1.418s ( 35%) Rereading 1 columns due to out-of-sample type exceptions
   4.007s        Total
There were 20 warnings (use warnings() to see them)

HughParsonage em 15 nov. 2017

@HughParsonage Obrigado! Estou confuso. Você está dizendo que funciona bem com verbose=TRUE, showProgress=TRUE que é o que esperávamos - yay! Isso falhou antes, não é? O padrão para showProgress é VERDADEIRO de qualquer maneira, mas quando você executa com o padrão FALSO para verbose , _então_ ele não funciona e você vê o desequilíbrio da pilha? É estranho que a saída _less_ o faça falhar. Por favor confirme. Se for esse o caso, talvez eu esteja latindo na árvore errada. Funciona bem aqui para mim no Linux, então confio nos seus testes no Windows. Obrigado.
(Além disso, na parte inferior da saída da 10ª execução, ele diz que houve 20 avisos. Suponho que esses sejam os 2 avisos mostrados acima, repetidos 10 vezes. Se sim, faz sentido.)

mattdowle em 15 nov. 2017

Oi, desculpe pela confusão, Matt.

Você está certo de que o problema original não resulta mais em uma falha, ou seja, o seguinte funciona conforme o esperado:

fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "")

Para esclarecer, no original, quando verbose =FALSE (o padrão) eu travava. Eu o executei com verbose = TRUE antes de registrar o problema e notei um aviso de 'desequilíbrio de pilha', mas não encontrei um travamento. Com a versão mais recente, não consigo travar (ou mesmo problemas) com verbose = FALSE .

O motivo pelo qual disse 'não corrigido' foi que notei as mensagens de aviso:

Warning messages:
Warning: stack imbalance in '$', 26 then 27
Warning: stack imbalance in 'lapply', 31 then 30
Warning: stack imbalance in '$', 14 then 15

que parecia estranho e pensei que poderia indicar um problema intimamente relacionado, embora não idêntico. Dito isto, esta manhã na Austrália já não consigo reproduzir as mensagens de aviso.

HughParsonage em 16 nov. 2017

OK eu vejo. Essas mensagens de aviso sobre o desequilíbrio da pilha são essencialmente erros, sim. Não podemos ignorá-los. Estou chamando aquele aviso de desequilíbrio de pilha de falha, embora ainda não tenha realmente travado. (É apenas uma questão de tempo até que ele trave depois de ver esse aviso.)

Quando você faz as 10 execuções em uma nova sessão de R com verbose=TRUE, showProgress=TRUE , qualquer um dos 20 avisos sobre desequilíbrio da pilha ou todos esses 20 são apenas os seguintes avisos regulares.

1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

Assim que ocorrer um aviso de desequilíbrio de pilha, inicie uma nova sessão R nova. Não podemos confiar em nada de R depois que isso ocorreu mesmo uma vez.

mattdowle em 16 nov. 2017

Consegui travar quando corri com verbose=TRUE, showProgress=TRUE . Algo sobre const char com SEXP . Estou tentando reproduzir isso da linha de comando (infelizmente ocorreu no RStudio e o RStudio fechou antes que eu pudesse ler a mensagem inteira).

HughParsonage em 16 nov. 2017

Não é possível reproduzir a falha. Aqui está o resultado após a reinicialização. Houve um aviso de desequilíbrio de pilha:

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-15 00:36:41 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> for (i in 1:10) fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE, showProgress = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 31%. ETA 00:00 Warning: stack imbalance in '$', 24 then 23
Read 91%. ETA 00:00 Warning: stack imbalance in '$', 27 then 26
Read 95%. ETA 00:00 Warning: stack imbalance in '$', 28 then 29
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.895
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.029s (  1%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.314s ( 15%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.761s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.015s (  1%) Finding first non-embedded \n after each jump
   +    0.599s ( 28%) Parse to row-major thread buffers
   +    0.400s ( 19%) Transpose
   +    0.746s ( 35%) Waiting
   0.895s ( 42%) Rereading 1 columns due to out-of-sample type exceptions
   2.107s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.335 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:01.049
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.402s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.974s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.209s (  9%) Parse to row-major thread buffers
   +    0.864s ( 36%) Transpose
   +    0.900s ( 38%) Waiting
   1.049s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
   2.385s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.212 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.414
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.293s ( 18%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.322s ( 81%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.199s ( 12%) Parse to row-major thread buffers
   +    0.822s ( 51%) Transpose
   +    0.301s ( 19%) Waiting
   0.414s ( 25%) Rereading 1 columns due to out-of-sample type exceptions
   1.626s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.451 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.409
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.403s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.448s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.194s ( 10%) Parse to row-major thread buffers
   +    0.974s ( 52%) Transpose
   +    0.279s ( 15%) Waiting
   0.409s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.860s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.480 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.412
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.459s ( 24%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.424s ( 75%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.197s ( 10%) Parse to row-major thread buffers
   +    0.938s ( 50%) Transpose
   +    0.288s ( 15%) Waiting
   0.412s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.892s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.381 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 97%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.401
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.005s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.384s ( 22%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.389s ( 78%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.196s ( 11%) Parse to row-major thread buffers
   +    0.911s ( 51%) Transpose
   +    0.281s ( 16%) Waiting
   0.401s ( 22%) Rereading 1 columns due to out-of-sample type exceptions
   1.781s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.384 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.480
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.002s (  0%) Column type detection using 10027 sample rows
   0.476s ( 26%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.378s ( 74%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.192s ( 10%) Parse to row-major thread buffers
   +    0.833s ( 45%) Transpose
   +    0.352s ( 19%) Waiting
   0.480s ( 26%) Rereading 1 columns due to out-of-sample type exceptions
   1.864s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.374 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.507
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.311s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.562s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.193s ( 10%) Parse to row-major thread buffers
   +    0.988s ( 52%) Transpose
   +    0.381s ( 20%) Waiting
   0.507s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
   1.881s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.318 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 96%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.493
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.006s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.306s ( 17%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.496s ( 83%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.193s ( 11%) Parse to row-major thread buffers
   +    0.935s ( 52%) Transpose
   +    0.367s ( 20%) Waiting
   0.493s ( 27%) Rereading 1 columns due to out-of-sample type exceptions
   1.811s        Total
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 11051  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:01.141 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
         0 : drop     
         1 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
         1 : int32    
         0 : int64    
         0 : float64  
         0 : float64  
         0 : float64  
         2 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.506
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.007s (  0%) sep=',' ncol=4 and header detection
   0.001s (  0%) Column type detection using 10027 sample rows
   0.132s (  8%) Allocation of 22885380 rows x 4 cols (0.469GB)
   1.506s ( 91%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.001s (  0%) Finding first non-embedded \n after each jump
   +    0.195s ( 12%) Parse to row-major thread buffers
   +    0.938s ( 57%) Transpose
   +    0.371s ( 23%) Waiting
   0.506s ( 31%) Rereading 1 columns due to out-of-sample type exceptions
   1.647s        Total
Warning: stack imbalance in 'for', 2 then 8
There were 20 warnings (use warnings() to see them)

HughParsonage em 16 nov. 2017

Estranhamente, essa certeza é ótima. Obrigado. Isso significa que o flush não funcionou e, afinal, terei que encontrar uma maneira de evitar Rprintf . Funciona de forma confiável com verbose=FALSE, showProgress=FALSE eu suponho (você escreveu isso perto do início deste problema, então estou contando com isso). "Confiável", significando 10 corridas diretas com apenas os dois avisos esperados e sem visão da pilha aviso de desequilíbrio.
Deixe comigo então. Obrigado novamente.

mattdowle em 16 nov. 2017

👍1

@HughParsonage Ok, tente novamente com a segunda tentativa recente. Ele ainda não foi mesclado com o master, portanto, tome cuidado ao buscar o Windows.zip no branch aqui . Como antes, forneça a saída completa de qualquer maneira para que eu possa verificar. Obrigado!

mattdowle em 17 nov. 2017

A primeira tentativa do seguinte resultou em um travamento (algo sobre um ponteiro).

A segunda tentativa (após a reinicialização) resulta em um aviso stack imbalance in '$', 16 then 15 .

# Assert that `data.table` is not installed:
stopifnot(!requireNamespace("data.table", quietly = TRUE))

install.packages("https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/bpsehtwybbbgbyy3/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557502 bytes (1.5 MB)
# downloaded 1.5 MB

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 01:38:17 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com

setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapping ... ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \r-only line endings are not allowed because \n is found in the data
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
#   [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=','  with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
#   Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000)    : 1551  Quote rule 0
# Type codes (jump 100)    : 1A51  Quote rule 0
# =====
#   Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
#   [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# Read 78%. ETA 00:00 Warning: stack imbalance in '$', 16 then 15
# Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.677 wall clock time
# [12] Finalizing the datatable
# Type counts:
#   1 : bool8     '1'
# 1 : int32     '5'
# 2 : string    'A'
# =============================
#   0.002s (  0%) Memory map 0.341GB file
# 0.007s (  0%) sep=',' ncol=4 and header detection
# 0.001s (  0%) Column type detection using 10027 sample rows
# 0.297s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 2.369s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# =    0.003s (  0%) Finding first non-embedded \n after each jump
# +    0.273s ( 10%) Parse to row-major thread buffers (grown 0 times)
# +    1.313s ( 49%) Transpose
# +    0.780s ( 29%) Waiting
# 0.893s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
# 2.677s        Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1        V2      V3 V4
# 1: Goulburn 110018063    3499 NA
# 2:       NA 110018064     812 NA
# 3:       NA 110018065    2158 NA
# 4:       NA 110019999     402 NA
# 5:       NA 110028068      10 NA
# ---                              
#   22885376:       NA 997999799       0 NA
# 22885377:       NA 998999899      64 NA
# 22885378:       NA 994999499      34 NA
# 22885379:       NA 0&&&&&&&&  250796 NA
# 22885380:       NA 0@@@@@@@@ 7305367 NA
# Warning messages:
#   1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#                 Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
#               2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#               Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

HughParsonage em 17 nov. 2017

Olá, @mattdowle. Existem versões do GCC ainda em uso cujo OpenMP é, no máximo, 3.1, não 4.0. Eu tive esse problema em um dos meus pacotes no CRAN ( Delaporte ) onde tentei usar uma diretiva SIMD (OpenMP 4.0) que compilou com Rtools para Windows (baseado em 4.9.3), mas gerou um erro na máquina Linux de alguém ainda usando gcc 4.8.0. Mesmo o Windows só pode usar chamadas 4.0 e não 4.5, se bem me lembro. Talvez isso esteja contribuindo para o problema?

aadler em 17 nov. 2017

@HughParsonage Obrigado por testar tão rapidamente! Ok, vou continuar pensando então!
@aadler Essa é uma boa ideia - tudo é possível.

mattdowle em 17 nov. 2017

@HughParsonage Só para confirmar que o mesmo comando com apenas uma alteração ( verbose=FALSE ) funciona bem? ou seja, fread("SA2-by-DJZ-2011.csv", verbose = FALSE, na.strings = "", header = FALSE) . O medidor de progresso ainda será exibido.

mattdowle em 17 nov. 2017

Sim, a execução desse comando (dez vezes) retornou o resultado esperado (ou seja, data.table com apenas dois avisos porque está formatado incorretamente). Sem avisos de desequilíbrio de pilha.

HughParsonage em 17 nov. 2017

Obrigado. Então, parece estar relacionado à saída do console. Mais algumas coisas para tentar ...

mattdowle em 17 nov. 2017

No modo detalhado, existem alguns ramos dentro da região paralela que chamam wallclock() . Eu dei um curto-circuito nisso para sempre retornar 0,0 e evitar a chamada do sistema, para descartar isso. Achei que fosse thread-safe, mas talvez não. Experimente o novo Windows.zip do branch reconstruído aqui .

mattdowle em 17 nov. 2017

Primeira tentativa:

install.packages("https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/o0pn9ttkrbqgqw2k/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1556972 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package ‘data.table’ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-17 03:49:20 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)

Segunda tentativa, recebo os seguintes avisos:

Read 22%. ETA 00:00 Error in fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  : 
  unprotect_ptr: pointer not found
In addition: Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
 Warning: stack imbalance in '$', 29 then 28
 Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>
Warning: stack imbalance in 'lapply', 125 then 126
Warning: stack imbalance in 'lapply', 55 then 53
Warning: stack imbalance in 'lapply', 30 then 34
Warning: stack imbalance in '<-', 28 then 31
Warning: stack imbalance in '{', 24 then 27
Warning: stack imbalance in '{', 18 then 21

Apenas um pensamento: isso poderia ser um problema com o RStudio? Executar o script do terminal não parece reproduzir tão facilmente. Estou executando o RStudio porque facilita a cópia da saída do console.

HughParsonage em 17 nov. 2017

Quando você diz que ele não se reproduz _tão facilmente_ fora do RStudio, ele reproduz _atudo_? Mesmo que isso aconteça apenas dentro do RStudio, ainda é algo que pretendo consertar no lado data.table. Estou apenas pedindo como outra rota para confirmar que definitivamente é "apenas" a saída do console e não algum outro desequilíbrio de pilha verdadeiro na lógica do fread.

mattdowle em 17 nov. 2017

Ainda estou para reproduzir fora do RStudio e posso reproduzir de forma confiável dentro dele (ou seja, posso reproduzir algum aviso ou falha). Tentei o prompt de comando do Windows e o git shell (no Windows).

Estou usando o RStudio versão 1.1.383 no Windows. Seria útil para você se eu também abordasse esse problema ou você gostaria que eu esperasse?

HughParsonage em 18 nov. 2017

Obrigado. É muito útil saber que está apenas dentro do RStudio. Não há necessidade de falar com eles. Significa apenas que tem a ver com o buffer do console de saída (ou similar). Eu fui em frente com um trabalho ao redor e prestes a empurrar.

mattdowle em 18 nov. 2017

Não vejo por que o Windows não está compilando essa mudança:
fread.c:1054:3: warning: too many arguments for format [-Wformat-extra-args]
Funciona bem aqui no Linux e no Travis. Isso está impedindo que o Windows.zip seja criado para você testar essa solução alternativa. Vou ter que dormir sobre isso.
(Ele está reclamando da linha 1054, mas não da próxima linha 1055, que é exatamente o mesmo afaics. Deve haver alguma diferença.% Llu um problema com __VA_ARGS__ no Windows - certamente não.)

mattdowle em 18 nov. 2017

Ok, finalmente o windows.zip está pronto para você tentar novamente aqui .

Existem várias soluções alternativas neste branch agora. Se funcionar, começarei a remover as soluções alternativas para estabelecer qual era. Os avisos do compilador llu parecem os mais promissores, pois isso daria origem a um desequilíbrio da pilha na saída detalhada, consistente com a explicação @ st-pasha encontrada aqui . Talvez a camada Rprintf estava escondendo isso do compilador, que pode ver agora que está usando fprintf diretamente.

mattdowle em 18 nov. 2017

HughParsonage em 18 nov. 2017

Na segunda tentativa (após reiniciar)

stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1pi0ae5iuyj9rhj8/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1559167 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package ‘data.table’ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-18 04:58:23 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com

fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file:  C:\Users\hughp\AppData\Local\Temp\RtmpIT9H0D/fread.out 
Input contains no \n. Taking this to be a filename to open
Read 11%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 28%. ETA 00:00 Warning: stack imbalance in '$', 19 then 20
Read 48%. ETA 00:00 Warning: stack imbalance in '$', 20 then 19
Read 98%. ETA 00:00 [01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.822 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.000s (  0%) Memory map 0.341GB file
   0.001s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.291s ( 10%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.531s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.002s (  0%) Finding first non-embedded \n after each jump
   +    0.282s ( 10%) Parse to row-major thread buffers (grown 0 times)
   +    1.537s ( 54%) Transpose
   +    0.710s ( 25%) Waiting
   0.842s ( 30%) Rereading 1 columns due to out-of-sample type exceptions
   2.822s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

Novamente, não é reproduzível fora do RStudio.

HughParsonage em 18 nov. 2017

Obrigado por testar tão rapidamente. Bem, isso certamente exclui muito então! Restam duas ideias. O primeiro empurrou e passou. Experimente o novo Windows.zip aqui . Esse alloca está na pilha e tem a ver com na.strings que você está definindo no momento. Definitivamente, na área certa (desequilíbrio de pilha) e vale a pena tentar.

mattdowle em 18 nov. 2017

Não há problema - estarei ausente nas próximas 12 horas ou mais, então não posso testar até lá.
No sábado, 18 de novembro de 2017 às 17:20, Matt Dowle [email protected] escreveu:

Obrigado por testar tão rapidamente. Bem, isso certamente exclui muito então!
Restam duas ideias. O primeiro empurrou e passou. Experimente o novo Windows.zip
Aqui
https://ci.appveyor.com/project/Rdatatable/data-table/build/1.0.1363/job/fo02vnbu5ebhwy3w/artifacts .
Essa alocação é alocada na pilha e tem a ver com na.strings que
você está configurando quando isso acontece. Definitivamente na área certa (pilha
desequilíbrio) e vale a pena tentar.
-
Você está recebendo isso porque foi mencionado.
Responda a este e-mail diretamente, visualize-o no GitHub
https://github.com/Rdatatable/data.table/issues/2481#issuecomment-345421856 ,
ou silenciar o tópico
https://github.com/notifications/unsubscribe-auth/AHvGDGa5Qnls5eSFBMaQO5s8DElfrpKSks5s3ncqgaJpZM4QcuPc
.

HughParsonage em 18 nov. 2017

Sem problemas. Obrigado! Também empurrei a segunda ideia agora. Parece que me lembro de \r causando um problema no Windows no passado, mas não me lembro de um desequilíbrio de pilha. De qualquer forma, para descartar isso, removi \r do medidor de progresso. A mensagem de desequilíbrio da pilha parece ser impressa onde ocorrem as linhas ETA. É possível que o console pegue \r e o trate de forma diferente para que a última linha seja substituída. Agora você deve ver uma nova linha sempre que o ETA é atualizado. Apenas temporariamente para descartar isso. Novo Windows.zip construído e passando aqui .

mattdowle em 18 nov. 2017

fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Log file:  C:\Users\hughp\AppData\Local\Temp\RtmpcVjZ1f/fread.out 
Input contains no \n. Taking this to be a filename to open
Read 5%. ETA 00:00
Read 8%. ETA 00:00
Read 11%. ETA 00:00
Read 15%. ETA 00:00
Read 18%. ETA 00:00
Read 21%. ETA 00:00
Read 25%. ETA 00:00
Read 28%. ETA 00:00
Read 31%. ETA 00:00
Read 35%. ETA 00:00
Read 38%. ETA 00:00
Read 41%. ETA 00:00
Read 45%. ETA 00:00
Read 48%. ETA 00:00
Read 51%. ETA 00:00
Read 55%. ETA 00:00
Warning: stack imbalance in '$', 30 then 31
Warning: stack imbalance in '$', 17 then 16
Read 58%. ETA 00:00
Read 61%. ETA 00:00
Read 65%. ETA 00:00
Read 68%. ETA 00:00
Read 71%. ETA 00:00
Read 75%. ETA 00:00
Read 78%. ETA 00:00
Read 81%. ETA 00:00
Read 85%. ETA 00:00
Read 88%. ETA 00:00
Read 91%. ETA 00:00
Read 95%. ETA 00:00
Read 98%. ETA 00:00
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.894 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.001s (  0%) Memory map 0.341GB file
   0.003s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.316s ( 11%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.574s ( 89%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.004s (  0%) Finding first non-embedded \n after each jump
   +    0.284s ( 10%) Parse to row-major thread buffers (grown 0 times)
   +    1.450s ( 50%) Transpose
   +    0.837s ( 29%) Waiting
   0.953s ( 33%) Rereading 1 columns due to out-of-sample type exceptions
   2.894s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

HughParsonage em 18 nov. 2017

Para sua informação: não consegui reproduzir este erro de desequilíbrio de pilha em uma máquina Windows diferente com uma versão um pouco mais antiga do RStudio.

HughParsonage em 20 nov. 2017

Nesse caso, parece que é hora de pedir suporte RStudio como você sugeriu, por favor. Dei uma olhada no código do fread novamente e estou sem ideias do meu lado. Diga a eles os dois números de versão do RStudio. Isso não significa necessariamente que é RStudio, pode ser uma falha no lado da tabela de dados que aparece apenas em uma versão do RStudio. Mas é estranho que pareça relacionado à saída do console e isso seja algo diferente e específico do RStudio. Eu pesquisei por "RStudio stack imabalance", mas surgiram muitos problemas sobre falhas de pacote, não sobre o RStudio em si. Problema difícil de pesquisar. Vamos manter o assunto em aberto aqui e ver o que eles dizem.

mattdowle em 20 nov. 2017

Duvido que essa última tentativa ajude, mas para completar, tente aqui . Talvez o compilador MinGW que é usado no Windows faça algo estranho com esses dois ints. Um deles é o constante 0 que talvez seja otimizado e, de alguma forma, cause um desequilíbrio na pilha.

No entanto, essa mensagem particular de desequilíbrio de pilha está vindo de eval.c: 491 no próprio R. Algum tópico deve estar executando essa linha, mas não acho que seja fread ou data.table . Esse check_stack_balance() só é chamado de 5 lugares em R internos:
em names.c no final de do_internal()
em objects.c , duas vezes em applyMethod()
em eval.c , duas vezes em eval()
Não vejo como qualquer um desses pode ser alcançado enquanto fread.c está em sua seção paralela. O único ponto de entrada sendo chamado é REprintf e não vejo como isso poderia atingir check_stack_balance() . Tudo que posso pensar atualmente é que o RStudio tem um thread que está fazendo algo em segundo plano que talvez interaja com a saída do console, talvez de forma diferente no Windows.
Finalmente, para completar, parece que usar REprintf é o caminho certo, já que a base R está usando isso (em vez de Rprintf) em seu medidor de progresso em libcurl.c: 354 e internet.c: 409 . É uma pena que a barra de progresso do R no nível C não esteja disponível na API do R (parece ser implementada duas vezes no R no nível C também).

mattdowle em 21 nov. 2017

@mattdowle , isso seria útil? https://github.com/r-lib/progress

aadler em 21 nov. 2017

@aader Sim - obrigado! Sua fonte contém este comentário :
// In R Studio we should print to stdout, because printing a \r
// to stderr is buggy (reported)
Mas eu já removi \r e o desequilíbrio de pilha ainda acontece. Eu me pergunto onde foi relatado.

mattdowle em 21 nov. 2017

Talvez isto?

https://support.rstudio.com/hc/en-us/community/posts/203136268-Carriage-return-does-not-work-on-stderr-

MichaelChirico em 21 nov. 2017

A última compilação também não funcionou:

Relatado em https://community.rstudio.com/t/stack-imbalance-possibly-in-stderr/3009

HughParsonage em 21 nov. 2017

Pergunta oportuna sobre R-devel: [Rd] Rprintf e REprintf thread-safe?

aadler em 21 nov. 2017

Conclusão "Rprintf e REprintf não são seguros para thread."

Yoiks!

aadler em 21 nov. 2017

Obrigado a todos pelos links e Hugh por levantar o problema com o RStudio.

data.table::fwrite() e data.table::fread() estão cientes de que Rprintf e REprintf não são seguros para threads, portanto, para seus medidores de progresso, eles apenas os chamam a partir da thread mestre. Não apenas dois encadeamentos data.table nunca chamam esse ponto de entrada R ao mesmo tempo, mas apenas o encadeamento mestre também o chama, e esse é o único ponto de entrada R chamado por qualquer um dos encadeamentos a qualquer momento durante o seção paralela. No entanto, Rprintf chama R_CheckUserInterrupt cada 100 impressões. Eu acho que essa é a parte que provavelmente não está segura, mesmo apenas do thread mestre. Essa é a razão para usar agora REprintf já que isso não chama R_CheckUserInterrupt . R internos usam REprintf para medidores de progresso, portanto, mudar para REprintf para consistência com o núcleo R faz sentido; ou seja, essa escolha não tem nada a ver com stderr vs stdout, por si só.

@kevinushey , você se importaria de dar uma olhada neste tópico e me informar sobre mais alguma coisa que eu possa tentar? Poderia ser relacionado ao RStudio, de alguma forma, talvez relacionado a um segmento de fundo? Se o RStudio tem um thread de segundo plano, então Rprintf / REprintf pode ser chamado de dois threads ao mesmo tempo. Mas, se fosse esse o caso, teríamos visto muitos mais problemas antes de agora. Portanto, isso parece muito improvável. Talvez o RStudio substitua os retornos de chamada ptr_* mencionados na seção 8.1.2 de R-exts - aqueles relacionados à saída e interação do console. No entanto, essa seção começa com "Para unix-alikes", então não sei como o Windows entra. Talvez a seção 8.1.5 questões de threading também seja relevante. Ambos são subseções da seção 8: "Vinculando GUIs e outros front-ends a R".

mattdowle em 21 nov. 2017

👍1

Estarei fora até o início de dezembro, então, infelizmente, não terei a chance de dar uma olhada até lá. No entanto, o RStudio executa quase tudo no thread principal usando o loop de evento R; as únicas exceções são para, por exemplo, indexação de arquivo em nível de projeto e esses threads de fundo normalmente não tocam em nenhuma API R.

O RStudio controla os vários ptr_* callbacks para lidar com a entrada e saída do console; Não consigo pensar imediatamente em como eles podem ser uma causa aqui, mas tentarei dar uma olhada mais profunda quando voltar.

kevinushey em 22 nov. 2017

👍1

Ok, tente este aqui . Antes, ele atualizava o status do andamento a cada 2%. No seu caso, seu arquivo está levando apenas menos de 3 segundos, então havia uma nova atualização de progresso no console do RStudio a cada 0,06 segundos. Talvez isso tenha sido demais para o RStudio. Portanto, esta tentativa imprime uma barra. Ele não usa \r alguma. Isso deve ser melhor para relatórios e arquivos de log de qualquer maneira, onde \r poderia preencher a saída.

Como seu tempo de 3 segundos é bastante rápido, reduzi a barra de progresso para começar em 1 segundo se houver um ETA de 1 segundo a partir daí. Caso contrário, ele não seria exibido e funcionaria para o seu arquivo apenas porque ele não estava sendo exibido. Depois de testar, aumentarei para o que fwrite tem; ou seja, começar em 2 segundos se o ETA for 2 segundos a partir daí.

mattdowle em 30 nov. 2017

Olá, @mattdowle. Meu último comentário em # 2503 pode estar relacionado a esse problema também.

aadler em 30 nov. 2017

Parece bom! Sem avisos (após 5 execuções). Primeiro execute abaixo (observe que os espaços iniciais parecem diferentes na saída real):

stopifnot(!requireNamespace("data.table", quietly = TRUE))
install.packages("https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip",
                 repos = NULL)
# Installing package into ‘C:/Users/hughp/Documents/R/win-library/3.4’
# (as ‘lib’ is unspecified)
# trying URL 'https://ci.appveyor.com/api/buildjobs/1o9s06o31v8i3ljr/artifacts/data.table_1.10.5.zip'
# Content type 'application/octet-stream' length 1557423 bytes (1.5 MB)
# downloaded 1.5 MB
# 
# package ‘data.table’ successfully unpacked and MD5 sums checked

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
setwd("~/ABS-data/inbox/SA2-by-DJZ-2011/")
fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 12 threads (omp_get_max_threads()=12, nth=12)
# NAstrings = [<<>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as boolean
# [02] Opening the file
# Opening file SA2-by-DJZ-2011.csv
# File opened, size = 349.4MB (366418725 bytes).
# Memory mapped ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<Australian Bureau of Statistic>>
#   [06] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=','  with 89 lines of 4 fields using quote rule 0
# Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
#   Quote rule picked = 0
# fill=false and the most number of columns found is 4
# [07] Detect column types, good nrow estimate and whether first row is column names
# 'header' changed by user from 'auto' to false
# Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
# Type codes (jump 000)    : 1551  Quote rule 0
# Type codes (jump 100)    : 1A51  Quote rule 0
# =====
#   Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
# Bytes from first data row on line 12 to the end of last row: 366418143
# Line length: mean=16.02 sd=0.21 min=16 max=29
# Estimated number of rows: 366418143 / 16.02 = 22877178
# Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
#   [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 1A51
# [10] Allocate memory for the datatable
# Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
# [11] Read the data
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# jumps=[0..360), chunk_size=1017828, total_size=366418143
# |--------------------------------------------------|
#   |==================================================|
#   Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.280 wall clock time
# [12] Finalizing the datatable
# Type counts:
#   1 : bool8     '1'
# 1 : int32     '5'
# 2 : string    'A'
# =============================
#   0.005s (  0%) Memory map 0.341GB file
# 0.037s (  2%) sep=',' ncol=4 and header detection
# 0.000s (  0%) Column type detection using 10027 sample rows
# 0.321s ( 14%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
# 1.917s ( 84%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
# =    0.011s (  0%) Finding first non-embedded \n after each jump
# +    0.560s ( 25%) Parse to row-major thread buffers (grown 0 times)
# +    0.488s ( 21%) Transpose
# +    0.858s ( 38%) Waiting
# 0.999s ( 44%) Rereading 1 columns due to out-of-sample type exceptions
# 2.280s        Total
# Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
# V1        V2      V3 V4
# 1: Goulburn 110018063    3499 NA
# 2:       NA 110018064     812 NA
# 3:       NA 110018065    2158 NA
# 4:       NA 110019999     402 NA
# 5:       NA 110028068      10 NA
# ---                              
#   22885376:       NA 997999799       0 NA
# 22885377:       NA 998999899      64 NA
# 22885378:       NA 994999499      34 NA
# 22885379:       NA 0&&&&&&&&  250796 NA
# 22885380:       NA 0@@@@@@@@ 7305367 NA
# Warning messages:
#   1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#                 Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
#               2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
#               Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

HughParsonage em 30 nov. 2017

🎉1

@HughParsonage Relief! Acho que é uma vitória então. Vou arrumar, mesclar e seguir em frente. Muito obrigado por testar.

@aadler Sim concordou que seu comentário na edição # 2503 parece o mesmo. Você também poderia testar o último do dev e confirmar que agora está corrigido? Esperamos que o problema com as.IDate você encontrou tenha sido causado pelo desequilíbrio de pilha anterior.

mattdowle em 30 nov. 2017

Nada de bom :(

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-30 00:21:00 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> DT <- fread('2017-11-22_1999_Performance.csv', header = TRUE, colClasses = CLS, select = SEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file 2017-11-22_1999_Performance.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|=======Warning: stack imbalance in '$', 27 then 26
===Warning: stack imbalance in '$', 26 then 27
================Error in fread("2017-11-22_1999_Performance.csv", header = TRUE, colClasses = CLS,  : 
  unprotect_ptr: pointer not found

aadler em 30 nov. 2017

@aadler Obrigado por esse relatório. Passei por freadR e localizei a proteção. Há 30% de chance de que funcione, já que, no seu caso, você está substituindo tipos e havia algumas proteções nessa parte do código. Por favor, tente novamente usando esta construção .

mattdowle em 1 dez. 2017

@aadler Se você não tentou a última compilação ainda, vá direto para tentar esta . Além disso, se for possível obter uma cópia do seu arquivo para mim, posso tentar sozinho no Windows RStudio.

mattdowle em 1 dez. 2017

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-01 01:54:04 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> ColCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+               rep('integer', 3L), rep('character', 2L),
+               'integer', 'Date', rep('numeric', 2L), 'Date',
+               rep('numeric', 12L), rep('integer', 5),
+               rep('numeric', 3L), 'integer', 'character')
> SELCOL <- c(WHATEVER)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = ColCLASS, select = SELCOL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|Error in fread("LargeFile.csv", header = TRUE, colClasses = ColCLASS,  : 
  unprotect_ptr: pointer not found

aadler em 1 dez. 2017

Graças a @aadler no e-mail, agora posso reproduzir. R 3.4.2, RStudio 1.1.383 mais recente e Windows 10 Pro 10.0.16299 Build 16299.

mattdowle em 2 dez. 2017

Estou vendo um comportamento estranho no RStudio, registrado aqui:
https://www.youtube.com/watch?v=tl2x2vmZxMU
Parece que o RStudio está gerando GCs apenas digitando. Por que isso e há alguma maneira de desligá-lo? Pode ser que quando fread() está imprimindo sua barra de progresso, o loop de evento separado do RStudio está pensando que a saída para o console é o usuário digitando e chama R que dá origem ao GC e desarma tudo? Talvez os usuários do RStudio aqui saibam, possam me indicar a direção certa ou talvez @kevinushey esteja de volta (você disse Kevin no início de dezembro, e hoje é primeiro: -))

mattdowle em 2 dez. 2017

Posso reproduzir o desequilíbrio da pilha de maneira confiável no console RStudio. Usando a guia Terminal RStudio, não consigo reproduzi-lo de forma alguma, mesmo com gcinfo(TRUE) . Curiosamente, GCs ocorrem enquanto a barra de progresso está sendo impressa e isso parece bom, como também funciona no Linux. Dado o comportamento naquele vídeo do RStudio Console, estou chegando à conclusão de que este é um bug do RStudio Console. Não consegui copiar o texto da janela do Terminal RStudio (Editar-> Copiar não funciona e nem Ctrl-C), então fiz uma captura de tela da guia Terminal para mostrar que o GC durante a barra de progresso está ok. Eu esperava que estivesse tudo bem porque apenas o thread mestre está chamando REprintf e os outros threads não estão chamando nenhuma API R.

Funciona bem no Terminal RStudio:
selection_014
Observe que há GC enquanto a barra de progresso está imprimindo pela primeira vez e isso funciona bem no Terminal RStudio. A barra de progresso é impressa uma segunda vez porque há uma exceção de tipo fora da amostra neste arquivo de teste que dispara uma releitura automática apenas para essas colunas.

Mas no console RStudio há stack imbalance ou unprotect_ptr: pointer not found :

R version 3.4.2 (2017-09-28) -- "Short Summer"
> gcinfo(TRUE)
[1] FALSE
Garbage collection 22 = 16+3+3 (level 0) ... 
25.5 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 23 = 16+4+3 (level 1) ... 
24.9 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 24 = 17+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 25 = 18+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (64%)
Garbage collection 26 = 19+4+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 27 = 20+4+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 28 = 20+5+3 (level 1) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 29 = 21+5+3 (level 0) ... 
25.1 Mbytes of cons cells used (79%)
6.5 Mbytes of vectors used (65%)
Garbage collection 30 = 22+5+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.5 Mbytes of vectors used (65%)
Garbage collection 31 = 23+5+3 (level 0) ... 
25.2 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 32 = 24+5+3 (level 0) ... 
25.3 Mbytes of cons cells used (80%)
6.6 Mbytes of vectors used (66%)
Garbage collection 33 = 25+5+3 (level 0) ... 
25.4 Mbytes of cons cells used (80%)
6.7 Mbytes of vectors used (66%)
Garbage collection 34 = 25+5+4 (level 2) ... 
24.6 Mbytes of cons cells used (61%)
6.4 Mbytes of vectors used (50%)
Garbage collection 35 = 26+5+4 (level 0) ... 
25.0 Mbytes of cons cells used (62%)
6.5 Mbytes of vectors used (52%)
> require(data.table)
Loading required package: data.table
Garbage collection 36 = 27+5+4 (level 0) ... 
27.2 Mbytes of cons cells used (68%)
7.1 Mbytes of vectors used (56%)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 01:04:34 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
Garbage collection 37 = 28+5+4 (level 0) ... 
27.7 Mbytes of cons cells used (69%)
7.3 Mbytes of vectors used (58%)
Garbage collection 38 = 29+5+4 (level 0) ... 
28.0 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (58%)
Garbage collection 39 = 30+5+4 (level 0) ... 
28.1 Mbytes of cons cells used (70%)
7.4 Mbytes of vectors used (59%)
Garbage collection 40 = 31+5+4 (level 0) ... 
28.2 Mbytes of cons cells used (70%)
7.5 Mbytes of vectors used (59%)
Garbage collection 41 = 32+5+4 (level 0) ... 
28.4 Mbytes of cons cells used (71%)
7.5 Mbytes of vectors used (59%)
> DT = fread("/Users/pasha/Downloads/LargeFile.csv")
Garbage collection 42 = 32+5+5 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
7.1 Mbytes of vectors used (2%)
Garbage collection 43 = 32+5+6 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
244.7 Mbytes of vectors used (42%)
Garbage collection 44 = 32+5+7 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
482.3 Mbytes of vectors used (42%)
Garbage collection 45 = 32+5+8 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
957.4 Mbytes of vectors used (56%)
Garbage collection 46 = 32+5+9 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
1432.6 Mbytes of vectors used (63%)
Garbage collection 47 = 32+5+10 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
2145.3 Mbytes of vectors used (75%)
Garbage collection 48 = 32+5+11 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
2620.4 Mbytes of vectors used (71%)
Garbage collection 49 = 32+5+12 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
3570.8 Mbytes of vectors used (78%)
Garbage collection 50 = 32+5+13 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
4283.5 Mbytes of vectors used (75%)
Garbage collection 51 = 32+5+14 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
5709.0 Mbytes of vectors used (77%)
Garbage collection 52 = 32+5+15 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
7372.0 Mbytes of vectors used (81%)
Garbage collection 53 = 32+5+16 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
8797.5 Mbytes of vectors used (79%)
Garbage collection 54 = 32+5+17 (level 2) ... 
27.4 Mbytes of cons cells used (54%)
10935.7 Mbytes of vectors used (80%)
|--------------------------------------------------|
|=====Error in fread("LargeFile.csv") : 
  unprotect_ptr: pointer not found
>

mattdowle em 2 dez. 2017

showProgress=FALSE resolve isso de forma confiável no console RStudio. Para reproduzir, ele precisa ser executado pela primeira vez em um console RStudio novo com showProgress=TRUE (ou seja, o padrão). Parece relacionado ao fato de haver um GC durante o medidor de progresso; existe na primeira execução em uma nova sessão. Ele só precisa ser um arquivo grande para que o medidor de progresso seja exibido. Nada a ver com reler ou argumentos passados para fread . Se a primeira execução em um console RStudio novo for com showProgress=FALSE para fazê-lo funcionar, essa execução expande o heap de R e uma execução subsequente na mesma sessão com showProgress=TRUE também funciona. Mas só porque não há GC durante o medidor de progresso, devido à primeira execução já expandindo o heap.
Por que um GC no thread mestre durante o medidor de progresso está ok no Linux e no Windows RStudio Terminal, mas não no RStudio Console, é a questão pendente.

mattdowle em 2 dez. 2017

Ok, isso corrige isso. O problema estava no lado data.table e não no RStudio. Funciona agora para mim de forma confiável no console RStudio no Windows. Era um problema que poderia ocorrer no Linux e no Max também, mas os padrões de memória não o acionavam. As outras threads tinham um ponto de entrada para R (ao empurrar seus buffers com colunas de string) que poderia acontecer ao mesmo tempo que o progresso da impressão da thread master usando REprintf . É por isso que só aconteceu na primeira corrida em uma nova sessão. Na segunda execução em diante, todas as strings no arquivo foram vistas antes, então as pesquisas de cache estavam ocorrendo (thread-safe) e não alocando (não thread-safe).

Então, @aadler e @HughParsonage , tente este . 95% de chance de isso funcionar agora!

mattdowle em 2 dez. 2017

Sem aviso, não tenho certeza se você está procurando por mais alguma coisa:

> gcinfo(TRUE)
[1] FALSE
> fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "", header = FALSE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file SA2-by-DJZ-2011.csv
  File opened, size = 349.4MB (366418725 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 89 lines of 4 fields using quote rule 0
  Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
  Type codes (jump 000)    : 1551  Quote rule 0
  Type codes (jump 100)    : 1A51  Quote rule 0
  =====
  Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 12 to the end of last row: 366418143
  Line length: mean=16.02 sd=0.21 min=16 max=29
  Estimated number of rows: 366418143 / 16.02 = 22877178
  Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 1A51
[10] Allocate memory for the datatable
  Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
Garbage collection 53 = 36+5+12 (level 2) ... 
30.3 Mbytes of cons cells used (60%)
7.9 Mbytes of vectors used (1%)
[11] Read the data
  jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Garbage collection 54 = 37+5+12 (level 0) ... 
30.8 Mbytes of cons cells used (61%)
566.6 Mbytes of vectors used (74%)
Garbage collection 55 = 37+6+12 (level 1) ... 
30.8 Mbytes of cons cells used (61%)
549.2 Mbytes of vectors used (72%)
  jumps=[0..360), chunk_size=1017828, total_size=366418143
|--------------------------------------------------|
|==================================================|
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.626 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         1 : int32     '5'
         2 : string    'A'
=============================
   0.002s (  0%) Memory map 0.341GB file
   0.005s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10027 sample rows
   0.469s ( 18%) Allocation of 25164895 rows x 4 cols (0.469GB) of which 22885380 ( 91%) rows used
   2.150s ( 82%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
   =    0.103s (  4%) Finding first non-embedded \n after each jump
   +    0.230s (  9%) Parse to row-major thread buffers (grown 0 times)
   +    0.718s ( 27%) Transpose
   +    1.099s ( 42%) Waiting
   0.745s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
   2.626s        Total
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
Garbage collection 56 = 37+6+13 (level 2) ... 
31.1 Mbytes of cons cells used (62%)
531.9 Mbytes of vectors used (70%)
Garbage collection 57 = 38+6+13 (level 0) ... 
31.1 Mbytes of cons cells used (62%)
532.0 Mbytes of vectors used (70%)
                V1        V2      V3 V4
       1: Goulburn 110018063    3499 NA
       2:       NA 110018064     812 NA
       3:       NA 110018065    2158 NA
       4:       NA 110019999     402 NA
       5:       NA 110028068      10 NA
      ---                              
22885376:       NA 997999799       0 NA
22885377:       NA 998999899      64 NA
22885378:       NA 994999499      34 NA
22885379:       NA 0&&&&&&&&  250796 NA
22885380:       NA 0@@@@@@@@ 7305367 NA
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", verbose = TRUE, na.strings = "",  :
  Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>

HughParsonage em 2 dez. 2017

Obrigado Hugh. Sim, é uma corrida limpa, assumindo que foi em uma nova sessão de console RStudio. Nenhum sinal de desequilíbrio da pilha ou de mensagens "desprotegido: ponteiro não encontrado", e o medidor de progresso está funcionando corretamente (duas vezes neste caso, pois há uma releitura). Agora é só @aadler para confirmar.

mattdowle em 2 dez. 2017

SUCESSO.

Primeira execução, nova instância do RStudio.

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> DT <- fread('LargeFile.csv', colClasses = colCLASS, select = colSEL, header = TRUE, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|==================================================|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:25.938 wall clock time
[12] Finalizing the datatable
  Type counts:
        23 : drop      '0'
         5 : int32     '5'
         7 : float64   '7'
         2 : string    'A'
=============================
   0.005s (  0%) Memory map 6.355GB file
   0.025s (  0%) sep=',' ncol=37 and header detection
   0.001s (  0%) Column type detection using 10049 sample rows
   4.681s ( 18%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
  21.226s ( 82%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
   =    0.485s (  2%) Finding first non-embedded \n after each jump
   +    1.465s (  6%) Parse to row-major thread buffers (grown 0 times)
   +    9.095s ( 35%) Transpose
   +   10.181s ( 39%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  25.938s        Total

Fechou o RStudio e abriu-o novamente para evitar que o cache de strings fosse ativado e executou-o novamente com gcinfo(TRUE) . Bônus adicionado, a conversão para IDate concluída (levou mais de 40 segundos, no entanto :)).

> colCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+               rep('integer', 3L), rep('character', 2L),
+               'integer', 'Date', rep('numeric', 2L), 'Date',
+               rep('numeric', 12L), rep('integer', 5),
+               rep('numeric', 3L), 'integer', 'character')
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> gcinfo(TRUE)
[1] FALSE
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
Garbage collection 46 = 36+5+5 (level 0) ... 
38.6 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 47 = 37+5+5 (level 0) ... 
38.7 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 48 = 38+5+5 (level 0) ... 
38.8 Mbytes of cons cells used (77%)
11.2 Mbytes of vectors used (71%)
Garbage collection 49 = 39+5+5 (level 0) ... 
39.0 Mbytes of cons cells used (78%)
11.2 Mbytes of vectors used (71%)
Garbage collection 50 = 40+5+5 (level 0) ... 
39.1 Mbytes of cons cells used (78%)
11.3 Mbytes of vectors used (71%)
Garbage collection 51 = 40+6+5 (level 1) ... 
38.8 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 52 = 41+6+5 (level 0) ... 
38.9 Mbytes of cons cells used (77%)
11.3 Mbytes of vectors used (71%)
Garbage collection 53 = 42+6+5 (level 0) ... 
41.5 Mbytes of cons cells used (83%)
12.2 Mbytes of vectors used (77%)
Garbage collection 54 = 42+7+5 (level 1) ... 
43.4 Mbytes of cons cells used (86%)
12.8 Mbytes of vectors used (81%)
Garbage collection 55 = 42+7+6 (level 2) ... 
44.7 Mbytes of cons cells used (72%)
13.0 Mbytes of vectors used (67%)
Garbage collection 56 = 43+7+6 (level 0) ... 
46.5 Mbytes of cons cells used (74%)
13.6 Mbytes of vectors used (70%)
Garbage collection 57 = 44+7+6 (level 0) ... 
47.0 Mbytes of cons cells used (75%)
13.8 Mbytes of vectors used (71%)
Garbage collection 58 = 45+7+6 (level 0) ... 
47.4 Mbytes of cons cells used (76%)
13.9 Mbytes of vectors used (71%)
Garbage collection 59 = 46+7+6 (level 0) ... 
47.7 Mbytes of cons cells used (76%)
14.2 Mbytes of vectors used (73%)
Garbage collection 60 = 47+7+6 (level 0) ... 
48.0 Mbytes of cons cells used (77%)
14.2 Mbytes of vectors used (73%)
Garbage collection 61 = 48+7+6 (level 0) ... 
48.1 Mbytes of cons cells used (77%)
14.3 Mbytes of vectors used (73%)
> DT <- fread('LargeFile.csv', header = TRUE, colClasses = colCLASS, select = colSEL, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 00AA700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
Garbage collection 62 = 48+7+7 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
13.6 Mbytes of vectors used (2%)
Garbage collection 63 = 48+7+8 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
488.7 Mbytes of vectors used (42%)
Garbage collection 64 = 48+7+9 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
963.9 Mbytes of vectors used (56%)
Garbage collection 65 = 48+7+10 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
1439.1 Mbytes of vectors used (63%)
Garbage collection 66 = 48+7+11 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
1914.2 Mbytes of vectors used (67%)
Garbage collection 67 = 48+7+12 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
2864.5 Mbytes of vectors used (77%)
Garbage collection 68 = 48+7+13 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
3577.3 Mbytes of vectors used (78%)
Garbage collection 69 = 48+7+14 (level 2) ... 
46.5 Mbytes of cons cells used (60%)
4290.0 Mbytes of vectors used (75%)
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|============================Garbage collection 70 = 49+7+14 (level 0) ... 
76.5 Mbytes of cons cells used (99%)
5487.5 Mbytes of vectors used (96%)
=Garbage collection 71 = 49+8+14 (level 1) ... 
77.0 Mbytes of cons cells used (100%)
5487.6 Mbytes of vectors used (96%)
Garbage collection 72 = 49+8+15 (level 2) ... 
77.0 Mbytes of cons cells used (81%)
5487.1 Mbytes of vectors used (80%)
==============Garbage collection 73 = 50+8+15 (level 0) ... 
94.3 Mbytes of cons cells used (100%)
5494.0 Mbytes of vectors used (80%)
Garbage collection 74 = 50+9+15 (level 1) ... 
94.5 Mbytes of cons cells used (100%)
5494.1 Mbytes of vectors used (80%)
Garbage collection 75 = 50+9+16 (level 2) ... 
94.5 Mbytes of cons cells used (82%)
5493.1 Mbytes of vectors used (67%)
=======|
Read 53945186 rows x 14 columns from 6.355GB (6823372783 bytes) file in 00:24.772 wall clock time
[12] Finalizing the datatable
  Type counts:
        23 : drop      '0'
         5 : int32     '5'
         7 : float64   '7'
         2 : string    'A'
=============================
   0.005s (  0%) Memory map 6.355GB file
   0.018s (  0%) sep=',' ncol=37 and header detection
   0.000s (  0%) Column type detection using 10049 sample rows
   5.496s ( 22%) Allocation of 62279495 rows x 37 cols (5.336GB) of which 53945186 ( 87%) rows used
  19.253s ( 78%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
   =    0.433s (  2%) Finding first non-embedded \n after each jump
   +    1.482s (  6%) Parse to row-major thread buffers (grown 0 times)
   +    9.515s ( 38%) Transpose
   +    7.822s ( 32%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  24.772s        Total
Garbage collection 76 = 51+9+16 (level 0) ... 
105.3 Mbytes of cons cells used (91%)
5500.3 Mbytes of vectors used (67%)
Garbage collection 77 = 51+10+16 (level 1) ... 
105.4 Mbytes of cons cells used (91%)
5500.2 Mbytes of vectors used (67%)
> DT[, Month := as.IDate(Month, format = "%Y-%m-%d")]
Garbage collection 78 = 51+10+17 (level 2) ... 
107.5 Mbytes of cons cells used (76%)
8174.1 Mbytes of vectors used (81%)
Garbage collection 79 = 51+11+17 (level 1) ... 
107.5 Mbytes of cons cells used (76%)
5910.4 Mbytes of vectors used (59%)
> gcinfo(FALSE)
[1] TRUE

aadler em 4 dez. 2017

🎉1

impressionante! : tada: ótimo trabalho para todos os envolvidos, especialmente @mattdowle que deve estar com alguns fios de cabelo agora curto com isso :)

MichaelChirico em 4 dez. 2017

👍2

Parece que minha estratégia de 'ficar de férias até que o problema seja resolvido' parece ter funcionado aqui :-)

Há mais alguma coisa que devo tentar analisar ou este problema foi considerado resolvido?

kevinushey em 4 dez. 2017

😄3

Obrigado @aadler e @HughParsonage! Alívio.
@kevinushey Ha ha. Sim, era do lado data.table e agora resolvido (PR # 2488). Obrigado.

mattdowle em 5 dez. 2017

Esta página foi útil?

0 / 5 - 0 avaliações

Questões relacionadas

FR: atribua nomes a CJ () como data.table () faz

franknarf1 · 3Comentários

Operações de linha em data.table usando `by = .I`

rafapereirabr · 3Comentários

Participar de uma tabela com chave em uma mesa sem chave às vezes não funciona

symbalex · 3Comentários

inclua o %ilike% do PostgreSQL em data.table

andschar · 3Comentários

[Request] Teste a compatibilidade do compilador com `-fopenmp` antes de usar `SHLIB_OPENMP_CFLAGS`

jimhester · 3Comentários