Data.table: fread peut ne pas analyser un fichier valide lorsque dec = ','

Créé le 13 avr. 2018 · 3Commentaires · Source: Rdatatable/data.table

> fread('a,b,c,d\n1e1,1e2,1e3,"4,0001"\n1,2,3,4\n', dec=',')
       a     b     c      d
   <num> <num> <num>  <num>
1:    10   100  1000 4.0001
Warning message:
In fread("a,b,c,d\n1e1,1e2,1e3,\"4,0001\"\n1,2,3,4\n", dec = ",") :
  Discarded single-line footer: <<1,2,3,4>>

Cela se produit parce que l'analyseur flottant consomme avidement 1,2 comme un seul jeton, alors que sans guillemets, il doit être analysé comme 2 champs séparés.

De plus, la section "Détails" de la documentation contient les informations suivantes (qui sont obsolètes depuis longtemps):

'fread' utilise la fonction C 'strtod' pour lire les données numériques; par exemple,
«1,23» ou «1,23». 'strtod' récupère le séparateur décimal ('.' ou
',' généralement) à partir de la locale de la session R plutôt que comme un
argument passé à la fonction 'strtod'. Donc pour
'fread (..., dec = ",")' pour travailler, 'fread' change cela (et seulement cela)
La locale de la session R temporairement à une locale qui fournit le
séparateur décimal souhaité.

bug fread

Source

st-pasha

👍1

Tous les 3 commentaires

Forfait génial! Merci.
Par coïncidence, j'ai vécu hier un comportement étrange peut-être lié à la peur.
Voici un exemple simple (dérivé d'un cas réel utilisant lapply fread sur un tas de fichiers).

library(data.table)

DT = data.table(A = rep("20,1", 1e4))
fwrite(DT, "DT.csv", quote = FALSE)

classA = character(1e3)

for (i in seq_along(classA)) {

  DT = fread("DT.csv", sep = ";", dec = ",", colClasses = "numeric")
  classA[i] = DT[, class(A)]
}

table(classA)

Cela me donne:

Warning message:
In fread("DT.csv", sep = ";", dec = ",", colClasses = "numeric",  :
  Bumped column 1 to type character on data row 387, field contains '20,1'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses a sample of 1,000 rows (100 rows at 10 points) so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

table(classA)
# classA
# character   numeric 
#        1       999 
which(classA == "character")
# [1] 9
`````

So, sometimes, the column is read as character with the spotted row index being different for different runs. And the ```which``` indicates it is more likely to happen in the first iterations. I don't get the randomness and the fact that colClasses is ignored (I set colClasses after reading fread doc...).
And... I did not manage to reproduce the error when setting verbose = TRUE... Also, the bug was observed using RStudio, not reproduced in the R console...
Sorry if I missed or misunderstood something...

```r
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4-3

loaded via a namespace (and not attached):
[1] tools_3.3.2 yaml_2.1.18

Atrebas le 14 avr. 2018

veuillez vous assurer de mettre à jour la version de développement de data.table et de tester à nouveau. il y a eu beaucoup d'améliorations à fread depuis la dernière version