Data.table: dec =','时,fread可能无法解析有效文件

创建于 2018-04-13  ·  3评论  ·  资料来源: Rdatatable/data.table

> fread('a,b,c,d\n1e1,1e2,1e3,"4,0001"\n1,2,3,4\n', dec=',')
       a     b     c      d
   <num> <num> <num>  <num>
1:    10   100  1000 4.0001
Warning message:
In fread("a,b,c,d\n1e1,1e2,1e3,\"4,0001\"\n1,2,3,4\n", dec = ",") :
  Discarded single-line footer: <<1,2,3,4>>

发生这种情况是因为float解析器贪婪地将1,2用作单个令牌,而没有引号的情况下必须将其解析为2个单独的字段。

此外,文档中的“详细信息”部分具有以下信息(此信息已过时很久了):

'fread'使用C函数'strtod'读取数字数据; 例如,
'1.23'或'1,23'。 'strtod'检索小数点分隔符('。'或
','通常)来自R会话的语言环境,而不是作为
参数传递给'strtod'函数。 因此对于
'fread(...,dec =“,”)'起作用,'fread'改变了这个(并且只有这个)
R会话的语言环境临时指向提供以下内容的语言环境:
所需的小数点分隔符。

bug fread

所有3条评论

很棒的包装! 谢谢你。
巧合的是,昨天我经历了一种与恐惧有关的奇怪行为。
这是一个简单的示例(来自对一堆文件使用lapply fread的实际案例)。

library(data.table)

DT = data.table(A = rep("20,1", 1e4))
fwrite(DT, "DT.csv", quote = FALSE)

classA = character(1e3)

for (i in seq_along(classA)) {

  DT = fread("DT.csv", sep = ";", dec = ",", colClasses = "numeric")
  classA[i] = DT[, class(A)]
}

table(classA)

这给了我:

Warning message:
In fread("DT.csv", sep = ";", dec = ",", colClasses = "numeric",  :
  Bumped column 1 to type character on data row 387, field contains '20,1'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses a sample of 1,000 rows (100 rows at 10 points) so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

table(classA)
# classA
# character   numeric 
#        1       999 
which(classA == "character")
# [1] 9
`````

So, sometimes, the column is read as character with the spotted row index being different for different runs. And the ```which``` indicates it is more likely to happen in the first iterations. I don't get the randomness and the fact that colClasses is ignored (I set colClasses after reading fread doc...).
And... I did not manage to reproduce the error when setting verbose = TRUE... Also, the bug was observed using RStudio, not reproduced in the R console...
Sorry if I missed or misunderstood something...

```r
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4-3

loaded via a namespace (and not attached):
[1] tools_3.3.2 yaml_2.1.18

请确保更新到data.table的开发版本,然后再次进行测试。 自上次发布以来,对fread进行了很多改进

确实...用1.10.5多次运行相同的示例,效果很好。
谢谢你。

此页面是否有帮助?
0 / 5 - 0 等级