Data.table: fread는 dec = ','일 때 유효한 파일을 구문 분석하지 못할 수 있습니다.

에 만든 2018년 04월 13일 · 3코멘트 · 출처: Rdatatable/data.table

> fread('a,b,c,d\n1e1,1e2,1e3,"4,0001"\n1,2,3,4\n', dec=',')
       a     b     c      d
   <num> <num> <num>  <num>
1:    10   100  1000 4.0001
Warning message:
In fread("a,b,c,d\n1e1,1e2,1e3,\"4,0001\"\n1,2,3,4\n", dec = ",") :
  Discarded single-line footer: <<1,2,3,4>>

이는 float 파서가 1,2 를 단일 토큰으로 탐욕스럽게 소비하는 반면 따옴표가 없으면 2 개의 개별 필드로 구문 분석되어야하기 때문에 발생합니다.

또한 문서의 "세부 사항"섹션에는 다음 정보가 있습니다 (오래된 정보 임).

'fread'는 C 함수 'strtod'를 사용하여 숫자 데이터를 읽습니다. 예 :
'1.23'또는 '1,23'. 'strtod'는 소수점 구분 기호 ( '.'또는
','일반적으로) 대신 R 세션의 로케일에서
'strtod'함수에 전달 된 인수입니다. 그래서
'fread (..., dec = ",")'가 작동하고, 'fread'가 이것을 변경합니다.
R 세션의 로케일을 제공하는 로케일에 임시로
원하는 소수 구분 기호.

bug fread

출처

st-pasha

👍1

모든 3 댓글

굉장한 패키지! 감사합니다.
우연히도 어제 나는 공포와 관련된 이상한 행동을 경험했습니다.
다음은 간단한 예제입니다 (파일 묶음에서 lapply fread를 사용하는 실제 사례에서 파생 됨).

library(data.table)

DT = data.table(A = rep("20,1", 1e4))
fwrite(DT, "DT.csv", quote = FALSE)

classA = character(1e3)

for (i in seq_along(classA)) {

  DT = fread("DT.csv", sep = ";", dec = ",", colClasses = "numeric")
  classA[i] = DT[, class(A)]
}

table(classA)

이것은 나에게 준다 :

Warning message:
In fread("DT.csv", sep = ";", dec = ",", colClasses = "numeric",  :
  Bumped column 1 to type character on data row 387, field contains '20,1'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses a sample of 1,000 rows (100 rows at 10 points) so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

table(classA)
# classA
# character   numeric 
#        1       999 
which(classA == "character")
# [1] 9
`````

So, sometimes, the column is read as character with the spotted row index being different for different runs. And the ```which``` indicates it is more likely to happen in the first iterations. I don't get the randomness and the fact that colClasses is ignored (I set colClasses after reading fread doc...).
And... I did not manage to reproduce the error when setting verbose = TRUE... Also, the bug was observed using RStudio, not reproduced in the R console...
Sorry if I missed or misunderstood something...

```r
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4-3

loaded via a namespace (and not attached):
[1] tools_3.3.2 yaml_2.1.18

Atrebas 에 2018년 04월 14일

data.table의 개발 버전으로 업데이트하고 다시 테스트하십시오. 지난 릴리스 이후 fread 이 많이 개선되었습니다.

MichaelChirico 에 2018년 04월 14일

👍1

실제로 ... 1.10.5로 동일한 예제를 몇 번 실행했는데 제대로 작동했습니다.
감사합니다.

Atrebas 에 2018년 04월 14일

👍1

이 페이지가 도움이 되었나요?

0 / 5 - 0 등급

Data.table: fread는 dec = ','일 때 유효한 파일을 구문 분석하지 못할 수 있습니다.

모든 3 댓글

관련 문제