Data.table: freadλŠ” dec = ','일 λ•Œ μœ νš¨ν•œ νŒŒμΌμ„ ꡬ문 λΆ„μ„ν•˜μ§€ λͺ»ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

에 λ§Œλ“  2018λ…„ 04μ›” 13일  Β·  3μ½”λ©˜νŠΈ  Β·  좜처: Rdatatable/data.table

> fread('a,b,c,d\n1e1,1e2,1e3,"4,0001"\n1,2,3,4\n', dec=',')
       a     b     c      d
   <num> <num> <num>  <num>
1:    10   100  1000 4.0001
Warning message:
In fread("a,b,c,d\n1e1,1e2,1e3,\"4,0001\"\n1,2,3,4\n", dec = ",") :
  Discarded single-line footer: <<1,2,3,4>>

μ΄λŠ” float νŒŒμ„œκ°€ 1,2 λ₯Ό 단일 ν† ν°μœΌλ‘œ νƒμš•μŠ€λŸ½κ²Œ μ†ŒλΉ„ν•˜λŠ” 반면 λ”°μ˜΄ν‘œκ°€ μ—†μœΌλ©΄ 2 개의 κ°œλ³„ ν•„λ“œλ‘œ ꡬ문 λΆ„μ„λ˜μ–΄μ•Όν•˜κΈ° λ•Œλ¬Έμ— λ°œμƒν•©λ‹ˆλ‹€.

λ˜ν•œ λ¬Έμ„œμ˜ "μ„ΈλΆ€ 사항"μ„Ήμ…˜μ—λŠ” λ‹€μŒ 정보가 μžˆμŠ΅λ‹ˆλ‹€ (였래된 정보 μž„).

'fread'λŠ” C ν•¨μˆ˜ 'strtod'λ₯Ό μ‚¬μš©ν•˜μ—¬ 숫자 데이터λ₯Ό μ½μŠ΅λ‹ˆλ‹€. 예 :
'1.23'λ˜λŠ” '1,23'. 'strtod'λŠ” μ†Œμˆ˜μ  ꡬ뢄 기호 ( '.'λ˜λŠ”
','일반적으둜) λŒ€μ‹  R μ„Έμ…˜μ˜ λ‘œμΌ€μΌμ—μ„œ
'strtod'ν•¨μˆ˜μ— 전달 된 μΈμˆ˜μž…λ‹ˆλ‹€. κ·Έλž˜μ„œ
'fread (..., dec = ",")'κ°€ μž‘λ™ν•˜κ³ , 'fread'κ°€ 이것을 λ³€κ²½ν•©λ‹ˆλ‹€.
R μ„Έμ…˜μ˜ λ‘œμΌ€μΌμ„ μ œκ³΅ν•˜λŠ” λ‘œμΌ€μΌμ— μž„μ‹œλ‘œ
μ›ν•˜λŠ” μ†Œμˆ˜ ꡬ뢄 기호.

bug fread

λͺ¨λ“  3 λŒ“κΈ€

ꡉμž₯ν•œ νŒ¨ν‚€μ§€! κ°μ‚¬ν•©λ‹ˆλ‹€.
μš°μ—°νžˆλ„ μ–΄μ œ λ‚˜λŠ” 곡포와 κ΄€λ ¨λœ μ΄μƒν•œ 행동을 κ²½ν—˜ν–ˆμŠ΅λ‹ˆλ‹€.
λ‹€μŒμ€ κ°„λ‹¨ν•œ μ˜ˆμ œμž…λ‹ˆλ‹€ (파일 λ¬ΆμŒμ—μ„œ lapply freadλ₯Ό μ‚¬μš©ν•˜λŠ” μ‹€μ œ μ‚¬λ‘€μ—μ„œ νŒŒμƒ 됨).

library(data.table)

DT = data.table(A = rep("20,1", 1e4))
fwrite(DT, "DT.csv", quote = FALSE)

classA = character(1e3)

for (i in seq_along(classA)) {

  DT = fread("DT.csv", sep = ";", dec = ",", colClasses = "numeric")
  classA[i] = DT[, class(A)]
}

table(classA)

이것은 λ‚˜μ—κ²Œ μ€€λ‹€ :

Warning message:
In fread("DT.csv", sep = ";", dec = ",", colClasses = "numeric",  :
  Bumped column 1 to type character on data row 387, field contains '20,1'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses a sample of 1,000 rows (100 rows at 10 points) so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

table(classA)
# classA
# character   numeric 
#        1       999 
which(classA == "character")
# [1] 9
`````

So, sometimes, the column is read as character with the spotted row index being different for different runs. And the ```which``` indicates it is more likely to happen in the first iterations. I don't get the randomness and the fact that colClasses is ignored (I set colClasses after reading fread doc...).
And... I did not manage to reproduce the error when setting verbose = TRUE... Also, the bug was observed using RStudio, not reproduced in the R console...
Sorry if I missed or misunderstood something...

```r
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4-3

loaded via a namespace (and not attached):
[1] tools_3.3.2 yaml_2.1.18

data.table의 개발 λ²„μ „μœΌλ‘œ μ—…λ°μ΄νŠΈν•˜κ³  λ‹€μ‹œ ν…ŒμŠ€νŠΈν•˜μ‹­μ‹œμ˜€. μ§€λ‚œ 릴리슀 이후 fread 이 많이 κ°œμ„ λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

μ‹€μ œλ‘œ ... 1.10.5둜 λ™μΌν•œ 예제λ₯Ό λͺ‡ 번 μ‹€ν–‰ν–ˆλŠ”λ° μ œλŒ€λ‘œ μž‘λ™ν–ˆμŠ΅λ‹ˆλ‹€.
κ°μ‚¬ν•©λ‹ˆλ‹€.

이 νŽ˜μ΄μ§€κ°€ 도움이 λ˜μ—ˆλ‚˜μš”?
0 / 5 - 0 λ“±κΈ‰