Data.table: ์ธ๋ฑ์Šค๋ฅผ ์‚ฌ์šฉํ•œ ์กฐ์ธ์€ ์ธ๋ฑ์‹ฑ ๋œ ์—ด ์ด๋ฆ„์ด ์กฐ์ธ ์—ด ์ด๋ฆ„์˜ ์ ‘๋‘์‚ฌ ์ธ ๊ฒฝ์šฐ ์˜ˆ๊ธฐ์น˜ ์•Š์€ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์— ๋งŒ๋“  2017๋…„ 11์›” 06์ผ  ยท  3์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: Rdatatable/data.table

์ด ๋ฌธ์ œ๋Š” stackoverflow ์— ๋Œ€ํ•œ ๋‚ด

๋‘ ๊ฐœ์˜ data.tables์˜ ํŠน์ • ์„ค์ •์˜ ๊ฒฝ์šฐ ์กฐ์ธ์ด ์˜ˆ์ƒ ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

library(data.table)

# In the code below the join does not deliver the result I would expect
DT1 <- data.table(colname=c("test1","test2","test2","test3"), colname_with_suffix=c("other","test","includes test within","other"))
DT2 <- data.table(lookup=c("test1","test2","test3"), lookup_result=c(1,2,3))
DT1[colname_with_suffix == "not found", ]  # automatically creates index on colname_with_suffix
DT1[DT2, lookup_result := i.lookup_result, on=c("colname"="lookup")][]
# PLEASE NOTE: same result with slightly different syntax: DT1[DT2, lookup_result := i.lookup_result, on=c(colname="lookup")][]
# colname  colname_with_suffix lookup_result
# 1:   test1                other         NA
# 2:   test2                 test         NA
# 3:   test2 includes test within         NA
# 4:   test3                other          3


# Expected result:
 # colname  colname_with_suffix lookup_result
# 1:   test1                other          1
# 2:   test2                 test          2
# 3:   test2 includes test within          2
# 4:   test3                other          3

๋‹ค์Œ ๋ณ€ํ˜•์˜ ๊ฒฝ์šฐ ์กฐ์ธ์ด ์˜ˆ์ƒ๋Œ€๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์œ„์˜ ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ๋™์ž‘์€ ์—ด ์ด๋ฆ„์ด ์กฐ์ธ ์—ด ์ด๋ฆ„์˜ ์ ‘๋‘์‚ฌ์ด๊ณ  ๋‘˜ ๋‹ค ์œ ์‚ฌํ•œ ํ…์ŠคํŠธ ๋‚ด์šฉ์„ ๊ฐ–๋Š” ์—ด์— ์ธ๋ฑ์Šค๊ฐ€์žˆ๋Š” ๊ฒฝ์šฐ์—๋งŒ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

# For all following alternatives the join delivers the correct result

# (a) Same data tables as above, but no index
DT1 <- data.table(colname=c("test1","test2","test2","test3"), colname_with_suffix=c("other","test","includes test within","other"))
DT2 <- data.table(lookup=c("test1","test2","test3"), lookup_result=c(1,2,3))
DT1[DT2, lookup_result := i.lookup_result, on=c("colname"="lookup")][]

# (b) Index on DT2, but completely different values in indexed column than in join column
DT1 <- data.table(colname=c("test1","test2","test2","test3"), colname_with_suffix=c("other","other","other","other"))
DT2 <- data.table(lookup=c("test1","test2","test3"), lookup_result=c(1,2,3))
DT1[colname_with_suffix == "not found", ]  # automatically creates index on colname_with_suffix
DT1[DT2, lookup_result := i.lookup_result, on=c("colname"="lookup")][]

# (c) Index on DT2, similar values in indexed column, but indexed column name is not a prefix of join column name
DT1 <- data.table(colname=c("test1","test2","test2","test3"), x.colname_with_suffix=c("other","test","includes test within","other"))
DT2 <- data.table(lookup=c("test1","test2","test3"), lookup_result=c(1,2,3))
DT1[x.colname_with_suffix == "not found", ]  # automatically creates index on x.colname_with_suffix
DT1[DT2, lookup_result := i.lookup_result, on=c("colname"="lookup")][]

์„ธ์…˜ ์ •๋ณด :

# R version 3.3.2 (2016-10-31)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
# 
# locale:
#     [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                    LC_TIME=German_Germany.1252    
# 
# attached base packages:
#     [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
#     [1] data.table_1.10.0
# 
# loaded via a namespace (and not attached):
#     [1] tools_3.3.2

Windows ๋ฐ Ubuntu Linux 14.04์—์„œ data.table 1.10.4 ๋ฐ R.Version 3.4.2์— ๋Œ€ํ•ด ๋™์ผํ•œ ๋™์ž‘์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

bug joins

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

์ด๊ฒƒ์€ ์‰ฌ์šด ์ˆ˜์ •์ด์—ˆ์Šต๋‹ˆ๋‹ค (pull ์š”์ฒญ ์ฐธ์กฐ). ๋ฒ„๊ทธ๋ฅผ ์‹ ๊ณ  ํ•ด ์ฃผ์…”์„œ ๊ฐ์‚ฌ ๋“œ๋ฆฌ๋ฉฐ ์ง€๊ธˆ ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋˜๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค.

๋ชจ๋“  3 ๋Œ“๊ธ€

์ด๊ฒƒ์€ ๊ณ ์น  ๊ฐ€์น˜๊ฐ€์žˆ๋Š” ๋ฒ„๊ทธ ์ธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” ๊ทธ๊ฒƒ์„ ํŒŒ๊ณ  ๋‚ด ์‹œ๊ฐ„์ด ํ—ˆ๋ฝํ•˜๋Š” ํ•œ ๊ณ ์น˜๋ ค๊ณ  ๋…ธ๋ ฅํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ์‰ฌ์šด ์ˆ˜์ •์ด์—ˆ์Šต๋‹ˆ๋‹ค (pull ์š”์ฒญ ์ฐธ์กฐ). ๋ฒ„๊ทธ๋ฅผ ์‹ ๊ณ  ํ•ด ์ฃผ์…”์„œ ๊ฐ์‚ฌ ๋“œ๋ฆฌ๋ฉฐ ์ง€๊ธˆ ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋˜๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค.

@MarkusBonsch ๋ฐฉ๊ธˆ data.table ์˜ ๊ฐ€์žฅ ์ตœ์‹  ๊ฐœ๋ฐœ ๋ฒ„์ „์— ์ปค๋ฐ‹์„ ์ ์šฉํ•˜๊ณ  ์—ฐ๊ฒฐ๋œ SO ์งˆ๋ฌธ์˜ ๋‘ ๊ฐ€์ง€ ์˜ˆ์ œ๋กœ ํ…Œ์ŠคํŠธํ–ˆ์Šต๋‹ˆ๋‹ค.

๋‘ ์˜ˆ์ œ ๋ชจ๋‘ ์˜ˆ์ƒ๋Œ€๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค!

๋น ๋ฅธ ์ˆ˜์ •์„์œ„ํ•œ ๋งŽ์€ THX.

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰