Data.table: ν•¨μˆ˜ setkey μ‚¬μš© 였λ₯˜

에 λ§Œλ“  2019λ…„ 04μ›” 09일  Β·  3μ½”λ©˜νŠΈ  Β·  좜처: Rdatatable/data.table

이 였λ₯˜λ₯Ό ν”Όν•˜λŠ” 방법을 μ•Œκ³  μžˆμ§€λ§Œ 이런 λ°©μ‹μœΌλ‘œ μ‹€ν–‰ν•˜λ©΄μ΄ 였λ₯˜κ°€ λ°œμƒν•˜λŠ” 이유λ₯Ό μ•Œ 수 μ—†μŠ΅λ‹ˆλ‹€.

datain <- data.frame(
 chrom = c("chr17", "chr4", "chr5", "chr13"),
 map = c(81061047, 106061533, 40102442, 73791553),
 rs = c("rs75954926", "rs7679673", "rs7708610", "rs78341008"),
 start = c(79061048, 104061534, 38102443, 71791554),
 end = c(83061048, 108061534, 42102443, 75791554)
)
datain
datain$chr<-datain$chrom
setDT(datain)
setkey(datain, chr, start, end)
datain

감사!

μ•Œλ ‰μŠ€

κ°€μž₯ μœ μš©ν•œ λŒ“κΈ€

iiuc, μˆ˜μ •ν•˜μ§€ μ•Šκ³  μ—΄ λ˜λŠ” λͺ©λ‘ μš”μ†Œλ₯Ό 볡제 ν•  λ•Œ λ©”λͺ¨λ¦¬ μ£Όμ†ŒλŠ” λ™μΌν•˜λ©° ν‚€λ₯Ό μ„€μ •ν•˜λ©΄ "혼작"ν•©λ‹ˆλ‹€.
copy() μ‚¬μš©ν•  λ•Œ μ •μƒμ μœΌλ‘œ μž‘λ™ν•©λ‹ˆλ‹€.

# KO
l <- list(x = c(9, 1), y = c(9, 1))
l[["z"]] <- l[["x"]]
l
setDT(l, key  = "x")[]

address(l$x) == address(l$z)
# TRUE

# OK
l <- list(x = c(9, 1), y = c(9, 1))
l[["z"]] <- copy(l[["x"]])
l
setDT(l, key  = "x")[]

λͺ¨λ“  3 λŒ“κΈ€

Windows 10κ³Ό macOS 10.13.6 λͺ¨λ‘μ—μ„œ data.table 1.12.0으둜이λ₯Ό μž¬ν˜„ ν•  수 μžˆμŠ΅λ‹ˆλ‹€. λ‚΄ macOS μ‹œμŠ€ν…œμ„ data.table 1.12.2둜 μ—…κ·Έλ ˆμ΄λ“œ ν•œ ν›„ μ•„λž˜μ˜ λͺ¨λ“  μ˜ˆμ œκ°€ λ™μΌν•œ κ²°κ³Όλ₯Ό μ œκ³΅ν•˜λŠ”μ§€ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.

μ’€ 더 κ°„λ‹¨ν•œ 예 :

> set.seed(1)
> d <- data.frame(x = paste0("chr",sample(17)[3:6]), y = 1:4)
> d$x2 <- d$x
> d
      x y    x2
1  chr9 1  chr9
2 chr13 2 chr13
3  chr3 3  chr3
4 chr11 4 chr11
> setDT(d)
> setkey(d,x2,y)
> d
       x y    x2
1:  chr9 4  chr9
2: chr13 2 chr13
3:  chr3 3  chr3
4: chr11 1 chr11

μ΄λŠ” setDT 직접 ν‚€λ₯Ό ν• λ‹Ή ν•  λ•Œλ„ λ°œμƒν•©λ‹ˆλ‹€ ( setDT setkeyv λ₯Ό μ‚¬μš©ν•˜μ—¬ ν‚€λ₯Ό μ„€μ •ν•˜κΈ° λ•Œλ¬Έμ— μ˜ˆμƒ ν•  수 있음).

> set.seed(1)
> d <- data.frame(x = paste0("chr",sample(17)[3:6]), y = 1:4)
> d$x2 <- d$x
> d
      x y    x2
1  chr9 1  chr9
2 chr13 2 chr13
3  chr3 3  chr3
4 chr11 4 chr11
> setDT(d, key = c("x2","y"))
> d
       x y    x2
1:  chr9 4  chr9
2: chr13 2 chr13
3:  chr3 3  chr3
4: chr11 1 chr11

λ³΄μ‹œλ‹€μ‹œν”Ό y -column의 μˆœμ„œλŠ” λ³€κ²½λ˜μ§€λ§Œ λ‹€λ₯Έ μ—΄μ˜ μˆœμ„œλŠ” λ³€κ²½λ˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. λΆ„λͺ…νžˆ 이것은 data.frame λ°©μ‹μœΌλ‘œ 열을 볡사 ν•œ λ‹€μŒ setDT 을 μ‚¬μš©ν•œ λ‹€μŒ setkey λ₯Ό μ‚¬μš©ν•  λ•Œλ§Œ λ°œμƒν•©λ‹ˆλ‹€. λͺ¨λ‘ μ˜λ„ ν•œ κ²°κ³Όλ₯Ό μ œκ³΅ν•˜λŠ” λ‹€μŒ 5 가지 경우λ₯Ό κ³ λ €ν•˜μ‹­μ‹œμ˜€.

1) 데이터 ν”„λ ˆμž„ 열을 λ³΅μ‚¬ν•˜μ§€ μ•ŠμŒ

> set.seed(1)
> d <- data.frame(x = paste0("chr",sample(17)[3:6]), y = 1:4)
> d
      x y
1  chr9 1
2 chr13 2
3  chr3 3
4 chr11 4
> setDT(d)
> setkey(d,x)
> d
       x y
1: chr11 4
2: chr13 2
3:  chr3 3
4:  chr9 1

2) 데이터 ν”„λ ˆμž„μ— μž„μ˜μ˜ μƒˆ μ—΄ 생성

> set.seed(1)
> d <- data.frame(x = paste0("chr",sample(17)[3:6]), y = 1:4)
> set.seed(1)
> d$x2 <- paste0("chr",sample(17)[1:4])
> d
      x y    x2
1  chr9 1  chr5
2 chr13 2  chr6
3  chr3 3  chr9
4 chr11 4 chr13
> setDT(d)
> setkey(d,x2,y)
> d
       x y    x2
1: chr11 4 chr13
2:  chr9 1  chr5
3: chr13 2  chr6
4:  chr3 3  chr9

3) κΈ°μ‘΄ 열을 μƒ˜ν”Œλ§ν•˜κ³  일뢀 extr 문자λ₯Ό λΆ™μ—¬ λ„£μ–΄ μƒˆ μ—΄ λ§Œλ“€κΈ°

> set.seed(1)
> d <- data.frame(x = paste0("chr",sample(17)[3:6]), y = 1:4)
> set.seed(1)
> d$x2 <- paste0("new_",sample(d$x,4))
> d
      x y        x2
1  chr9 1 new_chr13
2 chr13 2 new_chr11
3  chr3 3  new_chr3
4 chr11 4  new_chr9
> setDT(d)
> setkey(d,x2,y)
> d
       x y        x2
1: chr13 2 new_chr11
2:  chr9 1 new_chr13
3:  chr3 3  new_chr3
4: chr11 4  new_chr9

4) κΈ°μ‘΄ 열을 μƒ˜ν”Œλ§ν•˜μ—¬ μƒˆ μ—΄ λ§Œλ“€κΈ°

> set.seed(1)
> d <- data.frame(x = paste0("chr",sample(17)[3:6]), y = 1:4)
> set.seed(1)
> d$x2 <- sample(d$x,4)
> d
      x y    x2
1  chr9 1 chr13
2 chr13 2 chr11
3  chr3 3  chr3
4 chr11 4  chr9
> setDT(d)
> setkey(d,x2,y)
> d
       x y    x2
1: chr13 2 chr11
2:  chr9 1 chr13
3:  chr3 3  chr3
4: chr11 4  chr9

5) data.table λ°©μ‹μœΌλ‘œ μ—΄ 볡사

> set.seed(1)
> d <- data.frame(x = paste0("chr",sample(17)[3:6]), y = 1:4)
> d
      x y
1  chr9 1
2 chr13 2
3  chr3 3
4 chr11 4
> setDT(d)
> d[, x2 := x][]
       x y    x2
1:  chr9 1  chr9
2: chr13 2 chr13
3:  chr3 3  chr3
4: chr11 4 chr11
> setkey(d,x2,y)
> d
       x y    x2
1: chr11 4 chr11
2: chr13 2 chr13
3:  chr3 3  chr3
4:  chr9 1  chr9

이 λ²„κ·ΈλŠ” (맀우) νŠΉμ • μ‚¬μš© μ‚¬λ‘€μ—μ„œ λ°œμƒν•˜λŠ” κ²ƒμ²˜λŸΌ λ³΄μ΄μ§€λ§Œ μ—¬λŸ¬ ν”„λ‘œλ•μ…˜ μ‹œμŠ€ν…œμ—μ„œ setkey λ₯Ό μ‚¬μš©ν•˜κΈ° λ•Œλ¬Έμ— 이것은 λ‚˜μ—κ²Œ μ€‘μš” ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 이 νŠΉμ • 사둀가 λ‚΄ ν”„λ‘œλ•μ…˜ μ½”λ“œμ—μ„œ λ°œμƒν•˜λŠ”μ§€ ν™•μΈν•©λ‹ˆλ‹€.

@jaapwalhout 의 κ΄€μ°° 확인 (data.table_1.12.3 포함).
λ˜ν•œ (i) μš”μΈκ³Ό κ΄€λ ¨μ΄μžˆμ„ 수 μžˆλ‹€κ³  μƒκ°ν–ˆμ§€λ§Œ 그렇지 μ•ŠμŠ΅λ‹ˆλ‹€ (문자 λ˜λŠ” μ •μˆ˜μ—μ„œλ„ λ°œμƒ 함), (ii) ν‚€ (x λ˜λŠ” y)에 관계없이 λ°œμƒν•©λ‹ˆλ‹€.
μ•„λž˜λŠ” 더 κ°„λ‹¨ν•œ μž¬ν˜„ κ°€λŠ₯ν•œ μ˜ˆμž…λ‹ˆλ‹€.
과제의 κ²½κ³ λŠ” κ΄€λ ¨μ΄μžˆμ„ 수 μžˆμ§€λ§Œ 이해할 수 μ—†μŠ΅λ‹ˆλ‹€.

options(datatable.verbose = TRUE)

## KO
d <- data.frame(x = c(9, 1), y = c(9, 1))
d$x2 <- d$x
d
#   x y x2
# 1 9 9  9
# 2 1 1  1
setDT(d, key  = "x")[]
# forder took 0 sec
# reorder took 0 sec
#    x y x2
# 1: 9 1  9
# 2: 1 9  1

## KO
d <- data.frame(x = c("9", "1"), y = c(9, 1), stringsAsFactors = FALSE)
d$x2 <- d$x
d
#   x y x2
# 1 9 9  9
# 2 1 1  1
setDT(d, key  = "x")[]
# forder took 0 sec
# reorder took 0 sec
#    x y x2
# 1: 9 1  9
# 2: 1 9  1

## OK (with warning)
d <- data.frame(x = c("9", "1"), y = c(9, 1))
setDT(d)
d$x2 <- d$x
# Assigning to all 2 rows
# RHS for item 1 has been duplicated because NAMED is 2, but then is being plonked. length(values)==2; length(cols)==1)
setkey(d, y, verbose = TRUE)[]
# forder took 0 sec
# reorder took 0 sec
#    x y x2
# 1: 1 1  1
# 2: 9 9  9

iiuc, μˆ˜μ •ν•˜μ§€ μ•Šκ³  μ—΄ λ˜λŠ” λͺ©λ‘ μš”μ†Œλ₯Ό 볡제 ν•  λ•Œ λ©”λͺ¨λ¦¬ μ£Όμ†ŒλŠ” λ™μΌν•˜λ©° ν‚€λ₯Ό μ„€μ •ν•˜λ©΄ "혼작"ν•©λ‹ˆλ‹€.
copy() μ‚¬μš©ν•  λ•Œ μ •μƒμ μœΌλ‘œ μž‘λ™ν•©λ‹ˆλ‹€.

# KO
l <- list(x = c(9, 1), y = c(9, 1))
l[["z"]] <- l[["x"]]
l
setDT(l, key  = "x")[]

address(l$x) == address(l$z)
# TRUE

# OK
l <- list(x = c(9, 1), y = c(9, 1))
l[["z"]] <- copy(l[["x"]])
l
setDT(l, key  = "x")[]
이 νŽ˜μ΄μ§€κ°€ 도움이 λ˜μ—ˆλ‚˜μš”?
0 / 5 - 0 λ“±κΈ‰