Data.table: ๋Œ€ํ˜• csv(44GB)๊ฐ€ ์žˆ๋Š” fread๋Š” ์ตœ์‹  data.table dev ๋ฒ„์ „์—์„œ ๋งŽ์€ RAM์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์— ๋งŒ๋“  2017๋…„ 03์›” 22์ผ  ยท  30์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: Rdatatable/data.table

์•ˆ๋…•ํ•˜์„ธ์š”,

ํ•˜๋“œ์›จ์–ด์™€ ์†Œํ”„ํŠธ์›จ์–ด:
์„œ๋ฒ„ : Dell R930 4-Intel Xeon E7-8870 v3 2.1GHz, 45M ์บ์‹œ, 9.6GT/s QPI, ํ„ฐ๋ณด, HT, 18C/36T ๋ฐ 1TB RAM
์šด์˜ ์ฒด์ œ : ๋ ˆ๋“œํ–‡ 7.1
R ๋ฒ„์ „ : 3.3.2
data.table ๋ฒ„์ „ : 1.10.5 ๋นŒ๋“œ 2017-03-21

csv ํŒŒ์ผ(44GB, 872505ํ–‰ x 12785์—ด)์„ ๋กœ๋“œ ์ค‘์ž…๋‹ˆ๋‹ค. 144๊ฐœ์˜ ์ฝ”์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 1.30๋ถ„ ๋งŒ์— ๋งค์šฐ ๋น ๋ฅด๊ฒŒ ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค(ํ•˜์ดํผ์Šค๋ ˆ๋”ฉ์ด ํ™œ์„ฑํ™”๋œ 4๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ์—์„œ 72๊ฐœ์˜ ์ฝ”์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 144๊ฐœ์˜ ์ฝ”์–ด ์ƒ์ž๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ).

์ฃผ์š” ๋ฌธ์ œ๋Š” DT๊ฐ€ ๋กœ๋“œ๋  ๋•Œ csv ํŒŒ์ผ์˜ ํฌ๊ธฐ์™€ ๊ด€๋ จํ•˜์—ฌ ์‚ฌ์šฉ ์ค‘์ธ ๋ฉ”๋ชจ๋ฆฌ ์–‘์ด ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ 44GB csv(fwrite๋กœ ์ €์žฅ, saveRDS๋กœ ์ €์žฅ ๋ฐ compress=FALSE๋กœ 84GB ํŒŒ์ผ ์ƒ์„ฑ)๋Š” ~ 356GB RAM์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ์€ "verbose=TRUE"๋ฅผ ์‚ฌ์šฉํ•œ ์ถœ๋ ฅ์ž…๋‹ˆ๋‹ค.
_12785๊ฐœ์˜ ์—ด ์Šฌ๋กฏ ํ• ๋‹น(12785 - 0 ์‚ญ์ œ)
๋งค๋“œ๋น„์Šค ์‹œํ€€์…œ: ok
1440๊ฐœ์˜ ์ ํ”„ ํฌ์ธํŠธ์™€ 144๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋กœ ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ
858881๊ฐœ์˜ ์˜ˆ์ƒ ํ–‰ ์ค‘ 95.7% ์ฝ๊ธฐ
1๋ถ„ 33.736์ดˆ์˜ ๋ฒฝ์‹œ๊ณ„ ์‹œ๊ฐ„์— 43.772GB ํŒŒ์ผ์—์„œ 872505ํ–‰ x 12785์—ด ์ฝ๊ธฐ(์‹คํ–‰ ์ค‘์ธ ๋‹ค๋ฅธ ์•ฑ์˜ ์˜ํ–ฅ)
0.000์ดˆ ( 0%) ๋ฉ”๋ชจ๋ฆฌ ๋งต
0.070s( 0%) sep, ncol ๋ฐ ํ—ค๋” ๊ฐ์ง€
26.227s ( 28%) 1440 ์ ํ”„ ํฌ์ธํŠธ์—์„œ 34832 ์ƒ˜ํ”Œ ํ–‰์„ ์‚ฌ์šฉํ•œ ์—ด ์œ ํ˜• ๊ฐ์ง€
0.614์ดˆ(1%) RAM์— 3683116ํ–‰ x 12785์—ด(350.838GB) ํ• ๋‹น
0.000์ดˆ( 0%) ๋งค๋“œ๋ฐ”์ด์Šค ์ˆœ์ฐจ
66.825์ดˆ ( 71%) ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ
์ด 93.736์ดˆ_

๋ณ‘๋ ฌ ํŒจํ‚ค์ง€๋กœ ์ž‘์—…ํ•  ๋•Œ ๋•Œ๋•Œ๋กœ ๋ฐœ์ƒํ•˜๋Š” ์œ ์‚ฌํ•œ ๋ฌธ์ œ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ "mclappy"์™€ ๊ฐ™์€ ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•  ๋•Œ ์ฝ”์–ด๋‹น ํ•˜๋‚˜์˜ ์„ธ์…˜์ด ์‹œ์ž‘๋ฉ๋‹ˆ๋‹ค. ์ด ์Šคํฌ๋ฆฐ์ƒท์— ์ƒ์„ฑ/๋‚˜์—ด๋œ Rsessions๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

image

"rm(DT)"๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉด RAM์ด ์ดˆ๊ธฐ ์ƒํƒœ๋กœ ๋Œ์•„๊ฐ€๊ณ  "Rsessions"๊ฐ€ ์ œ๊ฑฐ๋ฉ๋‹ˆ๋‹ค.

์ด๋ฏธ "setDTthreads(20)"๋ฅผ ์‹œ๋„ํ–ˆ์ง€๋งŒ ์—ฌ์ „ํžˆ ๋™์ผํ•œ ์–‘์˜ RAM์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๊ฑด ๊ทธ๋ ‡๊ณ , ํŒŒ์ผ์ด "fread"์˜ ๋น„๋ณ‘๋ ฌ ๋ฒ„์ „์œผ๋กœ ๋กœ๋“œ๋˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์€ ์ตœ๋Œ€ 106GB๊นŒ์ง€๋งŒ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์˜ˆ๋ฅด๋ชจ

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

๊ด‘๋ฒ”์œ„ํ•œ "์‹ค์ œ" ํ…Œ์ด๋ธ”(๋ณ‘์› ๋ฐ์ดํ„ฐ)์—์„œ ํ…Œ์ŠคํŠธ: 30M ํ–‰ ร— 125์—ด v readr '1.2.0' ๋ฐ read.csv 3.4.3.

image

๋ชจ๋“  30 ๋Œ“๊ธ€

์ด๊ฒƒ์€ ๋ณ‘๋ ฌ์ด ์•„๋‹Œ fread ๊ตฌํ˜„์˜ ์ถœ๋ ฅ์ž…๋‹ˆ๋‹ค(data.table 1.10.5 IN DEVELOPMENT ๊ตฌ์ถ• 2017-02-09).

image

๊ทธ๋ฆฌ๊ณ  ์‚ฌ์šฉ๋œ ๋ฉ”๋ชจ๋ฆฌ ์–‘์„ ๋‹ค์‹œ ํ™•์ธํ–ˆ๋Š”๋ฐ ์ตœ๋Œ€ 84GB๊นŒ์ง€ ์˜ฌ๋ผ๊ฐ‘๋‹ˆ๋‹ค.

๊ธฐ์˜ˆ๋ฅด๋ชจ

๋„ค ๋ง์ด ๋งž์•„. ํ›Œ๋ฅญํ•œ ๋ณด๊ณ ์„œ์— ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค. ์ถ”์ •๋œ nrow๋Š” ๋Œ€๋žต ๋งž๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด์ง€๋งŒ(858,881 ๋Œ€ 872,505) ํ• ๋‹น์€ ๊ทธ๋ณด๋‹ค 4.2๋ฐฐ ๋” ํฌ๊ณ (3,683,116) ํ›จ์”ฌ ๋” ๋งŽ์Šต๋‹ˆ๋‹ค. ๊ณ„์‚ฐ์„ ๊ฐœ์„ ํ•˜๊ณ  ์ž์„ธํ•œ ์ถœ๋ ฅ์— ์ž์„ธํ•œ ๋‚ด์šฉ์„ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ช‡ ๊ฐ€์ง€ ์ž‘์—…์ด ๋” ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ์ง€๊ธˆ์€ ์žฌํ…Œ์ŠคํŠธ๋ฅผ ๋ณด๋ฅ˜ํ•˜์‹ญ์‹œ์˜ค.

๋‹ค์‹œ ํ…Œ์ŠคํŠธํ•˜์‹ญ์‹œ์˜ค. ์ง€๊ธˆ ์ˆ˜์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋ฐฉ๊ธˆ data.table dev๋ฅผ ์„ค์น˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
data.table 1.10.5 IN DEVELOPMENT ๊ตฌ์ถ• 2017-03-27 02:50:31 UTC

๋™์ผํ•œ 44GB ํŒŒ์ผ์„ ์ฝ์œผ๋ ค๊ณ  ํ•  ๋•Œ ๊ฐ€์žฅ ๋จผ์ € ๋ฐ›์€ ๋ฉ”์‹œ์ง€๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

DT <- fread('dt.daily.4km.csv')
์˜ค๋ฅ˜: protect(): ๋ณดํ˜ธ ์Šคํƒ ์˜ค๋ฒ„ํ”Œ๋กœ

๊ทธ๋Ÿฐ ๋‹ค์Œ ๋™์ผํ•œ ๋ช…๋ น์„ ๋‹ค์‹œ ์‹คํ–‰ํ•˜๊ณ  ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜๊ธฐ ์‹œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ๋ฒ„์ „์€ ๋ฉ€ํ‹ฐ์ฝ”์–ด ๋ชจ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. fread ๋ณ‘๋ ฌ ๋ฒ„์ „์„ ๋„ฃ๊ธฐ ์ „๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋กœ๋“œํ•˜๋Š” ๋ฐ ~ 25๋ถ„์ด ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค.

๊ธฐ์˜ˆ๋ฅด๋ชจ

๋ชจ๋“  r-์„ธ์…˜์„ ๋‹ซ๊ณ  ํ…Œ์ŠคํŠธ๋ฅผ ๋‹ค์‹œ ์‹คํ–‰ํ–ˆ๋Š”๋ฐ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

์ถ”์ธก๋œ ์—ด ์œ ํ˜•์€ 508๊ฐœ ์—ด์˜ 34711745๊ฐœ ๊ฐ’์— ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. colClasses๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋Ÿฌํ•œ ์—ด ํด๋ž˜์Šค๋ฅผ ์ˆ˜๋™์œผ๋กœ ์„ค์ •ํ•˜์‹ญ์‹œ์˜ค.

์•„๋ž˜๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค. "์ •์ˆ˜๋ฅผ ์ถ”์ธกํ–ˆ์ง€๋งŒ <<0....>>์ด(๊ฐ€) ํฌํ•จ๋˜์–ด ์žˆ๋‹ค๋Š” ๋ช‡ ๊ฐ€์ง€ ๋ฉ”์‹œ์ง€๊ฐ€ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

_๋ฒฝ์‹œ๊ณ„ ์‹œ๊ฐ„ 15:27.024์— 43.772GB ํŒŒ์ผ์—์„œ 872505ํ–‰ x 12785์—ด ์ฝ๊ธฐ(์œ ํœด ์ƒํƒœ์ธ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด๋”๋ผ๋„ ์—ด๋ ค ์žˆ๋Š” ๋‹ค๋ฅธ ์•ฑ์— ์˜ํ•ด ์†๋„๊ฐ€ ๋Š๋ ค์งˆ ์ˆ˜ ์žˆ์Œ)
171์—ด('D_19810618')์€ '์ •์ˆ˜'๋ฅผ ์ถ”์ธกํ–ˆ์ง€๋งŒ <<2.23000001907349>>๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
347์—ด('D_19811211')์€ '์ •์ˆ˜'๋กœ ์ถ”์ธก๋˜์ง€๋งŒ <<1.02999997138977>>์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
์—ด 348('D_19811212')์ด '์ •์ˆ˜'๋กœ ์ถ”์ธก๋˜์ง€๋งŒ <<3.75>>_์„(๋ฅผ) ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

์™€์šฐ - ๊ท€ํ•˜์˜ ํŒŒ์ผ์€ ์ •๋ง ๊ทน๋‹จ์ ์ธ ๊ฒฝ์šฐ๋ฅผ ํ…Œ์ŠคํŠธํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์—„์ฒญ๋‚œ. ์•ž์œผ๋กœ verbose=TRUE ์‹คํ–‰ํ•˜๊ณ  ์ „์ฒด ์ถœ๋ ฅ์„ ์ œ๊ณตํ•˜์‹ญ์‹œ์˜ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ๊ฒฝ์šฐ์— ์ œ๊ณตํ•œ ์ •๋ณด๋ฅผ ํ†ตํ•ด ์‹ค์ œ๋กœ ๋ฌธ์ œ๊ฐ€ ๋ฌด์—‡์ธ์ง€ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ์— ๋Œ€ํ•ด ๊ฐ ์—ด์— ๋Œ€ํ•ด ์ƒ์„ฑ๋œ ๋ฒ„ํผ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค(์ด ๊ฒฝ์šฐ 12,000๊ฐœ ์ด์ƒ์˜ ์—ด). ๊ฐ๊ฐ์€ ํ˜„์žฌ ๋ณ„๋„๋กœ ๋ณดํ˜ธ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ์„ ํ”ผํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค - ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์œ ํ˜• ์ถ”์ธก์— ๋Œ€ํ•œ ๋ฉ”์‹œ์ง€๊ฐ€ ์ •ํ™•ํ•ฉ๋‹ˆ๋‹ค. 508๊ฐœ์˜ ์—ด์ด ๋‹น์‹ ์—๊ฒŒ ์˜๋ฏธ๊ฐ€ ์žˆ๊ณ  ์ˆซ์ž์—ฌ์•ผ ํ•œ๋‹ค๋Š” ๋ฐ ๋™์˜ํ•˜์‹ญ๋‹ˆ๊นŒ? ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์—ด ๋ฒ”์œ„๋ฅผ colClasses ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. colClasses=list("numeric"=11:518)

์ด ๋ฐ์ดํ„ฐ๋Š” ์–ด๋Š ๋ถ„์•ผ์˜ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๊นŒ? ํŒŒ์ผ์„ ๋งŒ๋“ค๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ? ์ผ๋ฐ˜์ ์ธ ๋ชจ๋ฒ” ์‚ฌ๋ก€๋Š” ๊ธด ํ˜•์‹์œผ๋กœ ์ž‘์„ฑํ•˜๊ณ  ๋ฉ”๋ชจ๋ฆฌ์—๋„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ธ ์™€์ด๋“œ ํ˜•์‹์œผ๋กœ ๋’คํ‹€๋ฆฐ ๊ฒƒ์ฒ˜๋Ÿผ ๋Š๊ปด์ง‘๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ "D_19810618"๊ณผ ๊ฐ™์€ 508๊ฐœ์˜ ์—ด ์ด๋ฆ„์„ ์—ด ์ž์ฒด๊ฐ€ ์•„๋‹ˆ๋ผ ์—ด์˜ ๊ฐ’์œผ๋กœ ๋ณผ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ํŒŒ์ผ์„ ์ƒ์„ฑํ•˜๊ณ  ์žˆ๊ณ  ๊ธด ํ˜•์‹์œผ๋กœ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ฌป์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ ํŒŒ์ผ์„ ๋งŒ๋“œ๋Š” ์‚ฌ๋žŒ์—๊ฒŒ ๋” ์ž˜ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ œ์•ˆํ•˜์‹ญ์‹œ์˜ค. ์•„๋งˆ๋„ .SD ๋ฐ .SDcols ์‚ฌ์šฉํ•˜์—ฌ ์—ด์„ ํ†ตํ•ด ์ž‘์—…์„ ์ ์šฉํ•˜๊ณ  ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ๊ธด ํ˜•์‹๊ณผ "D_19810618"๊ณผ ๊ฐ™์€ ๊ฐ’์„ ํฌํ•จํ•˜๋Š” keyby= ์—ด์ด ํ›จ์”ฌ ์ข‹์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๋‚˜๋Š” ์—ฌ์ „ํžˆ fread ๋ฅผ ๋ชจ๋“  ์ž…๋ ฅ(12,000๊ฐœ ์ด์ƒ์˜ ์—ด์ด ํฌํ•จ๋œ ๋งค์šฐ ๋„“์€ ํŒŒ์ผ)์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์ตœ์„ ์„ ๋‹คํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋‹ค๋ฅธ ์‚ฌ๋žŒ๋“ค์ด ํŒŒ์ผ์„ ํ…Œ์ŠคํŠธํ•˜๊ณ  ์ „ํ˜€ ๋ฌธ์ œ๋ฅผ ์ฐพ์ง€ ์•Š๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค!

_ 508๊ฐœ์˜ ์—ด์ด ๋‹น์‹ ์—๊ฒŒ ์˜๋ฏธ๊ฐ€ ์žˆ๊ณ  ์ˆซ์ž์—ฌ์•ผ ํ•œ๋‹ค๋Š” ๋ฐ ๋™์˜ํ•˜์‹ญ๋‹ˆ๊นŒ?

์ด ํ…Œ์ด๋ธ”์—๋Š” ํ–‰๋ณ„ ์‹œ๊ณ„์—ด์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. IDx,IDy, Time1_value, Time2_value, Time3_value... ๋ฐ ๋ชจ๋“  Time N _value ์—ด์—๋Š” ์ˆซ์ž ๊ฐ’๋งŒ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. colClasses๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ 12783: list("numeric"=2:12783)์— ๋Œ€ํ•ด ์ˆ˜ํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋‚˜๋Š” ์ด๊ฒƒ์„ ์‹œ๋„ํ•  ๊ฒƒ์ด๋‹ค.

_์ด ๋ฐ์ดํ„ฐ์˜ ์ถœ์ฒ˜๋Š” ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?_
์ง€๋ฆฌ ๊ณต๊ฐ„ ๋ฐ์ดํ„ฐ. IDx ๋ฐ IDy๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ DT ๋‚ด์—์„œ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ฒ€์ƒ‰์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํ–‰์ด ๋งŽ์„์ˆ˜๋ก ๊ฒ€์ƒ‰ ์†๋„๊ฐ€ ๋Š๋ ค์ง‘๋‹ˆ๋‹ค. ๋งž๋‚˜์š”?
์ง€๊ธˆ์€ ๊ฝค ๋น ๋ฆ…๋‹ˆ๋‹ค(์™€์ด๋“œ ํ˜•์‹). ์‚ฌ์šฉ์ž๊ฐ€ ์˜์—ญ์„ ํด๋ฆญํ•œ ๋‹ค์Œ ์ฃผ์–ด์ง„ ํด๋ฆญ์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์œ„์น˜(์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ ํฌํ•จ)์—์„œ csv ํŒŒ์ผ ๋‚ด์—์„œ ์‹œ๊ณ„์—ด์ด ์ƒ์„ฑ๋˜๋Š” ์ง€๋„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋Œ€์‹  ๊ธด ํ˜•์‹์œผ๋กœ ๊ตฌํ˜„ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

๋ช‡ ๊ฐ€์ง€ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ง€๊ณ  ๋Œ์•„์˜ค๊ฒ ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜ ์ข‹์•„. list("numeric"=2:12783) iiuc๋Š” 508๊ฐœ ์—ด์— ๋Œ€ํ•œ ๋„์›€๋งŒ ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์•„ - ์•Œ๊ฒ ์Šต๋‹ˆ๋‹ค - 508์ด ์—ด์„ ํ†ตํ•ด ํฉ์–ด์ ธ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค(์ธ์ ‘ํ•œ ์—ด ์ง‘ํ•ฉ์ด ์•„๋‹˜)?

์•„๋‹ˆ์˜ค - data.table์ด ๋„“์„ ๋•Œ ๋” ๋น ๋ฅด์ง€ ์•Š์Šต๋‹ˆ๋‹ค! Long์€ ๊ฑฐ์˜ ํ•ญ์ƒ ๋” ๋น ๋ฅด๊ณ  ๋” ํŽธ๋ฆฌํ•ฉ๋‹ˆ๋‹ค. roll="nearest" ๋ฅผ ๋ณด์•˜๊ณ  ์‹œ๋„ํ•œ ์ ์ด ์žˆ์Šต๋‹ˆ๊นŒ? ์ง€๊ธˆ ์–ด๋–ป๊ฒŒ ํ•˜๊ณ  ์žˆ๋‚˜์š”? ์šฐ๋ฆฌ๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ์ฝ”๋“œ๋ฅผ ๋ณด์—ฌ์ฃผ์„ธ์š”. ๊ฑฐ์˜ ํ™•์‹คํžˆ ๊ธด ํ˜•์‹์ด ๋” ๋‚ซ์ง€๋งŒ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด 2D์— ๋Œ€ํ•ด ๋ช‡ ๊ฐ€์ง€ ๊ฐœ์„  ์‚ฌํ•ญ์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํƒ€์ด๋ฐ๋„ ๋ณด์—ฌ์ฃผ์„ธ์š”. "๋งค์šฐ ๋น ๋ฆ„"์ด๋ผ๊ณ  ๋งํ•˜๋ฉด ์‚ฌ๋žŒ๋“ค์€ "๋งค์šฐ ๋น ๋ฆ„"์ด ๋ฌด์—‡์ธ์ง€์— ๋Œ€ํ•ด ๋งค์šฐ ๋‹ค๋ฅธ ์ƒ๊ฐ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ์ด ๋ฐํ˜€์กŒ์Šต๋‹ˆ๋‹ค.

์ด ํ…Œ์ด๋ธ”์„ ๋…น์ด๋ฉด 2^31 ํ•œ๊ณ„์— ๋„๋‹ฌํ•ฉ๋‹ˆ๋‹ค. "์Œ์ˆ˜ ๊ธธ์ด ๋ฒกํ„ฐ๋Š” ํ—ˆ์šฉ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค"๋ผ๋Š” ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

๊ธด ํ˜•์‹์œผ๋กœ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ์†Œ์Šค๋กœ ๋Œ์•„๊ฐ€๊ฒ ์Šต๋‹ˆ๋‹ค.

# Read Data 
DT <- fread('dt.daily.4km.csv', showProgress = FALSE)
# Add two columns with truncated values of x and y (these are geog. coords.)
DT[,y_tr:=trunc(y)]
DT[,x_tr:=trunc(x)]

# For using on plotting (x-axis values)
xaxis<-seq.Date(as.Date("1981-01-01"),as.Date("2015-12-31"), "day") 

# subset by truncated coordinates to avoid full-table search. Now searches
# will happen in a smaller subset
DT2 <- DT[y_tr==trunc(y_clicked) & x_tr==trunc(x_clicked),]
# Add distance from each point in the data.table to the provided location, "gdist" is from
# Imap package for euclidean distance. 
DT2[,DIST:=gdist(lat.1 = DT2$y,
                       lon.1 = DT2$x,
                       lat.2 = y_clicked,
                       lon.2 = x_clicked, units="miles")]
# Get the minimum distance 
minDist <- min(DT2[,DIST])

# Get the y-axis values
yt <- transpose(DT2[DIST==minDist,3:(ncol(DT2)-3)])$V1`

# Ready to plot xaxis vs yt 
...
...

๊ณต๊ฐœ ์„œ๋ฒ„์— ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์ด ์—†์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ ์‚ฌ์šฉ์ž๋Š” ์ง€๋„๋ฅผ ํด๋ฆญํ•œ ๋‹ค์Œ ํ•ด๋‹น ์ขŒํ‘œ๋ฅผ ์บก์ฒ˜ํ•˜๊ณ  ์œ„์˜ ๊ฒ€์ƒ‰์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์‹œ๊ณ„์—ด์„ ๊ฐ€์ ธ์˜ค๊ณ  ํ”Œ๋กฏ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋งค์šฐ ๋งŽ์€ ์ˆ˜์˜ ์—ด์— ๋Œ€ํ•œ ๋˜ ๋‹ค๋ฅธ ์Šคํƒ ์˜ค๋ฒ„ํ”Œ๋กœ๋ฅผ ์ฐพ์•„ ์ˆ˜์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. https://github.com/Rdatatable/data.table/commit/d0469e670961dcdea115d433c0f2dce596d65906. ์ปค๋ฐ‹ ๋ฉ”์‹œ์ง€์—์„œ ์ด ๋ฌธ์ œ ๋ฒˆํ˜ธ์— ํƒœ๊ทธ๋ฅผ ์ง€์ •ํ•˜๋Š” ๊ฒƒ์„ ์žŠ์—ˆ์Šต๋‹ˆ๋‹ค.

์˜ค. ๊ทธ๊ฒŒ ํฌ์ธํŠธ์ž…๋‹ˆ๋‹ค. 872505ํ–‰ * 12780์—ด์€ 110์–ต ํ–‰์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๊ธด ํ˜•์‹์„ ์‚ฌ์šฉํ•˜๋ผ๋Š” ๋‚ด ์ œ์•ˆ์€ > 2^31์ด๋ฏ€๋กœ ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค - ๋‚ด๊ฐ€ ๊ทธ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด์•Œ์„ ๊นจ๋ฌผ๊ณ  > 2^31๋กœ ์ด๋™ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋™์•ˆ ์ž‘์—…ํ•˜๊ณ  ์žˆ๋Š” ์™€์ด๋“œ ํ˜•์‹์„ ๊ณ„์† ์‚ฌ์šฉํ•˜๊ณ  ์ด์— ๋Œ€ํ•ด ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

๋‹ค์‹œ ์‹œ๋„ํ•ด ์ฃผ์„ธ์š”. ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์€ ์ •์ƒ์œผ๋กœ ๋Œ์•„๊ฐ€์•ผ ํ•˜๋ฉฐ ์ƒ˜ํ”Œ ์™ธ ์œ ํ˜• ์˜ˆ์™ธ๊ฐ€ ์žˆ๋Š” 12,785๊ฐœ ์—ด ์ค‘ 508๊ฐœ ์—ด์„ ์ž๋™์œผ๋กœ ๋‹ค์‹œ ์ฝ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ž๋™ ์žฌ์‹คํ–‰ ์‹œ๊ฐ„์„ ํ”ผํ•˜๋ ค๋ฉด colClasses ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ˆ˜์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ์ „์ฒด ์ƒ์„ธ ์ถœ๋ ฅ์„ ๋ถ™์—ฌ๋„ฃ์œผ์‹ญ์‹œ์˜ค. ์†๊ฐ€๋ฝ์ด ๊ต์ฐจ!

ํ™•์ธ...
์ตœ์‹  data.table dev: data.table 1.10.5 IN DEVELOPMENT ๊ตฌ์ถ• ๊ฒฐ๊ณผ ์š”์•ฝ

์–ธ๊ธ‰ํ•  4๊ฐ€์ง€ ์ฃผ์š” ์‚ฌํ•ญ:

  1. fread๋Š” ํ† ๋ก ์—์„œ ํŒŒ์ผ์„ ์ž˜ ์ฝ์—ˆ์Šต๋‹ˆ๋‹ค.
  2. ์ฝ๋Š” ๋ฐ ~ 5.5๋ถ„์ด ๊ฑธ๋ ธ๊ณ  ์ด์ „ ๋ฒ„์ „์€ ~1.3๋ถ„์ด ๊ฑธ๋ ธ์Šต๋‹ˆ๋‹ค.
  3. ์ด ๋ฒ„์ „์€ ๋‚ด๊ฐ€ ํ…Œ์ŠคํŠธํ–ˆ๋˜ ์ด์ „ ๋ฒ„์ „์ฒ˜๋Ÿผ RAM ํ• ๋‹น์„ ๋Š˜๋ฆฌ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  4. ์ฝ”์–ด๊ฐ€ "๋œ ํ™œ์„ฑํ™”๋œ" ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค(์•„๋ž˜ ์Šคํฌ๋ฆฐ์ƒท ์ฐธ์กฐ).

๋ช‡ ๊ฐ€์ง€ ์˜๊ฒฌ/์งˆ๋ฌธ:
1.
์ฝ”์–ด์˜ ์‚ฌ์šฉ๋Ÿ‰ ๋น„์œจ์ด ์ด์ „๋งŒํผ ํ™œ์„ฑํ™”๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค(์Šคํฌ๋ฆฐ์ƒท ์ฐธ์กฐ).
image

fread์˜ ์ด์ „ ๋ฒ„์ „์—์„œ ์ฝ”์–ด์˜ ํ™œ๋™์€ ํ•ญ์ƒ ~ 90-80%์˜€์Šต๋‹ˆ๋‹ค. ์ด ๋ฒ„์ „์—์„œ๋Š” ์œ„์˜ ์ด๋ฏธ์ง€์™€ ๊ฐ™์ด ๊ฐ ์ฝ”์–ด์˜ ์•ฝ 2-3%๋ฅผ ์œ ์ง€ํ–ˆ์Šต๋‹ˆ๋‹ค.


  1. fread๊ฐ€ ์ •์ˆ˜์—์„œ ์ˆซ์ž๋กœ ๋ฒ”ํ”„๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ์ด์œ ๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์˜ˆ
Column 1489 ("D_19850126") bumped from 'integer' to 'numeric' due to <<0.949999988079071>> somewhere between row 6041 and row 24473

ํ•ด๋‹น ํ–‰์„ ๋‘ ๋ฒˆ ํ™•์ธํ–ˆ๋Š”๋ฐ ๊ดœ์ฐฎ์€ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๊ฐ€ ์ด๋ฏธ ์ˆซ์ž์ธ ๊ฒฝ์šฐ ์ •์ˆ˜๋กœ ๊ฐ์ง€๋˜๊ณ  '์ˆซ์ž'๋กœ '๋ฒ”ํ”„'ํ•ด์•ผ ํ•˜๋Š” ์ด์œ ๋Š” ๋ฌด์—‡์ž…๋‹ˆ๊นŒ(์•„๋ž˜ ์ œ์•ˆ๋œ ํ–‰์— ๋Œ€ํ•œ ์š”์•ฝ ์ฐธ์กฐ)? ์•„๋‹ˆ๋ฉด ๋‚ด๊ฐ€์ด ์ค„์„ ์ž˜๋ชป ์ดํ•ดํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ? ์ด๊ฒƒ์€ 508 ๋ผ์ธ์—์„œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. NA๊ฐ€ ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚ค๋Š”๊ฐ€?

summary(DT[6041:24473,.(D_19850126)])
   D_19850126   
 Min.   :0.750  
 1st Qu.:0.887  
 Median :0.945  
 Mean   :0.966  
 3rd Qu.:1.045  
 Max.   :1.210  
 NA's   :18393 

์•„๋ž˜๋Š” ํ…Œ์ŠคํŠธ ์ค‘ ์žฅํ™ฉํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

DT<-fread('dt.daily.4km.ver032917.csv', verbose=TRUE)

์‚ฐ์ถœ

Parameter na.strings == <<NA>>
None of the 0 na.strings are numeric (such as '-9999').Input contains no \n. Taking this to be a filename to open
File opened, filesize is 43.772296 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 starting: <<x,y,D_19810101,D_19810102,D_19810103,D_19810104,D_19810105,D_19810106,D_19810107,D_19810108,D_19810109,D_19810110,D_19810111,D_19810112,D_19810113,
...  
...
All the fields on line 1 are character fields. Treating as the column names.
Number of sampling jump points  = 11 because 1281788 startSize * 10 NJUMPS * 2 = 25635760 <= -244636240 bytes from line 2 to eof
Type codes (jump 00)    : 441111111111111111111111111111111111111111111111111111111111111111111111111111111111111111...1111111111  Quote rule 0
Type codes (jump 01)    : 444422222222222222242444424444442222222222424444424442224424222244222222222242422222224422...4442244422  Quote rule 0
Type codes (jump 02)    : 444422222222222222242444424444442422222242424444424442224424222244222222222242424222424424...4442444442  Quote rule 0
Type codes (jump 03)    : 444422222222222224242444424444444422222244424444424442224424222244222222222244424442424424...4442444442  Quote rule 0
Type codes (jump 04)    : 444444244422222224442444424444444444242244444444444444424444442244444222222244424442444444...4444444444  Quote rule 0
Type codes (jump 05)    : 444444244422222224442444424444444444242244444444444444424444442244444222222244424442444444...4444444444  Quote rule 0
Type codes (jump 06)    : 444444444422222224442444424444444444242244444444444444424444444444444244444444424442444444...4444444444  Quote rule 0
Type codes (jump 07)    : 444444444422222224442444424444444444242244444444444444444444444444444244444444424444444444...4444444444  Quote rule 0
Type codes (jump 08)    : 444444444442222224444444424444444444242244444444444444444444444444444444444444424444444444...4444444444  Quote rule 0
Type codes (jump 09)    : 444444444442222424444444424444444444244444444444444444444444444444444444444444424444444444...4444444444  Quote rule 0
Type codes (jump 10)    : 444444444442222424444444424444444444244444444444444444444444444444444444444444424444444444...4444444444  Quote rule 0
=====
 Sampled 305 rows (handled \n inside quoted fields) at 11 jump points including middle and very end
 Bytes from first data row on line 2 to the end of last row: 47000004016
 Line length: mean=45578.20 sd=33428.37 min=12815 max=108497
 Estimated nrow: 47000004016 / 45578.20 = 1031195
 Initial alloc = 2062390 rows (1031195 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
Type codes (colClasses) : 444444444442222424444444424444444444244444444444444444444444444444444444444444424444444444...4444444444
Type codes (drop|select): 444444444442222424444444424444444444244444444444444444444444444444444444444444424444444444...4444444444
Allocating 12785 column slots (12785 - 0 dropped)
Reading 44928 chunks of 0.998MB (22 rows) using 144 threads
Read 872505 rows x 12785 columns from 43.772GB file in 05:26.908 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Final type counts
         0 : drop     
         0 : logical  
         0 : integer  
         0 : integer64
     12785 : numeric  
         0 : character
Rereading 508 columns due to out-of-sample type exceptions.
Column 171 ("D_19810618") bumped from 'integer' to 'numeric' due to <<2.23000001907349>> somewhere between row 6041 and row 24473
Column 347 ("D_19811211") bumped from 'integer' to 'numeric' due to <<1.02999997138977>> somewhere between row 6041 and row 24473
Column 348 ("D_19811212") bumped from 'integer' to 'numeric' due to <<3.75>> somewhere between row 6041 and row 24473
Column 643 ("D_19821003") bumped from 'integer' to 'numeric' due to <<1.04999995231628>> somewhere between row 6041 and row 24473
Column 1066 ("D_19831130") bumped from 'integer' to 'numeric' due to <<1.46000003814697>> somewhere between row 6041 and row 24473
Column 1102 ("D_19840105") bumped from 'integer' to 'numeric' due to <<0.959999978542328>> somewhere between row 6041 and row 24473
Column 1124 ("D_19840127") bumped from 'integer' to 'numeric' due to <<0.620000004768372>> somewhere between row 6041 and row 24473
Column 1130 ("D_19840202") bumped from 'integer' to 'numeric' due to <<0.540000021457672>> somewhere between row 6041 and row 24473
Column 1489 ("D_19850126") bumped from 'integer' to 'numeric' due to <<0.949999988079071>> somewhere between row 6041 and row 24473
Column 1508 ("D_19850214") bumped from 'integer' to 'numeric' due to <<0.360000014305115>> somewhere between row 6041 and row 24473

... 
...

Reread 872505 rows x 508 columns in 05:29.167
Read 872505 rows. Exactly what was estimated and allocated up front
Thread buffers were grown 0 times (if all 144 threads each grew once, this figure would be 144)
=============================
   0.000s (  0%) Memory map
   0.093s (  0%) sep, ncol and header detection
   0.186s (  0%) Column type detection using 305 sample rows from 44928 jump points
   0.600s (  0%) Allocation of 872505 rows x 12785 cols (192.552GB) plus 1.721GB of temporary buffers
 326.029s ( 50%) Reading data
 329.167s ( 50%) Rereading 508 columns due to out-of-sample type exceptions
 656.075s        Total

์ด ๋งˆ์ง€๋ง‰ ์š”์•ฝ์€ ์™„๋ฃŒํ•˜๋Š” ๋ฐ ๋งŽ์€ ์‹œ๊ฐ„์ด ๊ฑธ๋ ธ์Šต๋‹ˆ๋‹ค(~6๋ถ„). ์ด ์žฅํ™ฉํ•œ ์š”์•ฝ์„ ๊ณ„์‚ฐํ•˜๋ฉด ์ „์ฒด fread๊ฐ€ ~ 11๋ถ„์ด ๊ฑธ๋ ธ์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น 508๊ฐœ ์—ด์„ ๋‹ค์‹œ ์ฝ๊ณ  ์žˆ์œผ๋ฉฐ verbose=TRUE๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  "์ƒ˜ํ”Œ ์œ ํ˜• ์˜ˆ์™ธ๋กœ ์ธํ•ด 508๊ฐœ ์—ด์„ ๋‹ค์‹œ ์ฝ๋Š” ์ค‘"์ด๋ผ๋Š” ๋ฉ”์‹œ์ง€๋„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

"verbose=TRUE" ์—†์ด ํ™•์ธํ•˜์‹ญ์‹œ์˜ค.

ptm<-proc.time() 
DT<-fread('dt.daily.4km.ver032917.csv')

Read 872505 rows x 12785 columns from 43.772GB file in 05:26.647 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Rereading 508 columns due to out-of-sample type exceptions.

Reread 872505 rows x 508 columns in 05:30.276
proc.time() - ptm 
    user   system  elapsed 
2113.100   85.919  657.870

์“ฐ๊ธฐ ํ…Œ์ŠคํŠธ
๊ทธ๊ฒƒ์€ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์ •๋ง ๋น ๋ฆ…๋‹ˆ๋‹ค. ์š”์ฆ˜์€ ๊ธ€์ด ์ฝ๊ธฐ๋ณด๋‹ค ๋Š๋ฆฌ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

fwrite(DT,'dt.daily.4km.ver032917.csv', verbose=TRUE)
No list columns are present. Setting sep2='' otherwise quote='auto' would quote fields containing sep2.
maxLineLen=151187 from sample. Found in 1.890s
Writing column names ... done in 0.000s
Writing 872505 rows in 32315 batches of 27 rows (each buffer size 8MB, showProgress=1, nth=144) ... 
done (actual nth=144, anyBufferGrown=no, maxBuffUsed=46%) 

"๋‹ค์‹œ ์ฝ๊ธฐ"๋Š” ์‹œ๊ฐ„์ด ๊ฝค ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค. ํŒŒ์ผ์„ ๋‘ ๋ฒˆ ์ฝ๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

ํ›Œ๋ฅญํ•œ! ๋ชจ๋“  ์ •๋ณด๋ฅผ ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

  1. ๋„ค ๋งž์Šต๋‹ˆ๋‹ค. ๋‹ค์‹œ ์ฝ์–ด์„œ๋Š” ์•ˆ๋ฉ๋‹ˆ๋‹ค. colClasses=list("numeric"=1:12785) ํ–ˆ์œผ๋ฏ€๋กœ Type code (colClasses) ์‹œ์ž‘ํ•˜๋Š” ์ถœ๋ ฅ ํ–‰์€ ๋ชจ๋‘ ๊ฐ’ 4์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ˆ„๋ฝ๋œ ํ…Œ์ŠคํŠธ๋ฅผ ์ˆ˜์ •ํ•˜๊ณ  ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
  2. 5.5m ๋Œ€ 1.3m์—์„œ๋Š” ํฅ๋ฏธ๋กญ์Šต๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ์˜ ๋ฒ„ํผ๋Š” ํ˜„์žฌ 1MB์ž…๋‹ˆ๋‹ค. ์บ์‹œ์— ๋งž๊ฒŒ ์ž‘์•„์•ผ ํ•œ๋‹ค๋Š” ์•„์ด๋””์–ด. ๊ทธ๋Ÿฌ๋‚˜ ๊ท€ํ•˜์˜ ๊ฒฝ์šฐ 1MB / 12785 ์—ด = 82 ๋ฐ”์ดํŠธ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๊ทธ ์„ ํƒ์œผ๋กœ ์บ์‹œ๊ฐ€ 50๋ฐฐ๋‚˜ ๋น„ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค. ์–ด์จŒ๋“  ๊ฐ™์•„์š”. ๋‹น์‹ ์˜ ํ…Œ์ŠคํŠธ๊ฐ€ ์—†์—ˆ๋‹ค๋ฉด ๋‚˜๋Š” ๊ทธ๋Ÿฐ ์ƒ๊ฐ์„ ํ•ด๋ณธ ์ ์ด ์—†์—ˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. 1.3m ์†๋„๋กœ ์ž‘๋™ํ•  ๋•Œ 1MB ํฌ๊ธฐ๊ฐ€ ์—†์—ˆ์Šต๋‹ˆ๋‹ค.
  3. 305์ค„๋งŒ ์ƒ˜ํ”Œ๋งํ•œ ์ด์œ ๋„ ๋งค์šฐ ์ด์ƒํ•ฉ๋‹ˆ๋‹ค. 1,000๊ฐœ๋ฅผ ์ƒ˜ํ”Œ๋งํ–ˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ์ด ๊ณ ์ •๋˜๋ฉด ์•„๋งˆ๋„ ๋‹น์‹ ์ด ํ•  ํ•„์š”์—†์ด 508 ์—ด์„ ๊ฐ์ง€ ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ƒ˜ํ”Œ์— ์‹œ๊ฐ„(0.186์ดˆ)์ด ๊ฑธ๋ฆฌ์ง€ ์•Š์œผ๋ฏ€๋กœ ์ƒ˜ํ”Œ ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

lscpu ์œ ๋‹‰์Šค ๋ช…๋ น์˜ ์ถœ๋ ฅ์„ ๋ถ™์—ฌ๋„ฃ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ์ด๊ฒƒ์€ ๋‹น์‹ ์˜ ์บ์‹œ ํฌ๊ธฐ๋ฅผ ์•Œ๋ ค์ค„ ๊ฒƒ์ด๊ณ  ๋‚˜๋Š” ๊ฑฐ๊ธฐ์—์„œ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. buffMB ์— ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ fread ํ•˜์—ฌ ๊ทธ๊ฒƒ์ด ๋งž๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ๋” ๋‚˜์€ ๊ณ„์‚ฐ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‚ด ์ด์ „ ๊ฒŒ์‹œ๋ฌผ์˜ ํ•œ ์ง€์ ์—์„œ ๋‚ด ์‹ค์ˆ˜. colClasses=list("numeric"=1:12785) ์ด(๊ฐ€) ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. "colClasses"๋ฅผ ์ง€์ •ํ•˜์ง€ ์•Š์œผ๋ฉด "๋‹ค์‹œ ์ฝ๊ธฐ"๊ฐ€ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ํ˜ผ๋ž€์„ ๋“œ๋ ค ์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค.
๋‚ด๊ฐ€ ์•Œ์•„์ฐจ๋ฆฐ ํ•œ ๊ฐ€์ง€๋Š” "colClasses"๋ฅผ ์ง€์ •ํ•˜์ง€ ์•Š์œผ๋ฉด ํ…Œ์ด๋ธ”์ด NA๋กœ ์ƒ์„ฑ๋˜๊ณ  RAM์ด DT๊ฐ€ ์ •์ƒ์ ์œผ๋กœ ๋กœ๋“œ๋œ ๊ฒƒ์ฒ˜๋Ÿผ ํ‘œ์‹œ๋œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค(RAM์—์„œ ~106MB).

๋‹ค์Œ์€ lscpu ์ถœ๋ ฅ์ž…๋‹ˆ๋‹ค.

Architecture:          x86_64                                               
CPU op-mode(s):        32-bit, 64-bit                                       
Byte Order:            Little Endian                                        
CPU(s):                144                                                  
On-line CPU(s) list:   0-143                                                      
Thread(s) per core:    2                                                          
Core(s) per socket:    18                                                         
Socket(s):             4                                                          
NUMA node(s):          4                                                             
Vendor ID:             GenuineIntel                                                  
CPU family:            6                                                             
Model:                 63                                                             
Model name:            Intel(R) Xeon(R) CPU E7-8870 v3 @ 2.10GHz                      
Stepping:              4                                                              
CPU MHz:               2898.328                                                          
BogoMIPS:              4195.66                                                           
Virtualization:        VT-x                                                              
L1d cache:             32K                                                                  
L1i cache:             32K                                                                  
L2 cache:              256K                                                                       
L3 cache:              46080K                                                                     
NUMA node0 CPU(s):     0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140                                                                                    
NUMA node1 CPU(s):     1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97,101,105,109,113,117,121,125,129,133,137,141                                                                                                        
NUMA node2 CPU(s):     2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94,98,102,106,110,114,118,122,126,130,134,138,142                                                                                                             
NUMA node3 CPU(s):     3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,87,91,95,99,103,107,111,115,119,123,127,131,135,139,143   

์•Œ๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ฐ์‚ฌ ํ•ด์š”.

์ฒซ ๋ฒˆ์งธ ์˜๊ฒฌ์„ ๋‹ค์‹œ ์ฝ์œผ๋ฉด "์ •์ˆ˜์—์„œ ์ˆซ์ž๋กœ ๋ฒ”ํ•‘"์ด ์•„๋‹ˆ๋ผ "์ •์ˆ˜์—์„œ ์ด์ค‘์œผ๋กœ ๋ฒ”ํ•‘"์ด๋ผ๊ณ  ๋งํ•˜๋ฉด ๋” ์ดํ•ด๊ฐ€ ๋ ๊นŒ์š”?

'NA๋กœ ์ƒ์„ฑ๋จ'์€ ๋ฌด์Šจ ๋œป์ธ๊ฐ€์š”? ๋ชจ๋“  ํ…Œ์ด๋ธ”์ด NA๋กœ ๊ฐ€๋“ ์ฐจ ์žˆ๊ณ  508๊ฐœ์˜ ์—ด๋งŒ ์žˆ์Šต๋‹ˆ๊นŒ?

'DT๊ฐ€ ์ •์ƒ์ ์œผ๋กœ ๋กœ๋“œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค(RAM์—์„œ ~106MB)'๋Š” ๋ฌด์—‡์ž…๋‹ˆ๊นŒ? 44GB ํŒŒ์ผ์ธ๋ฐ ์–ด๋–ป๊ฒŒ 106MB๊ฐ€ ์ •์ƒ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

  • ๊ธ€์Ž„, ์ด ํŒŒ์ผ์—์„œ ๋‚ด๊ฐ€ ๊ฐ€์ง„ ๋ชจ๋“  ๊ฒƒ์€ ์‹ค์ˆ˜์™€ NA์ž…๋‹ˆ๋‹ค. ์ •์ˆ˜ ๊ฐ’์ด ์—†์Šต๋‹ˆ๋‹ค. "datatypeA์—์„œ datatypeB๋กœ ๋ฒ”ํ•‘"์ด๋ผ๋Š” ๋ฉ”์‹œ์ง€๊ฐ€ ์–ธ์ œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๊นŒ?

  • ๊ทธ๊ฒŒ ๋ฐ”๋กœ ๋‚ด๊ฐ€ ์˜๋ฏธํ•˜๋Š” ๋ฐ”์ž…๋‹ˆ๋‹ค. ๋‚ด๊ฐ€ ์„ค์น˜ํ•œ ํ˜„์žฌ ๋ฒ„์ „์˜ data.table์—์„œ colClasses๋ฅผ ์ƒ๋žตํ•˜๋ฉด DT๋Š” ๋กœ๋“œ๋˜์ง€๋งŒ NA๋Š” ๊ฐ€๋“ ์ฐจ ์žˆ์Šต๋‹ˆ๋‹ค. DT[!is.na(D_19821001),] ๋Š” 0๊ฐœ์˜ ๋ ˆ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  colClasses๊ฐ€ ์žˆ๋Š” ํ…Œ์ด๋ธ”์„ ๋กœ๋“œํ•˜๊ณ  ๋™์ผํ•œ ํ•„ํ„ฐ๋ง์„ ์ˆ˜ํ–‰ํ•˜๋ฉด ์‹ค์ œ๋กœ ๋ ˆ์ฝ”๋“œ๊ฐ€ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.

  • ๊ธ€์Ž„, ์ด ํŒŒ์ผ์€ ๋””์Šคํฌ์— csv๋กœ 47GB์ด์ง€๋งŒ, ์ผ๋‹จ R์— ๋กœ๋“œํ•˜๋ฉด RAM์—์„œ ๋‘ ๋ฐฐ ์ด์ƒ ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค... ๊ฐ’์˜ ์ •๋ฐ€๋„์™€ ๊ด€๋ จ์ด ์—†์Šต๋‹ˆ๋‹ค. ์ผ๋‹จ ๋กœ๋“œ๋˜๋ฉด ์‹ค์ œ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ด ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค. ์ˆซ์ž๊ฐ€ ์ฆ๊ฐ€๋ฅผ ์œ ๋ฐœํ•ฉ๋‹ˆ๊นŒ?

๊ทธ๋ ‡๋‹ค๋ฉด ๋˜ ๋‹ค๋ฅธ ์˜คํƒ€๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 106MB๋Š” 106GB์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
๋‚˜๋Š” NA ์ธก๋ฉด์„ ๋”ฐ๋ฅด์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋„์ค‘์— ๋ช‡ ๊ฐ€์ง€ ์ˆ˜์ •ํ•œ ๋‹ค์Œ ์ƒˆ๋กœ ๋‹ค์‹œ ์‹œ๋„ํ•˜์‹ญ์‹œ์˜ค ...

์•Œ๊ฒ ์Šต๋‹ˆ๋‹ค. ๋‹ค์‹œ ์‹œ๋„ํ•ด ์ฃผ์„ธ์š”. ์ถ”์ธก ์ƒ˜ํ”Œ์ด 10,000์œผ๋กœ ์ฆ๊ฐ€ํ–ˆ์œผ๋ฉฐ(์‹œ๊ฐ„์ด ์–ผ๋งˆ๋‚˜ ๊ฑธ๋ฆฌ๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ํฅ๋ฏธ๋กœ์šธ ๊ฒƒ์ž…๋‹ˆ๋‹ค) ์ด์ œ ๋ฒ„ํผ ํฌ๊ธฐ์— ์ตœ์†Œ๊ฐ’์ด ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.
drat ํŒจํ‚ค์ง€ ํŒŒ์ผ์ด ์Šน๊ฒฉ๋  ๋•Œ๊นŒ์ง€ ์ตœ์†Œ 30๋ถ„์„ ๊ธฐ๋‹ค๋ ค์•ผ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค. "MB"๊ฐ€ ์•„๋‹ˆ๋ผ "GB"์˜€์Šต๋‹ˆ๋‹ค. ์ปคํ”ผ๊ฐ€ ๋ถ€์กฑํ–ˆ์–ด์š”.
์—…๋ฐ์ดํŠธ ๋ฐ›์•„์„œ ํ…Œ์ŠคํŠธ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ตœ์‹  data.table dev.๋กœ ํ…Œ์ŠคํŠธ: data.table 1.10.5 IN DEVELOPMENT ๊ตฌ์ถ• 2017-03-30 16:31:45 UTC :

์š”์•ฝ:
๊ทธ๊ฒƒ์€ ๋น ๋ฅด๊ฒŒ ์ž‘๋™ํ•˜๊ณ (csv๋ฅผ ์ฝ๋Š” ๋ฐ 1.43๋ถ„) ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น๋„ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. RAM ํ• ๋‹น์ด ์ด์ „์ฒ˜๋Ÿผ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋””์Šคํฌ์˜ 44GB csv๋Š” ๋กœ๋“œ๋˜๋ฉด RAM์˜ ~112(+ 37GB ์ž„์‹œ ๋ฒ„ํผ)GB๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ํŒŒ์ผ์— ์žˆ๋Š” ๊ฐ’์˜ ๋ฐ์ดํ„ฐ ์œ ํ˜•๊ณผ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๊นŒ?

ํ…Œ์ŠคํŠธ 1 :

colClasses=list("numeric"=1:12785) ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ 

DT<-fread('dt.daily.4km.csv', verbose=TRUE)

Parameter na.strings == <<NA>>
None of the 1 na.strings are numeric (such as '-9999').
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 43.772296 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 starting: <<x,y,D_19810101,D_19810102,D_19810103,D_19810104,D_19810105,D_19810106,D_19810107,
D_19810108,D_19810109,D_19810110,D_19810111,D_19810112,
...
All the fields on line 1 are character fields. Treating as the column names.
Number of sampling jump points  = 101 because 47000004016 bytes from row 1 to eof / (2 * 1281788 jump0size) == 18333
Type codes (jump 000)    : 441111111111111111111111111111111111111111111111111111111111111111111111111111111111111111...1111111111  Quote rule 0
Type codes (jump 001)    : 444222422222222224442444444444442222444444444444444444442444444444444222224444444444444444...4444444442  Quote rule 0
Type codes (jump 002)    : 444222444444222244442444444444444422444444444444444444442444444444444222224444444444444444...4444444442  Quote rule 0
...
Type codes (jump 034)    : 444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444...4444444444  Quote rule 0
Type codes (jump 100)    : 444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444...4444444444  Quote rule 0
=====
 Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points including middle and very end
 Bytes from first data row on line 2 to the end of last row: 47000004016
 Line length: mean=79727.22 sd=32260.00 min=12804 max=153029
 Estimated nrow: 47000004016 / 79727.22 = 589511
 Initial alloc = 1179022 rows (589511 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
Type codes (colClasses)  : 444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444...4444444444
Type codes (drop|select) : 444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444...4444444444
Allocating 12785 column slots (12785 - 0 dropped)
Reading 432 chunks of 103.756MB (1364 rows) using 144 threads
Read 872505 rows x 12785 columns from 43.772GB file in 02:17.726 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Final type counts
         0 : drop     
         0 : logical  
         0 : integer  
         0 : integer64
     12785 : double   
         0 : character
Thread buffers were grown 67 times (if all 144 threads each grew once, this figure would be 144)
=============================
   0.000s (  0%) Memory map
   0.099s (  0%) sep, ncol and header detection
  11.057s (  8%) Column type detection using 10049 sample rows
   0.899s (  1%) Allocation of 872505 rows x 12785 cols (112.309GB) plus 37.433GB of temporary buffers
 125.671s ( 91%) Reading data
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
 137.726s        Total

ํ…Œ์ŠคํŠธ 2 :

์ง€๊ธˆ colClasses=list("numeric"=1:12785)
ํƒ€์ด๋ฐ์ด ๋ช‡ ์ดˆ ๊ฐœ์„ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค ...

DT<-fread('dt.daily.4km.csv', colClasses=list("numeric"=1:12785), verbose=TRUE)

Allocating 12785 column slots (12785 - 0 dropped)
Reading 432 chunks of 103.756MB (1364 rows) using 144 threads
Read 872505 rows x 12785 columns from 43.772GB file in 01:43.028 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Final type counts
         0 : drop     
         0 : logical  
         0 : integer  
         0 : integer64
     12785 : double   
         0 : character
Thread buffers were grown 67 times (if all 144 threads each grew once, this figure would be 144)
=============================
   0.000s (  0%) Memory map
   0.092s (  0%) sep, ncol and header detection
  11.009s ( 11%) Column type detection using 10049 sample rows
   0.332s (  0%) Allocation of 872505 rows x 12785 cols (112.309GB) plus 37.433GB of temporary buffers
  91.595s ( 89%) Reading data
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
 103.028s        Total

์ข‹์•„ - ์šฐ๋ฆฌ๋Š” ๊ฑฐ๊ธฐ์— ๋„๋‹ฌํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ˆ˜์ •๋œ ์ƒ˜ํ”Œ ํฌ๊ธฐ์™€ 100๊ฐœ ํฌ์ธํŠธ(10,000๊ฐœ ์ƒ˜ํ”Œ ๋ผ์ธ)์—์„œ 100๊ฐœ ๋ผ์ธ์œผ๋กœ ๋Š˜๋ฆฌ๋ฉด ์œ ํ˜•์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ถ”์ธกํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ›Œ๋ฅญํ•ฉ๋‹ˆ๋‹ค. 44GB ํŒŒ์ผ์€ ์—ด์ด 12,875๊ฐœ์ด๊ณ  ํ–‰ ๊ธธ์ด๊ฐ€ ํ‰๊ท  80,000์ž์ด๋ฏ€๋กœ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๋ฐ 11์ดˆ๊ฐ€ ๊ฑธ๋ ธ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ทธ ์‹œ๊ฐ„์€ 90์ดˆ๊ฐ€ ๋” ๊ฑธ๋ ธ์„ ๋‹ค์‹œ ์ฝ๊ธฐ๋ฅผ ํ”ผํ•  ์ˆ˜ ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€์น˜๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ทธ๋•Œ ๊ทธ๊ฒƒ์„ ๊ณ ์ˆ˜ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋‘ ๋ฒˆ์งธ ์‹œ๊ฐ„์ด๊ณ  ์šด์˜ ์ฒด์ œ๊ฐ€ ์›Œ๋ฐ์—…๋˜์–ด ํŒŒ์ผ์„ ์บ์‹œํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‘ ๋ฒˆ์งธ ์‹œ๊ฐ„์ด ๋” ๋น ๋ฅด๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์ƒ์ž์—์„œ ์‹คํ–‰๋˜๋Š” ๋‹ค๋ฅธ ๋ชจ๋“  ๊ฒƒ์€ ๋ฒฝ์‹œ๊ณ„ ํƒ€์ด๋ฐ์— ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค. ์ด๋Š” 1์ฐจ ํ…Œ์ŠคํŠธ์˜ ๋™์ผํ•œ ์—ฐ์† 3ํšŒ ์‹คํ–‰์œผ๋กœ ํ•ด๊ฒฐ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ํ•œ ๊ฐ€์ง€๋งŒ ๋ณ€๊ฒฝํ•˜๊ณ  3๊ฐœ์˜ ๋™์ผํ•œ ์—ฐ์† ์‹คํ–‰์„ ๋‹ค์‹œ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. 44GB ํฌ๊ธฐ์—์„œ๋Š” ๋งŽ์€ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ฐจ์ด๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ 3๊ฐœ์˜ ์‹คํ–‰์œผ๋กœ ๊ฒฐ๋ก ์„ ๋‚ด๋ฆฌ๊ธฐ์— ์ถฉ๋ถ„ํ•˜์ง€๋งŒ ๋ธ”๋ž™ ์•„ํŠธ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ, ๋ฉ”๋ชจ๋ฆฌ์˜ 112GB์™€ ๋””์Šคํฌ์˜ 44GB๋Š” ๋ถ€๋ถ„์ ์œผ๋กœ ๋ชจ๋“  ์—ด์ด ์ด์ค‘ ์œ ํ˜•์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋” ํฌ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. R์—๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋‚ด ์••์ถ•์ด ์—†์œผ๋ฉฐ ์ด CSV(๋‹จ์ง€ ",," )์—๋Š” ๊ณต๊ฐ„์„ ์ฐจ์ง€ํ•˜์ง€ ์•Š์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ์—๋Š” 8๋ฐ”์ดํŠธ์˜ NA ๊ฐ’์ด ๋งŽ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ 112GB๊ฐ€ ์•„๋‹Œ 83GB์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค(872505ํ–‰ x 12785์—ด x 8๋ฐ”์ดํŠธ 2๋ฐฐ / 1024^3 = 83GB). ๊ทธ 112GB๋Š” ๋ผ์ธ ๊ธธ์ด์˜ ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ• ๋‹น๋œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ‰๊ท  ํ–‰ ๊ธธ์ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ 589,511ํ–‰์œผ๋กœ ์ถ”์ •ํ•˜์—ฌ ๋„ˆ๋ฌด ์งง์•˜์Šต๋‹ˆ๋‹ค. ์„  ๊ธธ์ด ํŽธ์ฐจ๊ฐ€ ๋„ˆ๋ฌด ์ปค์„œ ํด๋žจํ”„๊ฐ€ +100%์—์„œ ์ ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. 58,9511 * 2 = 1,179,022 * 12785 * 8 / 1024^3 = 112GB์ž…๋‹ˆ๋‹ค. ๊ฒฐ๊ตญ ํŒŒ์ผ์— 872,505๊ฐœ๊ฐ€ ์žˆ์Œ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ทธ๊ฒƒ์€ ์—ฌ์œ  ๊ณต๊ฐ„์„ ํ™•๋ณดํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ˆ˜์ •ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. (TODO1)

๋ชจ๋“  ์—ด์— ๋Œ€ํ•ด ์—ด ์œ ํ˜•์„ ์ง€์ •ํ•ด๋„ ์—ฌ์ „ํžˆ ์ƒ˜ํ”Œ๋ง๋ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž๊ฐ€ ๋ชจ๋“  ์—ด์„ ์ง€์ •ํ•œ ๊ฒฝ์šฐ ์ƒ˜ํ”Œ๋ง์„ ๊ฑด๋„ˆ๋›ฐ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. (TODO2)

๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ 2๋ฐฐ์ด๊ธฐ ๋•Œ๋ฌธ์— C ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ•จ์ˆ˜ strtod()๋ฅผ 110์–ต ๋ฒˆ ํ˜ธ์ถœํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๊ธฐ๋Šฅ์˜ ์ „๋ฌธํ™”์— ๋Œ€ํ•œ ์˜ค๋žœ ์—ผ์›์€ ์ด๋ก ์ ์œผ๋กœ ์ด ํŒŒ์ผ์˜ ์†๋„๋ฅผ ํฌ๊ฒŒ ๋†’์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. (์™„๋ฃŒ)

์ข‹์€ ์„ค๋ช… ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ด ํŒŒ์ผ๋กœ ๋‹ค๋ฅธ ๊ฒƒ์„ ํ…Œ์ŠคํŠธํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ์•Œ๋ ค์ฃผ์„ธ์š”. ์ €๋Š” ~2,100๋งŒ ํ–‰ x 1432๊ฐœ์˜ ์—ด์ด ์žˆ๋Š” ๊ธด ํ˜•์‹์— ๋” ๊ฐ€๊นŒ์šด ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ž‘์—… ์ค‘์ž…๋‹ˆ๋‹ค.

     NAME       NROW  NCOL      MB                                                                                                                                                     
[1,] DT   21,812,625 1,432 238,310  

์—ฌ๊ธฐ์— ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. 89G .tsv , ๋กœ๋“œ ์ค‘ ์ตœ๋Œ€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์€ ~180G์ž…๋‹ˆ๋‹ค. NA์™€ ๋”๋ธ”์ด ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ๊ธฐ๋Œ€๋˜๋Š” ๋ถ€๋ถ„์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

๋‚˜๋Š” ๋˜ํ•œ ์ด๊ฒƒ์— ๋Œ€ํ•ด ํ…Œ์ŠคํŠธํ•˜๊ฒŒ๋˜์–ด ๊ธฐ์ฉ๋‹ˆ๋‹ค.

Ubuntu 16.04 64bit / Linux 4.4.0-71-generic
R version 3.3.2 (2016-10-31)
data.table 1.10.5 IN DEVELOPMENT built 2017-04-04 14:27:46 UTC

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2699.984
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4660.70
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-15,32-47
NUMA node1 CPU(s):     16-31,48-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt ida
Parameter na.strings == <<NA>>
None of the 1 na.strings are numeric (such as '-9999').
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 88.603947 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 starting: <<allele prediction_uuid sample_>>
Detecting sep ...
  sep=='\t'  with 101 lines of 76 fields using quote rule 0
Detected 76 columns on line 1. This line is either column names or first data row (first 30 chars): <<allele    prediction_uuid sample_>>
All the fields on line 1 are character fields. Treating as the column names.
Number of sampling jump points  = 101 because 95137762779 bytes from row 1 to eof / (2 * 24414 jump0size) == 1948426
Type codes (jump 000)    : 5555542444111145424441111444111111111111111111111111111111111111111111111111  Quote rule 0
Type codes (jump 009)    : 5555542444114445424441144444111111111111111111111111111111111111111111111111  Quote rule 0
Type codes (jump 042)    : 5555542444444445424444444444111111111111111111111111111111111111111111111111  Quote rule 0
Type codes (jump 048)    : 5555544444444445444444444444225225522555545111111111111111111111111111111111  Quote rule 0
Type codes (jump 083)    : 5555544444444445444444444444225225522555545254452454411154454452454411154455  Quote rule 0
Type codes (jump 085)    : 5555544444444445444444444444225225522555545254452454454454454452454454454455  Quote rule 0
Type codes (jump 100)    : 5555544444444445444444444444225225522555545254452454454454454452454454454455  Quote rule 0
=====
 Sampled 10028 rows (handled \n inside quoted fields) at 101 jump points including middle and very end
 Bytes from first data row on line 2 to the end of last row: 95137762779
 Line length: mean=465.06 sd=250.27 min=198 max=929
 Estimated nrow: 95137762779 / 465.06 = 204571280
 Initial alloc = 409142560 rows (204571280 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
Type codes (colClasses)  : 5555544444444445444444444444225225522555545254452454454454454452454454454455
Type codes (drop|select) : 5555544444444445444444444444225225522555545254452454454454454452454454454455
Allocating 76 column slots (76 - 0 dropped)
Reading 90752 chunks of 1.000MB (2254 rows) using 64 threads

๋„์›€์ด ๋œ๋‹ค๋ฉด ๋‹ค์Œ์€ ๋งค์šฐ ๊ธด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค: 419,124,196 x 42(~2^34) ํ•˜๋‚˜์˜ ํ—ค๋” ํ–‰๊ณผ colClasses๊ฐ€ ์ „๋‹ฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-09-27 17:12:56 UTC; travis
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> CC <- c(rep('integer', 2), rep('character', 3),
+      rep('numeric', 2), rep('integer', 3),
+      rep('character', 2), 'integer', 'character', 'integer',
+      rep('character', 4), rep('numeric', 11), 'character',
+      'numeric', 'character', rep('numeric', 2),
+      rep('integer', 3), rep('numeric', 2), 'integer',
+      'numeric')
> P <- fread('XXXX.csv', colClasses = CC, header = TRUE, verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file XXXXcsv
  File opened, size = 51.71GB (55521868868 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<X,X,X,X>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 42 fields using quote rule 0
  Detected 42 columns on line 1. This line is either column names or first data row. Line starts as: <<X,X,X,X>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 42
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (55521868866 bytes from row 1 to eof) / (2 * 13006 jump0size) == 2134471
  Type codes (jump 000)    : 5161010775551055105101111111111111110110771117717  Quote rule 0
  Type codes (jump 022)    : 5561010775551055105101111111111111110110771117717  Quote rule 0
  Type codes (jump 030)    : 5561010775551055105101010107517171151110110771117717  Quote rule 0
  Type codes (jump 037)    : 5561010775551055105101010107517771171110110771117717  Quote rule 0
  Type codes (jump 073)    : 5561010775551055105101010107517771177110110771117717  Quote rule 0
  Type codes (jump 093)    : 5561010775551055105101010107717771177110110771117717  Quote rule 0
  Type codes (jump 100)    : 5561010775551055105101010107717771177110110771117717  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 55521868866
  Line length: mean=132.68 sd=6.00 min=118 max=425
  Estimated number of rows: 55521868866 / 132.68 = 418453923
  Initial alloc = 460299315 rows (418453923 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 11 type and 0 drop user overrides : 551010107755510105105101010107777777777710710775557757
[10] Allocate memory for the datatable
  Allocating 42 column slots (42 - 0 dropped) with 460299315 rows
[11] Read the data
  jumps=[0..52960), chunk_size=1048373, total_size=55521868441
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Read 419124195 rows x 42 columns from 51.71GB (55521868868 bytes) file in 13:42.935 wall clock time
Thread buffers were grown 0 times (if all 40 threads each grew once, this figure would be 40)
Final type counts
         0 : drop     
         0 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
        11 : int32    
         0 : int64    
        19 : float64  
         0 : float64  
         0 : float64  
        12 : string   
=============================
   0.000s (  0%) Memory map 51.709GB file
   0.016s (  0%) sep=',' ncol=42 and header detection
   0.016s (  0%) Column type detection using 10049 sample rows
 188.153s ( 23%) Allocation of 419124195 rows x 42 cols (125.177GB)
 634.751s ( 77%) Reading 52960 chunks of 1.000MB (7901 rows) using 40 threads
   =    0.121s (  0%) Finding first non-embedded \n after each jump
   +   17.036s (  2%) Parse to row-major thread buffers
   +  616.184s ( 75%) Transpose
   +    1.410s (  0%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
 822.935s        Total
> memory.size()
[1] 134270.3
> rm(P)
> gc()
             used    (Mb)  gc trigger     (Mb)    max used     (Mb)
Ncells     585532    31.3     5489235    293.2     6461124    345.1
Vcells 1508139082 11506.2 20046000758 152938.9 25028331901 190951.1
> memory.size()
[1] 87.56
> P <- fread('XXXX.csv', colClasses = CC, header = TRUE, verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file XXXX.csv
  File opened, size = 51.71GB (55521868868 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<X,X,X,X>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 42 fields using quote rule 0
  Detected 42 columns on line 1. This line is either column names or first data row. Line starts as: <<X,X,X,X>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 42
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (55521868866 bytes from row 1 to eof) / (2 * 13006 jump0size) == 2134471
  Type codes (jump 000)    : 5161010775551055105101111111111111110110771117717  Quote rule 0
  Type codes (jump 022)    : 5561010775551055105101111111111111110110771117717  Quote rule 0
  Type codes (jump 030)    : 5561010775551055105101010107517171151110110771117717  Quote rule 0
  Type codes (jump 037)    : 5561010775551055105101010107517771171110110771117717  Quote rule 0
  Type codes (jump 073)    : 5561010775551055105101010107517771177110110771117717  Quote rule 0
  Type codes (jump 093)    : 5561010775551055105101010107717771177110110771117717  Quote rule 0
  Type codes (jump 100)    : 5561010775551055105101010107717771177110110771117717  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 55521868866
  Line length: mean=132.68 sd=6.00 min=118 max=425
  Estimated number of rows: 55521868866 / 132.68 = 418453923
  Initial alloc = 460299315 rows (418453923 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 11 type and 0 drop user overrides : 551010107755510105105101010107777777777710710775557757
[10] Allocate memory for the datatable
  Allocating 42 column slots (42 - 0 dropped) with 460299315 rows
[11] Read the data
  jumps=[0..52960), chunk_size=1048373, total_size=55521868441
Read 98%. ETA 00:00 
[12] Finalizing the datatable
Read 419124195 rows x 42 columns from 51.71GB (55521868868 bytes) file in 05:04.910 wall clock time
Thread buffers were grown 0 times (if all 40 threads each grew once, this figure would be 40)
Final type counts
         0 : drop     
         0 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
        11 : int32    
         0 : int64    
        19 : float64  
         0 : float64  
         0 : float64  
        12 : string   
=============================
   0.000s (  0%) Memory map 51.709GB file
   0.031s (  0%) sep=',' ncol=42 and header detection
   0.000s (  0%) Column type detection using 10049 sample rows
  28.437s (  9%) Allocation of 419124195 rows x 42 cols (125.177GB)
 276.442s ( 91%) Reading 52960 chunks of 1.000MB (7901 rows) using 40 threads
   =    0.017s (  0%) Finding first non-embedded \n after each jump
   +   12.941s (  4%) Parse to row-major thread buffers
   +  262.989s ( 86%) Transpose
   +    0.495s (  0%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
 304.910s        Total
> memory.size()
[1] 157049.7

> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   

๋ช‡ ๊ฐ€์ง€ ๋ฉ”๋ชจ์ž…๋‹ˆ๋‹ค. colClasses๊ฐ€ ์ „๋‹ฌ๋˜๋ฉด ํ™•์ธํ•  ์ด์œ ๊ฐ€ ์—†๋‹ค๋Š” ์ ์—์„œ [09]๋ฅผ [07] ์•ž์— ๋‘๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ Windows๋Š” ๊ฐ ์‹คํ–‰ ํ›„ ์•ฝ 160GB๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค. memory.size()๋Š” ์•„๋งˆ๋„ ์ฒญ์†Œ๋ฅผ ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ์„œ๋ฒ„์— 532GB RAM์ด ์žˆ๋Š” ๊ฒฝ์šฐ ๋ฉ”๋ชจ๋ฆฌ ์บ์‹ฑ์€ ๋‘ ๋ฒˆ์งธ ์‹คํ–‰ ์‹œ ์†๋„ ์ฆ๊ฐ€์™€ ๊ด€๋ จ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋„์›€์ด ๋˜๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค.

๊ด‘๋ฒ”์œ„ํ•œ "์‹ค์ œ" ํ…Œ์ด๋ธ”(๋ณ‘์› ๋ฐ์ดํ„ฐ)์—์„œ ํ…Œ์ŠคํŠธ: 30M ํ–‰ ร— 125์—ด v readr '1.2.0' ๋ฐ read.csv 3.4.3.

image

๋ฌธ์ œ๋ฅผ ํ™•์ธํ•  ๊ธฐํšŒ๊ฐ€ 1.11.4์—์„œ๋„ ์—ฌ์ „ํžˆ ์œ ํšจํ•ฉ๋‹ˆ๊นŒ? ๋˜๋Š” ์˜ˆ์ œ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์ฝ”๋“œ.

์ด๊ฒƒ์€ ๋‚ด๊ฐ€ ์•„๋Š” ํ•œ ๋ชจ๋‘ ํ•ด๊ฒฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. @geponce ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ ์—…๋ฐ์ดํŠธํ•˜์‹ญ์‹œ์˜ค.
์œ„์˜ TODO1์€ ์ด์ œ #3024๋กœ ์ œ์ถœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
์œ„์˜ TODO2๋Š” ์ด์ œ #3025๋กœ ์ œ์ถœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰