ããã
ããŒããŠã§ã¢ãšãœãããŠã§ã¢ïŒ
ãµãŒããŒïŒDell R930 4-Intel Xeon E7-8870 v3 2.1GHzã45Mãã£ãã·ã¥ã9.6GT / s QPIãã¿ãŒããHTã18C / 36TãRAMã«1TB
OS ïŒRedhat 7.1
RããŒãžã§ã³ïŒ3.3.2
data.tableããŒãžã§ã³ïŒ1.10.5ãã«ã2017-03-21
csvãã¡ã€ã«ïŒ44 GBã872505è¡x 12785åïŒãèªã¿èŸŒãã§ããŸãã 144ã³ã¢ïŒãã€ããŒã¹ã¬ããã£ã³ã°ãæå¹ã«ããŠ144ã³ã¢ããã¯ã¹ã«ãã4ã€ã®ããã»ããµãã72ã³ã¢ïŒã䜿çšãããšã1.30åã§éåžžã«é«éã«ããŒããããŸãã
äž»ãªåé¡ã¯ãDTãããŒãããããšã䜿çšäžã®ã¡ã¢ãªã®éãcsvãã¡ã€ã«ã®ãµã€ãºã«æ¯ã¹ãŠå€§å¹ ã«å¢å ããããšã§ãã ãã®å Žåã44 GBã®csvïŒfwriteã§ä¿åãsaveRDSã§ä¿åãcompress = FALSEã§84GBã®ãã¡ã€ã«ãäœæïŒã¯ãçŽ356GBã®RAMã䜿çšããŠããŸãã
ãverbose = TRUEãã䜿çšããåºåã¯æ¬¡ã®ãšããã§ãã
_12785åã¹ãããã®å²ãåœãŠïŒ12785-0ãåé€ãããŸããïŒ
madviseã·ãŒã±ã³ã·ã£ã«ïŒããããŸãã
1440ãžã£ã³ããã€ã³ããš144ã¹ã¬ããã§ããŒã¿ãèªã¿åã
858881ã®æšå®è¡ã®95.7ïŒ
ãèªã¿åã
43.772GBãã¡ã€ã«ãã872505è¡x12785åã1å33.736ç§ã®å®æéã§èªã¿åããŸãïŒå®è¡äžã®ä»ã®ã¢ããªã®åœ±é¿ãåããŸãïŒ
0.000sïŒ0ïŒ
ïŒã¡ã¢ãªããã
0.070ç§ïŒ0ïŒ
ïŒsepãncolãããã³ããããŒã®æ€åº
26.227sïŒ28ïŒ
ïŒ1440ãžã£ã³ããã€ã³ãããã®34832ãµã³ãã«è¡ã䜿çšããåã¿ã€ãã®æ€åº
0.614ç§ïŒ1ïŒ
ïŒRAMå
ã®3683116è¡x 12785åïŒ350.838GBïŒã®å²ãåœãŠ
0.000ç§ïŒ0ïŒ
ïŒmadviseã·ãŒã±ã³ã·ã£ã«
66.825sïŒ71ïŒ
ïŒããŒã¿ã®èªã¿åã
93.736ç§åèš_
ãmclapplyããªã©ã®é¢æ°ã䜿çšãããšãã³ã¢ããšã«1ã€ã®rsessionãèµ·åãããã䞊åããã±ãŒãžã䜿çšãããšãã«çºçããããšãããåæ§ã®åé¡ã衚瀺ãããŸãã ãã®ã¹ã¯ãªãŒã³ã·ã§ããã§äœæ/ãªã¹ããããŠããRsessionsãåç §ããŠãã ããã
ãrmïŒDTïŒããå®è¡ãããšãRAMãåæç¶æ ã«æ»ãããRsessionsããåé€ãããŸãã
ãã§ã«ãsetDTthreadsïŒ20ïŒããªã©ãè©ŠããŸããããåãéã®RAMã䜿çšããŠããŸãã
ã¡ãªã¿ã«ããã¡ã€ã«ã«é䞊åããŒãžã§ã³ã®ãfreadããããŒããããŠããå Žåãã¡ã¢ãªå²ãåœãŠã¯æ倧106GBã«ãªããŸãã
ã®ã¬ã«ã¢
ããã¯ãé䞊åfreadå®è£ ããã®åºåã§ãïŒdata.table 1.10.5 IN DEVELOPMENT built 2017-02-09ïŒ
ãããŠã䜿çšãããŠããã¡ã¢ãªã®éãå確èªãããšããã84GBã«ãªããŸããã
ã®ã¬ã«ã¢
ã¯ããããªããæ£ããã çŽ æŽãããã¬ããŒããããããšãã æšå®ãããnrowã¯ã»ãŒæ£ããããã«èŠããŸããïŒ858,881察872,505ïŒãå²ãåœãŠã¯ãããã4.2å倧ããïŒ3,683,116ïŒãããªãé¢ããŠããŸãã èšç®ãæ¹åãã詳现åºåã«è©³çŽ°ãè¿œå ããŸããã ãã ããããã«ããã€ãã®äœæ¥ãå®äºãããŸã§ãåãã¹ããä¿çããŸãã
ããäžåºŠãã¹ãããŠãã ãã-ä»ããä¿®æ£ããå¿ èŠããããŸãã
data.tabledevãã€ã³ã¹ããŒã«ããŸããã
data.table 1.10.5 IN DEVELOPMENT built 2017-03-27 02:50:31 UTC
åã44GBã®ãã¡ã€ã«ãèªã¿èŸŒãããšãããšãã«æåã«åŸãã®ã¯ã次ã®ã¡ãã»ãŒãžã§ãã
DT <-freadïŒ 'dt.daily.4km.csv'ïŒ
ãšã©ãŒïŒprotectïŒïŒïŒä¿è·ã¹ã¿ãã¯ãªãŒããŒãããŒ
次ã«ãåãã³ãã³ããåå®è¡ããŠãæ£åžžã«åäœãå§ããŸããã ãã ãããã®ããŒãžã§ã³ã¯ãã«ãã³ã¢ã¢ãŒãã䜿çšããŠããŸããã fread-parallelããŒãžã§ã³ãé 眮ããåãšåãããã«ãããŒãã«ã¯çŽ25åããããŸãã
ã®ã¬ã«ã¢
ãã¹ãŠã®rã»ãã·ã§ã³ãéããŠãã¹ããåå®è¡ãããšããšã©ãŒãçºçããŸãã
æšæž¬ãããåã¿ã€ãã¯ã508åã®34711745å€ã«ã¯äžååã§ããã colClassesã䜿çšããŠããããã®åã¯ã©ã¹ãæåã§èšå®ããŸãã
以äžãåç §ããŠãã ããããæšæž¬ãããæŽæ°ã§ããã<< 0 .... >>ãå«ãŸããŠããŸããã«é¢ããã¡ãã»ãŒãžãããã€ã衚瀺ãããŸãã
_15ïŒ27.024ã®å®æéã§43.772GBãã¡ã€ã«ãã872505è¡x 12785åãèªã¿åããŸãïŒã¢ã€ãã«ç¶æ
ã®ããã«èŠããŠããä»ã®éããŠããã¢ããªã«ãã£ãŠé床ãäœäžããå¯èœæ§ããããŸãïŒ
å171ïŒ 'D_19810618'ïŒã¯ 'integer'ãæšæž¬ããŸãããã<< 2.23000001907349 >>ãå«ãŸããŠããŸã
å347ïŒ 'D_19811211'ïŒã¯ 'integer'ãæšæž¬ããŸãããã<< 1.02999997138977 >>ãå«ãŸããŠããŸã
å348ïŒ 'D_19811212'ïŒã¯ 'integer'ãæšæž¬ããŸãããã<< 3.75 >> _ãå«ãŸããŠããŸã
ãããŒ-ããªãã®ãã¡ã€ã«ã¯æ¬åœã«ãšããžã±ãŒã¹ããã¹ãããŠããŸãã çŽ æŽãããã å°æ¥çã«ã¯verbose=TRUE
å®è¡ããå®å
šãªåºåãæäŸããŠãã ããã ãããããã®å Žåã«ããªããæäŸããæ
å ±ã§ãç§ã¯åé¡ãå®éã«äœã§ããããèŠãããšãã§ããŸãã ã¹ã¬ããããšã«åããšã«äœæããããããã¡ãŒããããŸãïŒãã®å Žåã12,000åãè¶
ããŸãïŒã çŸåšããããããåå¥ã«ä¿è·ãããŠããŸãã ãããåé¿ããæ¹æ³ããããŸã-ããŸãã åæšæž¬ã«é¢ããã¡ãã»ãŒãžã¯æ£ããã§ãã ãããã®508åã¯ããªãã«ãšã£ãŠäœãæå³ããããæ°å€ã§ãªããã°ãªããªãããšã«åæããŸããïŒ æ¬¡ã®ããã«ãåã®ç¯å²ãcolClasses
æž¡ãããšãã§ããŸãïŒ colClasses=list("numeric"=11:518)
ãã®ããŒã¿ã¯ã©ã®ãã£ãŒã«ãããã®ãã®ã§ããïŒ ãã¡ã€ã«ãäœæããŠããŸããïŒ éåžžã®ãã¹ããã©ã¯ãã£ã¹ã¯é·ããã©ãŒãããã§æžã蟌ãïŒãããŠã¡ã¢ãªã«ãä¿æããïŒããšã§ããã¯ã€ããã©ãŒãããã«ããããŠããããã«æããŸãã éåžžããD_19810618ããªã©ã®508åã®ååã¯ãåèªäœã§ã¯ãªããåã®å€ãšããŠè¡šç€ºããããšæããŸãã ãã®ããããã¡ã€ã«ãäœæããŠãããã©ãããé·ã圢åŒã§äœæã§ãããã©ãããå°ããŸãã ããã§ãªãå Žåã¯ããã¡ã€ã«ãäœæããŠãã人ã«ããã¡ã€ã«ãããé©åã«äœæã§ããããšãææ¡ããŸãã ãããã.SD
ãš.SDcols
ã䜿çšããŠãåãä»ããŠæäœãé©çšããŠãããšæããŸãã é·ã圢åŒã§ããD_19810618ãã®ãããªå€ãä¿æããåã«keyby=
ã®æ¹ãã¯ããã«åªããŠããŸãã
ãã ãã fread
ãã12,000åãè¶
ããéåžžã«å¹
ã®åºããã¡ã€ã«ã§ãã£ãŠããããããå
¥åãåŠçã§ããããã«ã§ããéãåªãããã®ã«ããããåªããŸãã
ä»ã®äººãèªåã®ãã¡ã€ã«ããã¹ãããŠåé¡ããŸã£ãããªãããšãé¡ã£ãŠããŸãïŒ
_ãããã®508åã¯ããªãã«ãšã£ãŠäœãæå³ããããæ°å€ã§ããå¿ èŠãããããšã«åæããŸããïŒ
ãã®ããŒãã«ã«ã¯ãè¡ããšã®æç³»åãå«ãŸããŠããŸãã IDxã«ãIDYãTime1_valueãTime2_valueãTime3_value ...ãšå_valueãã¹ãŠã®æéNã¯æ°å€ã®ã¿ãå«ãŸããŠããŸãã colClassesã䜿çšããå Žåã¯ã12783ã«å¯ŸããŠå®è¡ããå¿ èŠããããŸãïŒlistïŒ "numeric" = 2ïŒ12783ïŒã ãããè©ŠããŠã¿ãŸãã
_ãã®ããŒã¿ã¯ã©ã®ãã£ãŒã«ãããã®ãã®ã§ããïŒ_
å°ç空éããŒã¿ã IDxãšIDyã䜿çšããŠDTå
ã§ãŠãŒã¯ãªããè·é¢æ€çŽ¢ãè¡ã£ãŠããŸãã è¡ãå€ãã»ã©æ€çŽ¢ãé
ããªããšæããŸãããïŒ
çŸåšã¯ããªãé«éã§ãïŒã¯ã€ããã©ãŒãããïŒã ãŠãŒã¶ãŒããšãªã¢ãã¯ãªãã¯ãããšãcsvãã¡ã€ã«å
ã§ãæãè¿ãå ŽæïŒããŒã¿ãå©çšå¯èœïŒããç¹å®ã®ã¯ãªãã¯ãŸã§ã®æç³»åãçæããããããããããŸãã 代ããã«ãé·ã圢åŒã§å®è£
ããŸãã
ããã€ãã®çµæãè¿ããŸãã
ãããã list("numeric"=2:12783)
iiucã¯ã508åã®ãã«ãã®ã¿ãå¿
èŠãªãããå¿
èŠãããŸããã ãã-ãªãã»ã©-508ã¯ç§ãæšæž¬ããåã«æ£ãã°ã£ãŠããŸãïŒãããã¯é£ç¶ããåã®ã»ããã§ã¯ãããŸããïŒïŒ
ããã-data.tableã¯ãå¹
ãåºãå Žåãã»ãšãã©é«éã«ãªããŸããã ãã³ã°ã¯ã»ãšãã©ã®å Žåãããéãããã䟿å©ã§ãã roll="nearest"
ãèŠãããè©ŠãããããããšããããŸããïŒ ä»ã¯ã©ãã§ããïŒ ç解ã§ããããã«ã³ãŒãã衚瀺ããŠãã ããã ã»ãŒç¢ºå®ã«é·ã圢åŒã®æ¹ãåªããŠããŸããã2Dæè¿åã®æ¡åŒµãå¿
èŠã«ãªãå ŽåããããŸãã ã¿ã€ãã³ã°ãæããŠãã ããã ãããªãéãããšèšããšããããªãéãããšã¯äœãã«ã€ããŠã人ã
ã¯å€§ããç°ãªãèããæã£ãŠããããšãããããŸãã
ãã®ããŒãã«ã溶ãããšã2 ^ 31ã®å¶éã«éããŸãã ãè² ã®é·ãã®ãã¯ãã«ã¯èš±å¯ãããŠããŸããããšãããšã©ãŒã衚瀺ãããŸãã
ãœãŒã¹ã«æ»ã£ãŠããã³ã°ãã©ãŒãããã§çæã§ãããã©ããã確èªããŸãã
# Read Data
DT <- fread('dt.daily.4km.csv', showProgress = FALSE)
# Add two columns with truncated values of x and y (these are geog. coords.)
DT[,y_tr:=trunc(y)]
DT[,x_tr:=trunc(x)]
# For using on plotting (x-axis values)
xaxis<-seq.Date(as.Date("1981-01-01"),as.Date("2015-12-31"), "day")
# subset by truncated coordinates to avoid full-table search. Now searches
# will happen in a smaller subset
DT2 <- DT[y_tr==trunc(y_clicked) & x_tr==trunc(x_clicked),]
# Add distance from each point in the data.table to the provided location, "gdist" is from
# Imap package for euclidean distance.
DT2[,DIST:=gdist(lat.1 = DT2$y,
lon.1 = DT2$x,
lat.2 = y_clicked,
lon.2 = x_clicked, units="miles")]
# Get the minimum distance
minDist <- min(DT2[,DIST])
# Get the y-axis values
yt <- transpose(DT2[DIST==minDist,3:(ncol(DT2)-3)])$V1`
# Ready to plot xaxis vs yt
...
...
ãããªãã¯ãµãŒããŒã«ã¢ããªã±ãŒã·ã§ã³ããããŸããã åºæ¬çã«ããŠãŒã¶ãŒã¯å°å³ãã¯ãªãã¯ããŠããããã®åº§æšããã£ããã£ããäžèšã®æ€çŽ¢ãå®è¡ããŠãæç³»åãååŸãããããããäœæã§ããŸãã
éåžžã«å€æ°ã®åã®å¥ã®ã¹ã¿ãã¯ãªãŒããŒãããŒãæ€åºããŠä¿®æ£ããŸããïŒ https ïŒ
ããã ããããã€ã³ãã§ãã 872505è¡* 12780åã¯110åè¡ã§ãã ãããã£ãŠã2 ^ 31ãè¶ ãããããé·ã圢åŒã«ãããšããç§ã®ææ¡ã¯æ©èœããŸããã ãã¿ãŸãã-ç§ã¯ãããèŠã€ããã¯ãã§ãã 匟䞞ãåãã§> 2 ^ 31ã«è¡ããªããã°ãªããŸããã ãããŸã§ã®éãããªããåãçµãã§ããå¹ åºããã©ãŒãããã«åºå·ããŸãããããããçªãæ¢ããŸãã
ããäžåºŠããçŽããŠãã ããã ã¡ã¢ãªäœ¿çšéã¯éåžžã«æ»ãããµã³ãã«å€ã®ã¿ã€ãã®äŸå€ãé€ããŠã12,785åã®ãã¡508åãèªåçã«åèªã¿åãããå¿
èŠããããŸãã èªååå®è¡æéãåé¿ããããã«ã colClasses
èšå®ã§ããŸãã
ä¿®æ£ãããŠããªãå Žåã¯ãå®å
šãªè©³çŽ°åºåã貌ãä»ããŠãã ããã æåãç¥ã£ãŠããïŒ
Ok...
ææ°ã®data.tabledevã䜿çšããçµæã®èŠçŽïŒ data.table 1.10.5 IN DEVELOPMENT built 2017-03-29 16:17:01 UTC
èšåãã¹ã4ã€ã®äž»ãªãã€ã³ãïŒ
ããã€ãã®ã³ã¡ã³ã/質åïŒ
1.1ã
ã³ã¢ã§ã®äœ¿çšçã¯ä»¥åã»ã©ã¢ã¯ãã£ãã§ã¯ãããŸãããã¹ã¯ãªãŒã³ã·ã§ãããåç
§ããŠãã ããã
以åã®ããŒãžã§ã³ã®freadã§ã¯ãã³ã¢ã®ã¢ã¯ãã£ããã£ã¯åžžã«çŽ90ã80ïŒ ã§ããã ãã®ããŒãžã§ã³ã§ã¯ãäžã®ç»åã«ç€ºãããã«ãåã³ã¢ã®çŽ2ã3ïŒ ãç¶æããŸããã
Column 1489 ("D_19850126") bumped from 'integer' to 'numeric' due to <<0.949999988079071>> somewhere between row 6041 and row 24473
ç§ã¯ãããã®è¡ãå確èªããŸããããåé¡ãªãããã§ãã æŽæ°ãšããŠæ€åºããããã§ã«æ°å€ã§ããå Žåã«ããã³ããããŠãæ°å€ãã«ããå¿ èŠãããã®ã¯ãªãã§ããïŒä»¥äžã«ææ¡ããè¡ã®èŠçŽãåç §ïŒã ãŸãã¯ç§ã¯ãã®è¡ãééã£ãŠç解ããŠããŸããïŒ ããã¯508è¡ã§çºçããŸãã NAãåé¡ãåŒãèµ·ãããŠããŸããïŒ
summary(DT[6041:24473,.(D_19850126)])
D_19850126
Min. :0.750
1st Qu.:0.887
Median :0.945
Mean :0.966
3rd Qu.:1.045
Max. :1.210
NA's :18393
ãã¹ãããã®ããã€ãã®åé·ã®äžã
DT<-fread('dt.daily.4km.ver032917.csv', verbose=TRUE)
åºå
Parameter na.strings == <<NA>>
None of the 0 na.strings are numeric (such as '-9999').Input contains no \n. Taking this to be a filename to open
File opened, filesize is 43.772296 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 starting: <<x,y,D_19810101,D_19810102,D_19810103,D_19810104,D_19810105,D_19810106,D_19810107,D_19810108,D_19810109,D_19810110,D_19810111,D_19810112,D_19810113,
...
...
All the fields on line 1 are character fields. Treating as the column names.
Number of sampling jump points = 11 because 1281788 startSize * 10 NJUMPS * 2 = 25635760 <= -244636240 bytes from line 2 to eof
Type codes (jump 00) : 441111111111111111111111111111111111111111111111111111111111111111111111111111111111111111...1111111111 Quote rule 0
Type codes (jump 01) : 444422222222222222242444424444442222222222424444424442224424222244222222222242422222224422...4442244422 Quote rule 0
Type codes (jump 02) : 444422222222222222242444424444442422222242424444424442224424222244222222222242424222424424...4442444442 Quote rule 0
Type codes (jump 03) : 444422222222222224242444424444444422222244424444424442224424222244222222222244424442424424...4442444442 Quote rule 0
Type codes (jump 04) : 444444244422222224442444424444444444242244444444444444424444442244444222222244424442444444...4444444444 Quote rule 0
Type codes (jump 05) : 444444244422222224442444424444444444242244444444444444424444442244444222222244424442444444...4444444444 Quote rule 0
Type codes (jump 06) : 444444444422222224442444424444444444242244444444444444424444444444444244444444424442444444...4444444444 Quote rule 0
Type codes (jump 07) : 444444444422222224442444424444444444242244444444444444444444444444444244444444424444444444...4444444444 Quote rule 0
Type codes (jump 08) : 444444444442222224444444424444444444242244444444444444444444444444444444444444424444444444...4444444444 Quote rule 0
Type codes (jump 09) : 444444444442222424444444424444444444244444444444444444444444444444444444444444424444444444...4444444444 Quote rule 0
Type codes (jump 10) : 444444444442222424444444424444444444244444444444444444444444444444444444444444424444444444...4444444444 Quote rule 0
=====
Sampled 305 rows (handled \n inside quoted fields) at 11 jump points including middle and very end
Bytes from first data row on line 2 to the end of last row: 47000004016
Line length: mean=45578.20 sd=33428.37 min=12815 max=108497
Estimated nrow: 47000004016 / 45578.20 = 1031195
Initial alloc = 2062390 rows (1031195 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
Type codes (colClasses) : 444444444442222424444444424444444444244444444444444444444444444444444444444444424444444444...4444444444
Type codes (drop|select): 444444444442222424444444424444444444244444444444444444444444444444444444444444424444444444...4444444444
Allocating 12785 column slots (12785 - 0 dropped)
Reading 44928 chunks of 0.998MB (22 rows) using 144 threads
Read 872505 rows x 12785 columns from 43.772GB file in 05:26.908 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Final type counts
0 : drop
0 : logical
0 : integer
0 : integer64
12785 : numeric
0 : character
Rereading 508 columns due to out-of-sample type exceptions.
Column 171 ("D_19810618") bumped from 'integer' to 'numeric' due to <<2.23000001907349>> somewhere between row 6041 and row 24473
Column 347 ("D_19811211") bumped from 'integer' to 'numeric' due to <<1.02999997138977>> somewhere between row 6041 and row 24473
Column 348 ("D_19811212") bumped from 'integer' to 'numeric' due to <<3.75>> somewhere between row 6041 and row 24473
Column 643 ("D_19821003") bumped from 'integer' to 'numeric' due to <<1.04999995231628>> somewhere between row 6041 and row 24473
Column 1066 ("D_19831130") bumped from 'integer' to 'numeric' due to <<1.46000003814697>> somewhere between row 6041 and row 24473
Column 1102 ("D_19840105") bumped from 'integer' to 'numeric' due to <<0.959999978542328>> somewhere between row 6041 and row 24473
Column 1124 ("D_19840127") bumped from 'integer' to 'numeric' due to <<0.620000004768372>> somewhere between row 6041 and row 24473
Column 1130 ("D_19840202") bumped from 'integer' to 'numeric' due to <<0.540000021457672>> somewhere between row 6041 and row 24473
Column 1489 ("D_19850126") bumped from 'integer' to 'numeric' due to <<0.949999988079071>> somewhere between row 6041 and row 24473
Column 1508 ("D_19850214") bumped from 'integer' to 'numeric' due to <<0.360000014305115>> somewhere between row 6041 and row 24473
...
...
Reread 872505 rows x 508 columns in 05:29.167
Read 872505 rows. Exactly what was estimated and allocated up front
Thread buffers were grown 0 times (if all 144 threads each grew once, this figure would be 144)
=============================
0.000s ( 0%) Memory map
0.093s ( 0%) sep, ncol and header detection
0.186s ( 0%) Column type detection using 305 sample rows from 44928 jump points
0.600s ( 0%) Allocation of 872505 rows x 12785 cols (192.552GB) plus 1.721GB of temporary buffers
326.029s ( 50%) Reading data
329.167s ( 50%) Rereading 508 columns due to out-of-sample type exceptions
656.075s Total
ãã®æåŸã®èŠçŽã¯ããããå®äºããã®ã«å€ãã®æéãããããŸããïŒã6åïŒã ãã®è©³çŽ°ãªèŠçŽãæ°ãããšãå šäœã®ææã¯çŽ11åããããŸããã ãããã®508åãåèªã¿åãããŠããŸãããverbose = TRUEã䜿çšããã«ãããµã³ãã«å€ã®ã¿ã€ãã®äŸå€ãåå ã§508åãåèªã¿åãããŠããŸãããšããã¡ãã»ãŒãžã衚瀺ãããŸãã
ãverbose = TRUEããªãã§ãã§ãã¯ããŠãã ãã
ptm<-proc.time()
DT<-fread('dt.daily.4km.ver032917.csv')
Read 872505 rows x 12785 columns from 43.772GB file in 05:26.647 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Rereading 508 columns due to out-of-sample type exceptions.
Reread 872505 rows x 508 columns in 05:30.276
proc.time() - ptm
user system elapsed
2113.100 85.919 657.870
fwriteãã¹ã
æ£åžžã«åäœããŠããŸãã æ¬åœã«éãã æè¿ã¯èªãããæžãã®ãé
ããšæããŸããã
fwrite(DT,'dt.daily.4km.ver032917.csv', verbose=TRUE)
No list columns are present. Setting sep2='' otherwise quote='auto' would quote fields containing sep2.
maxLineLen=151187 from sample. Found in 1.890s
Writing column names ... done in 0.000s
Writing 872505 rows in 32315 batches of 27 rows (each buffer size 8MB, showProgress=1, nth=144) ...
done (actual nth=144, anyBufferGrown=no, maxBuffUsed=46%)
ãåèªãã«ã¯ããªãã®æéãããããŸãã ããã¯ããã¡ã€ã«ã2åèªã¿åããããªãã®ã§ãã
åªç§ãªïŒ ãã¹ãŠã®æ å ±ãããããšãã
colClasses=list("numeric"=1:12785)
ãééããã®ã§ã Type code (colClasses)
å§ãŸãåºåè¡ã¯ãã¹ãŠå€4ã«ãªããŸããæ¬ èœããŠãããã¹ããä¿®æ£ããŠè¿œå ããŸããlscpu
unixã³ãã³ãã®åºåã貌ãä»ããŠãã ããã ããã«ããããã£ãã·ã¥ãµã€ãºãããããããããèããããšãã§ããŸãã fread
ãã©ã¡ãŒã¿ãŒãšããŠbuffMB
ãæäŸããŸãã®ã§ããããããã§ãããã©ããã確èªã§ããŸãã ãããããªããç§ã¯ããè¯ãèšç®ãæãä»ãããšãã§ããŸãã
以åã®æçš¿ã®1ã€ã®ãã€ã³ãã§ã®ç§ã®ééãã colClasses=list("numeric"=1:12785)
ã¯æ£åžžã«æ©èœããŠããŸãã ãcolClassesããæå®ããªãå Žåã¯ãåèªã¿èŸŒã¿ããè¡ããŸãã æ··ä¹±ããŠãã¿ãŸããã
ç§ãæ°ã¥ããããšã®1ã€ã¯ããcolClassesããæå®ããªãå ŽåãããŒãã«ã¯NAã§äœæãããRAMã¯DTãæ£åžžã«ããŒãããããã®ããã«è¡šç€ºãããããšã§ãïŒRAMã§çŽ106MBïŒã
lscpuã®åºåã¯æ¬¡ã®
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 144
On-line CPU(s) list: 0-143
Thread(s) per core: 2
Core(s) per socket: 18
Socket(s): 4
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E7-8870 v3 @ 2.10GHz
Stepping: 4
CPU MHz: 2898.328
BogoMIPS: 4195.66
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140
NUMA node1 CPU(s): 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97,101,105,109,113,117,121,125,129,133,137,141
NUMA node2 CPU(s): 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94,98,102,106,110,114,118,122,126,130,134,138,142
NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,87,91,95,99,103,107,111,115,119,123,127,131,135,139,143
ã¯ããããã£ãã ããããšãã
æåã®ã³ã¡ã³ããããäžåºŠèªãã§ããæŽæ°ããæ°å€ãžã®ãã³ããã§ã¯ãªãããæŽæ°ãã2åãžã®ãã³ãããšèšã£ãæ¹ãçã«ããªã£ãŠããŸããïŒ
ãNAã§äœæãããããšã¯ã©ãããæå³ã§ããïŒ ãã¹ãŠã®ããŒãã«ã¯NAã§ãã£ã±ãã§ããã508åã ãã§ããïŒ
ãDTã¯æ£åžžã«ããŒããããŸããïŒRAMã«çŽ106MBïŒããšã¯äœã§ããã ããã¯44GBã®ãã¡ã€ã«ã§ããã106MBã¯ã©ã®ããã«æ£åžžãªã®ã§ããïŒ
ããŠããã®ãã¡ã€ã«ã§ç§ãæã£ãŠããã®ã¯å®æ°ãšNAã ãã§ãã æŽæ°å€ããããŸããã ãdatatypeAããdatatypeBã«ãã³ãããããšããã¡ãã»ãŒãžããã€ã¹ããŒããŸããïŒ
ããããŸãã«ç§ãèšãããããšã§ãã ã€ã³ã¹ããŒã«ããdata.tableã®çŸåšã®ããŒãžã§ã³ã§ã¯ãcolClassesãçç¥ãããšãDTãèªââã¿èŸŒãŸããŸãããNAããã£ã±ãã«ãªããŸãã DT[!is.na(D_19821001),]
ã¯0ã¬ã³ãŒããçæããcolClassesã䜿çšããŠããŒãã«ãããŒãããåããã£ã«ã¿ãªã³ã°ãå®è¡ãããšãå®éã«ã¬ã³ãŒãã衚瀺ãããŸãã
ããŠããã®ãã¡ã€ã«ã¯ãã£ã¹ã¯ã®csvãšããŠ47 GBã§ãããRã«ããŒããããšãRAMã«2å以äžããããŸã...å€ã®ç²ŸåºŠãšã¯é¢ä¿ãããŸãããäžåºŠããŒãããããšãå®éã®ã¡ã¢ãªã®å²ãåœãŠãèªã¿èŸŒãŸããŸããæ°ã¯ãã®å¢å ãåŒãèµ·ãããŸããïŒ
ãã¶ãå¥ã®ã¿ã€ããã¹ïŒ106MBã¯106GBã ã£ãã¯ãã§ãã
ç§ã¯NAã®åŽé¢ã«åŸã£ãŠããŸããã ããããéäžã§ããã€ãã®ä¿®æ£ãè¡ã£ãŠãããããäžåºŠããçŽããŠãã ãã...
ããããŸãã-ããäžåºŠããçŽããŠãã ããã æšæž¬ãµã³ãã«ã¯10,000ã«å¢å ãïŒããã«ãããæéã確èªããã®ã¯èå³æ·±ãã§ãããïŒããããã¡ãŒãµã€ãºã«æå°å€ã課ãããããã«ãªããŸããã
dratããã±ãŒãžãã¡ã€ã«ãããã¢ãŒãããããŸã§ãå°ãªããšã30ååŸ
ã€å¿
èŠãããå ŽåããããŸãã
ç³ãèš³ãããŸãããããMBãã§ã¯ãªããGBãã§ããã ã³ãŒããŒã足ããŸããã§ããã
ã¢ããããŒããå
¥æããŠãã¹ãããŸãã
ææ°ã®data.tableéçºã§ã®ãã¹ãïŒ
æŠèŠïŒ
é«éã«åäœãïŒcsvã®èªã¿åãã«1.43åïŒãã¡ã¢ãªå²ãåœãŠãæ£åžžã«åäœããŠããŸããRAMå²ãåœãŠã¯ä»¥åã®ããã«å¢å ããªããšæããŸãã ãã£ã¹ã¯äžã®44GBã®csvã¯ãããŒãããããšRAMå
ã§çŽ112ïŒ+ 37 GBã®äžæãããã¡ãŒïŒGBã«å€æãããŸãã ããã¯ãã¡ã€ã«å
ã®å€ã®ããŒã¿åã«é¢é£ããŠããŸããïŒ
ãã¹ã1 ïŒ
colClasses=list("numeric"=1:12785)
ã䜿çšããã«
DT<-fread('dt.daily.4km.csv', verbose=TRUE)
Parameter na.strings == <<NA>>
None of the 1 na.strings are numeric (such as '-9999').
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 43.772296 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 starting: <<x,y,D_19810101,D_19810102,D_19810103,D_19810104,D_19810105,D_19810106,D_19810107,
D_19810108,D_19810109,D_19810110,D_19810111,D_19810112,
...
All the fields on line 1 are character fields. Treating as the column names.
Number of sampling jump points = 101 because 47000004016 bytes from row 1 to eof / (2 * 1281788 jump0size) == 18333
Type codes (jump 000) : 441111111111111111111111111111111111111111111111111111111111111111111111111111111111111111...1111111111 Quote rule 0
Type codes (jump 001) : 444222422222222224442444444444442222444444444444444444442444444444444222224444444444444444...4444444442 Quote rule 0
Type codes (jump 002) : 444222444444222244442444444444444422444444444444444444442444444444444222224444444444444444...4444444442 Quote rule 0
...
Type codes (jump 034) : 444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444...4444444444 Quote rule 0
Type codes (jump 100) : 444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444...4444444444 Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points including middle and very end
Bytes from first data row on line 2 to the end of last row: 47000004016
Line length: mean=79727.22 sd=32260.00 min=12804 max=153029
Estimated nrow: 47000004016 / 79727.22 = 589511
Initial alloc = 1179022 rows (589511 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
Type codes (colClasses) : 444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444...4444444444
Type codes (drop|select) : 444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444...4444444444
Allocating 12785 column slots (12785 - 0 dropped)
Reading 432 chunks of 103.756MB (1364 rows) using 144 threads
Read 872505 rows x 12785 columns from 43.772GB file in 02:17.726 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Final type counts
0 : drop
0 : logical
0 : integer
0 : integer64
12785 : double
0 : character
Thread buffers were grown 67 times (if all 144 threads each grew once, this figure would be 144)
=============================
0.000s ( 0%) Memory map
0.099s ( 0%) sep, ncol and header detection
11.057s ( 8%) Column type detection using 10049 sample rows
0.899s ( 1%) Allocation of 872505 rows x 12785 cols (112.309GB) plus 37.433GB of temporary buffers
125.671s ( 91%) Reading data
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
137.726s Total
ãã¹ã2 ïŒ
çŸåšcolClasses=list("numeric"=1:12785)
ã¿ã€ãã³ã°ãæ°ç§æ¹åãããŸãã...
DT<-fread('dt.daily.4km.csv', colClasses=list("numeric"=1:12785), verbose=TRUE)
Allocating 12785 column slots (12785 - 0 dropped)
Reading 432 chunks of 103.756MB (1364 rows) using 144 threads
Read 872505 rows x 12785 columns from 43.772GB file in 01:43.028 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Final type counts
0 : drop
0 : logical
0 : integer
0 : integer64
12785 : double
0 : character
Thread buffers were grown 67 times (if all 144 threads each grew once, this figure would be 144)
=============================
0.000s ( 0%) Memory map
0.092s ( 0%) sep, ncol and header detection
11.009s ( 11%) Column type detection using 10049 sample rows
0.332s ( 0%) Allocation of 872505 rows x 12785 cols (112.309GB) plus 37.433GB of temporary buffers
91.595s ( 89%) Reading data
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
103.028s Total
ããããŸãã-ç§ãã¡ã¯ããã«çããŸããã ä¿®æ£ããããµã³ãã«ãµã€ãºã«å ããŠã100ãã€ã³ãã§100è¡ïŒ10,000ãµã³ãã«è¡ïŒã«å¢ããããšã§ãã¿ã€ããæ£ããæšæž¬ããã®ã«ååã§ããã 44GBã®ãã¡ã€ã«ã«ã¯12,875åãããè¡ã®é·ãã¯å¹³å80,000æåã§ããããããµã³ããªã³ã°ã«11ç§ããããŸããã ãããããã®æéã¯ããã ãã®äŸ¡å€ããããŸãããããã¯ãäœåãª90ç§ãèŠããã§ãããåèªãåé¿ããããã§ãã ããã§ã¯ãããå®ããŸãã
2åç®ã§ããªãã¬ãŒãã£ã³ã°ã·ã¹ãã ããŠã©ãŒã ã¢ããããŠãã¡ã€ã«ããã£ãã·ã¥ãããããšãã£ãŠã2åç®ãéããšæããŸãã ããã¯ã¹äžã§å®è¡ãããŠããä»ã®ãã¹ãŠã®ãã®ã¯ãæãæèšã®ã¿ã€ãã³ã°ã«åœ±é¿ãäžããŸãã ããã¯ãæåã®ãã¹ãã3åé£ç¶ããŠå®è¡ããããšã§è§£æ±ºãããŸãã 次ã«ã1ã€ã ãå€æŽããŠã3åã®åäžã®é£ç¶å®è¡ãå床å®è¡ããŸãã 44GBã®ãµã€ãºã§ã¯ãå€ãã®èªç¶ãªå€åãèŠãããŸãã çµè«ãåºãã«ã¯éåžž3åã®å®è¡ã§ååã§ãããããã¯ãã©ãã¯ã¢ãŒãã§ããå¯èœæ§ããããŸãã
ã¯ããã¡ã¢ãªå
ã®112GBãšãã£ã¹ã¯å
ã®44GBã¯ããã¹ãŠã®åãdoubleåã§ãããããããŒã¿ãã¡ã¢ãªå
ã§å€§ããããã§ãã Rã«ã¯ã¡ã¢ãªå
å§çž®ããªãããã®CSVã§ã¯ã¹ããŒã¹ããšããªãïŒ ",,"
ïŒãã¡ã¢ãªã«8ãã€ããå ããNAå€ãããªããããããããŸãã ãã ãã112GBã§ã¯ãªã83GBã«ããå¿
èŠããããŸãïŒ872505è¡x12785åx8ãã€ãã®å粟床/ 1024 ^ 3 = 83GBïŒã ãã®112GBã¯ãåç·é·ã®å¹³åãšæšæºåå·®ã«åºã¥ããŠå²ãåœãŠããããã®ã§ãã å¹³åè¡é·ã«åºã¥ããŠãçããããšæããã589,511è¡ã«ãªããšæšå®ãããŸããã ç·ã®é·ãã®å€åãéåžžã«å€§ããã£ããããã¯ã©ã³ãã¯+ 100ïŒ
ã§æå¹ã«ãªããŸããã 58,9511 * 2 = 1,179,022 * 12785 * 8/1024 ^ 3 = 112GBã æçµçã«ã872,505ããã¡ã€ã«ã«å«ãŸããŠããããšãããããŸããã ããããããã¯ç©ºãé åã解æŸããŠããããã§ã¯ãããŸããã ä¿®æ£ããŸãã ïŒTODO1ïŒ
ãã ãããã¹ãŠã®åã«åã¿ã€ããæå®ããå Žåã§ãããµã³ããªã³ã°ãããŸãã ãŠãŒã¶ãŒããã¹ãŠã®åãæå®ããå Žåããµã³ããªã³ã°ãã¹ãããããå¿ èŠããããŸãã ïŒTODO2ïŒ
ãã¹ãŠã®ããŒã¿ã2åã§ãããããCã©ã€ãã©ãªé¢æ°strtodïŒïŒã«å¯ŸããŠ110ååã®åŒã³åºããè¡ãããŸãã ãã®æ©èœã®å°éåã«å¯Ÿããé·ãéæãŸããŠããããšã¯ãçè«çã«ã¯ãã®ãã¡ã€ã«ã®å€§å¹ ãªé«éåãããããã¯ãã§ãã ïŒçµããïŒ
ãããããããšããçŽ æŽããã説æã ãã®ãã¡ã€ã«ã§ä»ã®äœãããã¹ããããå Žåã¯ãç¥ãããã ããã ç§ã¯ãçŽ2,100äžè¡x 1432åã®ãããé·ã圢åŒã®å¥ã®ããŒã¿ã»ããã«åãçµãã§ããŸãã
NAME NROW NCOL MB
[1,] DT 21,812,625 1,432 238,310
ããã«ããŒã¿ãã€ã³ããè¿œå ããŸãã 89G .tsv
ãããŒãäžã®ããŒã¯ã¡ã¢ãªäœ¿çšéã¯çŽ180Gã§ãã NAãšããã«ãå€ãã®ã§ããã¯æåŸ
ã§ãããšæããŸãã
ç§ãããããã¹ãã§ããŠããããã§ãã
Ubuntu 16.04 64bit / Linux 4.4.0-71-generic
R version 3.3.2 (2016-10-31)
data.table 1.10.5 IN DEVELOPMENT built 2017-04-04 14:27:46 UTC
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 2699.984
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4660.70
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-15,32-47
NUMA node1 CPU(s): 16-31,48-63
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt ida
Parameter na.strings == <<NA>>
None of the 1 na.strings are numeric (such as '-9999').
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 88.603947 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 starting: <<allele prediction_uuid sample_>>
Detecting sep ...
sep=='\t' with 101 lines of 76 fields using quote rule 0
Detected 76 columns on line 1. This line is either column names or first data row (first 30 chars): <<allele prediction_uuid sample_>>
All the fields on line 1 are character fields. Treating as the column names.
Number of sampling jump points = 101 because 95137762779 bytes from row 1 to eof / (2 * 24414 jump0size) == 1948426
Type codes (jump 000) : 5555542444111145424441111444111111111111111111111111111111111111111111111111 Quote rule 0
Type codes (jump 009) : 5555542444114445424441144444111111111111111111111111111111111111111111111111 Quote rule 0
Type codes (jump 042) : 5555542444444445424444444444111111111111111111111111111111111111111111111111 Quote rule 0
Type codes (jump 048) : 5555544444444445444444444444225225522555545111111111111111111111111111111111 Quote rule 0
Type codes (jump 083) : 5555544444444445444444444444225225522555545254452454411154454452454411154455 Quote rule 0
Type codes (jump 085) : 5555544444444445444444444444225225522555545254452454454454454452454454454455 Quote rule 0
Type codes (jump 100) : 5555544444444445444444444444225225522555545254452454454454454452454454454455 Quote rule 0
=====
Sampled 10028 rows (handled \n inside quoted fields) at 101 jump points including middle and very end
Bytes from first data row on line 2 to the end of last row: 95137762779
Line length: mean=465.06 sd=250.27 min=198 max=929
Estimated nrow: 95137762779 / 465.06 = 204571280
Initial alloc = 409142560 rows (204571280 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
Type codes (colClasses) : 5555544444444445444444444444225225522555545254452454454454454452454454454455
Type codes (drop|select) : 5555544444444445444444444444225225522555545254452454454454454452454454454455
Allocating 76 column slots (76 - 0 dropped)
Reading 90752 chunks of 1.000MB (2254 rows) using 64 threads
ããã圹ç«ã€å Žåã¯ãéåžžã«é·ãããŒã¿ããŒã¹ã§ã®çµæã次ã«ç€ºããŸãã419,124,196x42ïŒã2 ^ 34ïŒã1ã€ã®ããããŒè¡ãšcolClassesãæž¡ãããŠããŸãã
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-09-27 17:12:56 UTC; travis
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> CC <- c(rep('integer', 2), rep('character', 3),
+ rep('numeric', 2), rep('integer', 3),
+ rep('character', 2), 'integer', 'character', 'integer',
+ rep('character', 4), rep('numeric', 11), 'character',
+ 'numeric', 'character', rep('numeric', 2),
+ rep('integer', 3), rep('numeric', 2), 'integer',
+ 'numeric')
> P <- fread('XXXX.csv', colClasses = CC, header = TRUE, verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file XXXXcsv
File opened, size = 51.71GB (55521868868 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<X,X,X,X>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 42 fields using quote rule 0
Detected 42 columns on line 1. This line is either column names or first data row. Line starts as: <<X,X,X,X>>
Quote rule picked = 0
fill=false and the most number of columns found is 42
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (55521868866 bytes from row 1 to eof) / (2 * 13006 jump0size) == 2134471
Type codes (jump 000) : 5161010775551055105101111111111111110110771117717 Quote rule 0
Type codes (jump 022) : 5561010775551055105101111111111111110110771117717 Quote rule 0
Type codes (jump 030) : 5561010775551055105101010107517171151110110771117717 Quote rule 0
Type codes (jump 037) : 5561010775551055105101010107517771171110110771117717 Quote rule 0
Type codes (jump 073) : 5561010775551055105101010107517771177110110771117717 Quote rule 0
Type codes (jump 093) : 5561010775551055105101010107717771177110110771117717 Quote rule 0
Type codes (jump 100) : 5561010775551055105101010107717771177110110771117717 Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 55521868866
Line length: mean=132.68 sd=6.00 min=118 max=425
Estimated number of rows: 55521868866 / 132.68 = 418453923
Initial alloc = 460299315 rows (418453923 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 11 type and 0 drop user overrides : 551010107755510105105101010107777777777710710775557757
[10] Allocate memory for the datatable
Allocating 42 column slots (42 - 0 dropped) with 460299315 rows
[11] Read the data
jumps=[0..52960), chunk_size=1048373, total_size=55521868441
Read 98%. ETA 00:00
[12] Finalizing the datatable
Read 419124195 rows x 42 columns from 51.71GB (55521868868 bytes) file in 13:42.935 wall clock time
Thread buffers were grown 0 times (if all 40 threads each grew once, this figure would be 40)
Final type counts
0 : drop
0 : bool8
0 : bool8
0 : bool8
0 : bool8
11 : int32
0 : int64
19 : float64
0 : float64
0 : float64
12 : string
=============================
0.000s ( 0%) Memory map 51.709GB file
0.016s ( 0%) sep=',' ncol=42 and header detection
0.016s ( 0%) Column type detection using 10049 sample rows
188.153s ( 23%) Allocation of 419124195 rows x 42 cols (125.177GB)
634.751s ( 77%) Reading 52960 chunks of 1.000MB (7901 rows) using 40 threads
= 0.121s ( 0%) Finding first non-embedded \n after each jump
+ 17.036s ( 2%) Parse to row-major thread buffers
+ 616.184s ( 75%) Transpose
+ 1.410s ( 0%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
822.935s Total
> memory.size()
[1] 134270.3
> rm(P)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 585532 31.3 5489235 293.2 6461124 345.1
Vcells 1508139082 11506.2 20046000758 152938.9 25028331901 190951.1
> memory.size()
[1] 87.56
> P <- fread('XXXX.csv', colClasses = CC, header = TRUE, verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file XXXX.csv
File opened, size = 51.71GB (55521868868 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<X,X,X,X>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 42 fields using quote rule 0
Detected 42 columns on line 1. This line is either column names or first data row. Line starts as: <<X,X,X,X>>
Quote rule picked = 0
fill=false and the most number of columns found is 42
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (55521868866 bytes from row 1 to eof) / (2 * 13006 jump0size) == 2134471
Type codes (jump 000) : 5161010775551055105101111111111111110110771117717 Quote rule 0
Type codes (jump 022) : 5561010775551055105101111111111111110110771117717 Quote rule 0
Type codes (jump 030) : 5561010775551055105101010107517171151110110771117717 Quote rule 0
Type codes (jump 037) : 5561010775551055105101010107517771171110110771117717 Quote rule 0
Type codes (jump 073) : 5561010775551055105101010107517771177110110771117717 Quote rule 0
Type codes (jump 093) : 5561010775551055105101010107717771177110110771117717 Quote rule 0
Type codes (jump 100) : 5561010775551055105101010107717771177110110771117717 Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 55521868866
Line length: mean=132.68 sd=6.00 min=118 max=425
Estimated number of rows: 55521868866 / 132.68 = 418453923
Initial alloc = 460299315 rows (418453923 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 11 type and 0 drop user overrides : 551010107755510105105101010107777777777710710775557757
[10] Allocate memory for the datatable
Allocating 42 column slots (42 - 0 dropped) with 460299315 rows
[11] Read the data
jumps=[0..52960), chunk_size=1048373, total_size=55521868441
Read 98%. ETA 00:00
[12] Finalizing the datatable
Read 419124195 rows x 42 columns from 51.71GB (55521868868 bytes) file in 05:04.910 wall clock time
Thread buffers were grown 0 times (if all 40 threads each grew once, this figure would be 40)
Final type counts
0 : drop
0 : bool8
0 : bool8
0 : bool8
0 : bool8
11 : int32
0 : int64
19 : float64
0 : float64
0 : float64
12 : string
=============================
0.000s ( 0%) Memory map 51.709GB file
0.031s ( 0%) sep=',' ncol=42 and header detection
0.000s ( 0%) Column type detection using 10049 sample rows
28.437s ( 9%) Allocation of 419124195 rows x 42 cols (125.177GB)
276.442s ( 91%) Reading 52960 chunks of 1.000MB (7901 rows) using 40 threads
= 0.017s ( 0%) Finding first non-embedded \n after each jump
+ 12.941s ( 4%) Parse to row-major thread buffers
+ 262.989s ( 86%) Transpose
+ 0.495s ( 0%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
304.910s Total
> memory.size()
[1] 157049.7
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
ããã€ãã®ã¡ã¢ã colClassesãæž¡ãããå Žåããã§ãã¯ããçç±ããªããšããç¹ã§ã[07]ã®åã«[09]ã眮ãããšããå§ãããŸãã ãŸããWindowsã¯ãå®è¡ãããã³ã«çŽ160GBã䜿çšãããŠããããšã瀺ããŸããã memory.sizeïŒïŒã¯ããããããã€ãã®ã¯ãªãŒãã³ã°ãè¡ããŸãã ãã®ãµãŒããŒã«532GBã®RAMãããå Žåãã¡ã¢ãªãã£ãã·ã¥ã¯ã2åç®ã®å®è¡ã§ã®é床ã®åäžãšé¢ä¿ãããå¯èœæ§ããããŸãã ãããã圹ã«ç«ãŠã°å¹žãã§ãã
åºããå®äžçãã®ããŒãã«ïŒç
é¢ããŒã¿ïŒã§ãã¹ãããŸãã3000äžè¡Ã125åvãªãŒããŒã1.2.0ãããã³read.csv
3.4.3ã
åé¡ã確èªãããã£ã³ã¹ã¯1.11.4ã§ããŸã æå¹ã§ããïŒ ãŸãã¯ãµã³ãã«ããŒã¿ãçæããã³ãŒãã
ç§ã®ç¥ãéããããã¯ãã¹ãŠè§£æ±ºãããŠããŸãã @geponceããã§ãªãå Žåã¯æŽæ°ããŠãã ããã
äžèšã®TODO1ã¯çŸåšïŒ3024ãšããŠæåºãããŠããŸã
äžèšã®TODO2ã¯çŸåšïŒ3025ãšããŠæåºãããŠããŸã
æãåèã«ãªãã³ã¡ã³ã
åºããå®äžçãã®ããŒãã«ïŒç é¢ããŒã¿ïŒã§ãã¹ãããŸãã3000äžè¡Ã125åvãªãŒããŒã1.2.0ãããã³
read.csv
3.4.3ã