Data.table: join operation almost 2 times slower

Created on 2 Oct 2019  ·  29Comments  ·  Source: Rdatatable/data.table

HI, I was testing
data.table_1.10.4-3 + R version 3.4.0 (2017-04-21)
vs
data.table_1.12.2 + R version 3.6.1 (2019-07-05)

and have noticed that join operation almost 2 times slower in new version data.table (R?)
I think mostly depends on version of data.table

rm(list=ls())

library(data.table)
library(tictoc)


aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

#aa[,  a1 := as.character(a)]
#bb[,  a1 := as.character(a)]

setindex(aa, a)
setindex(bb, a)

tic()
for(i in c(1:1000)) {
 # aa[bb, b := i.b, on=.(a, a1)] # test1
  aa[bb, b := i.b, on=.(a)] # test2
}
toc()

# test1
# 3.6.1: 5.5 sec with index
# 3.6.1: 6.87 sec without index
# 3.4.0: 3.02 sec (no index)

# test2
# 3.6.1: 3.82 sec with index
# 3.6.1: 4.44 sec without index
# 3.4.0: 2.48 sec (no index)
High joins performance regression

All 29 comments

Have run the following:

# run_join_time.sh
echo 'hash,time' > 'join_timings.csv'

for HASH in $(git rev-list --ancestry-path 1.10.0..1.12.4 | awk 'NR == 1 || NR % 10 == 0');
do git checkout $HASH && R CMD INSTALL . && Rscript time_join.R $HASH;
done
# time_join.R
library(data.table)
library(microbenchmark)

hash = commandArgs(trailingOnly = TRUE)

aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

fwrite(data.table(
  hash = hash,
  time = microbenchmark(times = 1000L, aa[bb, b := i.b, on=.(a)])$time
), 'join_timings.csv', append = TRUE)

Then did the following to create this plot

Screenshot 2019-10-03 at 6 59 11 PM

library(data.table)
times = fread('join_timings.csv')

times[ , date := as.POSIXct(system(sprintf("git show --no-patch --no-notes --pretty='%%cd' %s", .BY$hash), intern = TRUE), by = hash]
times[ , date := as.POSIXct(date, format = '%a %b %d %T %Y %z', tz = 'UTC')]
times[ , date := as.IDate(date)]
setorder(times, date)

tags = system('
for TAG in $(git tag);
  do git show --no-patch --no-notes --pretty="$TAG\t%cd" $TAG;
done
', intern = TRUE)

tags = setDT(tstrsplit(tags, '\t'))
setnames(tags, c('tag', 'date'))
tags[ , date := as.POSIXct(date, format = '%a %b %d %T %Y %z', tz = 'UTC')]
tags[ , date := as.IDate(date)]

times[ , {
  par(mar = c(7.1, 4.1, 2.1, 2.1))
  y = boxplot(I(time/1e6) ~ date, log = 'y', notch = TRUE, pch = '.',
              las = 2L, xlab = '', ylab = 'Execution time (ms)', cex.axis = .8)
  x = unique(date)
  idx = findInterval(tags$date, x)
  abline(v = idx[idx>0L], lty = 2L, col = 'red')
  text(idx[idx>0L], .7*10^(par('usr')[4L]), tags$tag[idx>0L], 
       col = 'red', srt = 90)
  NULL
}]


After starting I realized I could streamline this a bit by using

echo 'hash,time,date' > 'join_timings.csv'

for HASH in $(git rev-list --ancestry-path 1.10.0..1.12.4 | awk 'NR == 1 || NR % 10 == 0');
do git checkout $HASH && R CMD INSTALL . --preclean && Rscript time_join.R $HASH $(git show --no-patch --no-notes --pretty='%cd' $HASH);
done
library(data.table)
library(microbenchmark)

args = commandArgs(trailingOnly = TRUE)
hash = args[1L]
date = as.POSIXct(args[2L], format = '%a %b %d %T %Y %z', tz = 'UTC'))

aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

fwrite(data.table(
  hash = hash,
  date = date,
  time = microbenchmark(times = 1000L, aa[bb, b := i.b, on=.(a)])$time
), 'join_timings.csv', append = TRUE)

to add the commit time directly to the benchmark

join_timings.txt

Hi MichaelChirico, I have read twice
but what does exactly mean the timeline from 2016 to 2019? Is it a version of datatable?

not just versions but specific commits. let me try and add some demarcation of versions, a good idea I forgot to add in

@MichaelChirico your graph looks super cool, what was missing was an interpretation of that :) By guessing I would say that new parallel forder is slower for an input of length 100 than it was before. Which IMO is fine, because these are still microseconds. It is of course worth to reduce the overhead but the priority should be a reduction of resources in the most common usage scenarios. Looping 10000 times on a length 100 input is definitely not one of those.

that was my guess too after looking at the 1.12.0 NEWS. will try re running single threaded.

@jangorecki, if you do the simple math it's thousands of microseconds ie milliseconds
The inspiration of this test was my current project where I join 7-8 times tables with 40 000 rows
Before it takes 18 seconds, after it takes 40 seconds
My test also includes character key
I installed 3.6+1.12 and then rolled it back to 3.4+1.10
I think this is not normal situation that's why this issue is here

Yes the graph is really awesome; we should do more of this and automate it.

But the original report compares R 3.4 to R 3.6, which has R 3.5 in the middle. We know that R 3.5 impacted data.table via ALTREP's change of REAL(), INTEGER() from macros to (slightly heavier) functions.

@vk111 Can you please rerun your tests with v1.12.4 now on CRAN. Any better? We have been taking R API outside loops to cope with R 3.5+. As the next step, it would be great to know how you find v1.12.4. Can you also include more information about your machine; e.g operating system, RAM and number of CPU.

@matt, will do

Hi @mattdowle ,

Ive updated to 1.12.4 (+R 3.6.1) - basically there are the same numers for unit test and my func
I will consider here only joins with character without indexing - as a benchmark
Unit test (as mentioned in the first post)
3.6.1: 6.87 sec without index
3.4.0: 3.02 sec (no index)

My function from project remains the same: 20 vs 40 seconds
It consists only ~14 dt joins
two tables 30 000 rows (dt1) and 366 000 rows (dt2)
15 columns each
key is 2 characters and 1 numeric
and assigned 7 fields (from dt1 to dt2) during the join

my PC:
Win 10 64 bit
AMD A8 2.2 Ghz
8 Gb RAM

cmd: wmic cpu get NumberOfCores, NumberOfLogicalProcessors/Format:List
NumberOfCores=4
NumberOfLogicalProcessors=4

Thank you for your analysis

Hi @mattdowle so are you considering this?

According to a git bisect I just re-ran, this commit is "guilty" for the slow-down, hopefully that can help Matt pin down the root. That it's related to forder makes me relatively confident we've pinned it down correctly.

https://github.com/Rdatatable/data.table/commit/e59ba149f21bb35098ed0b527505d5ed6c5cb0db

More precisely:

# run_join_time.sh
R CMD INSTALL . && Rscript time_join.R
# time_join.R
library(data.table)
library(microbenchmark)

aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

summary(microbenchmark(times = 1000L, aa[bb, b := i.b, on=.(a)])$time/1e6)

With these two scripts, from master/HEAD:

git checkout 1.12.0
git bisect start
git bisect bad
git bisect good 1.11.8

Then at each iteration, run sh run_join_time.sh and check the result. The bad commits were taking roughly .9 seconds per iteration on my machine, good commits about .7.

Hi Michael, thank you, but performance has dropped dramatically especially
for character key-I provided with the unit test in the very first post
Anyway I think we need to wait confirmation from Matt whether it’s fixed in
new version or not

On Thu, 26 Dec 2019 at 15:02, Michael Chirico notifications@github.com
wrote:

According to a git bisect I just re-ran, this commit is "guilty" for the
slow-down, hopefully that can help Matt pin down the root. That it's
related to forder makes me relatively confident we've pinned it down
correctly.

e59ba14
https://github.com/Rdatatable/data.table/commit/e59ba149f21bb35098ed0b527505d5ed6c5cb0db

More precisely:

run_join_time.sh

R CMD INSTALL . && Rscript time_join.R

time_join.R

library(data.table)
library(microbenchmark)

aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

summary(microbenchmark(times = 1000L, aa[bb, b := i.b, on=.(a)])$time/1e6)

With these two scripts, from master/HEAD:

git checkout 1.12.0
git bisect start
git bisect bad
git bisect good 1.11.8

Then at each iteration, run sh run_join_time.sh and check the result. The
bad commits were taking roughly .9 seconds per iteration on my machine,
good commits about .7.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3928?email_source=notifications&email_token=ABRM7MJQITNUZDNYV7NTIX3Q2S2ORA5CNFSM4I4ZPREKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHVTR4Q#issuecomment-569063666,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABRM7MOTZRFFM2QKVFUHT6LQ2S2ORANCNFSM4I4ZPREA
.

>

Best regards,
Vasily Kuznetsov

@vk111 if this issue is not closed then you can safely assume the problem is not yet fixed. Whenever we fix something we close related issue.

@vk111 you may be able to mitigate with some minor tweaks using setkey instead of setindex. This cuts your test by more than half on my computer:

aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

aa[,  a1 := as.character(a)]
bb[,  a1 := as.character(a)]

tic()
for(i in c(1:1000)) {
  aa[bb, b := i.b, on=.(a, a1)] # test1 without key
}
toc()
## 4 sec elapsed

tic()
setkey(aa, a, a1)
setkey(bb, a, a1)

for(i in c(1:1000)) {
  aa[bb, b := i.b] # test2 without key
}
toc()
## 1.46 sec elapsed

image

library(microbenchmark)
library(data.table)

aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

aa[,  a1 := as.character(a)]
bb[,  a1 := as.character(a)]

aa_nokey = copy(aa)
bb_nokey = copy(bb)

tic()
aa_int_key = setkey(copy(aa), a)
bb_int_key = setkey(copy(bb), a)
toc()

tic()
aa_int_char = setkey(copy(aa), a, a1)
bb_int_char = setkey(copy(bb), a, a1)
toc()
int_nokey = function(dt1, dt2){
  dt1[dt2, b := i.b, on=.(a)]
}

int_key = function(dt1, dt2){
  dt1[dt2, b := i.b]
}

int_char_nokey = function(dt1, dt2){
  dt1[dt2, b := i.b, on=.(a, a1)]
}

int_char_key = function(dt1, dt2){
  dt1[dt2, b := i.b]
}

int_char_key_on = function(dt1, dt2){
  dt1[dt2, b := i.b, on=.(a, a1)]
}

boxplot(
  microbenchmark(
    a_no_key = int_nokey(aa, bb),
    a_key = int_key(aa_int_key, bb_int_key),
    a_a1_no_key = int_char_nokey(aa, bb),
    a_a1_key = int_char_key(aa_int_char, bb_int_char),
    a_a1_key_on = int_char_key_on(aa_int_char, bb_int_char)
  )
)

Hi Cole, yes I know about setkey - I provided above the short description
of my project and there's no purpose of 1000 joins - there are like 10-20
joins and the thing is that my character strings are very dynamic and I
need to make setkey every time before join like 20 times setkey and 20
times joins and therefore the gain in time is less than 2 times
I live with 3.4.0 for a while as a general workaround for this situation
and probably its not a big deal to fix it in new version especially if
there's a guilty commit found
By the way if we compare apples to apples I think that 3.4(1.10) will be
still faster than 3.6(1.12) if I use the same unit test with the setkey for
both versions
Happy holidays

On Sun, 29 Dec 2019, 14:48 Cole Miller, notifications@github.com wrote:

@vk111 https://github.com/vk111 you may be able to mitigate with some
minor tweaks using setkey instead of setindex. This cuts your test by
more than half on my computer:

aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

aa[, a1 := as.character(a)]
bb[, a1 := as.character(a)]

tic()
for(i in c(1:1000)) {
aa[bb, b := i.b, on=.(a, a1)] # test1 without key
}
toc()

4 sec elapsed

tic()
setkey(aa, a, a1)
setkey(bb, a, a1)

for(i in c(1:1000)) {
aa[bb, b := i.b] # test2 without key
}
toc()

1.46 sec elapsed

[image: image]
https://user-images.githubusercontent.com/57992489/71557665-126d7880-2a17-11ea-8664-21a10deb47ca.png

library(microbenchmark)
library(data.table)

aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

aa[, a1 := as.character(a)]
bb[, a1 := as.character(a)]

aa_nokey = copy(aa)
bb_nokey = copy(bb)

tic()
aa_int_key = setkey(copy(aa), a)
bb_int_key = setkey(copy(bb), a)
toc()

tic()
aa_int_char = setkey(copy(aa), a, a1)
bb_int_char = setkey(copy(bb), a, a1)
toc()
int_nokey = function(dt1, dt2){
dt1[dt2, b := i.b, on=.(a)]
}

int_key = function(dt1, dt2){
dt1[dt2, b := i.b]
}

int_char_nokey = function(dt1, dt2){
dt1[dt2, b := i.b, on=.(a, a1)]
}

int_char_key = function(dt1, dt2){
dt1[dt2, b := i.b]
}

int_char_key_on = function(dt1, dt2){
dt1[dt2, b := i.b, on=.(a, a1)]
}

boxplot(
microbenchmark(
a_no_key = int_nokey(aa, bb),
a_key = int_key(aa_int_key, bb_int_key),
a_a1_no_key = int_char_nokey(aa, bb),
a_a1_key = int_char_key(aa_int_char, bb_int_char),
a_a1_key_on = int_char_key_on(aa_int_char, bb_int_char)
)
)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3928?email_source=notifications&email_token=ABRM7MPU64YBK7UNT3O3OYDQ3CTBRA5CNFSM4I4ZPREKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHY73VA#issuecomment-569507284,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABRM7MKSP6LCUJD2HOAG7D3Q3CTBRANCNFSM4I4ZPREA
.

There is a pending PR #4370 which is related. It can make join with less overhead that [ normally imposes. It reduces the timing of your example by a factor of 2. The bigger the data, the smaller the improvement, but on the other hand, the more iterations, the better improvement. Note that there is no way to update column during join. You basically have to add columns during join, and then use something like fcoalesce, but that should be performant.
It should get some extra speed up once #4386 will also be merged.

library(data.table)
aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))
setindex(aa, a)
setindex(bb, a)
p = proc.time()[[3L]]
for (i in c(1:1000)) {
  aa[bb, b := i.b, on=.(a)]
}
proc.time()[[3L]]-p
#[1] 1.1

mergelist #4370

library(data.table)
aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))
setindex(aa, a)
setindex(bb, a)
p = proc.time()[[3L]]
for (i in c(1:1000)) {
  mergelist(list(aa, bb), on="a")
}
proc.time()[[3L]]-p
#[1] 0.696

Hi Jan, OK thanks

The bigger the data, the smaller the improvement, but on the other hand,
the more iterations, the better improvement
The real issue is related to big data and a few iterations - I wrote it
above

On Mon, 4 May 2020, 00:33 Jan Gorecki, notifications@github.com wrote:

There is a pending PR #4370
https://github.com/Rdatatable/data.table/pull/4370 which is related. It
can make join with less overhead that [ normally imposes. It reduces the
timing of your example by a factor of 2. The bigger the data, the smaller
the improvement, but on the other hand, the more iterations, the better
improvement. Note that there is no way to update column during join. You
basically have to add columns during join, and then use something like
fcoalesce, but that should be performant.
It should get some extra speed up once #4386
https://github.com/Rdatatable/data.table/pull/4386 will also be merged.

library(data.table)aa = data.table(a = seq(1,100), b = rep(0, 100))bb = data.table(a = seq(1,100), b = rep(1, 100))
setindex(aa, a)
setindex(bb, a)p = proc.time()[[3L]]for (i in c(1:1000)) {
aa[bb, b := i.b, on=.(a)]
}
proc.time()[[3L]]-p#[1] 1.1

mergelist #4370 https://github.com/Rdatatable/data.table/pull/4370

library(data.table)aa = data.table(a = seq(1,100), b = rep(0, 100))bb = data.table(a = seq(1,100), b = rep(1, 100))
setindex(aa, a)
setindex(bb, a)p = proc.time()[[3L]]for (i in c(1:1000)) {
mergelist(list(aa, bb), on="a")
}
proc.time()[[3L]]-p#[1] 0.696


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3928#issuecomment-623193573,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABRM7MJR5THRHE3FIXXV3YTRPXWNNANCNFSM4I4ZPREA
.

I checked if #4440 (merged yesterday) resolves this issue, but it look it doesn't. Although timings I am getting here are quite different from the timings reported by @vk111, so if you are able to test recent development version, and see if it makes any difference for your workflow that would be helpful.
As you are on Windows, there is another pending PR that could eventually help #4558.


This is the common script I run in each environment:

library(data.table)
setDTthreads(2) ## vk111 has 4 cores
aa = data.table(a = seq(1,100), b = rep(0, 100))[,  a1 := as.character(a)]
bb = data.table(a = seq(1,100), b = rep(1, 100))[,  a1 := as.character(a)]
setindex(aa, a); setindex(bb, a)
cat("test1\n")
system.time({
for(i in c(1:1000)) {
  aa[bb, b := i.b, on=.(a, a1)] # test1
}})
cat("test2\n")
system.time({
for(i in c(1:1000)) {
  aa[bb, b := i.b, on=.(a)] # test2
}})

And these are environments:

docker run -it --rm r-base:3.4.0
install.packages("https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.10.4-3.tar.gz", repos=NULL)
test1
   user  system elapsed 
  0.949   0.007   0.957 
test2
   user  system elapsed 
  0.864   0.000   0.864 
docker run -it --rm r-base:3.6.1
install.packages("https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.12.2.tar.gz", repos=NULL)
test1
   user  system elapsed 
  2.598   0.047   1.350 
test2
   user  system elapsed 
  2.092   0.016   1.062 



md5-e82041eef8382e88332d077bb608b87c



docker run -it --rm r-base:3.6.1
install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table")
test1
   user  system elapsed 
  1.174   0.045   1.219 
test2
   user  system elapsed 
  0.981   0.024   1.004 

Hi, I am on Windows

On Fri, 26 Jun 2020, 13:19 Jan Gorecki, notifications@github.com wrote:

I checked if #4440 https://github.com/Rdatatable/data.table/pull/4440
resolve this issue, but it looks it didn't. Although timings I am getting
here are quite different from the timings reported by @vk111
https://github.com/vk111, so if you are able to test recent development
version, and see if it makes any difference for your workflow that would be
helpful.
This is the common script I run in each environment:

library(data.table)
setDTthreads(2) ## vk111 has 4 coresaa = data.table(a = seq(1,100), b = rep(0, 100))[, a1 := as.character(a)]bb = data.table(a = seq(1,100), b = rep(1, 100))[, a1 := as.character(a)]
setindex(aa, a); setindex(bb, a)
cat("test1n")
system.time({for(i in c(1:1000)) {
aa[bb, b := i.b, on=.(a, a1)] # test1
}})
cat("test2n")
system.time({for(i in c(1:1000)) {
aa[bb, b := i.b, on=.(a)] # test2
}})

And these are environments:

docker run -it --rm r-base:3.4.0
install.packages("https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.10.4-3.tar.gz", repos=NULL)
test1
user system elapsed
0.949 0.007 0.957
test2
user system elapsed
0.864 0.000 0.864

docker run -it --rm r-base:3.6.1
install.packages("https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.12.2.tar.gz", repos=NULL)
test1
user system elapsed
2.598 0.047 1.350
test2
user system elapsed
2.092 0.016 1.062

docker run -it --rm r-base:3.6.1
install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table")
test1
user system elapsed
1.174 0.045 1.219
test2
user system elapsed
0.981 0.024 1.004

As you are on Windows, there is another pending PR that could eventually
help #4558 https://github.com/Rdatatable/data.table/pull/4558.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3928#issuecomment-650126662,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABRM7MJLU3XRN5YN37ZVLALRYR727ANCNFSM4I4ZPREA
.

@vk111 Yes, exactly what I was asking about. If you could check those 3 conifugrations on your Windows machine and see if there is any progress for you.

But as I understood pr for Windows is not merged yet

On Fri, 26 Jun 2020, 20:07 Jan Gorecki, notifications@github.com wrote:

@vk111 https://github.com/vk111 Yes, exactly what I was asking about.
If you could check those 3 conifugrations on your Windows machine and see
if there is any progress for you.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3928#issuecomment-650320224,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABRM7MMDOCNAWARCCDRSU23RYTPVFANCNFSM4I4ZPREA
.

4440 is already merged and helps for every OS. #4558 is still pending to merge.

@vk111 #4558 has been already merged as well

Hi Jan, I took your advice recently - #4419
https://github.com/Rdatatable/data.table/issues/4419 related to my issue
so I dont have this exact piece of code with the characters joins
But anyway thank you for this improvement I am sure it will help a lot of
people in the future if you believe it has the better performance then I
guess you should just merge it
Thank you

пт, 26 июн. 2020 г. в 20:29, Jan Gorecki notifications@github.com:

4440 https://github.com/Rdatatable/data.table/pull/4440 is already

merged and help for every OS. #4558
https://github.com/Rdatatable/data.table/pull/4558 is still pending to
merge.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3928#issuecomment-650330154,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABRM7MIN2P7BXMKG74Z7EKTRYTSIJANCNFSM4I4ZPREA
.

--
Best regards,
Vasily Kuznetsov

@vk111 mentioned PRs are already merged. The thing is to know if this issue got resolved, or at least improved. Those merged PRs were not attempting to resolve this issue, but they are related so they might have been helpful.

It happens that the initially reported minimal example was poorly exposing the issue. Based on the @vk111 https://github.com/Rdatatable/data.table/issues/3928#issuecomment-538681316 I constructed exmaple which clearly shows there is an issue here.

install.packages("https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.10.4-3.tar.gz", repos=NULL)
install.packages("https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.12.2.tar.gz", repos=NULL)
install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table")
library(data.table)
setDTthreads(2L)
set.seed(108)
n2 = 1:366000
n1 = sample(n2, 30000)
d1=data.table(id1=paste0("id",sample(n1)), id2=paste0("id",sample(n1)), v1=n1)
d2=data.table(id1=paste0("id",sample(n2)), id2=paste0("id",sample(n2)), v2=n2)
system.time(d2[d1, v1 := i.v1, on=c("id1","id2")])



md5-e82041eef8382e88332d077bb608b87c



# 3.4.0: 1.10.4-3
   user  system elapsed 
  0.144   0.000   0.144 
# 3.4.0: 1.12.2
   user  system elapsed 
  0.611   0.003   0.451 
# 3.4.0: 1.12.9
   user  system elapsed 
  0.611   0.000   0.452 



md5-e82041eef8382e88332d077bb608b87c



# 4.0.0: 1.10.4-3
   user  system elapsed 
  0.203   0.008   0.211 
# 4.0.0: 1.12.9
   user  system elapsed 
  0.812   0.004   0.644 

verbose shows that most of time is spent Calculated ad hoc index in 0.611s elapsed (0.769s cpu), which is call to forderv from bmerge.

Yes, exactly
But it seems there is no improvement in this tests?

On Tue, 30 Jun 2020, 00:45 Jan Gorecki, notifications@github.com wrote:

It happens that the initially reported minimal example was poorly exposing
the issue. Based on the @vk111 https://github.com/vk111 #3928 (comment)
https://github.com/Rdatatable/data.table/issues/3928#issuecomment-538681316
I constructed exmaple which clearly shows there is an issue here.

library(data.table)
setDTthreads(2L)
set.seed(108)n2 = 1:366000n1 = sample(n2, 30000)d1=data.table(id1=paste0("id",sample(n1)), id2=paste0("id",sample(n1)), v1=n1)d2=data.table(id1=paste0("id",sample(n2)), id2=paste0("id",sample(n2)), v2=n2)
system.time(d2[d1, v1 := i.v1, on=c("id1","id2")])

3.4.0: 1.10.4-3

user system elapsed
0.144 0.000 0.144

3.4.0: 1.12.2

user system elapsed
0.611 0.003 0.451

3.4.0: 1.12.9

user system elapsed
0.611 0.000 0.452

4.0.0: 1.12.9

user system elapsed
0.812 0.004 0.644


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3928#issuecomment-651408089,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABRM7MPQHSOEGLXS625BU3TRZEKSBANCNFSM4I4ZPREA
.

No improvement, this is still an issue.
After some digging I found the bottleneck is this call
https://github.com/Rdatatable/data.table/blob/ba32f3cba38ec270587e395f6e6c26a80be36be6/src/forder.c#L282
going further...
this loop is responsible for 60%
https://github.com/Rdatatable/data.table/blob/ba32f3cba38ec270587e395f6e6c26a80be36be6/src/forder.c#L242-L245
and those for 30%
https://github.com/Rdatatable/data.table/blob/ba32f3cba38ec270587e395f6e6c26a80be36be6/src/forder.c#L252-L258

I tried checking if using R internals (CHAR() a macro rather than function) would help here, but it does not help.


Even more minimal example, timing forderv only, on R 4.0 only

install.packages("https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.10.4-3.tar.gz", repos=NULL)
library(data.table)
setDTthreads(1L)
set.seed(108)
n2 = 1:366000
d2=data.table(id1=paste0("id",sample(n2)), id2=paste0("id",sample(n2)), v2=n2)
system.time(data.table:::forderv(d2, c("id1","id2")))
install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table")
library(data.table)
setDTthreads(1L)
set.seed(108)
n2 = 1:366000
d2=data.table(id1=paste0("id",sample(n2)), id2=paste0("id",sample(n2)), v2=n2)
system.time(data.table:::forderv(d2, c("id1","id2")))



md5-e82041eef8382e88332d077bb608b87c



# 4.0.2: 1.10.4-3
   user  system elapsed 
  0.198   0.000   0.198 
# 4.0.2: 1.12.9
   user  system elapsed 
  0.393   0.000   0.392 

I tried checking if using R internals (CHAR() a macro rather than function) would help here, but it does not help.

That's because R doesn't do that anymore. With USE_RINTERNALS, INTEGER, REAL and I assume CHAR too, are still inline functions. In other words, the benefit (i.e. macro instead of function) of USE_RINTERNALS was removed in recent versions of R to permit ALTREP. I don't believe it is really whether it is macro or function or inline function, per se, that makes the difference. It's what is being done by R inside that unit of work. In the old days, USE_RINTERNALS made it a macro and it was a direct memory lookup (an array dereference just taking account of the vector's header) which is why it was faster. It was a very lightweight unit of work. These days, R has at least one branch to check for ALTREP and, even with branch prediction, that extra work adds up and/or prevents prefetch. In other words, R now has a little bit more code inside these inline functions. I've no benchmark evidence, just theory based on looking at it.

Root cause might be removal of the following branch in cradix_r
https://github.com/Rdatatable/data.table/blob/76bb569fd7736b5f89471a35357e6a971ae1d424/src/forder.c#L780-L783
and now for input of length 2 there is no short-circuit.
It is not obvious how to put that branch now because function has changed a lot in terms of how it operates.

Was this page helpful?
0 / 5 - 0 ratings