Data.table: Regression in unique.data.table() as of data.table 1.12.0

Created on 30 Jan 2019  ·  3Comments  ·  Source: Rdatatable/data.table

# Minimal reproducible example

I believe there's been a regression in data.table in version 1.12.0 relative to 1.11.8. TL;DR

I think the implementation of unique.data.table() has changed and that the new implementation doesn't support complex types like lists in columns.

Some users of uptasticsearch have reported receiving an error message from our code that looks like this:

Error in forderv(x, by = by, sort = FALSE, retGrp = TRUE) :
Column 2 of by= (2) is type 'list', not yet supported

After investigating tonight, I found the source of the issue and can reproduce it. I believe the behavior of unique() has changed and I'd consider that change a regression.

On 1.12.0:

someDT <- data.table::data.table(
    col1 = 1:2,
    col2 = list(list(TRUE, FALSE), list(FALSE, TRUE))
)
unique(someDT)

Raises error

Error in forderv(x, by = by, sort = FALSE, retGrp = TRUE) :
Column 2 of by= (2) is type 'list', not yet supported

To downgrade to the previous release, I ran the following from the command line:

Rscript -e "remove.packages('data.table')"
wget http://cran.rstudio.com/src/contrib/Archive/data.table/data.table_1.11.8.tar.gz
R CMD INSTALL data.table_1.11.8.tar.gz

Once I had v 1.11.8 installed, I re-ran the R code above

someDT <- data.table::data.table(
    col1 = 1:2,
    col2 = list(list(TRUE, FALSE), list(FALSE, TRUE))
)
unique(someDT)

Works as expected and returns:

   col1   col2
1:    1 <list>
2:    2 <list>

So I went to the blame for unique.data.table() to see what had . changed. Looked like no substantive changes have been made between 1.11.8 and now, so then I thought to look at the blame for forderv().

I didn't see anything meaningful in the blame for forderv() either.

I decided to try one more thing...searching for the text "not yet supported" (from the error message). That led me to forder.c, whose blame led me to #3124.

As far as I can tell, this PR is the source of the problem above. There is no description on the PR so I'm not sure if this is an unintended side effect or a known regression that will be fixed in a future release of data.table.

# Output of sessionInfo()

R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0 yaml_2.2.0 data.table_1.12.0

Most helpful comment

Thanks for the thorough report!

All 3 comments

Thanks for the thorough report!

Thanks @jameslamb! In 1.11.8, you also get the error if the uniqueness is not resolved before the list column :

# with data.table 1.11.8 : 
DT = data.table(
    col1 = c(1,1),
    col2 = list(list(TRUE, FALSE), list(FALSE, TRUE))
)
unique(DT)
Error in forderv(x, by = by, sort = FALSE, retGrp = TRUE) : 
  Column 2 of 'by' (2) is type 'list', not yet supported

In that example I changed col1 to contain duplicates, so that col2 is then needed to decide uniqueness.

v1.12.0+ checks types of the all columns up front before starting the algorithm. Doing this check up front makes things easier internally for parallelism.

To make it work and exclude the list columns, use the by= argument to specify (probably) the first few columns. Since unique on list columns didn't work anyway, I think it's probably better to be forced to be explicit, rather than suddenly be surprised by the error later when dups do occur in the non-list columns.

> unique(DT, by="col1")
    col1   col2
1:     1 <list>
> 

I'll add something to NEWS and a suggestion to the error message ...

@mattdowle ah ok, makes sense! Thanks for taking a look and for clarifying in the error message.

Was this page helpful?
0 / 5 - 0 ratings