#
Minimal reproducible example
I believe there's been a regression in data.table
in version 1.12.0 relative to 1.11.8. TL;DR
I think the implementation of unique.data.table()
has changed and that the new implementation doesn't support complex types like lists in columns.
Some users of uptasticsearch have reported receiving an error message from our code that looks like this:
Error in forderv(x, by = by, sort = FALSE, retGrp = TRUE) :
Column 2 of by= (2) is type 'list', not yet supported
After investigating tonight, I found the source of the issue and can reproduce it. I believe the behavior of unique()
has changed and I'd consider that change a regression.
On 1.12.0:
someDT <- data.table::data.table(
col1 = 1:2,
col2 = list(list(TRUE, FALSE), list(FALSE, TRUE))
)
unique(someDT)
Raises error
Error in forderv(x, by = by, sort = FALSE, retGrp = TRUE) :
Column 2 of by= (2) is type 'list', not yet supported
To downgrade to the previous release, I ran the following from the command line:
Rscript -e "remove.packages('data.table')"
wget http://cran.rstudio.com/src/contrib/Archive/data.table/data.table_1.11.8.tar.gz
R CMD INSTALL data.table_1.11.8.tar.gz
Once I had v 1.11.8 installed, I re-ran the R code above
someDT <- data.table::data.table(
col1 = 1:2,
col2 = list(list(TRUE, FALSE), list(FALSE, TRUE))
)
unique(someDT)
Works as expected and returns:
col1 col2
1: 1 <list>
2: 2 <list>
So I went to the blame for unique.data.table() to see what had . changed. Looked like no substantive changes have been made between 1.11.8 and now, so then I thought to look at the blame for forderv()
.
I didn't see anything meaningful in the blame for forderv() either.
I decided to try one more thing...searching for the text "not yet supported" (from the error message). That led me to forder.c
, whose blame led me to #3124.
As far as I can tell, this PR is the source of the problem above. There is no description on the PR so I'm not sure if this is an unintended side effect or a known regression that will be fixed in a future release of data.table
.
#
Output of sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0 yaml_2.2.0 data.table_1.12.0
Thanks for the thorough report!
Thanks @jameslamb! In 1.11.8, you also get the error if the uniqueness is not resolved before the list
column :
# with data.table 1.11.8 :
DT = data.table(
col1 = c(1,1),
col2 = list(list(TRUE, FALSE), list(FALSE, TRUE))
)
unique(DT)
Error in forderv(x, by = by, sort = FALSE, retGrp = TRUE) :
Column 2 of 'by' (2) is type 'list', not yet supported
In that example I changed col1
to contain duplicates, so that col2
is then needed to decide uniqueness.
v1.12.0+ checks types of the all columns up front before starting the algorithm. Doing this check up front makes things easier internally for parallelism.
To make it work and exclude the list
columns, use the by=
argument to specify (probably) the first few columns. Since unique on list
columns didn't work anyway, I think it's probably better to be forced to be explicit, rather than suddenly be surprised by the error later when dups do occur in the non-list
columns.
> unique(DT, by="col1")
col1 col2
1: 1 <list>
>
I'll add something to NEWS and a suggestion to the error message ...
@mattdowle ah ok, makes sense! Thanks for taking a look and for clarifying in the error message.
Most helpful comment
Thanks for the thorough report!