Data.table: [bug] %in% statement fails if the category contains both lowercase and uppercase letters

Created on 15 May 2018  ·  6Comments  ·  Source: Rdatatable/data.table

In version 1.11.2, when using %in% and & statements together, %in% does not respect factor starting with a capitalized letter. Here is an example:

install.packages('data.table')
packageVersion("data.table")   # ‘1.11.2’
data("iris")
library(data.table)
iris <- data.table(iris)
iris$grp <- c('A', 'B')

[Issue]
After capitalizing the first letter in 'virginica', %in% statement cannot return to both groups when using a & statement, see below:

iris[, Species1 := factor(Species, levels = c('setosa', 'versicolor', 'virginica'), labels = c('setosa', 'versicolor', 'Virginica'))]

iris[Species1 %in% c('setosa', 'Virginica') & grp == 'B', table(Species1)]
# Species1
# setosa versicolor  Virginica 
# 0          0         25 

[Examples]
Tried with few examples below and they work fine.
If I subset on groups containing lowercases only, both groups were found.

iris[Species1 %in% c('setosa', 'versicolor') & grp == 'B', table(Species1)]
# Species1
# setosa versicolor  Virginica 
# 25         25          0 

Or, if I add parenthesis to the either statement, both groups were found.

iris[(Species1 %in% c('setosa', 'Virginica')) & grp == 'B', table(Species1)]
# Species1
# setosa versicolor  Virginica 
# 25          0         25 
iris[Species1 %in% c('setosa', 'Virginica') & (grp == 'B'), table(Species1)]
# Species1
# setosa versicolor  Virginica 
# 25          0         25 

I tried this statement in subset function and it works.

table(subset(iris, Species1 %in% c('setosa', 'Virginica') & grp == 'B')$Species1)
# setosa versicolor  Virginica 
# 25          0         25 

This feature works in an older version data.table package (use version 1.10.4-3 as an example here):

devtools::install_version("data.table", version = "1.10.4-3", repos = "http://cran.us.r-project.org")

packageVersion("data.table")   # ‘1.10.4.3’
data("iris")
library(data.table)
iris <- data.table(iris)
iris$grp <- c('A', 'B')

iris[, Species1 := factor(Species, levels = c('setosa', 'versicolor', 'virginica'), labels = c('setosa', 'versicolor', 'Virginica'))]

iris[Species1 %in% c('setosa', 'Virginica') & grp == 'B', table(Species1)]
# Species1
# setosa versicolor  Virginica 
# 25          0         25 

[session info]

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.2

loaded via a namespace (and not attached):
[1] compiler_3.4.4 tools_3.4.4    yaml_2.1.18   
bug regression

Most helpful comment

I have created a PR that (hopefully) fixes the issue. It is a regression that was introduced by one of my own PRs.

All 6 comments

@MarkusBonsch care to take a look? seems odd

Using verbose = TRUE

Optimized subsetting with index 'grp__Species1'
on= matches existing index, using index
Coercing character column i.'Species1' to factor to match type of x.'Species1'. If possible please change x.'Species1' to character. Character columns are now preferred in joins.

I suspect this should be a message at least, possibly a warning.

@ddong63 Using %in% for mixed character and factor is definitely something to avoid, coerce to proper data type before using match.

@HughParsonage it will be soon hopefully, there is https://github.com/Rdatatable/data.table/pull/2734 pending.

Very very strange. I will investigate and fix ASAP. Thanks for the report.

@jangorecki was right. When both columns have the same data type, either character or factor, it works fine.
Very much appreciate your attention @MarkusBonsch

I have created a PR that (hopefully) fixes the issue. It is a regression that was introduced by one of my own PRs.

Was this page helpful?
0 / 5 - 0 ratings