Data.table: Should rbindlist(..., fill=TRUE) return NA_logical_ in list columns?

Created on 25 Jan 2020 · 6Comments · Source: Rdatatable/data.table

From #4196:

When filling a list column, rbindlist departs from the behaviour of all other column types, and returns NULL elements instead of NA:

> A = data.table(c1=0, c2=list(1:3))
> B = data.table(c1=1)
> rbind(A,B,fill=TRUE)
      c1     c2
   <num> <list>
1:     0  1,2,3
2:     1

Expected:

> A = data.table(c1=0, c2=list(1:3))
> B = data.table(c1=1)
> rbind(A,B,fill=TRUE)
      c1     c2
   <num> <list>
1:     0  1,2,3
2:     1  NA

Should we change this behaviour for list columns to fill the rows with NA values to match the behaviour of fill=TRUE for other column types?

non-atomic column question

Source

sritchie73

Most helpful comment

What if we just instead add a sentence to the documentation noting the behaviour in the case of a missing list column:

Current entry for the fill argument:

TRUE fills missing columns with NAs. By default FALSE. When TRUE, use.names is set to TRUE.

Proposed:

TRUE fills missing columns with NAs, or NULL for missing list columns. By default FALSE. When TRUE, use.names is set to TRUE.

sritchie73 on 18 Feb 2020

👍2

All 6 comments

PR with fix provided if we want to make the change

sritchie73 on 25 Jan 2020

Current behaviour seems fine to me.

> str(as.integer(NULL)[1L])
 int NA
> str(as.list(NULL)[1L])
List of 1
 $ : NULL

IMO it should not be NA because:

it changes the type from a missing field (undefined) to a logical vector
it changes the length from 0 length to length 1

jangorecki on 25 Jan 2020

👍1

I'm leaning towards Jan's point. Current behavior of empty element is actually a list's way of representing missing (there isn't any object to point to). We could construct an example where each item of the list was a logical vector, each item being the result of some computation. In such a case, 3 different states might need to be represented: length 0 logical, length 1 NA logical, and missing computation. If length 1 NA logical was used for missing, those 2 couldn't be distinguished.

Would changing the print method suffice? Instead of nothing being printed, how about NULL ? Printing NA could again imply a length 1 NA logical, whereas NULL would be unambiguous, consistent with what base R prints for empty list items, and would give a further visual reminder that it was a list column.