Data.table: Should rbindlist(..., fill=TRUE) return NA_logical_ in list columns?

Created on 25 Jan 2020  ·  6Comments  ·  Source: Rdatatable/data.table

From #4196:

When filling a list column, rbindlist departs from the behaviour of all other column types, and returns NULL elements instead of NA:

> A = data.table(c1=0, c2=list(1:3))
> B = data.table(c1=1)
> rbind(A,B,fill=TRUE)
      c1     c2
   <num> <list>
1:     0  1,2,3
2:     1   

Expected:

> A = data.table(c1=0, c2=list(1:3))
> B = data.table(c1=1)
> rbind(A,B,fill=TRUE)
      c1     c2
   <num> <list>
1:     0  1,2,3
2:     1  NA   

Should we change this behaviour for list columns to fill the rows with NA values to match the behaviour of fill=TRUE for other column types?

non-atomic column question

Most helpful comment

What if we just instead add a sentence to the documentation noting the behaviour in the case of a missing list column:

Current entry for the fill argument:

TRUE fills missing columns with NAs. By default FALSE. When TRUE, use.names is set to TRUE.

Proposed:

TRUE fills missing columns with NAs, or NULL for missing list columns. By default FALSE. When TRUE, use.names is set to TRUE.

All 6 comments

PR with fix provided if we want to make the change

Current behaviour seems fine to me.

> str(as.integer(NULL)[1L])
 int NA
> str(as.list(NULL)[1L])
List of 1
 $ : NULL

IMO it should not be NA because:

  • it changes the type from a missing field (undefined) to a logical vector
  • it changes the length from 0 length to length 1

I'm leaning towards Jan's point. Current behavior of empty element is actually a list's way of representing missing (there isn't any object to point to). We could construct an example where each item of the list was a logical vector, each item being the result of some computation. In such a case, 3 different states might need to be represented: length 0 logical, length 1 NA logical, and missing computation. If length 1 NA logical was used for missing, those 2 couldn't be distinguished.

Would changing the print method suffice? Instead of nothing being printed, how about NULL ? Printing NA could again imply a length 1 NA logical, whereas NULL would be unambiguous, consistent with what base R prints for empty list items, and would give a further visual reminder that it was a list column.

also agree w Jan, in particular about length 0 --> length 1.

I'm using lengths(x)>0 a lot to filter rows by empty list columns.

we could I guess put logical() there instead, is there any advantage of logical() over NULL though?

What if we just instead add a sentence to the documentation noting the behaviour in the case of a missing list column:

Current entry for the fill argument:

TRUE fills missing columns with NAs. By default FALSE. When TRUE, use.names is set to TRUE.

Proposed:

TRUE fills missing columns with NAs, or NULL for missing list columns. By default FALSE. When TRUE, use.names is set to TRUE.

Doc change looks good. Plus the print method change I suggested too?

Was this page helpful?
0 / 5 - 0 ratings