I found a strange behavior of data.table
. I would like to know if there is a way to avoid it, or a workaround.
In my data management, I use often lapply
with .SD
, to assign new values to columns. To assign properly several columns, the order of the output column of the lapply
must be kept.
I found a situation where it is not the case.
Here the normal behavior
library(data.table)
plouf <- data.table(x = 1, y = 2, z = 3)
cols <- c("y","x")
plouf[,.SD,.SDcols = cols ,by = z]
plouf[,lapply(.SD,function(x){x}),.SDcols = cols ,by = z]
plouf[,lapply(.SD[x == 1],function(x){x}),.SDcols = cols ,by = z]
All these lines give :
z y x
1: 3 2 1
which I need for example to reassign to c("y","x"). But if I do:
plouf[,lapply(.SD[get("x") == 1],function(x){x}),.SDcols = c("y","x"),by = z]
z x y
1: 3 1 2
Here the order of x and y changed without reason, when it should yield the same result as the last "working" example. If then assign the wrong values to c("y","x")
if I assign the output of lapply
to new vector of columns. It seems that the use of get
in the i
part of .SD
triggers this bug.
Example of the effect of this on assignment:
plouf[, c(cols ) := lapply(.SD[get("x") == 1],function(x){x}),
.SDcols = cols ,by = z][]
# x y z
# 1: 2 1 3
Does anyone have a workaround ? The code I am using looks more like :
plouf[, c(cols ) := lapply(.SD[get("x") >= 1 & get("x") <= 3],function(x){mean}),
.SDcols = cols ,by = z]
the issue on github: https://github.com/Rdatatable/data.table/issues/4089