I have a dataset containing categorical variables and numeric features:
Experiment Replicate Batch Condition Cellline Feature1 Feature2 ...
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> ...
I am using the vtreat package in R to treat my data before modeling.
my_treatment <- vtreat::designTreatmentsZ(
dframe = data,
varlist = colnames(data),
minFraction = 0.05
)
data_treated <- vtreat::prepare(my_treatment, data)
After using prepare() I check the catP columns to check the levels of the categorical variables:
> table(data_treated$Cellline_catP)
0.0914634146341463 0.103658536585366 0.109756097560976 0.121951219512195
15 17 72 60
However, although I have 9 cell lines in my dataset, I see only 4 in data$Cellline_catP.
> dplyr::count(data, dplyr::n_distinct(Cellline))
# A tibble: 1 x 2
`dplyr::n_distinct(Cellline)` n
<int> <int>
1 9 164
Shouldn't there be also 9 different categories in data$Cellline_catP? I tried renaming the lines (it's a mix of numbers and letters), and excluding some lines, but it doesn't change.