I have a list containing a few millions of lists, these sublists have a few distinct possible values, maybe 10 to 100.
I want to count the number of occurrences of these values.
The code below works but it is very slow. Can we do this faster ?
count_by_list <- function(lst, var_nm = as.character(substitute(lst)), count_nm = "n"){
unique_lst <- unique(lst)
res <- tibble::tibble(!!var_nm := unique_lst, !!count_nm := NA)
for(i in seq_along(unique_lst)){
res[[count_nm]][[i]] <- sum(lst %in% res[[var_nm]][i])
}
res
}
x <- list(
list(a=1, b=2),
list(a=1, b=2),
list(b=3),
list(b=3, c=4))
count_by_list(x)
#> # A tibble: 3 x 2
#> x n
#> <list> <int>
#> 1 <named list [2]> 2
#> 2 <named list [1]> 1
#> 3 <named list [2]> 1
Created on 2019-11-29 by the reprex package (v0.3.0)
I tried hashing with the library digest
but it was actually slower, and getting worse as n increases :
library(digest)
count_by_list2 <- function(lst, var_nm = as.character(substitute(lst)), count_nm = "n"){
unique_lst <- unique(lst)
digested <- vapply(lst, digest, character(1))
res <- as.data.frame(table(digested))
names(res) <- c(var_nm, count_nm)
res[[1]] <- unique_lst
res
}
If you need to benchmark you can use x_big <- unlist(replicate(10000 ,x, F), recursive = FALSE)
.
I added the tags rcpp
and parallel processing
as these might help, these are not constraints on the answers.