Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201867

Count occurences of lists efficiently

$
0
0

I have a list containing a few millions of lists, these sublists have a few distinct possible values, maybe 10 to 100.

I want to count the number of occurrences of these values.

The code below works but it is very slow. Can we do this faster ?

count_by_list <- function(lst, var_nm = as.character(substitute(lst)), count_nm = "n"){
  unique_lst <- unique(lst)
  res <- tibble::tibble(!!var_nm := unique_lst, !!count_nm := NA)
  for(i in seq_along(unique_lst)){
    res[[count_nm]][[i]] <- sum(lst %in% res[[var_nm]][i])
  }
  res
}

x <- list(
  list(a=1, b=2),
  list(a=1, b=2),
  list(b=3),
  list(b=3, c=4))

count_by_list(x)
#> # A tibble: 3 x 2
#>   x                    n
#>   <list>           <int>
#> 1 <named list [2]>     2
#> 2 <named list [1]>     1
#> 3 <named list [2]>     1

Created on 2019-11-29 by the reprex package (v0.3.0)

I tried hashing with the library digest but it was actually slower, and getting worse as n increases :

library(digest)
count_by_list2 <- function(lst, var_nm = as.character(substitute(lst)), count_nm = "n"){
  unique_lst <- unique(lst)
  digested   <- vapply(lst, digest, character(1))
  res        <- as.data.frame(table(digested))
  names(res) <- c(var_nm, count_nm)
  res[[1]] <- unique_lst
  res
}

If you need to benchmark you can use x_big <- unlist(replicate(10000 ,x, F), recursive = FALSE).

I added the tags rcpp and parallel processing as these might help, these are not constraints on the answers.


Viewing all articles
Browse latest Browse all 201867

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>