I would like to aggregate a data frame while also adding in a new column (N) that counts the number of rows per value of the grouping variable, in base R.
This is trivial in dplyr
:
library(dplyr)
data(iris)
combined_summary <- iris %>% group_by(Species) %>% group_by(N=n(), add=TRUE) %>% summarize_all(mean)
> combined_summary
# A tibble: 3 x 6
# Groups: Species [3]
Species N Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 setosa 50 5.01 3.43 1.46 0.246
2 versicolor 50 5.94 2.77 4.26 1.33
3 virginica 50 6.59 2.97 5.55 2.03
I am however in the unfortunate position of having to write this code in an environment that doesn't allow for packages to be used (don't ask; it's not my decision). So I need a way to do this in base R.
I can do it in base R in a long-winded way as follows:
# First create the aggregated tables separately
summary_means <- aggregate(. ~ Species, data=iris, FUN=mean)
summary_count <- aggregate(Sepal.Length ~ Species, data=iris[, c("Species", "Sepal.Length")], FUN=length)
> summary_means
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
> summary_count
Species Sepal.Length
1 setosa 50
2 versicolor 50
3 virginica 50
# Then rename the count column
colnames(summary_count)[2] <- "N"> summary_count
Species N
1 setosa 50
2 versicolor 50
3 virginica 50
# Finally merge the two dataframes
combined_summary_baseR <- merge(x=summary_count, y=summary_means, by="Species", all.x=TRUE)
> combined_summary_baseR
Species N Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 50 5.006 3.428 1.462 0.246
2 versicolor 50 5.936 2.770 4.260 1.326
3 virginica 50 6.588 2.974 5.552 2.026
Is there any way to do this in a more efficient way in base R?