I'm doing a project and analysis on the SO data and wanted to check whether the badges (gold, silver and bronze) would depict significant differences between the top users within those groups i.e. that the CI don't overlap. To this end I incorporate t-student CI's and error bars but the mean computed in that summary doesn't match the mean shown using the stat_summary
see here (apologies for not having a reproducible example, the data set is huge):
str(comp)
'data.frame': 4500 obs. of 10 variables:
$ userId : num 51 58 61 79 101 122 136 142 233 238 ...
$ reputation : num 35198 39731 41299 38596 38689 ...
$ creationDate : POSIXct, format: "2008-08-01 13:31:13""2008-08-01 13:56:33""2008-08-01 14:21:00""2008-08-01 16:05:09" ...
$ lastAccessDate: POSIXct, format: "2019-11-30 16:40:08""2019-10-31 15:55:12""2019-12-01 01:41:04""2018-04-06 01:48:22" ...
$ location : chr "Yad Binyamin, Israel""Indianapolis, IN""Auckland, New Zealand""New York, NY" ...
$ views : int 3086 1825 1771 1404 1845 2936 2199 874 1655 780 ...
$ upvotes : int 2753 1049 1322 411 550 517 553 106 1734 216 ...
$ downvotes : int 44 55 219 38 64 51 98 3 211 18 ...
$ class : Factor w/ 3 levels "bronze","gold",..: 1 1 1 1 1 1 1 1 1 1 ...
$ badge : Factor w/ 91 levels "Altruist","Analytical",..: 52 52 52 52 52 52 52 52 52 52 ..
summaryRep <- comp %>%
group_by(class) %>%
summarise(n=n(), mean=mean(reputation), sd=sd(reputation), se=sd/sqrt(n), ci=qt(.975,n-1)*se)
> summaryRep
# A tibble: 3 x 6
class n mean sd se ci
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 bronze 1500 37494. 5513. 142. 279.
2 gold 1500 145712. 117260. 3028. 5939.
3 silver 1500 54451. 13118. 339. 664.
colorSpec <- c("#f9a602", "#c0c0c0", "#cd7f32")
names(colorSpec) <- c("gold", "silver", "bronze")
comp %>%
left_join(summaryRep, by="class") %>%
ggplot(aes(badge, reputation, colour=class, group=class)) +
geom_boxplot(notch=T) +
stat_summary(fun.y=mean, geom="point", shape=20, size=10) +
geom_errorbar(aes(ymin=mean-ci, ymax=mean+ci), width=.3) +
scale_y_log10() +
scale_colour_manual(values = colorSpec) +
geom_jitter(alpha=0.3)
See that the mean inside the error bar doesn't match the mean per class produced by stat_summary
.
PS: the data is very far from normally distributed so I'd need to use a different CI like a bootstrapped CI (BCI) but still I'm very curious why the mean don't match.
UPDATE this demonstrates that no matter which column ggplot uses to group by either badge or class should end in the same mean:
identical(comp %>%
group_by(class) %>%
summarise(avgReputation=mean(reputation)) %>%
select(avgReputation) %>%
arrange(avgReputation),
comp %>%
group_by(badge) %>%
summarise(avgReputation=mean(reputation)) %>%
select(avgReputation) %>%
arrange(avgReputation))
[1] TRUE