Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 205491

Why the two mean don't match when computed manually and using stat_summary?

$
0
0

I'm doing a project and analysis on the SO data and wanted to check whether the badges (gold, silver and bronze) would depict significant differences between the top users within those groups i.e. that the CI don't overlap. To this end I incorporate t-student CI's and error bars but the mean computed in that summary doesn't match the mean shown using the stat_summary see here (apologies for not having a reproducible example, the data set is huge):

str(comp)
'data.frame':   4500 obs. of  10 variables:
 $ userId        : num  51 58 61 79 101 122 136 142 233 238 ...
 $ reputation    : num  35198 39731 41299 38596 38689 ...
 $ creationDate  : POSIXct, format: "2008-08-01 13:31:13""2008-08-01 13:56:33""2008-08-01 14:21:00""2008-08-01 16:05:09" ...
 $ lastAccessDate: POSIXct, format: "2019-11-30 16:40:08""2019-10-31 15:55:12""2019-12-01 01:41:04""2018-04-06 01:48:22" ...
 $ location      : chr  "Yad Binyamin, Israel""Indianapolis, IN""Auckland, New Zealand""New York, NY" ...
 $ views         : int  3086 1825 1771 1404 1845 2936 2199 874 1655 780 ...
 $ upvotes       : int  2753 1049 1322 411 550 517 553 106 1734 216 ...
 $ downvotes     : int  44 55 219 38 64 51 98 3 211 18 ...
 $ class         : Factor w/ 3 levels "bronze","gold",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ badge         : Factor w/ 91 levels "Altruist","Analytical",..: 52 52 52 52 52 52 52 52 52 52 ..

summaryRep <- comp %>% 
    group_by(class) %>%
    summarise(n=n(), mean=mean(reputation), sd=sd(reputation), se=sd/sqrt(n), ci=qt(.975,n-1)*se)
> summaryRep
# A tibble: 3 x 6
  class      n    mean      sd    se    ci
  <fct>  <int>   <dbl>   <dbl> <dbl> <dbl>
1 bronze  1500  37494.   5513.  142.  279.
2 gold    1500 145712. 117260. 3028. 5939.
3 silver  1500  54451.  13118.  339.  664.

colorSpec <- c("#f9a602", "#c0c0c0", "#cd7f32")
names(colorSpec) <- c("gold", "silver", "bronze")
comp %>% 
  left_join(summaryRep, by="class") %>%
  ggplot(aes(badge, reputation, colour=class, group=class)) +
  geom_boxplot(notch=T) +
  stat_summary(fun.y=mean, geom="point", shape=20, size=10) +
  geom_errorbar(aes(ymin=mean-ci, ymax=mean+ci), width=.3) +
  scale_y_log10() +
  scale_colour_manual(values = colorSpec) +
  geom_jitter(alpha=0.3)    

enter image description here

See that the mean inside the error bar doesn't match the mean per class produced by stat_summary.

PS: the data is very far from normally distributed so I'd need to use a different CI like a bootstrapped CI (BCI) but still I'm very curious why the mean don't match.

UPDATE this demonstrates that no matter which column ggplot uses to group by either badge or class should end in the same mean:

identical(comp %>% 
  group_by(class) %>%
  summarise(avgReputation=mean(reputation)) %>%
  select(avgReputation) %>%
  arrange(avgReputation),
comp %>% 
  group_by(badge) %>%
  summarise(avgReputation=mean(reputation)) %>%
  select(avgReputation) %>%
  arrange(avgReputation))
[1] TRUE

Viewing all articles
Browse latest Browse all 205491

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>