I am trying to fit a glm to large dataset of proportions. The samples in dataset is by spatial grids, where I have count data for mature and immature fishes caught, but the sample size is uneven with some very low, and others with very high total counts. That seems to be influencing (increasing) the deviance of the glm (family = quasibinomial) I am trying to fit. Standardizing the counts, by dividing all with a constant reduces deviance but I am not sure if this is an acceptable way.
To show an example, I am providing some dummy data and steps of analysis below.
dat <- structure(list(success = c(817L, 619L, 447L, 822L, 682L, 65L,
858L, 401L, 731L, 219L, 505L, 878L, 686L, 707L, 727L, 801L, 786L,
151L, 178L, 339L, 280L, 788L, 659L, 306L, 429L, 236L, 997L, 739L,
676L, 181L, 490L, 857L, 471L, 584L, 633L, 433L, 442L, 777L, 830L,
755L, 64L, 864L, 898L, 863L, 44L, 94L, 887L, 962L, 666L, 150L,
817L, 619L, 447L, 822L, 682L, 65L, 858L, 401L, 731L, 219L, 505L,
878L, 686L, 707L, 727L, 801L, 786L, 151L, 178L, 339L, 280L, 788L,
659L, 306L, 429L, 236L, 997L, 739L, 676L, 181L, 490L, 857L, 471L,
584L, 633L, 433L, 442L, 777L, 830L, 755L, 64L, 864L, 898L, 863L,
44L, 94L, 887L, 962L, 666L, 150L), failure = c(3996L, 1821L,
7643L, 3309L, 1780L, 3197L, 9975L, 9062L, 8464L, 9183L, 3266L,
2645L, 6356L, 8188L, 8497L, 4744L, 3035L, 7443L, 9896L, 8550L,
3237L, 8766L, 7383L, 6345L, 8039L, 1527L, 9560L, 9773L, 7326L,
7340L, 9648L, 7566L, 1878L, 7764L, 6601L, 5064L, 6798L, 6634L,
2715L, 8004L, 9923L, 3825L, 7381L, 2703L, 7570L, 7174L, 2030L,
8434L, 5643L, 6527L, 3996L, 1821L, 7643L, 3309L, 1780L, 3197L,
9975L, 9062L, 8464L, 9183L, 3266L, 2645L, 6356L, 8188L, 8497L,
4744L, 3035L, 7443L, 9896L, 8550L, 3237L, 8766L, 7383L, 6345L,
8039L, 1527L, 9560L, 9773L, 7326L, 7340L, 9648L, 7566L, 1878L,
7764L, 6601L, 5064L, 6798L, 6634L, 2715L, 8004L, 9923L, 3825L,
7381L, 2703L, 7570L, 7174L, 2030L, 8434L, 5643L, 6527L), expl_varA = c(75L,
13L, 45L, 2L, 3L, 9L, 21L, 79L, 77L, 36L, 30L, 58L, 17L, 93L,
44L, 61L, 23L, 97L, 98L, 11L, 26L, 25L, 43L, 89L, 84L, 35L, 39L,
71L, 22L, 31L, 95L, 46L, 70L, 88L, 10L, 81L, 76L, 7L, 90L, 62L,
56L, 49L, 80L, 86L, 53L, 20L, 65L, 34L, 16L, 48L, 75L, 13L, 45L,
2L, 3L, 9L, 21L, 79L, 77L, 36L, 30L, 58L, 17L, 93L, 44L, 61L,
23L, 97L, 98L, 11L, 26L, 25L, 43L, 89L, 84L, 35L, 39L, 71L, 22L,
31L, 95L, 46L, 70L, 88L, 10L, 81L, 76L, 7L, 90L, 62L, 56L, 49L,
80L, 86L, 53L, 20L, 65L, 34L, 16L, 48L), expl_varB = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L), .Label = c("A", "B"), class = "factor")), class = "data.frame", row.names = c(NA,
-100L))
Here is the model run using raw success and failure counts.
# Model with raw data
mod1 <- glm(cbind(success, failure)~expl_varA*expl_varB, data = dat, quasibinomial)
summary(mod1)
Call:
glm(formula = cbind(success, failure) ~ expl_varA * expl_varB,
family = quasibinomial, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-35.633 -9.931 -0.878 11.598 35.323
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.068592 0.225116 -9.189 8.22e-15 ***
expl_varA -0.005894 0.004019 -1.467 0.146
expl_varBB -0.290050 0.323285 -0.897 0.372
expl_varA:expl_varBB 0.004341 0.005656 0.767 0.445
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasibinomial family taken to be 339.7348)
Null deviance: 30497 on 99 degrees of freedom
Residual deviance: 29639 on 96 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
Residual deviance is quite high on 96 degrees of freedom.
When success and failure are divided by a constant, residual deviance decreases, and I am able to get to 1:1 deviance to degree of freedom ratio.
# Model with standardized counts
mod2 <- glm(cbind(success/300, failure/300)~expl_varA*expl_varB, data = dat, quasibinomial)
summary(mod2)
Call:
glm(formula = cbind(success/300, failure/300) ~ expl_varA * expl_varB,
family = quasibinomial, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.05725 -0.57334 -0.05067 0.66962 2.03939
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.068592 0.225116 -9.189 8.22e-15 ***
expl_varA -0.005894 0.004019 -1.467 0.146
expl_varBB -0.290050 0.323285 -0.897 0.372
expl_varA:expl_varBB 0.004341 0.005656 0.767 0.445
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasibinomial family taken to be 1.132449)
Null deviance: 101.657 on 99 degrees of freedom
Residual deviance: 98.796 on 96 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
The deviance issue seems to be addressed, though I am not sure it would be an acceptable way. Will I be justified in addressing deviance in a model like this, and if not, how can I address the issue? It may likely be that there are other variable that are not measured in experiment causing the deviance, though could it also be that number of total counts in samples also influence the deviance regardless of having all possible variable measured?