Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 205301

Reducing deviance in glm binomial model

$
0
0

I am trying to fit a glm to large dataset of proportions. The samples in dataset is by spatial grids, where I have count data for mature and immature fishes caught, but the sample size is uneven with some very low, and others with very high total counts. That seems to be influencing (increasing) the deviance of the glm (family = quasibinomial) I am trying to fit. Standardizing the counts, by dividing all with a constant reduces deviance but I am not sure if this is an acceptable way.

To show an example, I am providing some dummy data and steps of analysis below.

dat <- structure(list(success = c(817L, 619L, 447L, 822L, 682L, 65L, 
858L, 401L, 731L, 219L, 505L, 878L, 686L, 707L, 727L, 801L, 786L, 
151L, 178L, 339L, 280L, 788L, 659L, 306L, 429L, 236L, 997L, 739L, 
676L, 181L, 490L, 857L, 471L, 584L, 633L, 433L, 442L, 777L, 830L, 
755L, 64L, 864L, 898L, 863L, 44L, 94L, 887L, 962L, 666L, 150L, 
817L, 619L, 447L, 822L, 682L, 65L, 858L, 401L, 731L, 219L, 505L, 
878L, 686L, 707L, 727L, 801L, 786L, 151L, 178L, 339L, 280L, 788L, 
659L, 306L, 429L, 236L, 997L, 739L, 676L, 181L, 490L, 857L, 471L, 
584L, 633L, 433L, 442L, 777L, 830L, 755L, 64L, 864L, 898L, 863L, 
44L, 94L, 887L, 962L, 666L, 150L), failure = c(3996L, 1821L, 
7643L, 3309L, 1780L, 3197L, 9975L, 9062L, 8464L, 9183L, 3266L, 
2645L, 6356L, 8188L, 8497L, 4744L, 3035L, 7443L, 9896L, 8550L, 
3237L, 8766L, 7383L, 6345L, 8039L, 1527L, 9560L, 9773L, 7326L, 
7340L, 9648L, 7566L, 1878L, 7764L, 6601L, 5064L, 6798L, 6634L, 
2715L, 8004L, 9923L, 3825L, 7381L, 2703L, 7570L, 7174L, 2030L, 
8434L, 5643L, 6527L, 3996L, 1821L, 7643L, 3309L, 1780L, 3197L, 
9975L, 9062L, 8464L, 9183L, 3266L, 2645L, 6356L, 8188L, 8497L, 
4744L, 3035L, 7443L, 9896L, 8550L, 3237L, 8766L, 7383L, 6345L, 
8039L, 1527L, 9560L, 9773L, 7326L, 7340L, 9648L, 7566L, 1878L, 
7764L, 6601L, 5064L, 6798L, 6634L, 2715L, 8004L, 9923L, 3825L, 
7381L, 2703L, 7570L, 7174L, 2030L, 8434L, 5643L, 6527L), expl_varA = c(75L, 
13L, 45L, 2L, 3L, 9L, 21L, 79L, 77L, 36L, 30L, 58L, 17L, 93L, 
44L, 61L, 23L, 97L, 98L, 11L, 26L, 25L, 43L, 89L, 84L, 35L, 39L, 
71L, 22L, 31L, 95L, 46L, 70L, 88L, 10L, 81L, 76L, 7L, 90L, 62L, 
56L, 49L, 80L, 86L, 53L, 20L, 65L, 34L, 16L, 48L, 75L, 13L, 45L, 
2L, 3L, 9L, 21L, 79L, 77L, 36L, 30L, 58L, 17L, 93L, 44L, 61L, 
23L, 97L, 98L, 11L, 26L, 25L, 43L, 89L, 84L, 35L, 39L, 71L, 22L, 
31L, 95L, 46L, 70L, 88L, 10L, 81L, 76L, 7L, 90L, 62L, 56L, 49L, 
80L, 86L, 53L, 20L, 65L, 34L, 16L, 48L), expl_varB = structure(c(1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L), .Label = c("A", "B"), class = "factor")), class = "data.frame", row.names = c(NA, 
-100L))

Here is the model run using raw success and failure counts.

# Model with raw data
mod1 <- glm(cbind(success, failure)~expl_varA*expl_varB, data = dat, quasibinomial)

summary(mod1)
Call:
glm(formula = cbind(success, failure) ~ expl_varA * expl_varB, 
    family = quasibinomial, data = dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-35.633   -9.931   -0.878   11.598   35.323  

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -2.068592   0.225116  -9.189 8.22e-15 ***
expl_varA            -0.005894   0.004019  -1.467    0.146    
expl_varBB           -0.290050   0.323285  -0.897    0.372    
expl_varA:expl_varBB  0.004341   0.005656   0.767    0.445    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasibinomial family taken to be 339.7348)

    Null deviance: 30497  on 99  degrees of freedom
Residual deviance: 29639  on 96  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

Residual deviance is quite high on 96 degrees of freedom.

When success and failure are divided by a constant, residual deviance decreases, and I am able to get to 1:1 deviance to degree of freedom ratio.

# Model with standardized counts
mod2 <- glm(cbind(success/300, failure/300)~expl_varA*expl_varB, data = dat, quasibinomial)

summary(mod2)
Call:
glm(formula = cbind(success/300, failure/300) ~ expl_varA * expl_varB, 
    family = quasibinomial, data = dat)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.05725  -0.57334  -0.05067   0.66962   2.03939  

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -2.068592   0.225116  -9.189 8.22e-15 ***
expl_varA            -0.005894   0.004019  -1.467    0.146    
expl_varBB           -0.290050   0.323285  -0.897    0.372    
expl_varA:expl_varBB  0.004341   0.005656   0.767    0.445    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasibinomial family taken to be 1.132449)

    Null deviance: 101.657  on 99  degrees of freedom
Residual deviance:  98.796  on 96  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

The deviance issue seems to be addressed, though I am not sure it would be an acceptable way. Will I be justified in addressing deviance in a model like this, and if not, how can I address the issue? It may likely be that there are other variable that are not measured in experiment causing the deviance, though could it also be that number of total counts in samples also influence the deviance regardless of having all possible variable measured?


Viewing all articles
Browse latest Browse all 205301

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>