I'd like to apply chi-square test scipy.stats.chisquare
. And the total number of observations is different in my groups.
import pandas as pd
data={'expected':[20,13,18,21,21,29,45,37,35,32,53,38,25,21,50,62],
'observed':[19,10,15,14,15,25,25,20,26,38,50,36,30,28,59,49]}
data=pd.DataFrame(data)
print(data.expected.sum())
print(data.observed.sum())
To ignore this is incorrect - right?
Does the default behavior of scipy.stats.chisquare
takes this into account? I checked with pen and paper and looks like it doesn't. Is there a parameter for this?
from scipy.stats import chisquare
# incorrect since the number of observations is unequal
chisquare(f_obs=data.observed, f_exp=data.expected)
When I do manual adjustment I get slightly different result.
# adjust actual number of observations
data['obs_prop']=data['observed'].apply(lambda x: x/data['observed'].sum())
data['observed_new']=data['obs_prop']*data['expected'].sum()
# proper way
chisquare(f_obs=data.observed_new, f_exp=data.expected)
Please correct me if I am wrong at some point. Thanks.
ps: I tagged R for additional statistical expertise