I have a stock market dataset from GitHub:
import pandas as pd
import numpy as np
import statsmodels.api as sm
Smarket_url = 'https://raw.githubusercontent.com/selva86/datasets/master/Smarket.csv'
#Load data
Smarket = pd.read_csv(Smarket_url)
I'm doing logistic regression with the GLM function of the 'statsmodels' package. I did the same regression using R-Studio and it is giving me the same results except that the resulting coefficients in R that are negative appear as positive in Python, and vice versa. In python I initially used:
Smarket_model = sm.formula.glm('Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume',
data=Smarket,family=sm.families.Binomial()).fit()
This were the results:
Intercept 0.1260
Lag1 0.0731
Lag2 0.0423
Lag3 -0.0111
Lag4 -0.0094
Lag5 -0.0103
Volume -0.1354
I reduced the problem to being in how statsmodels was categorizing the outcome variable: 'Stock Up' was 0 and 'Stock Down' was 1. So I created a numpy array changing this configuration to Stock Up = 1 and Stock Down = 0. I then used the statsmodel.GML() function instead:
#Create numpy array changing zeros to ones and vice versa
change = np.where(Smarket['Direction']=='Up',1,0)
#Add intercept
smarket_vars = sm.add_constant(Smarket[['Lag1','Lag2', 'Lag3', 'Lag4','Lag5','Volume']])
#Fit model
market_model = sm.GLM(change, smarket_vars,family=sm.families.Binomial() ).fit()
This gave me the right negative coefficients:
const -0.1260
Lag1 -0.0731
Lag2 -0.0423
Lag3 0.0111
Lag4 0.0094
Lag5 0.0103
Volume 0.1354
My question is, how can I get the right values without having to create the numpy array changing the 0s and 1s? Why did sm.formula.glm()
assumed that 'Stock Up' was a 0 and 'Stock Down' was a 1? Thanks to everyone who read all this nonsense and is willing to help me :)