I have about 15 SPSS election studies files saved as .sav files. My group and I will be recoding about 10 variables for each study to run some logistic regressions.
I have used haven()
to import all the files, so it looks like all the variables are of the haven_labelled()
class.
I have always been a little confused about how to handle this class of variables, however I have observed a lot of improved performance as the haven() and labelled() packages have been updated, so I'm inclined to keep using it as opposed to using, e.g. rio
or foreign
.
But I want to get a sense of what best practices should be before we start this effort so we don't look back with regret.
Each study file has about 200 variables, with a mix of factors and numeric variables. But to start, I'm wondering how I should go about recoding the sex variable so that I end up with a variable male
where 1 is male and 0 is not.
One thing I want to ask about is the car::Recode()
method of recoding variables as opposed to the dplyr::recode
variable way. I personally find the dplyr::recode()
syntax very clunky and the help documentation poor. I am also not sure about the best way to set missing values.
To be specific, I think I have three specific questions.
Question 1: is there a compelling reason to use dplyr::recode
as opposed to car::Recode
? My own answer is that car::Recode()
looks sufficient and easy to use.
Question 2: Should I make a point of converting variables to factors or numeric or will I be OK, leaving variables as haven_labelled with updated value labels? I am concerned about this quote from the haven documentation about the labelled_class
: ''This class provides few methods, as I expect you’ll coerce to a standard R class (e.g. a factor()
) soon after importing''
However, maybe the haven_labelled
class has been improved and is sufficiently different from the labelled class that it is no longer necessary to force conversion to other standard R classes.
Question 3: is there any advantage to setting missing values with the labelled
(e.g. na_range()
, na_values()
) rather than with the car::Recode()
method ?
My inclination is that there clear disadvantages to using the labelled
methods and I should stick with the car::Recode()
method.
Thank you .
#FAKE DATA
library(labelled)
var1<-labelled(rep(c(1,5), 100), c(male = 1, female = 5))
var2<-labelled(sample(c(1,3,5,7,8,9), size=200, replace=T), c('strongly agree'=1, 'agree'=3, 'disagree'=5, 'strongly disagree'=7, 'DK'=8, 'refused'=9))
#give variable labels
var_label(var1)<-'Respondent\'s sex'
var_label(var2)<-'free trade is a good thing'
df<-data.frame(var1=var1, var2=var2)
str(df)
#This works really well; and I really like this.
look_for(df, 'sex')
look_for(df, 'free trade')
#the Car way
df$male<-car::Recode(df$var1, "5=0")
#Check results
df$male
#value labels are still there, so would have to be removed or updated
as_factor(df$male)
#Remove value labels
val_labels(df$male)<-NULL
#Check
class(df$male) #left with a numeric variable
#The other car way, keeping and modifying value labels
df$male2<-car::Recode(df$var1, "5=0")
df$male2
val_label(df$male2, 0)<-c('female')
val_label(df$male2, 5)<-NULL
val_labels(df$male2)
#Check class
class(df$male2)
#Can run numeric functions on it
mean(df$male2)
#easily convert to factor
as_factor(df$male2)
#How to handle missing values
#The CAR way
#use car to set missing values to NA
df$free_trade<-Recode(df$var2, "8=NA; 9=NA")
#Check class
class(df$free_trade)
#can still run numeric functions on haven_labelled
mean(df$free_trade, na.rm=T)
#table
table(df$free_trade)
#did the na recode work?
table(is.na(df$free_trade))
#check value labels
val_labels(df$free_trade)
#How to handle missing values
#The CAR way
#use car to set missing values to NA
df$free_trade<-Recode(df$var2, "8=NA; 9=NA")
#Check class
class(df$free_trade)
#can still run numeric functions on haven_labelled
mean(df$free_trade, na.rm=T)
#table
table(df$free_trade)
#did the na recode work?
table(is.na(df$free_trade))
#check value labels
val_labels(df$free_trade)
#set missing values the labelled way
table(df$var2)
na_values(df$var2)<-c(8,9)
#check
df$var2
#but a table function of does not pick up 8 and 9 as m isisng
table(df$var2)
#this seems to not work very well
table(to_factor(df$var2))
to_factor(df$var2)