I'm working on a dataset containing information about individuals' place of residence and occupation. Originally, it says that someone resides at an address from a year to a year, e.g. from 1920 to 1925. If the individual moved to that address in 1920 there is a dummy variable with the value of 1. Similarily, if the individual moved out from that address in 1925, there is also a dummy with the value of 1.
Now, the problem is that when I unnest the "from year - to year", there will be a value of 1 for all observations, both moved out and moved in, from 1920 to 1925.
Example data:
library(tidyr)
library(dplyr)
individual <- c('John Doe','Peter Gynn','Jolie Hope', 'Jolie Hope')
occupation <- c('banker', 'butcher', 'clerk', 'clerk')
first_obs <- c(1920, 1920, 1920, 1925)
last_obs <- c(1925, 1925, 1925, 1926)
moved_in <- c(1, 0, 1, 1)
moved_out <- c(0, 0, 1, 0)
address <- c('king street', 'market street', 'montgomery road', 'princes ave')
df <- data.frame(individual, occupation, address, first_obs, last_obs, moved_in, moved_out)
df$year <- mapply(seq,df$first_obs,df$last_obs,SIMPLIFY=FALSE)
new_df <- df %>%
unnest(year) %>%
select(-first_obs,-last_obs)
As you can see, it seems that Jolie Hope, for example, moved in and moved out of her address every year between 1920 and 1925, but she supposed to have moved in in 1920 and moved out in 1925. Is there a solution for this?
Additionally, I have som problems with duplicated values due to people moving in and out in the same year. For instance, Jolie Hope moved out from Mongomery Road in 1925 and moved in at Princes Avenue in 1925. I think the best solution would be to only use the "moved in" row. Is it possible to systematically remove all the "moved out" rows where there are duplicated values?