I'm trying to filter a large data set by a few different variables. Here's a dummy dataset to show what I mean:
df -> data.frame(game_id = c(1,1,2,2,3,3,4,4,5,5,6,6),
team = c(a,a,a,a,a,a,b,b,b,b,b,b),
play_id = c(1,2,1,2,1,2,1,2,1,2,1,2),
value = c(.2,.6,.9,.7,.5,.5,.4,.6,.5,.9,.2,.8),
play_type = c("run","pass","pass","pass","run","pass","run","run","pass","pass","run","run",
qtr = c(1,1,1,1,1,1,1,1,1,1,1,1)
Where:
- game_id = unique identifier of a matchup between two teams
team = designates which team is on offensive. two teams are assigned to each game_id and there are over 30 teams total in real dataset
play_id = sequential number of individual plays in a game (each game has at about 100 plays total split among teams)
value = at any point in the game, this value is the % chance the team on offense has of winning the game
play_type = strategy used by the offense of that play
qtr = 4 quarters in a complete game
My goal is find all games in which either team in a matchup had a value of at least .8 at any point in qtr 1, the trick being I want to mark all the plays leading up to that team's advantage and compare what percentage of them used the "run" strategy vs. "pass" strategy.
I was able to isolate the teams with such an advantage here:
types = c("run","pass")
df <- df %>%
filter(play_type %in% types, qtr == 1, wp > .79) %>%
distinct(game_id,team)
but I'm racking my brain to utilize that to serve my needs. a for loop doesn't work bc the datasets aren't the same size.
Ideally, I'd create a new data frame with only games in which this .8 value occurs at any point in qtr 1 for either team and then has a variable that assigns which team had that advantage for all play_ids leading up to this advantage.
Hopefully this made sense. thank you all!