I have a df with incomplete duplicates. The duplicates are based on 2 columns (dates and co.name) after which the data differs. What I would like to do is "flip a coin" and keep one of the 2 duplicates since there is no way of validating which is correct.
I've thought of subsetting the dataframe by dates and co.name and then merging that back to the original, only keeping one side but was wondering if there is a better way
dates <- c(rep("2019-06-17", 2), rep("2016-01-11", 2), rep("2016-04-11",2), '2016-04-12', '2016-04-12')
co.name <- c(rep("co1", 2), rep("co2", 2), rep("co1",2), 'co1', 'co2')
total <- c(10,10,15,12,10,9,12,14)
new.products <- c(3,0,4,0,2,0,1,4)
df <-data.frame(dates, co.name, total, new.products)
df
dates co.name total new.products
1 2019-06-17 co1 10 3
2 2019-06-17 co1 10 0
3 2016-01-11 co2 15 4
4 2016-01-11 co2 12 0
5 2016-04-11 co1 10 2
6 2016-04-11 co1 9 0
7 2016-04-12 co1 12 1
8 2016-04-12 co2 14 4
df %>%
group_by(co.name, dates) %>%
filter(n() == 2)
# A tibble: 6 x 4
# Groups: co.name, dates [3]
dates co.name total new.products
<fct> <fct> <dbl> <dbl>
1 2019-06-17 co1 10 3
2 2019-06-17 co1 10 0
3 2016-01-11 co2 15 4
4 2016-01-11 co2 12 0
5 2016-04-11 co1 10 2
6 2016-04-11 co1 9 0
Expected output:
# A tibble: 5 x 4
dates co.name total new.products
<fct> <fct> <dbl> <dbl>
1 2019-06-17 co1 10 0
2 2016-01-11 co2 12 0
3 2016-04-11 co1 9 0
4 2016-04-11 co1 10 2
5 2016-04-11 co1 9 0
Or
# A tibble: 5 x 4
dates co.name total new.products
<fct> <fct> <dbl> <dbl>
1 2019-06-17 co1 10 3
2 2016-01-11 co2 15 4
3 2016-04-11 co1 10 2
4 2016-04-11 co1 10 2
5 2016-04-11 co1 9 0