I have data with double entries, that looks like this:
+-----+-------+-----------+-----------+--------+
| id | first | last | birthyear | father |
+-----+-------+-----------+-----------+--------+
| a12 | linda | john | 1991 | NA |
| 3n8 | max | well | 1915 | NA |
| 15z | linda | NA | 1991 | dan |
| 1y9 | pam | degeneres | 1855 | NA |
| 84z | NA | degeneres | 1950 | hank |
| 9i5 | max | well | NA | mike |
+-----+-------+-----------+-----------+--------+
There are multiple entries for a single person, but each entry has unique data that needs to be preserved. I want to merge these entries, keeping all information. Only the "id" column does not have to match, i want to keep the first "id" entry in the list as the final "id". So my final dataframe would look like this:
+-----+-------+-----------+-----------+--------+
| id | first | last | birthyear | father |
+-----+-------+-----------+-----------+--------+
| a12 | linda | john | 1991 | dan |
| 3n8 | max | well | 1915 | mike |
| 1y9 | pam | degeneres | 1855 | NA |
| 84z | NA | degeneres | 1950 | hank |
+-----+-------+-----------+-----------+--------+
In this example, there are two entries with last name "degeneres" who did not get merged because the birthyear does not match. The entries where there were only matching entries (aside from NAs) did get merged.
So far, the farthest i got is generating a list ordered by matching first names:
df <- data.frame(id = c("a12", "3n8", "15z", "1y9", "84z", "9i5"), first = c("linda", "max", "linda", "pam", NA, "max"), last = c("john", "well", NA, "degeneres", "degeneres", "well"), birthyear = c("1991", "1915", "1991", "1855", "1950", NA), father = c(NA, NA, "dan", NA, "hank", "mike"), stringsAsFactors = F)
name_list <- list()
i <- 1
for(n in df$first) {
name_list[[i]] <- df[df$first == n,]
i <<- i + 1
}
I also tried to apply merge in a meaningful way, but that does not give me the desired results:
merge(x = df, y = df, by = c("first", "last", "birthyear", "father"))
+---------+-----------+-----------+--------+------+------+
| first | last | birthyear | father | id.x | id.y |
+---------+-----------+-----------+--------+------+------+
| linda | john | 1991 | <NA> | a12 | a12 |
| linda | NA | 1991 | dan | 15z | 15z |
| max | well | 1915 | NA | 3n8 | 3n8 |
| max | well | NA | mike | 9i5 | 9i5 |
| NA | degeneres | 1950 | hank | 84z | 84z |
| pam | degeneres | 1855 | NA | 1y9 | 1y9 |
+---------+-----------+-----------+--------+------+------+
how could i best proceed?