Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 206316

matching and filling in blanks of data frame in R

$
0
0

I have data with double entries, that looks like this:

+-----+-------+-----------+-----------+--------+
| id  | first |   last    | birthyear | father |
+-----+-------+-----------+-----------+--------+
| a12 | linda | john      | 1991      | NA     |
| 3n8 | max   | well      | 1915      | NA     |
| 15z | linda | NA        | 1991      | dan    |
| 1y9 | pam   | degeneres | 1855      | NA     |
| 84z | NA    | degeneres | 1950      | hank   |
| 9i5 | max   | well      | NA        | mike   |
+-----+-------+-----------+-----------+--------+

There are multiple entries for a single person, but each entry has unique data that needs to be preserved. I want to merge these entries, keeping all information. Only the "id" column does not have to match, i want to keep the first "id" entry in the list as the final "id". So my final dataframe would look like this:

+-----+-------+-----------+-----------+--------+
| id  | first |   last    | birthyear | father |
+-----+-------+-----------+-----------+--------+
| a12 | linda | john      | 1991      | dan    |
| 3n8 | max   | well      | 1915      | mike   |
| 1y9 | pam   | degeneres | 1855      | NA     |
| 84z | NA    | degeneres | 1950      | hank   |
+-----+-------+-----------+-----------+--------+

In this example, there are two entries with last name "degeneres" who did not get merged because the birthyear does not match. The entries where there were only matching entries (aside from NAs) did get merged.

So far, the farthest i got is generating a list ordered by matching first names:

df <- data.frame(id = c("a12", "3n8", "15z", "1y9", "84z", "9i5"), first = c("linda", "max", "linda", "pam", NA, "max"), last = c("john", "well", NA, "degeneres", "degeneres", "well"), birthyear = c("1991", "1915", "1991", "1855", "1950", NA), father = c(NA, NA, "dan", NA, "hank", "mike"), stringsAsFactors = F)

name_list <- list()
i <- 1
for(n in df$first) {
  name_list[[i]] <- df[df$first == n,]
  i <<- i + 1
}

I also tried to apply merge in a meaningful way, but that does not give me the desired results:

merge(x = df, y = df, by = c("first", "last", "birthyear", "father"))

+---------+-----------+-----------+--------+------+------+
|   first |   last    | birthyear | father | id.x | id.y |
+---------+-----------+-----------+--------+------+------+
| linda   | john      | 1991      | <NA>   | a12  | a12  |
| linda   | NA        | 1991      | dan    | 15z  | 15z  |
| max     | well      | 1915      | NA     | 3n8  | 3n8  |
| max     | well      | NA        | mike   | 9i5  | 9i5  |
| NA      | degeneres | 1950      | hank   | 84z  | 84z  |
| pam     | degeneres | 1855      | NA     | 1y9  | 1y9  |
+---------+-----------+-----------+--------+------+------+

how could i best proceed?

EDIT:

Thanks for the responses so far! Just to be clear: I don't want to be conservative in determining which row describes a unique person. For example, this input:

+-----+-------+------+-----------+--------+
| id  | first | last | birthyear | father |
+-----+-------+------+-----------+--------+
| 9i5 | max   | well | NA        | mike   |
| 9i6 | dan   | well | NA        | mike   |
| 9i7 | NA    | well | NA        | NA     |
+-----+-------+------+-----------+--------+

needs to give this output:

+-----+-------+------+-----------+--------+
| id  | first | last | birthyear | father |
+-----+-------+------+-----------+--------+
| 9i5 | max   | well | NA        | mike   |
| 9i6 | dan   | well | NA        | mike   |
+-----+-------+------+-----------+--------+

EDIT2:

So i've used Adam's script on my data set. It works great, there is only a hiccup because of exactly the problem that Salix predicted/found. I have a row with very little data about my woman named Linda. Turns out, there are two Linda's that are definitely unique, and a third entry named Linda with no further information.

The script is now trying to match the unknown Linda to both of the other two unique Linda's. I've traced the issue down to a collision in the merge_id object. For my data set, it looks like this:

+------+------+
| V1   | V2   |
+------+------+
|  188 |  916 |
|  188 | 1048 |
|  752 | 1048 |
|  916 | 1048 |
| 1048 | 1058 |
+------+------+

As you can see, person 1048 matches with people who do not match with eachother. So for example 188 - 916 - 1048 could all be the same person, because 188 matches 916, 188 matches 1048 and 916 matches 1048. All fine.

But then person 752 also matches with 1048, but does not match with 188 or 916. Ergo, 1048 does not have enough information and needs to be deleted.

I'm trying to come up with a function that detects this collision and deletes 1048 from the dataset.


Viewing all articles
Browse latest Browse all 206316

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>