Original (see Update below)
Assume I have a data set with a low two digit number of columns (some NA/empty) and more than 100.000 rows, represented by the following example dataframe
df <- data.frame(ID = c(1,2,3,4,5),
CTR1 = c("England", "England", "England", "China", "England"),
CTR2 = c("England", "China", "China", "England", NA),
CTR3 = c("England", "China", "China", "England", NA),
CTR4 = c("China", "USA", "USA", "China", NA),
CTR5 = c("USA", "England", "USA", "USA", NA),
CTR6 = c("England", "China", "USA", "England", NA))
df
ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
1 England England England China USA England
2 England China China USA England China
3 England China China USA USA USA
4 China England England China USA England
5 England
and I want to count the co-occurrences by ID/row to get a co-occurrence matrix that sums up the co-occurence by ID/row only once, meaning that no value over 1 will be allocated to a combination (i.e. assign a value of 1 for the existence of a co-occurrence independent of in-row frequencies and order, assign a value of 0 for no co-occurrence/combination by ID/row),
1 England-England-England => 1
2 England-England => 1
3 England-China => 1
4 England- => 0
Another important aspects regards the counting of observations that appear once in a row but in combination with others, e.g. USA in row 1. They should get a value of 1 for their own co-occurrence (as they are in a combination even though not with themselves) so that the combination USA-USA also gets a value of 1 assigned.
1 England England England China USA England
USA-USA => 1
China-China => 1
USA-China => 1
England-England => 1
England-USA => 1
England-China => 1
Due to the fact that row count should not >1 for a combination by row/ID, this results to:
China England USA
China 1 1 1
England 1 1 1
USA 1 1 1
This should lead to the following result based on the example dataframe, where a value of 4 is assigned to each combination based on the fact that each combination has occured at least in four rows and each string is part of a combination of the original dataframe:
China England USA
China 4 4 4
England 4 4 4
USA 4 4 4
So there are five conditions for counting:
- Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is not counted.
- A combination should be counted as 1.
- Observations occuring more than once do not contribute to a higher value for the interaction, i.e. several occurrences of the same country do not matter.
- Being in a combination (even in the case the same country does not appear twice in a row) results in counting as a self-combination, i.e. a value of 1 is assigned.
- There is no value over 1 assigned to a combination by row/ID.
I've tried to implement this by using dplyr
, data.table
, base aggregate
or plyr
adjusting code from [1], [2], [3], [4], [5] and [6] but as I don't care about order within a row but I also don't want to sum up all combinations within a row, I haven't got the aspired result so far.
I'm a novice in R. Any help is very much appreciated.
Update
Thanks to @jazzurro for his anwer. It made me realize that the duplicates may complicate things. I hope keeping only unique values by row simplifies the task so that the example dataframe looks like this by applying code from here.
df <- data.frame(ID = c(1,2,3,4,5),
CTR1 = c("England", "England", "England", "China", "England"),
CTR2 = c("England", "China", "China", "England", NA),
CTR3 = c("USA", "USA", "USA", "USA", NA),
CTR4 = c(NA, NA, NA, NA, NA),
CTR5 = c(NA, NA, NA, NA, NA),
CTR6 = c(NA, NA, NA, NA, NA))
ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
1 England China USA
2 England China USA
3 England China USA
4 China England USA
5 England
Leaving only three conditions to be fulfilled:
Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is not counted.
A combination should be counted as 1.
Being in a combination results in counting as a self-combination as well (USA-USA), i.e. a value of 1 is assigned.