Say that I have these data:
clear all
input n str6 G1 str6 G2 v desired computed
1 "B""A" 1 5 .
2 "A""A" 2 5.5 .
3 "C""A" 3 4.5 .
4 "A""B" 4 2 .
5 "B""B" 5 2.5 .
6 "C""B" 6 1.5 .
end
n
is observation number, G1
is group 1, G2
is group 2 (say class 1 and class 2), and v
is value. desired
is the desired output, and computed
will be the attempt at the desired output.
My goal is to perform ~in Stata~ an operation, in this example an average, over all observations that had no contact with the observation, including the observation itself---i.e., that were not in the same G1
or in the same G2
as the observation (or are that observation). For example, v
for observation 1 would be the sum of the values of v
for observations 4 and 6. (1, 2, and 3 are excluded because they share the same G2
as 1. 5 is also excluded because it shares the same G1
as 1.) So we sum the v
of observations 4 and 6 and get 4+6=10
and divide by the number, 2, to get 5
.
I think I can get what I want with the following code:
local N = _N
forvalues i = 1/`N' {
preserve
*create temp, which, when equal to 1, indicates the observations to make the calculation on
gen temp = 1
*save locals equal to the first and second group of `i'
local temp_G1 = G1[`i']
local temp_G2 = G2[`i']
*make temp = 0 for observations that were in first and/or second group as `i'
replace temp = 0 if G1=="`temp_G1'"
replace temp = 0 if G2=="`temp_G2'"
*compute sum on observations that have a temp equal to 1
egen sum = sum(v) if temp==1
*fill in the sum for all obs
egen sum_all = max(sum)
*compute number in group
egen num = total(temp) if temp==1
display "`num'"
egen num_all = max(num)
*save the value of the sum in a local
local calc = sum_all[`i']/num_all[`i']
restore
*fill in the value from the local for row `i'
replace computed = `calc' in `i'
}
However, this approach seems very long and inelegant. Is there a better way to go about this in Stata? I thought about using bys
, but I couldn't figure it out. If it were only G1
or G2
, I think it would be easier, but both together seem problematic with double counting---bys might include observations both in the G1
count and in the G2
count.
I guess another way to ask the question is if there is a way to do functions on each observation/row like R
's apply
family or if I need to use the clumsy loops approach like I do here.