Consider the following dataset.
clear input year str3 b bx 2001 abc 32 2001 nbc 14 2001 cbs 23 2001 nbc 14 2002 abc 11 2002 nbc 21 2002 wb 19 2002 pbs 30 2003 abc 25 2003 nbc 16 2003 nbc 16 2003 cbs 19 2003 pbs 28 2003 abc 25 end
In this example year is the group variable. What we want to do is to create a dataset that contains all pairs of the variable b within each year. For example, for the year 2001, we want observations that pair nbc with abc and cbs. However, we do not want nbc to be paired with itself. Further, the pair nbc with cbs is considered to be a duplicate permutation of cbs with nbc. We want to keep only one of each duplicate permutation. Hopefully, the data for 2001 will look something like this,
2001 abc cbs 32 23 2001 abc nbc 32 14 2001 cbs nbc 23 14
There is a Stata FAQ page written by Nick Cox (http://www.stata.com/support/faqs/data/pairs.html) that takes a different approach to solving this problem.
Our approach will be to make use of the joinby command. To accomplish this we will have to save the data into a temporary file and change the variable names before using the joinby command.
Here is the code fragment that will create all the possible pairs without duplicate permutations.
/* find and remove duplicates within year */ by year b, sort: gen i=_n keep if i==1 drop i tempfile temp1 save `temp1', replace /* create all possible pairs using joinby */ rename b c /* rename prior to joinby */ rename bx cx /* rename prior to joinby */ joinby year using `temp1' drop if b==c /* drop if b and c are the same */ egen d1=concat(b c) /* create joint pairs in order 1 */ egen d2=concat(c b) /* create joint pairs in order 2 */ replace d1=d2 if b>c /* d1 has pairs in ascending order */ /* find and remove duplicate permutations within year */ by year d1, sort: gen i=_n keep if i==1 /* clean up variables an display */ drop d2 drop i order year b c bx cx clist
Running the above code yields the following output.
year b c bx cx d1 1. 2001 cbs abc 23 32 abccbs 2. 2001 nbc abc 14 32 abcnbc 3. 2001 cbs nbc 23 14 cbsnbc 4. 2002 abc nbc 11 21 abcnbc 5. 2002 pbs abc 30 11 abcpbs 6. 2002 abc wb 11 19 abcwb 7. 2002 nbc pbs 21 30 nbcpbs 8. 2002 wb nbc 19 21 nbcwb 9. 2002 pbs wb 30 19 pbswb 10. 2003 abc cbs 25 19 abccbs 11. 2003 abc nbc 25 16 abcnbc 12. 2003 pbs abc 28 25 abcpbs 13. 2003 nbc cbs 16 19 cbsnbc 14. 2003 cbs pbs 19 28 cbspbs 15. 2003 nbc pbs 16 28 nbcpbs
In the above output, bx is the x value associated with b and cx is the x value associated with c. Variable d1 contains the concatenated pair in asscending order and was used to find duplicate permutations. This approach can be extended to any number of variables associated with each of the paired values.