Generate Anonymous Keys using Mata | Stata Code Fragments

Version info: Code for this page was tested in Stata 12.1.

This code fragment page shows an example using Mata to write a function that generates anonymized keys for wide data such as for participants who rate other participants or indicate who they know or are close to.

In this example, we have collected data on some school children. Each child in the study gets a unique ID, and their grade and sex are recorded. They are also asked to nominate their three closest friends. These friends may either be in the study or not. If the friends are in the study, we will have their IDs and the friend’s friends. However, some friends may not have been in the study, in which case they get their own ID too. Thus although the IDs in the id column are unique, they are not exhaustive. If these IDs could lead back to the kids, we may want to anonymize the data, but generating delinked IDs. However, all the IDs need to be recoded, both those in the id column and those for the friends. Also note that some children do not have three close friends so the IDs for some friends are missing. Here is some example data.


/* some example data */
clear
input id grade sex friend1 friend2 friend3
1  5     0   2       3       14
2  6     0   1       3       14
3  6     0   1       2       14
4  5     1   1       2       3
5  7     0   1       .       .
6  6     1   2       4       5
7  5     1   4       6       .
end

/* list the data, Stata style */
list


     +------------------------------------------------+
     | id   grade   sex   friend1   friend2   friend3 |
     |------------------------------------------------|
  1. |  1       5     0         2         3        14 |
  2. |  2       6     0         1         3        14 |
  3. |  3       6     0         1         2        14 |
  4. |  4       5     1         1         2         3 |
  5. |  5       7     0         1         .         . |
     |------------------------------------------------|
  6. |  6       6     1         2         4         5 |
  7. |  7       5     1         4         6         . |
     +------------------------------------------------+

In order to generate and recode the ids, we will write a short Mata function, called genKey. This function takes a single argument, a string vector containing the names of all the variables containing IDs to be recoded. For all the variables specified, genKey will create a list of unique values, create anonymous values for them, recode the values, and then replace the values in the Stata dataset with the new recoded IDs. Note that the function can handle missing data and leaves missing IDs as missing.

First we will run the Mata code to define the function. To do this, in Stata, open a do file editor, copy and paste the code, highlight all of it and do it at once. This will not return output, but will have the function saved and available for use.


/* Create mata function generate anonymous keys */
mata
void genKey(string vector vars) {
  D = st_data(., vars)
  longD = colshape(D, 1)
  UID = uniqrows(longD)
  UID = (UID, uniform(rows(UID), 1))
  UID = sort(UID, 2)
  lookup = (UID[, 1], (1::rows(UID)))
  res = (longD, longD, (1::rows(longD)))
  
  for (i=1;i<=rows(lookup);i++) {
    index = select(res[, 3], lookup[i, 1] :== res[, 1])
    res[index, 2] = J(rows(index), 1, lookup[i, 2])
  }
  
  index = select(res[, 3], res[, 1] :== .)
  res[index, 2] = res[index, 1]
  
  res = colshape(res[, 2], cols(D))
  
  st_store((1, rows(res)), st_varindex(vars), res)
}
end
/* end mata function creation */

Now it is easy to use this function to recode our data to anonymous IDs. Note that because the IDs are random, everytime you run the function, the results will be different unless you work to set the random seed (which we do not show here).


/* list original data */
list

     +------------------------------------------------+
     | id   grade   sex   friend1   friend2   friend3 |
     |------------------------------------------------|
  1. |  1       5     0         2         3        14 |
  2. |  2       6     0         1         3        14 |
  3. |  3       6     0         1         2        14 |
  4. |  4       5     1         1         2         3 |
  5. |  5       7     0         1         .         . |
     |------------------------------------------------|
  6. |  6       6     1         2         4         5 |
  7. |  7       5     1         4         6         . |
     +------------------------------------------------+

/* generate random IDs and replace in data */
mata genKey(("id", "friend1", "friend2", "friend3"))

/* list data to see updated IDs */
list

     +------------------------------------------------+
     | id   grade   sex   friend1   friend2   friend3 |
     |------------------------------------------------|
  1. |  3       5     0         2         1         8 |
  2. |  2       6     0         3         1         8 |
  3. |  1       6     0         3         2         8 |
  4. |  9       5     1         3         2         1 |
  5. |  5       7     0         3         .         . |
     |------------------------------------------------|
  6. |  6       6     1         2         9         5 |
  7. |  4       5     1         9         6         . |
     +------------------------------------------------+

What happens if we run the function again?


mata genKey(("id", "friend1", "friend2", "friend3"))

list

     +------------------------------------------------+
     | id   grade   sex   friend1   friend2   friend3 |
     |------------------------------------------------|
  1. |  2       5     0         4         5         1 |
  2. |  4       6     0         2         5         1 |
  3. |  5       6     0         2         4         1 |
  4. |  7       5     1         2         4         5 |
  5. |  6       7     0         2         .         . |
     |------------------------------------------------|
  6. |  9       6     1         4         7         6 |
  7. |  3       5     1         7         9         . |
     +------------------------------------------------+

The values keep getting reshuffled. This function is simple and for anonymous IDs, just uses sequential integers, but it does so in a random way by sorting random values from a uniform distribution.

Note that this function can handle an arbitrary number of IDs (rows in the dataset) as well as an arbitrary number of variables containing IDs. Further, the variables need not be in any sort of order nor adjacent to each other.