### Introduction

Stata has two built-in variables called **_n** and **_N**. **_n** is Stata notation for the current observation number. **_n** is 1 in the first observation, 2 in the second, 3 in the third, and so on.

**_N** is Stata notation for the total number of observations. Let’s see how **_n** and **_N** work.

input score group 72 1 84 2 76 1 89 3 82 2 90 1 85 1 end generate id = _n generate nt = _N listscore group id nt 1. 72 1 1 7 2. 84 2 2 7 3. 76 1 3 7 4. 89 3 4 7 5. 82 2 5 7 6. 90 1 6 7 7. 85 1 7 7

As you can see, the variable **id** contains observation number running from 1 to 7 and **nt** is the total number of observations, which is 7.

### Counting with by

Using **_n** and **_N** in conjunction with the **by** command can produce some very useful results. Of course, to use the **by** command we must first sort our data on the **by** variable.

sort group score by group: generate n1 = _n by group: generate n2 = _N listscore group id nt n1 n2 1. 72 1 1 7 1 4 2. 76 1 3 7 2 4 3. 85 1 7 7 3 4 4. 90 1 6 7 4 4 5. 82 2 5 7 1 2 6. 84 2 2 7 2 2 7. 89 3 4 7 1 1

Now **n1** is the observation number within each group and **n2** is the total number of observations for each group.

To **list** the lowest score for each group use the following:

list if n1==1score group id nt n1 n2 1. 72 1 1 7 1 4 5. 82 2 5 7 1 2 7. 89 3 4 7 1 1

To **list** the highest score for each group use the following:

list if n1==n2score group id nt n1 n2 4. 90 1 6 7 4 4 6. 84 2 2 7 2 2 7. 89 3 4 7 1 1

### Another use of _n

Let’s use _n to find out if there are duplicate **id** numbers in the following data:

input id score 117 72 204 84 311 76 289 89 141 82 277 90 465 85 289 88 182 84 end sort id list if id == id[_n + 1]id score 6. 289 88list in 6/7id score 6. 289 88 7. 289 89

As it turns out, observations 6 and 7 have the same **id** numbers and but different **score** values.

### Finding Duplicates

Now let’s use **_N** to find duplicate observations.

input id score x1 x2 y1 y2 z1 z2 117 72 3 16 42 7 59 61 204 84 6 12 44 9 51 66 141 82 2 17 41 5 56 61 311 76 9 14 46 1 58 62 289 89 4 13 48 3 55 68 141 82 2 17 41 5 56 61 277 90 3 12 44 6 52 65 465 85 5 19 43 2 54 64 289 88 7 18 45 4 58 69 182 84 1 11 47 7 52 61 141 90 4 13 43 4 51 65 end sort id score x1 x2 y1 y2 z1 z2 by id score x1 x2 y1 y2 z1 z2: generate n = _N list if n>1Observation 2 id 141 score 82 x1 2 x2 17 y1 41 y2 5 z1 56 z2 61 n 2 Observation 3 id 141 score 82 x1 2 x2 17 y1 41 y2 5 z1 56 z2 61 n 2

In this example we **sort** the observations by all of the variables. Then we use all of the variable in the by statement and set set **n** equal to the total number of observations that are identical. Finally, we list the observations for which **_N** is greater than 1, thereby identifying the duplicate observations.

If you have a lot of variables in the dataset, it could take a long time to type them all out twice. We can make use of the “*” wildcard to indicates that we wish to use all the variables. Further in the latest versions of Stata we can combine sort and by into a single statement. Below is a simplified version of the code that will yield the exact same results as above.

bysort * : generate n = _N list if n>1