How can I quickly convert many string variables to numeric variables?

There may be times that you receive a file that has many (or all) of the variables defined as strings, that is, character variables. The variables may contain numeric values, but if they are defined as type string, there are very few things you can do to analyze the data. You cannot get means, you cannot do a regression, you cannot do an ANOVA, etc… Sometimes the dataset contains numerical values that are stored as strings. We will address this scenario first. Then we will address the case where the string variables actually contain strings, and the goal is to assign each value the string takes on to a numeric value.

All of the examples on this page use the same dataset, so let’s start by examining the data. The example dataset, hsbs, is a subset of the High School and Beyond data file with all of the variables as string variables.

use https://stats.idre.ucla.edu/stat/stata/faq/hsbs, clear

As you see from the describe command below, the variables are all defined as string variables (e.g., science is str2, a string of length 2).

describe

Contains data from hsbs.dta
  obs:            20                          
 vars:             6                          25 Nov 1999 19:09
 size:           400 (99.6% of memory free)
-------------------------------------------------------------------------------
   1. id        str3   %3s                    
   2. gender    str1   %5s                    
   3. race      str5   %9s                    
   4. schtyp    str3   %5s                    
   5. read      str2   %5s                    
   6. science   str2   %5s                    
-------------------------------------------------------------------------------
Sorted by:

Now that we know the variables are string variables, we can use the list command to see what the strings stored in these variables look like. Although the variable science is defined as str2, you can see from the list below that it contains just numeric values. Even so, because the variable is defined as str2, Stata cannot perform any kind of numerical analysis of the variable science. The same is true for the variable read.

list

      id  gender       race  schtyp   read  science 
  1.  70      m          1    pub     45     47  
  2. 121      f          1    pub     68     63  
  3.  86      m          1    pub     44     58  
  4. 141      m          1    pub     63     53  
  5. 172      m          1    pub     47     53  
  6. 113      m          1    pub     44     63  
  7.  50      m          3    pub     50     53  
  8.  11      m          2    pub     34     39  
  9.  84      m          1    pub     63      .  
 10.  48      m          3    pub     57     50  
 11.  75      m          1    pub     60     53  
 12.  60      m          X    pub     57     63  
 13.  95      m          1    pub     73     61  
 14. 104      m          1    pub     54     55  
 15.  38      m          3    pub     45     31  
 16. 115      m          1    pub     42     50  
 17.  76      m          1    pub     47     50  
 18. 195      m          1    pri     57         
 19. 114      m          1    pub     68     55  
 20.  85      m          1    pub     55     53

Converting string variables with numeric values

One method of converting numbers stored as strings into numerical variables is to use a string function called real that translates numeric values stored as strings into numeric values Stata can recognize as such. The first line of syntax reads in the dataset shown above. The second generates a new variable read_n that is equal to the value of the number stored in the string variable read. The real(s) is the function that translates the values held as strings, where s is the variable containing strings.

gen read_n = real(read)
list read read_n in 1/5

     +---------------+
     | read   read_n |
     |---------------|
  1. |   45       45 |
  2. |   68       68 |
  3. |   44       44 |
  4. |   63       63 |
  5. |   47       47 |
     +---------------+

A second method of achieving the same result is the command destring. Let’s try using the destring command and see how it works. The first line of syntax loads the dataset again, so that we are starting with a dataset containing only string variables again. The second line of syntax runs the destring command.

use https://stats.idre.ucla.edu/stat/stata/faq/hsbs, clear
destring , replace

id has all characters numeric; replaced as int
gender contains non-numeric characters; no replace
race contains non-numeric characters; no replace
schtyp contains non-numeric characters; no replace
read has all characters numeric; replaced as byte
science has all characters numeric; replaced as byte
(2 missing values generated)

As you can see from the describe command below, the destring command converted all of the variables to numeric, except for race, gender and schtyp. Since these variables had characters in them, the destring command left such variables alone. If there had been any numeric variables in the dataset, they would remain unchanged.

describe

Contains data from https://stats.idre.ucla.edu/stat/stata/faq/hsbs.dta
  obs:            20                          
 vars:             6                          25 Nov 1999 19:09
 size:           340 (98.5% of memory free)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
id              int    %10.0g                 
gender          str1   %5s                    
race            str5   %9s                    
schtyp          str3   %5s                    
read            byte   %10.0g                 
science         byte   %10.0g                 
-------------------------------------------------------------------------------
Sorted by:  
     Note:  dataset has changed since last saved

Both of the techniques described above have attributes that in some situations are advantages and in other situations may be disadvantages. The command destring can be run on an entire dataset in one step, the method using the function “real” requires issuing a command for each variable to be converted (although this can be done with a loop rather than typing out the syntax for each variable). One potential advantage to using the function “real” (the first method) is that if the function “real” encounters a non-numeric value, it sets the variable equal to missing in that case and moves on. To some extent destring can be made to behave similarly, but not identically. In order to convert a string variable containing any non-numeric value using destring one must list the characters that should be ignored (e.g. “,” or “.”). Additionally, rather than setting values for those cases containing non-numeric values to missing (what the function “real” does), destring removes the specified non-numeric characters. destring will extract the specified strings and then convert, meaning that “a4” can be converted to “4”. destring‘s behavior is very good if one has numeric values stored as strings that occasionally contain things like commas (e.g. 4,354), but there may be situations where this behavior is undesirable.

Converting string variables with non-numeric values into numeric values

How do we convert gender and schtyp into numeric values? We can use the encode command as shown below. These commands create gender2 and schtyp2.

encode gender, generate(gender2)
encode schtyp, generate(schtyp2)

Notice in the describe command below that gender2 and schtyp2 are numeric variables and they have labels associated with them (called gender2 and schtyp2).

describe

Contains data from https://stats.idre.ucla.edu/stat/stata/faq/hsbs.dta
  obs:            20                          
 vars:             8                          25 Nov 1999 19:09
 size:           560 (99.8% of memory free)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
id              str3   %3s                    
gender          str1   %5s                    
race            str5   %9s                    
schtyp          str3   %5s                    
read            str2   %5s                    
science         str2   %5s                    
gender2         long   %8.0g       gender2    
schtyp2         long   %8.0g       schtyp2    
-------------------------------------------------------------------------------
Sorted by:  
     Note:  dataset has changed since last saved

If we list out the data, it appears that gender2 and schtyp2 are identical to gender and schtyp, however they are really numeric and what you are seeing are the value labels associated with the variables.

list

      id  gen~r       race  sch~p   read  sci~e   gender2   schtyp2
  1.  70      m          1    pub     45     47         m       pub
  2. 121      f          1    pub     68     63         f       pub
  3.  86      m          1    pub     44     58         m       pub
  4. 141      m          1    pub     63     53         m       pub
  5. 172      m          1    pub     47     53         m       pub
  6. 113      m          1    pub     44     63         m       pub
  7.  50      m          3    pub     50     53         m       pub
  8.  11      m          2    pub     34     39         m       pub
  9.  84      m          1    pub     63      .         m       pub
 10.  48      m          3    pub     57     50         m       pub
 11.  75      m          1    pub     60     53         m       pub
 12.  60      m          X    pub     57     63         m       pub
 13.  95      m          1    pub     73     61         m       pub
 14. 104      m          1    pub     54     55         m       pub
 15.  38      m          3    pub     45     31         m       pub
 16. 115      m          1    pub     42     50         m       pub
 17.  76      m          1    pub     47     50         m       pub
 18. 195      m          1    pri     57                m       pri
 19. 114      m          1    pub     68     55         m       pub
 20.  85      m          1    pub     55     53         m       pub

Below we use the nolabel option and you see that gender2 and schtyp2 are really numeric.

list, nolabel

      id  gen~r       race  sch~p   read  sci~e   gender2   schtyp2
  1.  70      m          1    pub     45     47         2         2
  2. 121      f          1    pub     68     63         1         2
  3.  86      m          1    pub     44     58         2         2
  4. 141      m          1    pub     63     53         2         2
  5. 172      m          1    pub     47     53         2         2
  6. 113      m          1    pub     44     63         2         2
  7.  50      m          3    pub     50     53         2         2
  8.  11      m          2    pub     34     39         2         2
  9.  84      m          1    pub     63      .         2         2
 10.  48      m          3    pub     57     50         2         2
 11.  75      m          1    pub     60     53         2         2
 12.  60      m          X    pub     57     63         2         2
 13.  95      m          1    pub     73     61         2         2
 14. 104      m          1    pub     54     55         2         2
 15.  38      m          3    pub     45     31         2         2
 16. 115      m          1    pub     42     50         2         2
 17.  76      m          1    pub     47     50         2         2
 18. 195      m          1    pri     57                2         1
 19. 114      m          1    pub     68     55         2         2
 20.  85      m          1    pub     55     53         2         2

What about the variable race? It is still a character variable because our prior destring command saw the X in the data and did not attempt to convert it because it had non-numeric values. Below we can convert it to numeric by include the ignore(X) option that tells destring to convert the variable to numeric and when it encounters X to convert that to a missing value. You can see the results in the list command below.

destring race, replace ignore(X)
race: characters X removed; replaced as byte
(1 missing value generated)

list

      id  gen~r        race  sch~p   read  sci~e   gender2   schtyp2
  1.  70      m           1    pub     45     47         m       pub
  2. 121      f           1    pub     68     63         f       pub
  3.  86      m           1    pub     44     58         m       pub
  4. 141      m           1    pub     63     53         m       pub
  5. 172      m           1    pub     47     53         m       pub
  6. 113      m           1    pub     44     63         m       pub
  7.  50      m           3    pub     50     53         m       pub
  8.  11      m           2    pub     34     39         m       pub
  9.  84      m           1    pub     63      .         m       pub
 10.  48      m           3    pub     57     50         m       pub
 11.  75      m           1    pub     60     53         m       pub
 12.  60      m           .    pub     57     63         m       pub
 13.  95      m           1    pub     73     61         m       pub
 14. 104      m           1    pub     54     55         m       pub
 15.  38      m           3    pub     45     31         m       pub
 16. 115      m           1    pub     42     50         m       pub
 17.  76      m           1    pub     47     50         m       pub
 18. 195      m           1    pri     57                m       pri
 19. 114      m           1    pub     68     55         m       pub
 20.  85      m           1    pub     55     53         m       pub

As you have seen, we can use destring to convert string variables that contain numbers into numeric variables, and it can handle situations where some values are stored as a character (like the X we saw with race). If you have a character variable that is stored as all characters, you can use encode to convert the character variable to numeric and it will create value labels that have the values that were stored with the character variable.

Fore more information, see the help or reference manual about the destring and encode commands.