There may be times that you receive a file that has many (or all) of the variables defined as strings, that is, character variables. The variables may contain numeric values, but if they are defined as type string, there are very few things you can do to analyze the data. You cannot get means, you cannot do a regression, you cannot do an ANOVA, etc… Sometimes the dataset contains numerical values that are stored as strings. We will address this scenario first. Then we will address the case where the string variables actually contain strings, and the goal is to assign each value the string takes on to a numeric value.
All of the examples on this page use the same dataset, so let’s start by examining the data. The example dataset, hsbs, is a subset of the High School and Beyond data file with all of the variables as string variables.
use https://stats.idre.ucla.edu/stat/stata/faq/hsbs, clear
As you see from the describe command below, the variables are all defined as string variables (e.g., science is str2, a string of length 2).
describe Contains data from hsbs.dta obs: 20 vars: 6 25 Nov 1999 19:09 size: 400 (99.6% of memory free) ------------------------------------------------------------------------------- 1. id str3 %3s 2. gender str1 %5s 3. race str5 %9s 4. schtyp str3 %5s 5. read str2 %5s 6. science str2 %5s ------------------------------------------------------------------------------- Sorted by:
Now that we know the variables are string variables, we can use the list command to see what the strings stored in these variables look like. Although the variable science is defined as str2, you can see from the list below that it contains just numeric values. Even so, because the variable is defined as str2, Stata cannot perform any kind of numerical analysis of the variable science. The same is true for the variable read.
list id gender race schtyp read science 1. 70 m 1 pub 45 47 2. 121 f 1 pub 68 63 3. 86 m 1 pub 44 58 4. 141 m 1 pub 63 53 5. 172 m 1 pub 47 53 6. 113 m 1 pub 44 63 7. 50 m 3 pub 50 53 8. 11 m 2 pub 34 39 9. 84 m 1 pub 63 . 10. 48 m 3 pub 57 50 11. 75 m 1 pub 60 53 12. 60 m X pub 57 63 13. 95 m 1 pub 73 61 14. 104 m 1 pub 54 55 15. 38 m 3 pub 45 31 16. 115 m 1 pub 42 50 17. 76 m 1 pub 47 50 18. 195 m 1 pri 57 19. 114 m 1 pub 68 55 20. 85 m 1 pub 55 53
Converting string variables with numeric values
One method of converting numbers stored as strings into numerical variables is to use a string function called real that translates numeric values stored as strings into numeric values Stata can recognize as such. The first line of syntax reads in the dataset shown above. The second generates a new variable read_n that is equal to the value of the number stored in the string variable read. The real(s) is the function that translates the values held as strings, where s is the variable containing strings.
gen read_n = real(read) list read read_n in 1/5 +---------------+ | read read_n | |---------------| 1. | 45 45 | 2. | 68 68 | 3. | 44 44 | 4. | 63 63 | 5. | 47 47 | +---------------+
A second method of achieving the same result is the command destring. Let’s try using the destring command and see how it works. The first line of syntax loads the dataset again, so that we are starting with a dataset containing only string variables again. The second line of syntax runs the destring command.
use https://stats.idre.ucla.edu/stat/stata/faq/hsbs, clear destring , replace id has all characters numeric; replaced as int gender contains non-numeric characters; no replace race contains non-numeric characters; no replace schtyp contains non-numeric characters; no replace read has all characters numeric; replaced as byte science has all characters numeric; replaced as byte (2 missing values generated)
As you can see from the describe command below, the destring command converted all of the variables to numeric, except for race, gender and schtyp. Since these variables had characters in them, the destring command left such variables alone. If there had been any numeric variables in the dataset, they would remain unchanged.
describe Contains data from https://stats.idre.ucla.edu/stat/stata/faq/hsbs.dta obs: 20 vars: 6 25 Nov 1999 19:09 size: 340 (98.5% of memory free) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- id int %10.0g gender str1 %5s race str5 %9s schtyp str3 %5s read byte %10.0g science byte %10.0g ------------------------------------------------------------------------------- Sorted by: Note: dataset has changed since last saved
Both of the techniques described above have attributes that in some situations are advantages and in other situations may be disadvantages. The command destring can be run on an entire dataset in one step, the method using the function “real” requires issuing a command for each variable to be converted (although this can be done with a loop rather than typing out the syntax for each variable). One potential advantage to using the function “real” (the first method) is that if the function “real” encounters a non-numeric value, it sets the variable equal to missing in that case and moves on. To some extent destring can be made to behave similarly, but not identically. In order to convert a string variable containing any non-numeric value using destring one must list the characters that should be ignored (e.g. “,” or “.”). Additionally, rather than setting values for those cases containing non-numeric values to missing (what the function “real” does), destring removes the specified non-numeric characters. destring will extract the specified strings and then convert, meaning that “a4” can be converted to “4”. destring‘s behavior is very good if one has numeric values stored as strings that occasionally contain things like commas (e.g. 4,354), but there may be situations where this behavior is undesirable.
Converting string variables with non-numeric values into numeric values
How do we convert gender and schtyp into numeric values? We can use the encode command as shown below. These commands create gender2 and schtyp2.
encode gender, generate(gender2) encode schtyp, generate(schtyp2)
Notice in the describe command below that gender2 and schtyp2 are numeric variables and they have labels associated with them (called gender2 and schtyp2).
describe Contains data from https://stats.idre.ucla.edu/stat/stata/faq/hsbs.dta obs: 20 vars: 8 25 Nov 1999 19:09 size: 560 (99.8% of memory free) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- id str3 %3s gender str1 %5s race str5 %9s schtyp str3 %5s read str2 %5s science str2 %5s gender2 long %8.0g gender2 schtyp2 long %8.0g schtyp2 ------------------------------------------------------------------------------- Sorted by: Note: dataset has changed since last saved
If we list out the data, it appears that gender2 and schtyp2 are identical to gender and schtyp, however they are really numeric and what you are seeing are the value labels associated with the variables.
list id gen~r race sch~p read sci~e gender2 schtyp2 1. 70 m 1 pub 45 47 m pub 2. 121 f 1 pub 68 63 f pub 3. 86 m 1 pub 44 58 m pub 4. 141 m 1 pub 63 53 m pub 5. 172 m 1 pub 47 53 m pub 6. 113 m 1 pub 44 63 m pub 7. 50 m 3 pub 50 53 m pub 8. 11 m 2 pub 34 39 m pub 9. 84 m 1 pub 63 . m pub 10. 48 m 3 pub 57 50 m pub 11. 75 m 1 pub 60 53 m pub 12. 60 m X pub 57 63 m pub 13. 95 m 1 pub 73 61 m pub 14. 104 m 1 pub 54 55 m pub 15. 38 m 3 pub 45 31 m pub 16. 115 m 1 pub 42 50 m pub 17. 76 m 1 pub 47 50 m pub 18. 195 m 1 pri 57 m pri 19. 114 m 1 pub 68 55 m pub 20. 85 m 1 pub 55 53 m pub
Below we use the nolabel option and you see that gender2 and schtyp2 are really numeric.
list, nolabel id gen~r race sch~p read sci~e gender2 schtyp2 1. 70 m 1 pub 45 47 2 2 2. 121 f 1 pub 68 63 1 2 3. 86 m 1 pub 44 58 2 2 4. 141 m 1 pub 63 53 2 2 5. 172 m 1 pub 47 53 2 2 6. 113 m 1 pub 44 63 2 2 7. 50 m 3 pub 50 53 2 2 8. 11 m 2 pub 34 39 2 2 9. 84 m 1 pub 63 . 2 2 10. 48 m 3 pub 57 50 2 2 11. 75 m 1 pub 60 53 2 2 12. 60 m X pub 57 63 2 2 13. 95 m 1 pub 73 61 2 2 14. 104 m 1 pub 54 55 2 2 15. 38 m 3 pub 45 31 2 2 16. 115 m 1 pub 42 50 2 2 17. 76 m 1 pub 47 50 2 2 18. 195 m 1 pri 57 2 1 19. 114 m 1 pub 68 55 2 2 20. 85 m 1 pub 55 53 2 2
What about the variable race? It is still a character variable because our prior destring command saw the X in the data and did not attempt to convert it because it had non-numeric values. Below we can convert it to numeric by include the ignore(X) option that tells destring to convert the variable to numeric and when it encounters X to convert that to a missing value. You can see the results in the list command below.
destring race, replace ignore(X) race: characters X removed; replaced as byte (1 missing value generated) list id gen~r race sch~p read sci~e gender2 schtyp2 1. 70 m 1 pub 45 47 m pub 2. 121 f 1 pub 68 63 f pub 3. 86 m 1 pub 44 58 m pub 4. 141 m 1 pub 63 53 m pub 5. 172 m 1 pub 47 53 m pub 6. 113 m 1 pub 44 63 m pub 7. 50 m 3 pub 50 53 m pub 8. 11 m 2 pub 34 39 m pub 9. 84 m 1 pub 63 . m pub 10. 48 m 3 pub 57 50 m pub 11. 75 m 1 pub 60 53 m pub 12. 60 m . pub 57 63 m pub 13. 95 m 1 pub 73 61 m pub 14. 104 m 1 pub 54 55 m pub 15. 38 m 3 pub 45 31 m pub 16. 115 m 1 pub 42 50 m pub 17. 76 m 1 pub 47 50 m pub 18. 195 m 1 pri 57 m pri 19. 114 m 1 pub 68 55 m pub 20. 85 m 1 pub 55 53 m pub
As you have seen, we can use destring to convert string variables that contain numbers into numeric variables, and it can handle situations where some values are stored as a character (like the X we saw with race). If you have a character variable that is stored as all characters, you can use encode to convert the character variable to numeric and it will create value labels that have the values that were stored with the character variable.
Fore more information, see the help or reference manual about the destring and encode commands.