String processing is fairly easy in Stata because of the many built-in string functions. Among these string functions are three functions that are related to regular expressions, regexm for matching, regexr for replacing and regexs for subexpressions. We will show some examples of how to use regular expression to extract and/or replace a portion of a string variable using these three functions. At the bottom of the page is an explanation of all the regular expression operators as well as the functions that work with regular expressions.
Examples
Example 1: A researcher has addresses as a string variable and wants to create a new variable that contains just the zip codes.
Example 2: We have a variable that contains full names in the order of first name and then last name. We want to create a new variable with full name in the order of last name and then first name separated by comma.
Example 2: Dates were entered as a string variable, in some cases the year was entered as a four-digit value (which is what Stata generally expects to see), but in other cases it was entered as a two-digit value. We want to create a date variable in numeric format based on this string variable. This task can actually easily be handled with regular Stata commands, see our FAQ page “My date variable is a string, how can I turn it into a date variable Stata can recognize?” for information on doing this. We have included this example here for demonstration purposes, not because regular expressions are necessarily the best way to handle this situation.
In these situations, regular expressions can be used to identify cases in which a string contains a set of values (e.g. a specific word, a number followed by a word etc.) and extract that set of values from the whole string for use elsewhere.
Example 1: Extracting zip codes from addresses
Let’s start with some fake entries of addresses.
input str60 address "4905 Lakeway Drive, College Station, Texas 77845 USA" "673 Jasmine Street, Los Angeles, CA 90024" "2376 First street, San Diego, CA 90126" "6 West Central St, Tempe AZ 80068" "1234 Main St. Cambridge, MA 01238-1234" end
To find the zip code we will look for a five-digit number within an address. The gen command (short for "generate") below tells Stata to generate a new variable called zip. The rest of the command is a little tricky, the "if" is evaluated first, if(regexm(address, “[0-9][0-9][0-9][0-9][0-9]”)) searches the variable address for a five digit number, and, if it can find a five digit number in the variable address, the = regexs(0) indicates that Stata should set the value of zip to be equal to that five-digit number. We indicate that we want a five-digit number by specifying “[0-9]” five times. Unless otherwise indicated using a *, +, or ? mark, one and only one of the characters contained in brackets will be matched. This means that stringing five of these expressions together will enable us to find a string of exactly five digits. Note that the 0-9 indicates that the expression should match any character 0 through 9 (i.e. 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 are all matches).
gen zip = regexs(0) if(regexm(address, "[0-9][0-9][0-9][0-9][0-9]")) list +--------------------------------------------------------------+ | address zip | |--------------------------------------------------------------| 1. | 4905 Lakeway Drive, College Station, Texas 77845 USA 77845 | 2. | 673 Jasmine Street, Los Angeles, CA 90024 90024 | 3. | 2376 First street, San Diego, CA 90126 90126 | 4. | 6 West Central St, Tempe AZ 80068 80068 | 5. | 1234 Main St. Cambridge, MA 01238-1234 01238 | +--------------------------------------------------------------+
Example 1, Variation Number 1
In our simplified example above, none of the addresses have five-digit street numbers. What if there are addresses with five-digit street numbers? Let’s look at another dataset of fake addresses and see what happens when we try to use the same code above.
clear input str60 address "4905 Lakeway Drive, College Station, Texas 77845" "673 Jasmine Street, Los Angeles, CA 90024" "2376 First street, San Diego, CA 90126" "66666 West Central St, Tempe AZ 80068" "12345 Main St. Cambridge, MA 01238" end gen zip = regexs(0) if(regexm(address, "[0-9][0-9][0-9][0-9][0-9]")) list +----------------------------------------------------------+ | address zip | |----------------------------------------------------------| 1. | 4905 Lakeway Drive, College Station, Texas 77845 77845 | 2. | 673 Jasmine Street, Los Angeles, CA 90024 90024 | 3. | 2376 First street, San Diego, CA 90126 90126 | 4. | 66666 West Central St, Tempe AZ 80068 66666 | 5. | 12345 Main St. Cambridge, MA 01238 12345 | +----------------------------------------------------------+
Apparently, this is not working correctly since the last two rows of the variable zip have picked up the street numbers for these addresses instead of zip codes. In this data set, the zip code appears at the end of the address string. If we assume that this the case for all addresses in the data, the remedy will be really simple. We can specify "[0-9][0-9][0-9][0-9][0-9]$" which would instruct Stata to find a five-digit number at the end of the string.
gen zip = regexs(0) if(regexm(address, "[0-9][0-9][0-9][0-9][0-9]$")) list +----------------------------------------------------------+ | address zip | |----------------------------------------------------------| 1. | 4905 Lakeway Drive, College Station, Texas 77845 77845 | 2. | 673 Jasmine Street, Los Angeles, CA 90024 90024 | 3. | 2376 First street, San Diego, CA 90126 90126 | 4. | 66666 West Central St, Tempe AZ 80068 80068 | 5. | 12345 Main St. Cambridge, MA 01238 01238 | +----------------------------------------------------------+
Example 1, Variation Number 2
Sometimes zip code also include the four-digit extension and the country name may also appear at the end of the address, such as in some of the addresses shown below.
clear input str60 address "4905 Lakeway Drive, College Station, Texas 77845 USA" "673 Jasmine Street, Los Angeles, CA 90024" "2376 First street, San Diego, CA 90126" "66666 West Central St, Tempe AZ 80068" "12345 Main St. Cambridge, MA 01238-1234" "12345 Main St Sommerville MA 01239-2345" "12345 Main St Watertwon MA 01239 USA" end
In this type of more realistic situation, the code in the previous examples won’t work correctly since there are extra characters after the zip code to be extracted. Here is how we can do it using a more complicated regular expression.
gen zip = regexs(1) if regexm(address, "([0-9][0-9][0-9][0-9][0-9])[-]*[0-9]*[ a-zA-Z]*$")list+--------------------------------------------------------------+ | address zip | |--------------------------------------------------------------| 1. | 4905 Lakeway Drive, College Station, Texas 77845 USA 77845 | 2. | 673 Jasmine Street, Los Angeles, CA 90024 90024 | 3. | 2376 First street, San Diego, CA 90126 90126 | 4. | 66666 West Central St, Tempe AZ 80068 80068 | 5. | 12345 Main St. Cambridge, MA 01238-1234 01238 | |--------------------------------------------------------------| 6. | 12345 Main St Sommerville MA 01239-2345 01239 | 7. | 12345 Main St Watertwon MA 01239 USA 01239 | +--------------------------------------------------------------+
What we have added in the regular expression is this sub-: "[-]*[0-9]*[ a-zA-Z]*". There are three components in this regular expression.
- [-]* – matching zero or more dashes "-"
- [0-9]* – matching zero or more numbers
- [ a-zA-Z]* – matching zero or more blank spaces or letters
These additions allow us to match up the cases where there are trailing characters after the zip code and to extract the zip code correctly. Notice that we also used "regexs(1)" instead of "regexs(0)" as we did previously, because we are now using subexpressions indicated by the pair of parenthesis in "([0-9][0-9][0-9][0-9][0-9])". Another strategy that might work better in some cases is the regular expression
gen zip2 = regexs(1) if(regexm(address, ".*([0-9][0-9][0-9][0-9][0-9])"))
In this example, the period (i.e. “.”) matches any charctor, and the asterix alone (“*”) matches any characters. Together, the two indicate that the number we are looking for should not occur at the very beginning of the string, but may occur anywhere after.
Example 2: Extracting first name and last name and switching their order
We have a variable that contains a person’s full name in the order of first name and then last name. We want to create a new variable for full name in the order of last name and then first name separated by comma. To start, let’s make a sample data set.
clear input str40 fullname "John Adams" "Adam Smiths" "Mary Smiths" "Charlie Wade" end
Now we need to capture the first word and the second word and swap them. Here is the regular expression for this purpose: (([a-zA-Z]+)[ ]*([a-zA-Z]+)).
There are three parts in this regular expression:
- ([a-zA-Z]+) – subexpression capturing a string consisting of letters, both lower case and upper case. This will be the first name.
- [ ]* – matching with space(s). This is the spacing between first name and last name.
- ([a-zA-Z]+) – subexpression capturing a string consisting of letters. This will be the last name.
gen n = regexs(2)+", "+regexs(1) if regexm(fullname, "([a-zA-Z]+)[ ]*([a-zA-Z]+)") list +------------------------------+ | fullname n | |------------------------------| 1. | John Adams Adams, John | 2. | Adam Smiths Smiths, Adam | 3. | Mary Smiths Smiths, Mary | 4. | Charlie Wade Wade, Charlie | +------------------------------+
This indeed works. Let’s see how regexs works in this case. regex actually identifies a number of sections, based on the whole expression as well as the subexpressions. The following code uses regexs to place each of these components (subexpressions) into its own variable and then displays them.
gen n0 = regexs(0) if regexm(fullname, "(([a-zA-Z]+)[ ]*([a-zA-Z]+))") gen n1 = regexs(2) if regexm(fullname, "(([a-zA-Z]+)[ ]*([a-zA-Z]+))") gen n2 = regexs(3) if regexm(fullname, "(([a-zA-Z]+)[ ]*([a-zA-Z]+))") list fullname n0 n1 n2 +------------------------------------------------+ | fullname n0 n1 n2 | |------------------------------------------------| 1. | John Adams John Adams John Adams | 2. | Adam Smiths Adam Smiths Adam Smiths | 3. | Mary Smiths Mary Smiths Mary Smiths | 4. | Charlie Wade Charlie Wade Charlie Wade | +------------------------------------------------+
Example 3: Two- and four- digit values for year.
In this example, we have dates entered as a string variable. Stata can handle this using standard commands (see "My date variable is a string, how can I turn it into a date variable Stata can recognize?"), we are using this as an example of what you could do with regular expressions. The goal of this process is to produce a string variable with the appropriate four digit year for every case, which Stata can then easily convert into a date. To do this we will start by separating out each element of the date (day, month, and two- or four- digit year) into a separate variable, then we will assign the correct four-digit year to cases where there are currently only two digits, finally, we concatenate the variables to create a single string variable that contains month, day, and four-digit years.
First, input the dates:
input str18 date 20jan2007 16June06 06sept1985 21june04 4july90 9jan1999 6aug99 19august2003 end
Next, we want to identify the day of the month and place it in a variable called day. To do this we instruct Stata to find the day by looking at the beginning of the string (i.e. the date), for one or more values from 0-9. (In other words, look for a number at the start of the line, since we know the first series of numbers is the day.) Generate a new variable day, and set it equal to that value.
gen day = regexs(0) if regexm(date, "^[0-9]+")
The line of syntax below finds the month by looking for one or more letters together in the string. Then, generates the variable month and sets it equal to the month identified in the string.
gen month = regexs(0) if regexm(date, "[a-zA-Z]+")
The year is where things get more complex. Note that the values for assigning centuries are based on my knowledge of my “data.” First of all, we extract all the digits for year. We use the "$" operator to indicate that the search is from the end of the string. We then turn the string variable into a numeric variable using Stata’s function "real". The next action involves dealing with two digit years starting with "0". This corresponds to recent years in the twenty first century. To turn these into four-digit years, we concatenate (using the +) the string identified (the two-digit year) with the string "20". Next we will find the two-digit years 10-99, and concatenate those strings with the string "19". Finally, we create the variable date2 which is our date containing only four-digit years. (We could also use the three variables, day, month, and year to to create a date variable using the Stata date functions.)
gen year = regexs(0) if regexm(date, "[0-9]*$") replace year = "20"+regexs(0) if regexm(year, "^[0][0-9]$") replace year = "19"+regexs(0) if regexm(year, "^[1-9][0-9]$") gen date2 = day+month+year list +---------------------------------------------------+ | date day month year date2 | |---------------------------------------------------| 1. | 20jan2007 20 jan 2007 20jan2007 | 2. | 16June06 16 June 2006 16June2006 | 3. | 06sept1985 06 sept 1985 06sept1985 | 4. | 21june04 21 june 2004 21june2004 | 5. | 4july90 4 july 1990 4july1990 | |---------------------------------------------------| 6. | 9jan1999 9 jan 1999 9jan1999 | 7. | 6aug99 6 aug 1999 6aug1999 | 8. | 19august2003 19 august 2003 19august2003 | +---------------------------------------------------+
Regular Expressions
Regular expressions are, in general, a way of searching for and in some cases replacing the occurrence of a pattern within a string based on a set of rules. These rules are defined using a set of operators. The following table shows all of the operators Stata accepts, and explains each one. Note that in Stata, regular expressions will always fall within quotation marks.
[ ] Square brackets indicate that one of the characters inside the brackets should be matched. For example, if I wanted to search for a single letter between f and m, I would type "[f-m]" a-z A range specifies that any value within that range is acceptable. This is case sensitive, so a-z is not the same as A-Z, if either case can be counted as a match, include both a-zA-Z. Numeric values are also acceptable as ranges (e.g. 0-9). . A period matches any character. Allows you to match characters that are usually regular expression operators. For example, if you wanted to match a "[" you would type [ instead of just a single [. * Match zero or more of the characters in preceding expression. For example if I wanted to match a number made up of one or more digits if there is a number, but still want to indicate a match if the rest of the expression fits, I could specify [0-9]* + Match one or more of the characters in the preceding expression. For example if I wanted to match a word containing any combination of letters, I would specify [a-zA-Z]+ ? Match either zero or one of the previous expression. ^ When it appears at the beginning of an expression, a "^" indicates that the following expression should appear at the beginning of the string. $ When it appears at the end of an expression, a "$" indicates that the preceding expression should appear at the end of the string. For example, if I wanted to match a number that was the last thing to appear at the end of a string, I would specify "[0-9]+$" | The logical operator or, indicating that either the expression preceding it or following it qualify as a match. ( ) Creates a subexpression within a larger expression. Useful with the "or" perator (i.e. | ), and when extracting and replacing values. For example, if I wanted to extract a numeric value which I know follows directly after a word or set of letters, I could use the regular expression “[a-zA-Z]+([0-9]+)" this matches the whole expression, but allows you to select the portion in the parentheses (called a substring). Handling substrings is discussed in greater detail below. These expressions can be combined to search for a wide variety of strings.
As mentioned above, there are three types of functions that can be preformed with regular expressions in Stata (if you are creative, you can do any number of other things using these functions, but the basic tools are the built in Stata functions). Stata has separate commands for each of the three types of actions regular expressions can perform:
- regexm – used to find matching strings, evaluates to one if there is a match, and zero otherwise
- regexs – used to return the nth substring within an expression matched by regexm (hence, regexm must always be run before regexs, note that an "if" is evaluated first even though it appears later on the line of syntax).
- regexr – used to replace a matched expression with something else.
Each of these has a slightly different syntax. The line below shows the syntax for regexm, that is, the function that matches your regular expression, where the string may either be a string you type in yourself, a string from a macro, or most commonly, the name of a variable. Regular expression is the regular expression for the string you would like to find, note that it must appear in quotation marks.
regexm(string, "regular expression")For regexs, that is, to recall all or a portion of a string, the syntax is:
regexs(n)Where n is the number assigned to the substring you want to extract. The substrings are actually divided when you run regexm. The entire substring is returned in zero, and each substring is numbered sequentially from 1 to n. For example, regexm(“907-789-3939”, “([0-9]*)-([0-9]*)-([0-9]*)”) returns the following:
Subexpression # String Returned 0 907-789-3939 1 907 2 789 3 3939 Note that in subexpressions 1, 2, and 3, the dashes are dropped, since they are not included in the parentheses that mark the subexpressions.
You can take another look at how this works using the following syntax, which uses the display command to run the function.
display regexm("907-789-3939", "([0-9]*)-([0-9]*)-([0-9]*)") display regexs(0) display regexs(1) display regexs(2) display regexs(3)Because they are functions, the regex commands work within other commands (e.g. generate), but cannot be used on their own (i.e. you cannot start a command in Stata with regexm(…)).
Reference
What are regular expressions and how can I use them in Stata?