Converting dates entered as strings into numeric dates that can be used by Stata is relatively simple. The commands vary somewhat based on the format in which the data were entered and how consistently that format was applied. In most cases, Stata can handle dates in which the days were entered as one or two digit numbers, and in which the month was entered as either abbreviation (e.g. Jan, Sept, Sep), or typed out in whole (e.g. January, September). The one place you can run into trouble is with the year. In general, Stata expects to find four-digit values for year, if your data were not entered this way (or not consistently entered this way), Stata can handle this, but some additions are needed to the syntax. The following set of examples covers a number of common scenarios. In general the examples move from the simplest case (where values for year are all four-digits), to more complex cases (where the values for year were not consistently entered as four-digit values).
Case 1: Dates with four-digit values for year.
input str10 date 20jan2007 06sept2001 21june2004 04july2006 6aug2005 end list +------------+ | date | |------------| 1. | 20jan2007 | 2. | 06sept2001 | 3. | 21june2004 | 4. | 04july2006 | 5. | 6aug2005 | +------------+
Two commands, date and format, are used to convert a string date into a numeric date. Within these functions you will need to specify how the date-time information is formatted in the string. Use upper case for day, month, and year (e.g. “DMY”) but lower case if you want to specify hours, minutes or seconds (e.g. “DMYhms”). In the syntax below, the first command actually generates the new date variable and converts the string into the proper numeric code for the date in question (i.e. the number of days elapsed between January 1st, 1960 and the date in question). Our data are in the order day, month, year, so we use “dmy” (or “DMY”) within the date(…) command (see below for information about different formats). The second command formats the numeric value so that when Stata displays the date, it is in a form that is easy for humans to read. Note that while using list, and some other data display commands, produces values that look like string dates, the actual values stored by Stata are numeric.
In Stata the commands are:
gen date2 = date(date, "DMY") format date2 %td
In either version, if you list the data, you should see the following:
+------------------------+ | date date2 | |------------------------| 1. | 20jan2007 20jan2007 | 2. | 06sept2001 06sep2001 | 3. | 21june2004 21jun2004 | 4. | 04july2006 04jul2006 | 5. | 6aug2005 06aug2005 | +------------------------+
Above, our date string was formatted day, month, year, with no character separating the different components of our date, however, Stata can handle a wide variety of formats and separating characters. For example, our data could have been listed “20-jan-2007”, “20/jan/2007”, or “20.jan.2007” and the commands above would work without any changes. Stata will even handle dates the case in which some entries have separators and others do not. We can also change the order in which the components of the date are listed, all we will need to change is the order of the letters used to tell Stata which component of the date is where (e.g. year, month, day can be specified using “YMD”). Dates can also be all numeric, for example, 20-Jan-2007 can also be represented by 20-1-2007 and even 20012007. Note that if all numeric dates are entered without separators, you must make sure that for the day and month single digit values (e.g. the 1 for January above) have leading zeros.
Case 2: Dates with two-digit and four-digit values for year, with all two-digit dates in the same century.
The above example assumed that the values for year were listed in a four digit format (e.g. 2004), however, this is not always the case. The data below contains dates listed with only two digits, assuming that I know these dates came either from the 20th (e.g. 1907) or 21st century (e.g. 2007), I can tell Stata to treat the years as though they are four digit values. To do this I add either “19” or “20” in front of the Y in the part of the command that gives Stata the order in which the values appear. For example, for dates listed day, month, year, the specifcation is “DM20Y”. In the data below, notice that the last two dates both contain four-digit years (2005 and 1999); this is fine — the only restriction is that the entries with two-digit values for year all come from the same century.
input str14 date 20jan07 06sept01 21june04 04july06 6august05 6august2005 20june1999 end list +-------------+ | date | |-------------| 1. | 20jan07 | 2. | 06sept01 | 3. | 21june04 | 4. | 04july06 | 5. | 6august05 | |-------------| 6. | 6august2005 | 7. | 20june1999 | +-------------+
The command to create a new numeric date variable from a string variable and format the new variable so it displays as a date are:
gen date2 = date(date, "DM20Y") format date2 %td
Note that for dates in the 20th century (e.g. 1997) the command is “DM19Y”.
If you list the data, you should see the following:
list date date2 +-------------------------+ | date date2 | |-------------------------| 1. | 20jan07 20jan2007 | 2. | 06sept01 06sep2001 | 3. | 21june04 21jun2004 | 4. | 04july06 04jul2006 | 5. | 6august05 06aug2005 | |-------------------------| 6. | 6august2005 06aug2005 | 7. | 20june1999 20jun1999 | +-------------------------+
As with the first example, the dates can be formatted in a variety of ways, order of the components does not mater, but the 19 or 20 always precedes the Y.
Case 3: Dates with two-digit and four-digit values for year, all two-digit dates are not in the same century or not from the 20th or 21st century.
In case 2 we assumed that all the dates with two-digit values for year were from the same century, and that all of the dates with two-digit values for year fall in either the 20th or 21st centuries. In this example, we will tell Stata the most recent year in which the data could appear, and Stata will assign the centuries so that all dates occur before that year. An example of when this be useful is if I know the values below are birth dates for currently living individuals. Since they are currently living, I know that none of them was born after the current year (2007), and it is unlikely that the value “07” in the year is 1907 rather than 2007 (note that this sort of information is highly dependent on the data, so you will want to check any assumptions you make against what you actually know about your data). This would also come in handy if your 2-digit year values all come from an earlier century. For example, if all your dates are from the 1700s, giving Stata 1799 as the last possible year will result in all two digit dates begin assigned to the eighteenth century (i.e. 84 would be interpreted as 1784).
input str18 date 20jan2007 06sept1985 21june04 4july90 9jan1999 6aug99 19august2003 end list +--------------+ | date | |--------------| 1. | 20jan2007 | 2. | 06sept1985 | 3. | 21june04 | 4. | 4july90 | 5. | 9jan1999 | |--------------| 6. | 6aug99 | 7. | 19august2003 | +--------------+
The command to create a new numeric date variable from a string variable and format the new variable so it displays as a date is:
gen date2 = date(date, "DMY", 2007) format date2 %td
If you list the data you should see the following:
list date date2 +-------------------------+ | date date2 | |-------------------------| 1. | 20jan07 20jan2007 | 2. | 06sept01 06sep2001 | 3. | 21june04 21jun2004 | 4. | 04july06 04jul2006 | 5. | 6august05 06aug2005 | |-------------------------| 6. | 6august2005 06aug2005 | 7. | 20june1999 20jun1999 | +-------------------------+