This FAQ page demonstrates the use of traditional methods and introduces SAS special characters for reading in (messy) data with a character variable of varying length.
When the character variable contains only a single word
This half of the page shows how to read in a character variable with a single word with varying length when the dataset is space delimited. For our example we have a hypothetical website dataset with the following variables: age of page (age), the url (site) and the number of hits the site received (hits).
We start by reading in the dataset where our character variable, site, is read in with the default character format given by $.
data web; input age site $ hits; datalines; 12 http://www.site1.org/default.htm 123456 130 http://www.site2.com/index.htm 97654 254 http://www.site3.edu/department/index.htm 987654 ; proc print; run; Obs age site hits 1 12 http://w 123456 2 130 http://w 97654 3 254 http://w 987654
Using the default method the variable site was read only to the 8th character, the default length for character variables, which was not what we want. Next, we reformat the site variable by setting its format to be maximum length for the character variable across the observations, 41 columns wide. The format is specified by $41. after site in the input statement.
data web; input age site $41. hits; datalines; 12 http://www.site1.org/default.htm 123456 130 http://www.site2.com/index.htm 97654 254 http://www.site3.edu/department/index.htm 987654 ; proc print; run; Obs age site hits 1 12 http://www.site1.org/default.htm 123456 130 2 254 http://www.site3.edu/department/index.htm 987654
This approach didn’t work either; it read in two observations.
There are three possible methods we can try next.
Method 1: The first method requires that prior to the input statement we use a length statement where we define the format of the character variable, and then in the input statement we format site with just $.
data web; length site $41; input age site $ hits; datalines; 12 http://www.site1.org/default.htm 123456 130 http://www.site2.com/index.htm 97654 254 http://www.site3.edu/department/index.htm 987654 ; proc print; run; Obs site age hits 1 http://www.site1.org/default.htm 12 123456 2 http://www.site2.com/index.htm 130 97654 3 http://www.site3.edu/department/index.htm 254 987654
Method 2: For the second method we use the SAS special character, the colon modifier ( : ), for the site variable format, :$41.. The colon modifier tells SAS when it reads in site to do it until there is a break in the character and then stop. Note, when a character variable has more than one word, the colon modifier will take only the first word.
data web; input age site :$41. hits; datalines; 12 http://www.site1.org/default.htm 123456 130 http://www.site2.com/index.htm 97654 254 http://www.site3.edu/department/index.htm 987654 ; proc print; run; Obs age site hits 1 12 http://www.site1.org/default.htm 123456 2 130 http://www.site2.com/index.htm 97654 3 254 http://www.site3.edu/department/index.htm 987654
Method 3: The final method, similar to the first, uses a SAS special character. The special character, & (ampersand), is set up in the same fashion as the colon special character. However, the special character assumes that the character variable ends only when it encounter a blank space that is two or more spaces long. Hence, a single space used to differentiate the character variable and the adjacent variable will be treated as one variable (and the data will be incorrectly read in). When the space to differentiate two variables is greater than two spaces, SAS begins to read in the next variable. The rationale for this rule is evident when the word contains one or more words. For this example, we make a slight modification to the raw data and put two or more spaces between the entries for site and the adjacent variable hits.
data web; input age site & $41. hits; datalines; 12 http://www.site1.org/default.htm 123456 130 http://www.site2.com/index.htm 97654 254 http://www.site3.edu/department/index.htm 987654 ; proc print; run; Obs age site hits 1 12 http://www.site1.org/default.htm 123456 2 130 http://www.site2.com/index.htm 97654 3 254 http://www.site3.edu/department/index.htm 987654
When the character variable contains one or more words
The second half of this page shows how to read in a character variable when the character contains one or more words with varying length and the dataset is space delimited. For this example we create a hypothetical dataset containing the following variables; zip-code (zip), fruits produced in the zip code (produce) and pounds of fruit produced in the zip-code (pound).
The first example reads in the data from an external text file. Below is raw data used and the program used to read it in. Note that the quote marks around the character variable.
10034 "apples, grapes kiwi" 123456 92626 "oranges" 97654 25414 "pears apple" 987654data fruit; infile 'C:messy.txt' delimiter = ' '; length fruit $22; input zip fruit $ pounds; proc print; run; Obs fruit zip pounds 1 "apples, 10034 . 2 "oranges" 92626 97654 3 "pears 25414 .
Clearly, our SAS data step did not correctly read in the data. Next we add the dsd option in the infile statement. The dsd option tells SAS that our delimiter, spaces, can be embedded in our character variable.
data fruit; infile 'C:messy.txt' delimiter = ' ' dsd; length fruit $22; input zip fruit $ pounds; proc print; run; Obs fruit zip pounds 1 apples, grapes kiwi 10034 123456 2 oranges 92626 97654 3 pears apple 25414 987654
For the second example, we are going to read the data in within SAS and use the special character &. Once more, the special character assumes that the character variable ends only when it encounters a blank space that is two or more spaces long. Hence, a single space to differentiate the character variable and the adjacent variable will be ignored and the two variables will be treated as one variable. When the space to differentiate variables is greater than or equal to two spaces, SAS begins to read in the next variable. We make a slight modification to the raw data and put two or more spaces between the entries for fruit and pounds.
data fruit; input zip fruit & $22. pounds; datalines; 10034 apples, grapes kiwi 123456 92626 oranges 97654 25414 pears apple 987654 ; proc print; run; Obs zip fruit pounds 1 10034 apples, grapes kiwi 123456 2 92626 oranges 97654 3 25414 pears apple 987654