How do I create a format out of a string variable?

Sometimes, two variables in a dataset may convey the same information, except one being numeric variable and the other being a string variable. For example, in the data set below, we have a numeric variable a coded 1/0 for gender and a string variable b also for gender but with more explicit information. It is easy to use the numeric variable, but we may also want to keep the information given from the string variable. This is a case where we want to create value labels for the numeric variable based on the string variable. In SAS, we will create a format from the string variable and apply the format to the numeric variable.

Example 1: A simple example

We have a tiny data set containing the two variables a and b and two observations.

data test;
  input a b $;
datalines;
1 female
0 male
;
run;

Apparently we want to create a format for variable a so that 1 = female and 0 = male. It is easy to create a format simply using the procedure format. For example, we can do the following.

proc format;
  value gender 1 = "female"
               0 = "male";
run;
proc format;
  select gender;
run;

----------------------------------------------------------------------------
|       FORMAT NAME: GENDER   LENGTH:    6   NUMBER OF VALUES:    2        |
|   MIN LENGTH:   1  MAX LENGTH:  40  DEFAULT LENGTH   6  FUZZ: STD        |
|--------------------------------------------------------------------------|
|START           |END             |LABEL  (VER. V7|V8   20MAY2004:14:25:17)|
|----------------+----------------+----------------------------------------|
|               0|               0|male                                    |
|               1|               1|female                                  |
----------------------------------------------------------------------------

We can also do the following using the a data step. This approach does not depend on the number of categories of the string variable. The code will be exactly the same. This is definitely easier when the number of categories is large.

data fmt_dataset;
  retain fmtname "lgender";
  set test ;
  start = a;
  label = b;
run;
proc format cntlin = fmt_dataset fmtlib;
  select lgender;
  run;

----------------------------------------------------------------------------
|       FORMAT NAME: LGENDER  LENGTH:    6   NUMBER OF VALUES:    2        |
|   MIN LENGTH:   1  MAX LENGTH:  40  DEFAULT LENGTH   6  FUZZ: STD        |
|--------------------------------------------------------------------------|
|START           |END             |LABEL  (VER. V7|V8   20MAY2004:14:01:06)|
|----------------+----------------+----------------------------------------|
|               0|               0|male                                    |
|               1|               1|female                                  |
----------------------------------------------------------------------------

Example 2: Another simple (but not so simple) example

We have a dataset called test2 and it looks like the following. There are many repeated rows in the dataset. If we apply the same approach from the previous example, SAS will yield an error message saying that the range is repeated, or values overlap. So we need extract a smaller dataset with no repeats in it.

data test2;
  input group    variable $;
  datalines;
 0   group1
 0   group1
 0   group1
 0   group1
 1   group2
 1   group2
 1   group2
 1   group2
 2   group3
 2   group3
 2   group3
 2   group3
 3   group4
 3   group4
 3   group4
 3   group4
 ;
 run;

The easiest way of creating a dataset without repeats is to use proc sql.

proc sql;
  create table tofmt as
  select distinct group, variable
  from test2;
quit;

proc print data = tofmt;
run;

Obs    group    variable

 1       0       group1
 2       1       group2
 3       2       group3
 4       3       group4

Now we are ready to create the format out of the dataset tofmt.

data fmt_dataset;
  retain fmtname "cvar";
  set tofmt ;
  start = group;
  label = variable;
run;
proc format cntlin = fmt_dataset fmtlib;
  select cvar;
  run;

----------------------------------------------------------------------------
|       FORMAT NAME: CVAR     LENGTH:    6   NUMBER OF VALUES:    4        |
|   MIN LENGTH:   1  MAX LENGTH:  40  DEFAULT LENGTH   6  FUZZ: STD        |
|--------------------------------------------------------------------------|
|START           |END             |LABEL  (VER. V7|V8   09JUN2008:16:23:21)|
|----------------+----------------+----------------------------------------|
|               0|               0|group1                                  |
|               1|               1|group2                                  |
|               2|               2|group3                                  |
|               3|               3|group4                                  |
----------------------------------------------------------------------------

proc print data = test2;
  format group cvar.;
run;

Obs    group     variable

  1    group1     group1
  2    group1     group1
  3    group1     group1
  4    group1     group1
  5    group2     group2
  6    group2     group2
  7    group2     group2
  8    group2     group2
  9    group3     group3
 10    group3     group3
 11    group3     group3
 12    group3     group3
 13    group4     group4
 14    group4     group4
 15    group4     group4
 16    group4     group4