How can I count how many cases are missing in a string variable?

There are at least two ways to find out how many missing cases there are in a string variable. The first way is to use the missing values command to define a missing value for the variable. The second way is to create a new variable which is zero if the value is not missing and one if it is missing. To do that, you will need to use the length and the trim functions with the compute command.

Consider the data set below. We have two string variables, fname (first name) and lname (last name). Because we have given fname a length of five and lname a length of eight, both variables are short string variables. (Note that if we had just typed (A), the length of the string variable would have been one, not the length of the first case, as it would have been in other statistical packages, such as SAS.)

data list list / id * fname (A5) lname (A8) age.
begin data
1 "Beth" "Jones" .
2 "Bob" "Jensen" 23
3 "     " "Andersen" 25
4 "Andy" "Smith" 26
5 "Al" "Peterson" 21
6 "Ann" "Glenn" 22
7 "Pete" "  " 29
8 "Pam" "Wright" 21
9 "   " "Brown" 29
end data.

Notice that there are two missing values for fname, and one missing value each for lname and age. If you run this code and look at the SPSS data editor, you will see that the cells of the missing names are empty, but the cell of missing value of age has a period in it. You will also notice that while SPSS issued an error message regarding the missing value for age, it did not issue an error message for any of the missing names. This is because blanks (i.e., a null string) is a valid value for a string variable. Now let’s look at the frequencies of each variable. We know that there should be seven valid cases and two missing for fname, and eight valid cases and one missing for both lname and age.

freq var = fname lname age.

Statistics

FNAME LNAME AGE

N Valid 9 9 8

Missing 0 0 1

**Statistics**
	FNAME	LNAME	AGE
N	Valid	9	9	8
Missing	0	0	1

FNAME

Frequency Percent Valid Percent Cumulative Percent

Valid
2 22.2 22.2 22.2

Al 1 11.1 11.1 33.3

Andy 1 11.1 11.1 44.4

Ann 1 11.1 11.1 55.6

Beth 1 11.1 11.1 66.7

Bob 1 11.1 11.1 77.8

Pam 1 11.1 11.1 88.9

Pete 1 11.1 11.1 100.0

Total 9 100.0 100.0

**FNAME**
	Frequency	Percent	Valid Percent	Cumulative Percent
Valid		2	22.2	22.2	22.2
Al	1	11.1	11.1	33.3
Andy	1	11.1	11.1	44.4
Ann	1	11.1	11.1	55.6
Beth	1	11.1	11.1	66.7
Bob	1	11.1	11.1	77.8
Pam	1	11.1	11.1	88.9
Pete	1	11.1	11.1	100.0
Total	9	100.0	100.0

LNAME

Frequency Percent Valid Percent Cumulative Percent

Valid
1 11.1 11.1 11.1

Andersen 1 11.1 11.1 22.2

Brown 1 11.1 11.1 33.3

Glenn 1 11.1 11.1 44.4

Jensen 1 11.1 11.1 55.6

Jones 1 11.1 11.1 66.7

Peterson 1 11.1 11.1 77.8

Smith 1 11.1 11.1 88.9

Wright 1 11.1 11.1 100.0

Total 9 100.0 100.0

**LNAME**
	Frequency	Percent	Valid Percent	Cumulative Percent
Valid		1	11.1	11.1	11.1
Andersen	1	11.1	11.1	22.2
Brown	1	11.1	11.1	33.3
Glenn	1	11.1	11.1	44.4
Jensen	1	11.1	11.1	55.6
Jones	1	11.1	11.1	66.7
Peterson	1	11.1	11.1	77.8
Smith	1	11.1	11.1	88.9
Wright	1	11.1	11.1	100.0
Total	9	100.0	100.0

AGE

Frequency Percent Valid Percent Cumulative Percent

Valid 21.00 2 22.2 25.0 25.0

22.00 1 11.1 12.5 37.5

23.00 1 11.1 12.5 50.0

25.00 1 11.1 12.5 62.5

26.00 1 11.1 12.5 75.0

29.00 2 22.2 25.0 100.0

Total 8 88.9 100.0

Missing System 1 11.1

Total 9 100.0

**AGE**
	Frequency	Percent	Valid Percent	Cumulative Percent
Valid	21.00	2	22.2	25.0	25.0
22.00	1	11.1	12.5	37.5
23.00	1	11.1	12.5	50.0
25.00	1	11.1	12.5	62.5
26.00	1	11.1	12.5	75.0
29.00	2	22.2	25.0	100.0
Total	8	88.9	100.0
Missing	System	1	11.1
Total	9	100.0

However, this is not what we see. Although age as one missing value, neither fname nor lname have missing values. Unlike missing values for numeric variables, missing values for string variables are not assigned a period (.). Rather, they are left blank and SPSS does not consider them to be missing. To indicate a missing value in a string variable, you need to use the missing values command and assign a "value" to missing cases. This "value" can be one or more blanks, or a numeric code such as 9999. You can only define missing values for string variables whose length is eight or less (what SPSS calls "short" string variables). It is important to note that there are no system missing values for either short nor long string variables. You can assign different missing values to different variables within the same missing values command, as shown below.

missing values fname lname ("  ").

Now let’s look at the frequencies.

freq var = fname lname age.
Statistics

FNAME LNAME AGE

N Valid 7 8 8

Missing 2 1 1

**Statistics**
	FNAME	LNAME	AGE
N	Valid	7	8	8
Missing	2	1	1

FNAME

Frequency Percent Valid Percent Cumulative Percent

Valid Al 1 11.1 14.3 14.3

Andy 1 11.1 14.3 28.6

Ann 1 11.1 14.3 42.9

Beth 1 11.1 14.3 57.1

Bob 1 11.1 14.3 71.4

Pam 1 11.1 14.3 85.7

Pete 1 11.1 14.3 100.0

Total 7 77.8 100.0

Missing
2 22.2

Total 9 100.0

**FNAME**
	Frequency	Percent	Valid Percent	Cumulative Percent
Valid	Al	1	11.1	14.3	14.3
Andy	1	11.1	14.3	28.6
Ann	1	11.1	14.3	42.9
Beth	1	11.1	14.3	57.1
Bob	1	11.1	14.3	71.4
Pam	1	11.1	14.3	85.7
Pete	1	11.1	14.3	100.0
Total	7	77.8	100.0
Missing		2	22.2
Total	9	100.0

LNAME

Frequency Percent Valid Percent Cumulative Percent

Valid Andersen 1 11.1 12.5 12.5

Brown 1 11.1 12.5 25.0

Glenn 1 11.1 12.5 37.5

Jensen 1 11.1 12.5 50.0

Jones 1 11.1 12.5 62.5

Peterson 1 11.1 12.5 75.0

Smith 1 11.1 12.5 87.5

Wright 1 11.1 12.5 100.0

Total 8 88.9 100.0

Missing
1 11.1

Total 9 100.0

**LNAME**
	Frequency	Percent	Valid Percent	Cumulative Percent
Valid	Andersen	1	11.1	12.5	12.5
Brown	1	11.1	12.5	25.0
Glenn	1	11.1	12.5	37.5
Jensen	1	11.1	12.5	50.0
Jones	1	11.1	12.5	62.5
Peterson	1	11.1	12.5	75.0
Smith	1	11.1	12.5	87.5
Wright	1	11.1	12.5	100.0
Total	8	88.9	100.0
Missing		1	11.1
Total	9	100.0

AGE

Frequency Percent Valid Percent Cumulative Percent

Valid 21.00 2 22.2 25.0 25.0

22.00 1 11.1 12.5 37.5

23.00 1 11.1 12.5 50.0

25.00 1 11.1 12.5 62.5

26.00 1 11.1 12.5 75.0

29.00 2 22.2 25.0 100.0

Total 8 88.9 100.0

Missing System 1 11.1

Total 9 100.0

**AGE**
	Frequency	Percent	Valid Percent	Cumulative Percent
Valid	21.00	2	22.2	25.0	25.0
22.00	1	11.1	12.5	37.5
23.00	1	11.1	12.5	50.0
25.00	1	11.1	12.5	62.5
26.00	1	11.1	12.5	75.0
29.00	2	22.2	25.0	100.0
Total	8	88.9	100.0
Missing	System	1	11.1
Total	9	100.0

Now the frequencies are as we would expect them to be.

You can also use the display dictionary command to see that the missing values have been properly assigned.

display dictionary.

          List of variables on the working file

Name                                                                   Position

ID                                                                            1
          Measurement Level: Scale
          Column Width: 8  Alignment: Right
          Print Format: F8.2
          Write Format: F8.2

FNAME                                                                         2
          Measurement Level: Nominal
          Column Width: 8  Alignment: Left
          Print Format: A5
          Write Format: A5
          Missing Values: ''

LNAME                                                                         3
          Measurement Level: Nominal
          Column Width: 8  Alignment: Left
          Print Format: A8
          Write Format: A8
          Missing Values: ''

AGE                                                                           4
          Measurement Level: Scale
          Column Width: 8  Alignment: Right
          Print Format: F8.2
          Write Format: F8.2

Note that we did not assign any missing values to id or age; therefore, none are shown under those variables.

You can also use the missing values command to delete previously declared missing values. To do this, do not type anything in the parentheses after the variable listed on the missing values command. If you use the SPSS keyword all instead of a variable or a list of variables, you will delete all user-defined missing values for all variables, both string and numeric.

The second way to determine the number of missing values for a string variable is to create a new variable that has a value of one if the cell in the original variable is not empty (i.e., there is a character of some sort in there,) and one if it is empty. Next, the describe command is used to sum up the number of ones (i.e., the number of missing values). For our example, we will create a new variable for each string variable in our data set. The variable missf will indicate the missing values for fname, and missl will indicate the missing values for lname. We will use two functions with the compute command to create our new variables. The rtrim function trims the blank spaces from the right of the variable. If there is nothing but spaces, it trims the length to zero. The length function determines length of the value. Next, the expression is evaluated for each case. If the length is zero, the expression is true and a one is placed in the new variable. If the expression is false, then a zero is placed in the new variable.

compute missf = (length(rtrim(fname)) = 0).
compute missl = (length(rtrim(lname)) = 0).
execute.

desc var = missf missl
 /statistics = sum.
Descriptive Statistics

N Sum

MISSF 9 2.00

MISSL 9 1.00

Valid N (listwise) 9

**Descriptive Statistics**
	N	Sum
MISSF	9	2.00
MISSL	9	1.00
Valid N (listwise)	9

list.

ID FNAME LNAME AGE MISSF MISSL

1.00 Beth Jones . .00 .00 2.00 Bob Jensen 23.00 .00 .00 3.00 Andersen 25.00 1.00 .00 4.00 Andy Smith 26.00 .00 .00 5.00 Al Peterson 21.00 .00 .00 6.00 Ann Glenn 22.00 .00 .00 7.00 Pete 29.00 .00 1.00 8.00 Pam Wright 21.00 .00 .00 9.00 Brown 29.00 1.00 .00

Number of cases read: 9 Number of cases listed: 9