Data Setup for Comparing Means in SPSS
David P. Nichols
Senior Support Statistician
SPSS, Inc.
April 1994
Testing hypotheses about equality of means is one of the
most commonly used applications of statistical software. SPSS
offers a variety of procedures capable of performing mean
comparisons. Several of these procedures are fairly simple,
designed to easily handle specific problems, while others are
more general, and necessarily more complex. In order to
successfully employ any of these options, users need to be
familiar with the data structure required by SPSS. Judging by the
number of statistical support calls that involve questions of
data setup for procedures ranging from T-TEST to MANOVA, many
users are not clear on the logic of this structure.
SPSS, like most other statistical software,
primarily works on a rectangular cases by variables format. That
is, rows of the rectangular data matrix represent cases, while
columns denote variables. (Even though on occasion data sets are
large enough to require multiple records or lines per case, the
logic remains as if we were still using one line and simply
wrapping it around as many times as necessary.) The decisive
question when we look to compare two or more means is whether
they represent means of independent or related samples.
The independent vs. related samples distinction is usually
equivalent to the question of whether we want to compare means of
two or more groups of cases or the means of the same group of
cases under two or more conditions. For this reason the terms
between subjects and within subjects are commonly used to denote
the type of comparison(s) desired. In the T-TEST procedure these
two kinds of analysis are referred to as independent vs. related
samples tests. The generalization of the related samples (within
subjects) situation to more than two time points or conditions is
handled most generically in the MANOVA procedure via the
WSFACTORS specification, though the RELIABILITY procedure's
STATISTICS=ANOVA option also provides some tests of means of
related samples.
Setup for Independent Samples (Between Subjects Analyses)
If the desired comparison(s) involve between subjects or
independent samples data, the appropriate data structure involves
one or more grouping variables to identify what kind of case each
line of data represents, with the values for the variable(s) on
which we wish to compare the groups listed in one or more
separate variable(s). Thus the proper data setup for a comparison
of the means of two groups of cases would be along the lines of:
DATA LIST FREE / GROUP Y.
BEGIN DATA
1 5.2
1 4.3
...
2 7.1
2 6.9
END DATA.
In other words SPSS needs something to tell it which group a case
belongs to (this variable--called GROUP in our example--is often
referred to as a factor variable), as well as the value of the
measured variable(s) of interest (Y). Once the data are
successfully entered in this format, any of the following procedure
commands can be used to obtain a test of the null hypothesis of
equal population means for the two groups:
T-TEST GROUPS=GROUP /VAR=Y.
MEANS Y BY GROUP /STATISTICS=ANOVA.
ONEWAY Y BY GROUP(1,2).
ANOVA Y BY GROUP(1,2).
MANOVA Y BY GROUP(1,2).
For situations in which there are three or more groups the
same structure would prevail, except that there would be more
than two values for the GROUP variable, and of course then we
could not use the T-TEST procedure to compare more than two means
at one time. If there are data groupings defined by more than one
type of factor, such as gender and geographical region, then we
simply have more grouping variables (such as GENDER with two
categories and REGION with several) entered in our data set. In
this case we move to either ANOVA or MANOVA, since MEANS and
ONEWAY are designed specifically for use with one grouping
factor.
Setup for Paired or Related Samples (Within Subjects Analyses)
Suppose instead of wanting to compare the means of two or
more groups of cases, we now want to make comparisons among
measurements taken on the same cases at different times or under
different conditions. Since the repeated measures or time example
is so common, we will call the factor of interest here TIME. The
difference between this situation and that involving between
subjects analyses is that here we are concerned with comparing
related measurements on the same cases. Thus the data setup is
different. Rather than having one variable distinguish among the
cases on the basis of group membership, we simply have two
measured variables for each case. If we call these TIME1 and
TIME2, the data setup might look like:
DATA LIST FREE / TIME1 TIME2.
BEGIN DATA
1.5 3.8
2.1 4.2
...
3.2 4.7
END DATA.
The MEANS, ONEWAY and ANOVA procedures are not useful, as they do
not handle within subjects data. Instead we could obtain the same
results in varying forms of presentation by any of the following
specifications:
T-TEST PAIRS=TIME1 TIME2.
RELIABILITY VARIABLES=TIME1 TIME2
/STATISTICS=ANOVA.
MANOVA TIME1 TIME2
/WSFACTORS=TIME(2).
Should we move to a comparison involving more than two
related means we would not be able to use the T-TEST procedure,
and the results produced by the RELIABILITY procedure, though
presented in a more familiar format for many people than those
given in MANOVA, will provide only a part of the information
given by MANOVA, and this information will only be strictly valid
under some fairly severe assumptions. For this reason users are
generally much safer to work with MANOVA for within subjects
analyses. Adding more time points would produce no structural
changes in the MANOVA specification, only a longer list of
dependent variables and a change in the number of levels of the
WSFACTOR TIME. Note that this name is arbitrary; we can call this
factor anything we want as long as it is eight characters or less
and does not match any reserved words in MANOVA.
Note that if we are using data in which there are both
grouping or between subjects factors and related or repeated
variables forming within subjects factors, MANOVA is the only
procedure we can use. If we had two groups measured at two time
points and wished to perform a factorial analysis of variance on
these data, comparing groups across time, time changes across
groups, and the interaction of the two, we would use syntax such
as:
DATA LIST FREE / GROUP TIME1 TIME2.
BEGIN DATA
1 2.1 4.2
1 3.0 3.6
...
2 2.5 2.1
2 3.1 2.6
END DATA.
MANOVA TIME1 TIME2 BY GROUP(1,2)
/WSFACTORS=TIME(2).
