How can I analyze an unbalanced randomized block design?

Randomized block type designs are relatively common in certain fields. Balanced randomized designs can be analyzed using traditional anova and regression methods but unbalanced designs require the use of maximum likelihood methods.

We will begin by analyzing a balanced design with four levels of variable a and 8 subjects denoted s on response variable y using tradition anova.

use https://stats.idre.ucla.edu/stat/stata/faq/randomblock, clear

describe

Contains data from https://stats.idre.ucla.edu/stat/stata/faq/randomblock.dta
  obs:            32
 vars:             3                          1 May 2002 09:01
 size:           224 (99.9% of memory free)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
s               byte   %8.0g
a               byte   %8.0g
y               byte   %8.0g
-------------------------------------------------------------------------------
Sorted by:

/* balanced design -- traditional anova */

anova y a s

                           Number of obs =      32     R-squared     =  0.7318
                           Root MSE      = 1.18523     Adj R-squared =  0.6041

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |        80.5    10        8.05       5.73     0.0004
                         |
                       a |          49     3  16.3333333      11.63     0.0001
                       s |        31.5     7         4.5       3.20     0.0180
                         |
                Residual |        29.5    21   1.4047619
              -----------+----------------------------------------------------
                   Total |         110    31   3.5483871

anova, regress

      Source |       SS       df       MS              Number of obs =      32
-------------+------------------------------           F( 10,    21) =    5.73
       Model |        80.5    10        8.05           Prob > F      =  0.0004
    Residual |        29.5    21   1.4047619           R-squared     =  0.7318
-------------+------------------------------           Adj R-squared =  0.6041
       Total |         110    31   3.5483871           Root MSE      =  1.1852

------------------------------------------------------------------------------
           y        Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------------------------------------------------------------------
_cons                   8   .6949006    11.51   0.000     6.554875    9.445125
a
           1        -3.25   .5926133    -5.48   0.000    -4.482407   -2.017593
           2        -2.75   .5926133    -4.64   0.000    -3.982407   -1.517593
           3           -2   .5926133    -3.37   0.003    -3.232407   -.7675933
           4    (dropped)
s
           1         -2.5   .8380817    -2.98   0.007    -4.242886   -.7571137
           2        -2.25   .8380817    -2.68   0.014    -3.992886   -.5071137
           3         -2.5   .8380817    -2.98   0.007    -4.242886   -.7571137
           4         -2.5   .8380817    -2.98   0.007    -4.242886   -.7571137
           5         -2.5   .8380817    -2.98   0.007    -4.242886   -.7571137
           6         -1.5   .8380817    -1.79   0.088    -3.242886    .2428863
           7         -.25   .8380817    -0.30   0.768    -1.992886    1.492886
           8    (dropped)
------------------------------------------------------------------------------

Next we will set two of the observations to missing.

replace y=. if a==1 & s==2
replace y=. if a==2 & s==3

generate complete = 1
replace complete = 0 if s==2 | s==3

tabulate a, gen(a)

          a |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          7       23.33       23.33
          2 |          7       23.33       46.67
          3 |          8       26.67       73.33
          4 |          8       26.67      100.00
------------+-----------------------------------
      Total |         30      100.00

tabulate s if complete

          s |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          4       16.67       16.67
          4 |          4       16.67       33.33
          5 |          4       16.67       50.00
          6 |          4       16.67       66.67
          7 |          4       16.67       83.33
          8 |          4       16.67      100.00
------------+-----------------------------------
      Total |         24      100.00

In traditional anova, particularly procedures that use with wide data, only subjects with complete data are included in the analysis. In this case, only the six subjects with complete data would be used.

/* unbalanced design -- traditional anova */

anova y a s if complete

                           Number of obs =      24     R-squared     =  0.7105
                           Root MSE      = 1.33229     Adj R-squared =  0.5560

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |  65.3333333     8  8.16666667       4.60     0.0054
                         |
                       a |      38.125     3  12.7083333       7.16     0.0033
                       s |  27.2083333     5  5.44166667       3.07     0.0420
                         |
                Residual |      26.625    15       1.775
              -----------+----------------------------------------------------
                   Total |  91.9583333    23  3.99818841

anova, regress

      Source |       SS       df       MS              Number of obs =      24
-------------+------------------------------           F(  8,    15) =    4.60
       Model |  65.3333333     8  8.16666667           Prob > F      =  0.0054
    Residual |      26.625    15       1.775           R-squared     =  0.7105
-------------+------------------------------           Adj R-squared =  0.5560
       Total |  91.9583333    23  3.99818841           Root MSE      =  1.3323

------------------------------------------------------------------------------
           y        Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------------------------------------------------------------------
_cons            8.041667   .8158584     9.86   0.000     6.302706    9.780628
a
           1    -3.166667   .7691987    -4.12   0.001    -4.806175   -1.527158
           2           -3   .7691987    -3.90   0.001    -4.639508   -1.360492
           3           -2   .7691987    -2.60   0.020    -3.639508   -.3604917
           4    (dropped)
s
           1         -2.5   .9420722    -2.65   0.018    -4.507979   -.4920207
           4         -2.5   .9420722    -2.65   0.018    -4.507979   -.4920207
           5         -2.5   .9420722    -2.65   0.018    -4.507979   -.4920207
           6         -1.5   .9420722    -1.59   0.132    -3.507979    .5079793
           7         -.25   .9420722    -0.27   0.794    -2.257979    1.757979
           8    (dropped)
------------------------------------------------------------------------------

If you have Stata 9 you can use xtmixed. We will run it using both the default REML and full maximum likelihood. With maximum likelihood we can run the analysis with all observations not just on subjects with complete data.

xtmixed y a1 a2 a3 || s:, variance

Performing EM optimization:

Performing gradient-based optimization:

Iteration 0:   log restricted-likelihood =  -49.98054
Iteration 1:   log restricted-likelihood =  -49.98054

Computing standard errors:

Mixed-effects REML regression                   Number of obs      =        30
Group variable: s                               Number of groups   =         8

                                                Obs per group: min =         3
                                                               avg =       3.8
                                                               max =         4


                                                Wald chi2(3)       =     29.29
Log restricted-likelihood =  -49.98054          Prob > chi2        =    0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          a1 |  -3.135557   .6416073    -4.89   0.000    -4.393084    -1.87803
          a2 |  -2.753089   .6416073    -4.29   0.000    -4.010616   -1.495562
          a3 |         -2   .6157501    -3.25   0.001    -3.206848   -.7931519
       _cons |       6.25   .5327192    11.73   0.000      5.20589     7.29411
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
s: Identity                  |
                  var(_cons) |   .7537252   .6242815      .1486605     3.82147
-----------------------------+------------------------------------------------
               var(Residual) |   1.516593   .4889405      .8062073    2.852931
------------------------------------------------------------------------------
LR test vs. linear regression: chibar2(01) =     3.43 Prob >= chibar2 = 0.0320

test a1 a2 a3

 ( 1)  [y]a1 = 0
 ( 2)  [y]a2 = 0
 ( 3)  [y]a3 = 0

           chi2(  3) =   29.29
         Prob > chi2 =    0.0000

/* approximate F-test */

display r(chi2)/r(df)

9.7619818

/* next with full maximum likelihood */

xtmixed y a1 a2 a3 || s:, variance mle

Performing EM optimization:

Performing gradient-based optimization:

Iteration 0:   log likelihood = -50.864974
Iteration 1:   log likelihood = -50.864974

Computing standard errors:

Mixed-effects ML regression                     Number of obs      =        30
Group variable: s                               Number of groups   =         8

                                                Obs per group: min =         3
                                                               avg =       3.8
                                                               max =         4


                                                Wald chi2(3)       =     33.86
Log likelihood = -50.864974                     Prob > chi2        =    0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          a1 |   -3.13568   .5967199    -5.25   0.000    -4.305229    -1.96613
          a2 |  -2.753432   .5967199    -4.61   0.000    -3.922982   -1.583883
          a3 |         -2    .572654    -3.49   0.000    -3.122381   -.8776188
       _cons |       6.25    .496394    12.59   0.000     5.277086    7.222914
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
s: Identity                  |
                  var(_cons) |   .6595255   .5088202      .1453899    2.991774
-----------------------------+------------------------------------------------
               var(Residual) |    1.31173    .393547      .7285611    2.361691
------------------------------------------------------------------------------
LR test vs. linear regression: chibar2(01) =     3.99 Prob >= chibar2 = 0.0229

test a1 a2 a3

 ( 1)  [y]a1 = 0
 ( 2)  [y]a2 = 0
 ( 3)  [y]a3 = 0

           chi2(  3) =   33.86
         Prob > chi2 =    0.0000

/* approximate F-test */

display r(chi2)/r(df)

11.287767

If you don’t have Stata 9 (and therefore don’t have xtmixed) you can use xtreg with the mle (maximum likelihood) option.

/*  unbalanced design -- maximum likelihood using xtreg */

xtreg y a1 a2 a3, i(s) mle

Random-effects ML regression                    Number of obs      =        30
Group variable (i): s                           Number of groups   =         8

Random effects u_i ~ Gaussian                   Obs per group: min =         3
                                                               avg =       3.8
                                                               max =         4

                                                LR chi2(3)         =     20.35
Log likelihood  = -50.864974                    Prob > chi2        =    0.0001

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          a1 |   -3.13568   .5967916    -5.25   0.000     -4.30537    -1.96599
          a2 |  -2.753432   .5972855    -4.61   0.000     -3.92409   -1.582774
          a3 |         -2   .5726538    -3.49   0.000    -3.122381   -.8776192
       _cons |       6.25   .4963939    12.59   0.000     5.277086    7.222914
-------------+----------------------------------------------------------------
    /sigma_u |   .8121118   .3132698     2.59   0.010     .1981143    1.426109
    /sigma_e |   1.145308   .1718083     6.67   0.000       .80857    1.482046
-------------+----------------------------------------------------------------
         rho |   .3345713   .1958727                      .0692199    .7346624
------------------------------------------------------------------------------
Likelihood-ratio test of sigma_u=0: chibar2(01)=    3.99 Prob>=chibar2 = 0.023

test a1 a2 a3

 ( 1)  [y]a1 = 0
 ( 2)  [y]a2 = 0
 ( 3)  [y]a3 = 0

           chi2(  3) =   33.83
         Prob > chi2 =    0.0000

/* approximate F-test */

display r(chi2)/r(df)

11.277428

As you can see the maximum likelihood solutions are much closer to the balanced design results then is the analysis using only subjects complete data.