Hierarchical Linear Model using SAS (r)

HLM is a type of mixed model developed and refined by Bryk and Raudenbush, as well as their colleagues.  When data is hierarchically structured like education data (e.g., students sampled from the set of schools), independence of errors cannot be assumed; so classical statistical models, such as OLS, would not be good.  Standard errors would be underestimated, making it overly easy to attain statistically significant results.

Bryk and Raudenbush have their software out through Scientific Software (R).  I use SAS proc mixed because I find it easier to use than HLM software.  I have not experienced a bug with SAS PROC MIXED.

I demonstrate how HLM works using Korean and American data of 8th graders.  TIMSS or Third International Math and Science Study collected mathematics scores of these students, as well as lots of other things.  I also show SAS syntax, so it is easy to replicate what I did here--at least for SAS users.

Download SAS DATA .  Here I assume that you put it in your "my document" folder.

1. Run this program just once to create library that contains tables for variables.

2. The whole program is here.


ANOVA Model

This is an intercept-only model.  The purpose is to get a grand-mean (=intercept) and the size of variance at several levels.  In this case, the variances of math score are estimated at individual-level and school-level.

Level1: Math Score= b0+error_ij

Level2: b00=g00+error00_j

@

/*Anova model or intercept-only model*/
proc mixed data=both2 covtest noclprint;
by IDcntry;
model bsmpv01= / solution ddfm=bw ;
random intercept/ sub= IDSCHOOL ;
run;

See SAS outputs for anova models for two countries

This is where you find variance estimates for Korea.  Note how highlighted parts go to the table below.

                                 Covariance Parameter Estimates

                                                    Standard         Z
              Cov Parm      Subject     Estimate       Error     Value        Pr Z

              Intercept     IDSCHOOL     0.06161     0.01268      4.86      <.0001
              Residual                    0.9077     0.02448     37.08      <.0001
This is for fixed effects.
                                   Solution for Fixed Effects

                                         Standard
                Effect       Estimate       Error      DF    t Value    Pr > |t|

                Intercept      0.7409     0.02692     149      27.52      <.0001
KOREA HLM Anova models @ @ @ @ @ USA  HLMAnova models @ @ @
@ KOREA @ @ @ @ US @ @ @
@ Coefficient Error P Parameter Variance Error P @ Coefficient Error P Parameter Variance Error P
Intercept 0.7409 (0.03) < .0001 0.0616 (0.02)   <.0001 @ -0.3061 (0.04) < .0001 0.2141 (0.02)   <.0001
Variance @ @ @ @ @ @ @ @ @ @ @ @ @
Level 1 0.9077 (0.01)   <.0001 @ @ @ @ 0.4769 (0.01)   <.0001 @ @ @
ICC 7% @ @ @ @ @ @ 45% @ @ @ @ @
Sample Size 2901 @ @ @ @ @ @ 6944 @ @ @ @ @
N of Schools 150 @ @ @ @ @ @ 183 @ @ @ @ @
 -2 RES LOG L 8081.8 @ @ @ @ @ @ 15088.5 @ @ @ @ @
AIC 8085.8 @ @ @ @ @ @ 15092.5 @ @ @ @ @
  AICC 8085.8 @ @ @ @ @ @ 15092.5 @ @ @ @ @
 BIC 8091.8 @ @ @ @ @ @ 15098.9 @ @ @ @ @

Comparing intercepts, we learn that Koreans do a lot better than Americans in this test.  Comparison variances, we learn that in Korea students differ in scores mostly within schools, while schools are very similar to each other.  American schools are very different in average scores.

We could do this sort of graph for presentation. 


Comparison of OLS and HLM

OLS Regression HLM (Random Coefficient model)
OLS coefficients are just one set of them, while for HLM coefficients are esimated for each group unit (i.e., school)
Math Score = b0 + b1* college + b2* boy + error.

@

Level1: Math Score= b0+b1*college+b2*boy+error_ij

Level2: b00=g00+error00_j

Level2:b10=g10+error10_j

Level2:b20=g20+error20_j

@

PROC REG and PROC MIXED are very similar.  The crucial line in PROC MIXED is RANDOM statement.  You put variable names (including intercept) whose coefficients you want to estimate for each group unit.
/*OLS regression*/
/*by IDcntry--> to run a regression separately by nation*/
proc reg data=both2;
by IDcntry;
model  BSMPV01= college2 boy2;
run;
/*college and boy as predictors*/
/*Coefficients of these two predictors are set RANDOM*/
proc mixed data=both2 covtest noclprint;
by IDcntry;
class college boy;
model bsmpv01= college boy/ solution ddfm=bw ;
random intercept college boy/ sub= IDSCHOOL s;
ods output solutionR=sol;
run;

highlighted in blue are optional part of the programming.

@

SAS output for OLS results

OLS Results @ @ @ @ @ @ @
@ @ @
@ Korea @ US
@ Coefficient Error p-value Coefficient Error p-value
Intercept 0.55 (0.03) *** @ -0.435 (0.02) ***
College Educated Parent 0.53 (0.04) *** @ 0.357 (0.02) ***
BOY 0.13 (0.04) *** @ 0.033 (0.02) @
Sample Size 2900 @ @ @ 6943 @ @
R2 0.05 @ @ @ 0.039 @ @
Math Score is Z-score.  Korean and American samples are merged first and then standardized.
*** p<0.001; ** p<0.01; * p<0.05 @ @ @ @ @ @

SAS output for MIXED results

HLM Results @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
@ @ @ @ @ @ @ @ @ @ @ @ @ @
@ Korea @ @ US
@ Parameters @ Parameter Variance @ @ Parameters @ Parameter Variance
@ @ Error @ @ @ Error @ @ @ @ Error @ @ @ Error @
Intercept 0.56 (0.03) *** @ 0.02 (0.01) * @ Intercept -0.38 (0.04) *** 0.20 (0.02) ***
College Educated Parent 0.49 (0.05) *** @ 0.01 (0.01) @ @ College Educated Parent 0.16 (0.02) *** 0.02 (0.01) **
BOY 0.14 (0.04) ** @ FIXED @ @ @ BOY 0.05 (0.02) ** FIXED @ @
Residual 0.88 (0.02) *** @ @ @ @ @ Residual 0.47 (0.01) *** @ @ @ @
ICC  3% @ @ @ @ @ @ @ ICC  30% @ @ @ @ @ @
*** p<0.001; ** p<0.01; * p<0.05 @ @ @ @ @ @ @ @ @ @ @ @ @

Generally speaking, standard errors are bigger for HLM results.  Also US results look different probably reflecting that fact that the ICC for US data is large.  US data suffers more from correlated errors within schools, which probably is why when the correlated error is fixed in HLM, the results look different.

Looking at coefficients obtained

It's good to look at the random effects.  You can have a feel for the data in this way.Because I requested that results for random effects be saved in a data set by saying "ods output solutionR=sol;" I get a data set "sol" that has coefficients for all schools.  I did proc print to print out the coefficients for parents' education variables.  Notice when we set coefficients RANDOM, all schools (variable subject indicates ID) get coeffcients.  For example, at school 1, the effect of parent being a college educated is .03051 (baseline = parent graduated middle school).

/*Eyeball the results*/
proc print data=sol;
where effect="college" and estimate > -9 and DF > -9;
run;
 The SAS System       21:56 Monday, December 8, 2003  63

                                                                 StdErr
   Obs  IDCNTRY  Effect     college    boy  Subject  Estimate      Pred    DF   tValue   Probt

     2    410    college  2college             1      0.03051   0.09281  2898     0.33  0.7423
     3    410    college  high school          1      0.01327   0.09111  2898     0.15  0.8842
     7    410    college  2college             2     -0.00905   0.09191  2898    -0.10  0.9215
     8    410    college  high school          2     -0.00407   0.09187  2898    -0.04  0.9646
    12    410    college  2college             3      0.01475   0.09165  2898     0.16  0.8722
    13    410    college  high school          3     -0.00559   0.09247  2898    -0.06  0.9518
    17    410    college  2college             4     -0.01486   0.09473  2898    -0.16  0.8753
    18    410    college  high school          4      0.01260   0.09036  2898     0.14  0.8891
    22    410    college  2college             5      0.01241   0.09039  2898     0.14  0.8908
    23    410    college  high school          5      0.01070   0.09429  2898     0.11  0.9097
    27    410    college  2college             6     -0.02322   0.09249  2898    -0.25  0.8018
    28    410    college  high school          6      0.02978   0.09134  2898     0.33  0.7444
    32    410    college  2college             7     0.001952   0.09430  2898     0.02  0.9835
    33    410    college  high school          7      0.02067   0.09074  2898     0.23  0.8198
    37    410    college  2college             8      0.08799   0.09122  2898     0.96  0.3349
    38    410    college  high school          8     -0.02738   0.09389  2898    -0.29  0.7706
REST OMMITTED
                     
Below I am plotting coefficients, the first one for Korea and the second one for the US.
proc univariate data=sol plot;
where effect="college" and estimate > -9 and DF > -9;
by IDcntry;
var estimate;run;
@
                                 Histogram                 #             Boxplot
                 0.085+*                                   1                0
                      .
                      .*                                   2                0
                      .*                                   2                |
                      .*****                               9                |
                      .*********                          18                |
                      .************                       24                |
                      .*********************              42             +-----+
                      .*******************************    62             *--+--*
                      .***********************            45             |     |
                      .*********************              42             +-----+
                      .***********                        21                |
                      .*********                          17                |
                      .****                                7                |
                      .*                                   2                |
                      .*                                   2                0
                      .*                                   1                0
                -0.085+**                                  3                0
                       ----+----+----+----+----+----+-
                       * may represent up to 2 counts

                                 Histogram                  #             Boxplot
                  0.19+*                                    1                0
                      .*                                    1                0
                      .
                      .**                                   4                |
                      .*****                               10                |
                      .***                                  5                |
                      .***************                     30                |
                      .***************                     29                |
                  0.03+***********************             45             +-----+
                      .***************************         53             |  +  |
                      .********************************    63             *-----*
                      .***********************             45             +-----+
                      .*****************                   34                |
                      .************                        24                |
                      .*****                               10                |
                      .****                                 7                |
                 -0.13+***                                  5                0
                       ----+----+----+----+----+----+--
                       * may represent up to 2 counts

@


I think one insight to derive here is that US has a slightly larger dispersion of college effect.  Interesting to think of what this may mean.


Modeling cross-level Interaction

Under development