# Hierarchical Linear Model using SAS (r)

HLM is a type of mixed model developed and refined by Bryk and Raudenbush, as well as their colleagues.  When data is hierarchically structured like education data (e.g., students sampled from the set of schools), independence of errors cannot be assumed; so classical statistical models, such as OLS, would not be good.  Standard errors would be underestimated, making it overly easy to attain statistically significant results.

Bryk and Raudenbush have their software out through Scientific Software (R).  I use SAS proc mixed because I find it easier to use than HLM software.  I have not experienced a bug with SAS PROC MIXED.

I demonstrate how HLM works using Korean and American data of 8th graders.  TIMSS or Third International Math and Science Study collected mathematics scores of these students, as well as lots of other things.  I also show SAS syntax, so it is easy to replicate what I did here--at least for SAS users.

1. Run this program just once to create library that contains tables for variables.

2. The whole program is here.

# ANOVA Model

This is an intercept-only model.  The purpose is to get a grand-mean (=intercept) and the size of variance at several levels.  In this case, the variances of math score are estimated at individual-level and school-level.

 Level1: Math Score= b0+error_ij Level2: b00=g00+error00_j /*Anova model or intercept-only model*/ proc mixed data=both2 covtest noclprint; by IDcntry; model bsmpv01= / solution ddfm=bw ; random intercept/ sub= IDSCHOOL ; run;

See SAS outputs for anova models for two countries

This is where you find variance estimates for Korea.  Note how highlighted parts go to the table below.

Covariance Parameter Estimates

Standard         Z
Cov Parm      Subject     Estimate       Error     Value        Pr Z

Intercept     IDSCHOOL     0.06161     0.01268      4.86      <.0001
Residual                    0.9077     0.02448     37.08      <.0001
This is for fixed effects.
Solution for Fixed Effects

Standard
Effect       Estimate       Error      DF    t Value    Pr > |t|

Intercept      0.7409     0.02692     149      27.52      <.0001
 KOREA HLM Anova models USA  HLMAnova models KOREA US Coefficient Error P Parameter Variance Error P Coefficient Error P Parameter Variance Error P Intercept 0.7409 (0.03) < .0001 0.0616 (0.02) <.0001 -0.3061 (0.04) < .0001 0.2141 (0.02) <.0001 Variance Level 1 0.9077 (0.01) <.0001 0.4769 (0.01) <.0001 ICC 7% 45% Sample Size 2901 6944 N of Schools 150 183 -2 RES LOG L 8081.8 15088.5 AIC 8085.8 15092.5 AICC 8085.8 15092.5 BIC 8091.8 15098.9

Comparing intercepts, we learn that Koreans do a lot better than Americans in this test.  Comparison variances, we learn that in Korea students differ in scores mostly within schools, while schools are very similar to each other.  American schools are very different in average scores.

We could do this sort of graph for presentation.

# Comparison of OLS and HLM

 OLS Regression HLM (Random Coefficient model) OLS coefficients are just one set of them, while for HLM coefficients are esimated for each group unit (i.e., school) Math Score = b0 + b1* college + b2* boy + error. Level1: Math Score= b0+b1*college+b2*boy+error_ij Level2: b00=g00+error00_j Level2:b10=g10+error10_j Level2:b20=g20+error20_j PROC REG and PROC MIXED are very similar.  The crucial line in PROC MIXED is RANDOM statement.  You put variable names (including intercept) whose coefficients you want to estimate for each group unit. /*OLS regression*/ /*by IDcntry--> to run a regression separately by nation*/ proc reg data=both2; by IDcntry; model BSMPV01= college2 boy2; run; /*college and boy as predictors*/ /*Coefficients of these two predictors are set RANDOM*/ proc mixed data=both2 covtest noclprint; by IDcntry; class college boy; model bsmpv01= college boy/ solution ddfm=bw ; random intercept college boy/ sub= IDSCHOOL s; ods output solutionR=sol; run; highlighted in blue are optional part of the programming.

SAS output for OLS results

 OLS Results Korea US Coefficient Error p-value Coefficient Error p-value Intercept 0.55 (0.03) *** -0.435 (0.02) *** College Educated Parent 0.53 (0.04) *** 0.357 (0.02) *** BOY 0.13 (0.04) *** 0.033 (0.02) Sample Size 2900 6943 R2 0.05 0.039 Math Score is Z-score.  Korean and American samples are merged first and then standardized. *** p<0.001; ** p<0.01; * p<0.05

SAS output for MIXED results

 HLM Results Korea US Parameters Parameter Variance Parameters Parameter Variance Error Error Error Error Intercept 0.56 (0.03) *** 0.02 (0.01) * Intercept -0.38 (0.04) *** 0.20 (0.02) *** College Educated Parent 0.49 (0.05) *** 0.01 (0.01) College Educated Parent 0.16 (0.02) *** 0.02 (0.01) ** BOY 0.14 (0.04) ** FIXED BOY 0.05 (0.02) ** FIXED Residual 0.88 (0.02) *** Residual 0.47 (0.01) *** ICC 3% ICC 30% *** p<0.001; ** p<0.01; * p<0.05

Generally speaking, standard errors are bigger for HLM results.  Also US results look different probably reflecting that fact that the ICC for US data is large.  US data suffers more from correlated errors within schools, which probably is why when the correlated error is fixed in HLM, the results look different.

# Looking at coefficients obtained

It's good to look at the random effects.  You can have a feel for the data in this way.Because I requested that results for random effects be saved in a data set by saying "ods output solutionR=sol;" I get a data set "sol" that has coefficients for all schools.  I did proc print to print out the coefficients for parents' education variables.  Notice when we set coefficients RANDOM, all schools (variable subject indicates ID) get coeffcients.  For example, at school 1, the effect of parent being a college educated is .03051 (baseline = parent graduated middle school).

/*Eyeball the results*/
proc print data=sol;
where effect="college" and estimate > -9 and DF > -9;
run;
 The SAS System 21:56 Monday, December 8, 2003 63 StdErr Obs IDCNTRY Effect college boy Subject Estimate Pred DF tValue Probt 2 410 college 2college 1 0.03051 0.09281 2898 0.33 0.7423 3 410 college high school 1 0.01327 0.09111 2898 0.15 0.8842 7 410 college 2college 2 -0.00905 0.09191 2898 -0.10 0.9215 8 410 college high school 2 -0.00407 0.09187 2898 -0.04 0.9646 12 410 college 2college 3 0.01475 0.09165 2898 0.16 0.8722 13 410 college high school 3 -0.00559 0.09247 2898 -0.06 0.9518 17 410 college 2college 4 -0.01486 0.09473 2898 -0.16 0.8753 18 410 college high school 4 0.01260 0.09036 2898 0.14 0.8891 22 410 college 2college 5 0.01241 0.09039 2898 0.14 0.8908 23 410 college high school 5 0.01070 0.09429 2898 0.11 0.9097 27 410 college 2college 6 -0.02322 0.09249 2898 -0.25 0.8018 28 410 college high school 6 0.02978 0.09134 2898 0.33 0.7444 32 410 college 2college 7 0.001952 0.09430 2898 0.02 0.9835 33 410 college high school 7 0.02067 0.09074 2898 0.23 0.8198 37 410 college 2college 8 0.08799 0.09122 2898 0.96 0.3349 38 410 college high school 8 -0.02738 0.09389 2898 -0.29 0.7706 REST OMMITTED

Below I am plotting coefficients, the first one for Korea and the second one for the US.
proc univariate data=sol plot;
where effect="college" and estimate > -9 and DF > -9;
by IDcntry;
var estimate;run;
 Histogram # Boxplot 0.085+* 1 0 . .* 2 0 .* 2 | .***** 9 | .********* 18 | .************ 24 | .********************* 42 +-----+ .******************************* 62 *--+--* .*********************** 45 | | .********************* 42 +-----+ .*********** 21 | .********* 17 | .**** 7 | .* 2 | .* 2 0 .* 1 0 -0.085+** 3 0 ----+----+----+----+----+----+- * may represent up to 2 counts Histogram # Boxplot 0.19+* 1 0 .* 1 0 . .** 4 | .***** 10 | .*** 5 | .*************** 30 | .*************** 29 | 0.03+*********************** 45 +-----+ .*************************** 53 | + | .******************************** 63 *-----* .*********************** 45 +-----+ .***************** 34 | .************ 24 | .***** 10 | .**** 7 | -0.13+*** 5 0 ----+----+----+----+----+----+-- * may represent up to 2 counts

I think one insight to derive here is that US has a slightly larger dispersion of college effect.  Interesting to think of what this may mean.

# Modeling cross-level Interaction

Under development