Statistical joint test of categorical variables when expressed as a series of dummy variables

When I have multiple subgroup represented in a series of dummy variables (e.g.,  race groups, grade levels, etc.), I want to know if dummy variables as a system  contribute to the model with statistical significance.   This may be called a joint test because I want to know if, for example, race groups together (not separately)  make a differences to the model.

The easiest way to do this is to treat those variables as classification variables.  You will get a joint statistical test in one of the result tables.

proc glimmix ..;

class race grade_level;

....

run;

In my applications I almost always use numeric version of variables, i.e., dummy variables (coded as 0 or 1).  I like this approach because I can use PROC MEANS on them to create a descriptive statistics table.

The question is how I get joint statistical tests when  all of my predictors are numerically coded and thus I can't rely on the class statement (shown above in the syntax example).

The GLIMMIX syntax below treats race groups and grade levels as numerically coded dummy variables (if YES 1, else 0).

The parameter estimate tables will show coefficients derived for each of the numeric variables; however, I wouldn't know if race groups as a group matters to the model or grade levels as a system matters to the model.   For example, even when  the coefficient derived for subjects being black is statistically significant, that is only about how black students are different from white students (reference group in this example).  We don't know if race as a group matters and race groups jointly make a statistically significant contribution to the model.

<Again this can be done easily by using class variables instead (as shown earlier); however, I like using numeric variables in my models.>

Contrast statements will do the trick.

proc glimmix data=usethis namelen=32;
class groupunit;
model Y= treat black hispanic other grade09 grade10 grade11/
solution ddfm=kr dist=&dist link=&link ;
output out=&outcome.gmxout residual=resid;
random intercept /subject=groupunit;
CONTRAST 'Joint F-Test Race groups ' Black 1, Hispanic 1, other 1;
CONTRAST 'Joint F-Test Grade levels' grade09 1, grade10 1, grade11 1,

ods output
ParameterEstimates=_3_&outcome.result covparms=_3_&outcome.cov
Contrasts=cont&outcome;
run;

 

PROC GLIMMIX's option: lsmeans group / ilink diff;

proc glimmix data=asdf METHOD=RSPL;

class CAMPUS_14 subgroup;

model y=x1 x2 x3 subgroup

/dist=binomial link=logit s ddfm=kr;

lsmeans group / ilink diff;

ods output  ModelInfo=x1var1 ParameterEstimates=x2var1 CovParms=x3var1

Diffs=DIF_RESULT1 LSMeans=LS1;

run;

HLM: What happens if I enter level2 variables at level 1?

The goal of this exercise is to find out what happens if I intentionally use a level2 variable at level 1 in HLM.   I found that the coefficients and standard errors remain about the same.  The parameter that differed was just the degree of freedom, which was consistent with my expectation.

Using my old NELS dataset, I ran two different HLM models using Bryk and Raudenbush’s software (See model 1 and model 2 equations in the table below).

  • The URBAN as level 1 covariate model (I entered the level2 variable URBAN wrongly at level 1)
  • The URBAN as level 2 covariate model (I entered the level2 variable URBAN correctly at level 2)

The outcome variable is the achievement composite (POSTTEST), students are level 1 and schools are level 2.  When expressed as mixed models, the two models are identical, which is why I expected most parameters to come out the same.

POSTTESTij = γ00
γ10*URBANij  + u0jrij

The first model (MODEL 1; see below) included URBAN (students are in urban school) as a level 1 predictor.  Of course this is a wrong specification because urban is a school characteristic.  In the second model (MODEL 2), I used it at the expected level, which is at level 2 (school level).

These models look different, but AGAIN when expressed as mixed models, they are identical.  As the third model (MODEL 3), I replicated the same HLM model using SAS PROC GLIMMIX.  SAS requires that the equation be expressed as a mixed model.

Results showed that coefficients and standard errors are more or less the same across three models.  The only one thing that was different was degree of freedom.

Conclusion: As long as variables enter the model as fixed effects as done here, there is nothing magical about the HLM model.  HLM software or SAS PROC GLIMMIX (option ddfm=kr) adjust degree of freedom values, accounting for the fact that URBAN is a school-level variable and thus should not be awarded a value that is too large.  Notice that under the correct specification (MODEL 2 and MODEL 3), the degree of freedom for URBAN is close to the number of schools, not to the number of students.

Thanks for any comments you may have.

MODEL 1 MODEL 2 MODEL 3
URBAN as LEVEL 1 covariate URBAN as level 2 covariate SAS PROC GLIMMIX
Level-1 Model

POSTTESTij = β0j + β1j*(URBANij) + rij

Level-2 Model

β0j = γ00 + u0j
β1j = γ10

Mixed Model

POSTTESTij = γ00
γ10*URBANij  + u0jrij

Level-1 Model

POSTTESTij = β0j + rij

Level-2 Model

β0j = γ00 + γ01*(URBAN_LEj) + u0j

Mixed Model

POSTTESTij = γ00 + γ01*URBAN_LEj  + u0jrij

proc glimmix data=kaz.level1;

class schoolID;

model

posttest =

urban

/solution ddfm=kr dist=normal link=identity s ;

random schoolID;

run;

 

Final estimation of fixed effects
(with robust standard errors)

Fixed Effect  Coefficient  Standard
error
 t-ratio  Approx.
d.f.
 p-value
For INTRCPT1, β0
    INTRCPT2, γ00 52.643432 0.526139 100.056 125 <0.001
For URBAN slope, β1
    INTRCPT2, γ10 -0.450022 1.157924 -0.389 692 0.698

 

Final estimation of variance components

Random Effect Standard
Deviation
Variance
Component
  d.f. χ2 p-value
INTRCPT1, u0 3.76951 14.20923 125 292.48369 <0.001
level-1, r 8.39004 70.39271

Statistics for current covariance components model

 

Final estimation of fixed effects
(with robust standard errors)

Fixed Effect  Coefficient  Standard
error
 t-ratio  Approx.
d.f.
 p-value
For INTRCPT1, β0
    INTRCPT2, γ00 52.643459 0.526140 100.056 124 <0.001
    URBAN_LE, γ01 -0.449983 1.157920 -0.389 124 0.698

 

Final estimation of variance components

Random Effect Standard
Deviation
Variance
Component
  d.f. χ2 p-value
INTRCPT1, u0 3.76919 14.20678 124 292.48068 <0.001
level-1, r 8.39007 70.39334

Statistics for current covariance components model

 

Solutions for Fixed Effects
Effect Estimate Standard Error DF t Value Pr > |t|
Intercept 52.6434 0.5460 95.01 96.41 <.0001
urban -0.4501 1.1095 132.4 -0.41 0.6856
 

 

Covariance Parameter Estimates
Cov Parm Estimate Standard Error
schoolID 14.2150 3.4610
Residual 70.3913 3.7455

 

Datasets:

www.nippondream.com/file/datafiles_HLM.zip

Results from Model 1

Results from Model 2

Results from PROC GLIMMIX

 

PROC GLIMMIX error messages

While running PROC GLIMMIX with a large dataset with categorical variables explicitly treated as CLASS variables AND with weight, I got the following error message:

WARNING: Obtaining minimum variance quadratic unbiased estimates as starting values for the         covariance parameters failed.

The weight and the use of categorical variables were the cause of the problem as I found out that without weight the model produces results.

SAS support staff spent time trying to figure out what the issue is/was.  I sent them a dataset (though I replaced the variables with randomly generated values) and the exact syntax.  They determined that (thank you so much for going extra miles to figure this out!):

"the computation of MIVQUE fails because the default singular value (1e-12) leads to the detection of singularity while sweeping of the mixed model equation"  -- May 12, 2016.

They proposed that singular value should be specified as a very small value to let the conversion happen:

proc glimmix data=example singular=1e-9;
weight psweight;
class race_cate level2 ;
model outcome= race_cate / s dist=normal link=identity ddfm=kr;
random intercept/subject=level2;
run;

****

Before hearing this final response from SAS, I wrote this and said that I chose to use PROC MIXED because it produces results.

 

SAS Support person suggested that I use of PARMS statement after Random, so I feed in the initial values for variance-covariance values manually/directly.  I did something like this:

PARMS (0.2) (0.4);

Then I get:

“ERROR: Values given in PARMS statement are not feasible.”

Out of curiosity, I used PROC MIXED to run the same model without PARMS but with a weight statement.  This time, the model produced results without encountering any errors.

PROC MIXED and PROC GLIMMIX produce identical results (or almost or essentially identical results) when the same models are specified.

I *think* my solution for now is to use PROC MIXED for this specific model.

The following two produced the identical results.
proc mixed data=analysis noclprint ;
where random_number > 0.8;
weight psweight;
class level2 race_cate2 loc_cate;
model zposttest=
<I put variable names here -- including categorical variables, level2 race_cate2 loc_cate>
/ s ddfm=kr ;
random int/subject=level2;
ParameterEstimates=kaz5
Diffs=DIF_GROUP5
LSMeans=LS5
Tests3=jointtest;
run;

proc glimmix data=analysis noclprint ;
where random_number > 0.8;
weight psweight;
class level2 race_cate2 loc_cate;
model zposttest=
<I put variable names here -- including categorical variables, race_cate loc_cate>
/ s dist=normal link=identity ddfm=kr STDCOEF ;
lsmeans
race_cate2 loc_cate
/ ilink diff
at (
treat
Zpretest
male
LEP
SPECED
DISADV
)=(0 0 0 0 0 0);
random int/subject=level2;
*parms (.03 ) (.45) /* /hold=1,2*/ ;
*ods output
ParameterEstimates=kaz5
Diffs=DIF_GROUP5
LSMeans=LS5
Tests3=jointtest;
run;

Doing HLM and Time-series analysis at the same time using GLIMMIX

HLM (multilevel models) and econometric analyses (e.g., time series analysis, ARIMA, etc.) are treated as different approaches (the goal of which is to deal with data dependency problem), they can be implemented in the same model via. SAS PROC GLIMMIX.  However, I believe doing this is computationally demanding and models may not converge.

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_glimmix_sect020.htm

For detailed discussion of this topic, see p.28 of my HLM manual:

http://www.estat.us/sas/PROCMIXED.doc

Kenward-Roger Degrees of Freedom (SAS GLIMMIX)

I will edit this essay later.

Imagine I have an HLM model where level-1 are students and level-2 are schools.  I can enter teacher-level variables, but people who are used to HLM software by SSI will wonder how the use of teacher-level variables is possible without making the model 3-level models (level1 students, level2 teachers, level3 schools).  This is because in the old HLM tradition (people who studied HLM using HLM software in 1990s), equations are written/processed separately by levels and data must be prepared by levels (student data file and school data file).

If there are student-level and school-level, some HLM users may wonder how teacher-level variables can enter the model (or prepared in what file? student-level file or school-level file???).

People who use SAS or other software programs (maybe R?) will not wonder this because they think in terms of one whole equation and say, "of course we can enter variables of any levels."

The software programs they use adjust degree of freedom to take into consideration that teacher-level variables have a lot less number of values compared to the outcome variable.

K-R option in PROC GLIMMIX adjusts degree of freedom to account for the fact that group-level variables have a lot less possible values when compared to outcome variables.

K-R degree of freedom option seems most appropriate for multilevel modeling applied in educational evaluation studies (where typically students are nested within schools).  This option kicks in only when at least one random coefficient is estimated (e.g., intercepts as random effects).

proc glimmix data=i3d;
class school ;
model y =< here variable names >
/solution ddfm=kr dist=normal link=identity s ;
random int /sub=school;
run;

After playing with data type of simulation (I will do a better and well-documented simulation in the future), I learned the following:

For student-level (level-1) variables, degree of freedom under KR option is close to the number of students (minus a small number of cases maybe close to the number of predictors).  DF is larger if variance contained in the variable is greater.  For binary variables, DF is the largest when variance is .50 (DF gets smaller as a proportion gets close to 0 or 1).

For school-level (level2) variables, DF is close to the number of schools minus a small number maybe close to a number of predictors.  This may be also adjusted by variance of the variable.

I created a fake teacher-level predictors by creating two possible numeric values per school.  DF for this variable was close to the number of teachers in the data (two per school) minus a small number.  I think this is also adjusted by variance of the variable (which I didn't test exactly).

 

Proc mixed and Proc glimmix produce identical results

PROC MIXED and PROC GLIMMIX produce identical results for the following linear model settings.

proc glimmix data=kaz.asdf;
where IMPACT_ANALYSIS=1;
class school;
model
y =
x1 x2 x3
/solution ddfm=kr dist=normal link=identity s ;
random school;
run;

proc mixed data=kaz.asdf;
where IMPACT_ANALYSIS=1;
class school;
model
y =
x1 x2 x3
/solution s ddfm=kr;
random school;
run;

How to derive subgroup adjusted outcome averages and conduct pairwise stat-test

When you run a statistical interaction model (e.g., Y=TREATMENT + GENDER + TREATMENT*GENDER), you also want to run a mathematically equivalent model:
Y= T_MALE + C_MALE + T_FEMALE + C_FEMALE.

SUBGROUP defines the four groups below.

This is for linear model (HLM because I have the random statement).  

proc glimmix data=asdf3 namelen=32;
class CAMPUSID  SUBGROUP;
model y=
x1 x2 x3
SUBGROUP
/solution ddfm=kr dist=normal link=identity s noint;
lsmeans SUBGROUP / ilink diff;
random int / subject = CAMPUSID;
ods output ModelInfo=x1var1 ParameterEstimates=x2var1 CovParms=x3var1
nobs=x4var1 Diffs=DIF_RESULT1;
run;

This is for logistic regression model.  

proc glimmix data=asdf METHOD=RSPL;
class CAMPUS_14 subgroup;
model y=x1 x2 x3 subgroup
/dist=binomial link=logit s ddfm=kr;
lsmeans group / ilink diff;
ods output  ModelInfo=x1var1 ParameterEstimates=x2var1 CovParms=x3var1
Diffs=DIF_RESULT1 LSMeans=LS1;
run;