The meaning of intercept and centering of predictor variables

The result table of a regression model includes, among other things, a column of coefficients.  The intercept value, shown at the top cell of the coefficient column, may look mysterious and even arbitrary.  The intercept is the predicted value for a subject whose values for all predictors in the model are 0’s.  If the regression model includes gender as a predictor (coded as 1 if male, else 0), the intercept will indicate the average outcome value for female subjects.  If the model includes gender and body weight, the intercept value will indicate the average outcome value for females who has a body weight of zero.  Nobody’s weight is 0; thus, the meaning of the intercept in this case is nonsensical.   If an analyst is not particularly interested in adding a substantive meaning to the intercept, he/she can ignore the intercept and safely interpret the rest of coefficients.

Personally I want all values in my result tables to have a substantive and interpretative meaning.  As mentioned, with dummy variables (coded as 1 or 0) included in the model, the intercept already has a meaning.

If the model includes continuous variables, however, I recommend centering those variables around the variables’ average value.  If the variable in question is a test score whose value range is 0 to 100 and the average score was 65, I would subtract 65 from each subject’s test score (if a test score is 60, then 60 - 65.  In SAS, you can do:

proc standard  data=abc out=abc2 mean=0;

var testscore1 ;

run;

 

With centering, the intercept will obtain a meaning.  The intercept value indicates the predicted value for a subject whose test score is the average score.  Again, the centering does not affect coefficients of other variables included in the model or any other values obtained from the model.

You can also center a predictor’s values and fix its standard deviation to be 1.  If SAS, you can do:

proc standard data=abc out=abc2 mean=0 std=1;

var testscore;

run;

The resulting value is called “z-score.”  Z-score may be better-known than the concept of centering.  Z-score is one specific type of centering.  Its mean is zero (as all values are centered around the average value) and standard deviation is fixed as 1.

I typically apply “z-scoring” for a pretest variable whose scores are large numbers (e.g., 953, 405, etc.).  Without this adjustment, the derived coefficients may be too small to read in the table (e.g., 0.00000014).

 

 

 

WWC attrition table

P. 13 of the WWC stadards document.

https://ies.ed.gov/ncee/wwc/Docs/referenceresources/wwc_procedures_v3_0_standards_handbook.pdf

 

Overall Attrition Conservative Boundary Liberal Boundary
0 0.057 0.1
0.01 0.058 0.101
0.02 0.059 0.102
0.03 0.059 0.103
0.04 0.06 0.104
0.05 0.061 0.105
0.06 0.062 0.107
0.07 0.063 0.108
0.08 0.063 0.109
0.09 0.063 0.109
0.1 0.063 0.109
0.11 0.062 0.109
0.12 0.062 0.109
0.13 0.061 0.108
0.14 0.06 0.108
0.15 0.059 0.107
0.16 0.059 0.106
0.17 0.058 0.105
0.18 0.057 0.103
0.19 0.055 0.102
0.2 0.054 0.1
0.21 0.053 0.099
0.22 0.052 0.097
0.23 0.051 0.095
0.24 0.049 0.094
0.25 0.048 0.092
0.26 0.047 0.09
0.27 0.045 0.088
0.28 0.044 0.086
0.29 0.043 0.084
0.3 0.041 0.082
0.31 0.04 0.08
0.32 0.038 0.078
0.33 0.036 0.076
0.34 0.035 0.074
0.35 0.033 0.072
0.36 0.032 0.07
0.37 0.031 0.067
0.38 0.029 0.065
0.39 0.028 0.063
0.4 0.026 0.06
0.41 0.025 0.058
0.42 0.023 0.056
0.43 0.021 0.053
0.44 0.02 0.051
0.45 0.018 0.049
0.46 0.016 0.046
0.47 0.015 0.044
0.48 0.013 0.042
0.49 0.012 0.039
0.5 0.01 0.037
0.51 0.009 0.035
0.52 0.007 0.032
0.53 0.006 0.03
0.54 0.004 0.028
0.55 0.003 0.026
0.56 0.002 0.023
0.57 0 0.021
0.58 - 0.019
0.59 - 0.016
0.6 - 0.014
0.61 - 0.011
0.62 - 0.009
0.63 - 0.007
0.64 - 0.005
0.65 - 0.003

T-test in SAS datastep

The following SAS datastep conducts a test using functions in a datastep.

proc means data=both STACKODSOUTPUT n mean std min max stderr ;
class treat ;
var

<Variables here>

;
ods output summary=kaz2;
run;

data c;set kaz2;
if treat=0;
N_c=N;
mean_c=mean;
StdDev_c=StdDev;
Min_C=Min;
Max_C=Max;
StdErr_C=StdErr;
keep N_C MEAN_C StdDev_c MIN_C MAX_C StdErr_C Variable label;
run;

data t;set kaz2;
if treat=1;
N_t=N;
mean_t=mean;
StdDev_t=StdDev;
Min_t=Min;
Max_t=Max;
StdErr_t=StdErr;
Variable_QC=Variable;
keep N_T MEAN_T StdDev_t MIN_T MAX_T Variable_QC StdErr_t;
run;

data merge_CT;
merge C T ;
difference=MEAN_T-MEAN_C;

/*https://www.itl.nist.gov/div898/handbook/eda/section3/eda353.htm*/
POOLED_SE=sqrt( ( (StdDev_t*StdDev_t) / N_T ) + ( (StdDev_c*StdDev_c ) / N_C ) );

T_value=abs(difference)/POOLED_SE;

P_value=(1-probnorm(T_value))*2;
*if P_value < 0.1 then sig="t";
if P_value < 0.05 then sig="* ";
if P_value < 0.01 then sig="** ";
if P_value < 0.001 then sig="***";
if P_value =. then sig="";

run;

Statistical joint test of categorical variables when expressed as a series of dummy variables

When I have multiple subgroup represented in a series of dummy variables (e.g.,  race groups, grade levels, etc.), I want to know if dummy variables as a system  contribute to the model with statistical significance.   This may be called a joint test because I want to know if, for example, race groups together (not separately)  make a differences to the model.

The easiest way to do this is to treat those variables as classification variables.  You will get a joint statistical test in one of the result tables.

proc glimmix ..;

class race grade_level;

....

run;

In my applications I almost always use numeric version of variables, i.e., dummy variables (coded as 0 or 1).  I like this approach because I can use PROC MEANS on them to create a descriptive statistics table.

The question is how I get joint statistical tests when  all of my predictors are numerically coded and thus I can't rely on the class statement (shown above in the syntax example).

The GLIMMIX syntax below treats race groups and grade levels as numerically coded dummy variables (if YES 1, else 0).

The parameter estimate tables will show coefficients derived for each of the numeric variables; however, I wouldn't know if race groups as a group matters to the model or grade levels as a system matters to the model.   For example, even when  the coefficient derived for subjects being black is statistically significant, that is only about how black students are different from white students (reference group in this example).  We don't know if race as a group matters and race groups jointly make a statistically significant contribution to the model.

<Again this can be done easily by using class variables instead (as shown earlier); however, I like using numeric variables in my models.>

Contrast statements will do the trick.

proc glimmix data=usethis namelen=32;
class groupunit;
model Y= treat black hispanic other grade09 grade10 grade11/
solution ddfm=kr dist=&dist link=&link ;
output out=&outcome.gmxout residual=resid;
random intercept /subject=groupunit;
CONTRAST 'Joint F-Test Race groups ' Black 1, Hispanic 1, other 1;
CONTRAST 'Joint F-Test Grade levels' grade09 1, grade10 1, grade11 1,

ods output
ParameterEstimates=_3_&outcome.result covparms=_3_&outcome.cov
Contrasts=cont&outcome;
run;

 

Why use METHOD=RSPL for PROC GLIMMIX

The reason for using R (Restricted method) is because the alternative M (Maximum method) can have bias about covariance (level-2 variance in our application) and when the number of group unit is relatively small, so this is a real threat.

 

PROC GLIMMIX's option: lsmeans group / ilink diff;

proc glimmix data=asdf METHOD=RSPL;

class CAMPUS_14 subgroup;

model y=x1 x2 x3 subgroup

/dist=binomial link=logit s ddfm=kr;

lsmeans group / ilink diff;

ods output  ModelInfo=x1var1 ParameterEstimates=x2var1 CovParms=x3var1

Diffs=DIF_RESULT1 LSMeans=LS1;

run;