STDCOEF to request standardized coefficients from PROC GLIMMIX

proc glimmix data=qc2 METHOD=RSPL;
class group_ID;
model Y = X
/dist=binomial link=logit s ddfm=kr STDCOEF;
run;

It is not clear how coefficients are standardized.  Based on my investigation, centering is definitely done around the variable's grand mean; however, SD used for standardization is not 1.  When I simulated it, SD was around 0.036 (meaning I created a z-score using mean=0 and STD=0.036 to get the same coefficient off STDCOEF,

How to check if a sas dataset was created or not

/*CHECK IF COVPARMS is CREATED*//*CHECK IF COVPARMS is CREATED*/
%macro checkds(dsn);
%if %sysfunc(exist(&dsn)) %then %do;
proc print data = &dsn;run;
data HLM_OR_NOT;value="YES_HLM";run;
%end;
%else %do;
data covparms;CovParm="NoCov";run;
data HLM_OR_NOT;value="NO_HLM ";run;
%end;
%mend checkds;
/* Invoke the macro, pass a non-existent data set name to test */
%checkds(work.covparms);
data _null_;set hlm_or_not;
call symput ("hlm_or_not",value);
run;
%put &hlm_or_not;
/*CHECK IF COVPARMS is CREATED*//*CHECK IF COVPARMS is CREATED*/

How to determine sample size for a binary variable

If you need to determine sample size for your survey when a variable of interest is a binary outcome, you can use power analysis and decide how many subjects you need before collecting data.

You can ignore this one, if too confusing=> You should adjust the sample size by expected response missing rate (e.g., you aim to collect data from 100 people; you expect only 95 will reply; then you use 95 as the sample size for power evaluation).

I wrote an Excel file for sample size calculation, but let me write a bit more here about what I did in there:

https://drive.google.com/file/d/0B7AoA5fyqX_sMkZJOUZxN3JvbUk/view?usp=sharing

If you have expectation as to what %s you will be looking at after your experiment, use those %s (% for the treatment group and % for the control/comparison group) and decide the sample size you will need to evaluate the two %s with confidence.

Often we don't have such expectations -- probably because no one has done a similar study like yours. You can just assume  the two percentages are close to 50%, which will give you the most conservative power analysis results. So if you want to see if the group difference of 5% will give you sufficient statistical confidence, given a certain sample size, you can set the two %s to be 47.5% and 52.5% (the difference being 5%).

If you want to see if the group difference of 10% will give you a good enough statistical confidence, given a certain sample size, you can set the two %s to be 45% and 55% (the difference being 5%).

I wish I could write this more tightly.

Reference:

http://www.surveysystem.com/sscalc.htm

Thanks: Mr. George Ohashi for showing me the function that adjusts sample sizes by expected missing rate.

What does "the intercept being statistically significant" mean?

What does "the intercept being statistically significant" mean?

You can safely ignore that information.

Statistical test is a test of whether a coefficient is different from zero, so in the most general context, the intercept being different zero doesn’t have much meaning in and of itself.   We can force it to mean something, though.  If a) we standardize the outcome score such that the average is 0 (In SAS, this would be proc standard mean=0; var outcome;run;) and b) if the intercept is significant, it means that whatever the intercept represents (e.g., Hispanic student’s average score) is significantly different from the average score (I'm using "significantly different" in a sloppy way to focus on the main point of my explanation).

Having said that, I just made this up just as an example.  Generally speaking, we should ignore the intercept’s significant level.

In contrast, the intervention effect being significant means something important and often exciting.  If the intervention effect is .20, statistical test examines whether .20 is different from 0.  0 means no effect, so if the impact coefficient is statistically different from 0, it's a good news (you would also want to examine the size of the coefficient too.)

How to understand and test statistical interaction effect

Using the regression model framework, an analyst can test whether the effect of X depends on another predictor.  If the outcome is student achievement and the most important independent variable is the intervention variable (students received treatment 1; else 0), one can further ask if the program effect depends on students’ demographic factors, such as gender, race and ethnicity, and some important student statues (e.g., special education, English learner).  If the main research question is whether the program has an effect on student outcome, we often ask the next question, “does the program effect depend on student demographic factors or student status variables.  If, for example, an educational intervention program works only for boys but not for girls, we expect to see statistical interaction between intervention and gender.  You can say this in different ways:

  1. The program impact varies by gender
  2. The effect of the program depends on gender
  3. The program and gender interact
  4. Gosh, this program is effective particularly for boys!

I will continue to use gender as an example through this text.

Read the rest in the following document (MS-WORD):

How to test and understand statistical interaction effect

If between-group variance increased instead of decreased in HLM

In non-HLM model (e.g., OLS), variance (outcome variance) will always reduce as you add predictors.

Variance is about how residuals are distributed.  If predictors are explaining outcomes very well, variance is small.  If predictors are NOT explaining outcomes well, variance is large.  Sometimes, outcome does not have enough variance to begin with (e.g., trying to explain when only 3 person out of 100,000 subjects graduate from high school)

In HLM, variance specific to levels (level-1 variance, level-2 variance) can increase, which is counter-intuitive. For example, when modeling student's achievement as an outcome, you add pretest score to the model and all of a sudden between-school variance increases.  This below was an example of how it can happen.

https://www.statmodel.com/download/Level-2%20R-square%20decreasing%20when%20adding%20level1-covariate.pdf

I will state my conclusions first.

a) It is theoretically and empirically possible to see group-level variance increase when individual-level predictor(s) are added to the model.  By looking at data (case-t-case or group-average <e.g., school average> comparison of residuals from a nonconditional model and a conditional model will help), you will need to understand why it happened (so you can explain when your audience question you).  Some situations will give you meaningful explanations (as mentioned in the housing price and location explanation in the PDF file referenced above).  Other situations will provide boring and a matter-of-fact explanations (it just happens as level-1 predictors change the value of the group mean estimate <i.e., random intercepts>).

b) before reaching an explanation, always suspect an error in the data.  Errors in the data can be related to the between-group variance increase (I will provide an example of this in situation 2).

 

Situation 1

The model was a multi-level logistic regression where:

  • The outcome: 0 or 1 (passing the post test or not passing)
  • Pretest score: interval score
  • 2-level models: Students (=subjects) are nested within schools

What happened was:

Between school variance increased as we enter the level-1 pretest score.  We have variance from anova model (non-conditional model) and the conditional model (the model that includes predictors).  The between-group variance increased.

This is how we solved:

  • a) I identified which predictor is causing this situation.  We quickly identified that it is the pretest by just testing how between-school variance changed by one predictor at a time.
  • b) I examined residuals from the model that that doesn't include the problem predictor (pretest) and the model that includes it.  When the two data columns are plotted, two observations were off from the rest of observation points, indicating after the predictor was entered into the model, these two group's errors (deviation from the mean) increased in size.
  • c) I examined how the outcome variable is related to the problem predictor.  Although the two are positively correlated, the two problem groups had an unexpected association for the two variables.  Despite that the two had low pretest scores, which would predict low outcome scores, they had relatively high outcome scores.
  • d) This means that the two groups have exceptionally high scores compared to prediction, which results in size increase of errors associated with subjects in these two groups.
  • e) Two alternative solutions
    • Remove outliers and make note of the situation to readers
    • Keep outliers, check consistency of results, if results do not change in substantive meaning (e.g., the impact coefficient stayed more or less the same), make note of the situation to readers

Situation 2

 

 

That is about it, but I may come back and edit.

I also want to create a fake dataset to replicate this problem.  I will create an outcome variable and a predictor that are positively correlated (e.g., math posttest and math pretest).  For ease of interpretation I will convert them into z-scores (proc standard data=abc mean=0 std=1;var pretest posttest;run;).  I will take 10% of the data and flip the sign of the predictors, so the association for this subgroup will be the opposite of the rest 90%.  I *think* this will replicate the situation, but I am not 100% sure.

Stat test on between-group variance

proc glimmix data=temp2 /*Method=RSPL*/ ;
class  CAMPUS;
model Y=/dist=binomial link=logit s ddfm=kr;
random int / subject = CAMPUS_14;
  covtest /wald;
output out=gmxout_alglog residual=resid;
RUN;