Tricky things to describe about stat model specification

When describing statistical models and results in writing, the following are tricky issues and require decisions and standardized way of description (and they must be brief, intuitive, full of meaning):

  • How do we choose omitted category/reference group?
  • Why is there no level-1 error term in logistic regression?
  • Why use HLM?
  • Why use logistic regression model?
  • Meaning of odds ratio
  • Effect size interpretation (Why 2.0 is often used)
  • Why use certain covariates
  • How do we talk about predictors, covariates, and the treatment indicator (1 if treatment subject; else 0).  There seems a difference between predictors and covariates.
  • How to discuss variance change (R2, etc.)
  • Negative level-2 variance in case of HLM
  • What do we do when between-group variance is small (the model may not converge)
  • What to do when the model does not converge?
  • How to deal with model names such as HLM, HGLM, etc.
  • When converting a scale or ordinal variable into a binary variable as an outcome of logistic regression, there are many possible cutpoints to define 0 vs. 1 (low vs. high).  How do we justify?

Cronbach's alpha

The UCLA site explains Cronbach's alpha as the average internal correlation among survey items.  It also says that it is not a measure of unidimensionality.  Rather, it is a measurement of internal consistency (though just intuitively I feel what is coherent tends to be also uni-dimensional... I think the point is that the measure is most optimal by design for the assessment of internal correlation, not dimentionality.

http://www.ats.ucla.edu/stat/spss/faq/alpha.html

Standardized versus Raw

This SAS website says one should use the standardized version of the measure (as opposed to raw).

https://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect032.htm

It says: "Because the variances of some variables vary widely, you should use the standardized score to estimate reliability."

A note to myself: Does this mean if I standardized all items before the analysis, I get the same value for raw and standardized?  I can experiment this.

Survey Sampling Design and Regression Analysis using SAS SURVEYREG

Simple random sampling given the population size

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_surveyreg_sect003.htm

Confirm that regular regression analysis produces larger standard errors. Using the total sample size as a part of the modeling process, PROC SURVEYREG achieves smaller standard errors (more precise measurement).
data IceCream;
input Grade Spending Income Kids @@;
datalines;
7 7 39 2 7 7 38 1 8 12 47 1
9 10 47 4 7 1 34 4 7 10 43 2
7 3 44 4 8 20 60 3 8 19 57 4
7 2 35 2 7 2 36 1 9 15 51 1
8 16 53 1 7 6 37 4 7 6 41 2
7 6 39 2 9 15 50 4 8 17 57 3
8 14 46 2 9 8 41 2 9 8 41 1
9 7 47 3 7 3 39 3 7 12 50 2
7 4 43 4 9 14 46 3 8 18 58 4
9 9 44 3 7 2 37 1 7 1 37 2
7 4 44 2 7 11 42 2 9 8 41 2
8 10 42 2 8 13 46 1 7 2 40 3
9 6 45 1 9 11 45 4 7 2 36 1
7 9 46 1
;
run;

proc surveyreg data=IceCream total=4000;
model Spending = Income / solution;
run;

proc reg data=icecream;
model spending=income;run;

Stratified Sampling
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_surveyreg_sect004.htm

data IceCream;
input Grade Spending Income Kids @@;
datalines;
7 7 39 2 7 7 38 1 8 12 47 1
9 10 47 4 7 1 34 4 7 10 43 2
7 3 44 4 8 20 60 3 8 19 57 4
7 2 35 2 7 2 36 1 9 15 51 1
8 16 53 1 7 6 37 4 7 6 41 2
7 6 39 2 9 15 50 4 8 17 57 3
8 14 46 2 9 8 41 2 9 8 41 1
9 7 47 3 7 3 39 3 7 12 50 2
7 4 43 4 9 14 46 3 8 18 58 4
9 9 44 3 7 2 37 1 7 1 37 2
7 4 44 2 7 11 42 2 9 8 41 2
8 10 42 2 8 13 46 1 7 2 40 3
9 6 45 1 9 11 45 4 7 2 36 1
7 9 46 1
;
run;

data StudentTotals;
input Grade _TOTAL_;
datalines;
7 1824
8 1025
9 1151
;run;

data IceCream2;
set IceCream;
if Grade=7 then Prob=20/1824;
if Grade=8 then Prob=9/1025;
if Grade=9 then Prob=11/1151;
Weight=1/Prob;
run;

proc surveyreg data=IceCream2 total=StudentTotals;
strata Grade /list;
class Kids;
model Spending = Income / solution;
weight Weight;
run;

proc reg data=icecream2;
model spending=income;
weight Weight;
run;

How to conduct a stat test for the difference between two dummy variables in the model

This is a question post.

Imagine that I have a regression model:

Y=b0+b1*black+b2*white+b3*asian+ error

where the ommited category is hispanic.  Because of this omission, each of race variable coefficients corresponds to the difference between each one of the race variables and Hispanic subjects (in other words, Hispanic is being the reference group).

If I want to know if the difference between white and black is statistically significant, I could omit black and see if the coefficient for white subjects is statistically significant (or I can omit white instead).  This, however, requires running of one whole separate model (though it is a mathematically equivalent model as the original model).

Another approach in SAS would be to request stat tests evaluating every possible contrast between the race groups, but this again involves an extra step.

Is there an easy to just to read information I already have from the original model and evaluate the between-group differences statistically (e.g., black vs. white, asian vs. black, white vs. black)?  For example, if I have this table below (w/ hypothetical values), is it possible to know if the black vs. white difference is statistically significant?  Does this table give me sufficient information to know if the white vs black difference (2.1-3) is statistically significant?

coeff. stderr. prob.
Intercept 1.2 (0.2) 0.1
black 2.1 (0.1) 0.6
white 3 (0.2) 0.05
asian 4 (0.01) 0.01

My hunch is that I can pool two standard errors (one from black and the other from black) and use it to evaluate the black and white difference (and somehow I have to figure out what DF should be). However, I don't think pooled standard errors are utilized for stats tests that are being reported already in this table (e.g., black effect is evaluated only based on its own standard error). It would strange that I had to rely on pooled standard errors.

My goal is to create an Excel sheet that does this calculation, so people can conduct the test without necessarily rerunning the models again (and they just rely on result tables).

Reference:

Example results:

www.nippondream.com/file/dummy variable interpretation of reg results.xlsx

Thank you for your input on this (please email k u e k w a @ GMAIL) or leave a comment below.