Survey Sampling Design and Regression Analysis using SAS SURVEYREG

Simple random sampling given the population size

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_surveyreg_sect003.htm

Confirm that regular regression analysis produces larger standard errors. Using the total sample size as a part of the modeling process, PROC SURVEYREG achieves smaller standard errors (more precise measurement).
data IceCream;
input Grade Spending Income Kids @@;
datalines;
7 7 39 2 7 7 38 1 8 12 47 1
9 10 47 4 7 1 34 4 7 10 43 2
7 3 44 4 8 20 60 3 8 19 57 4
7 2 35 2 7 2 36 1 9 15 51 1
8 16 53 1 7 6 37 4 7 6 41 2
7 6 39 2 9 15 50 4 8 17 57 3
8 14 46 2 9 8 41 2 9 8 41 1
9 7 47 3 7 3 39 3 7 12 50 2
7 4 43 4 9 14 46 3 8 18 58 4
9 9 44 3 7 2 37 1 7 1 37 2
7 4 44 2 7 11 42 2 9 8 41 2
8 10 42 2 8 13 46 1 7 2 40 3
9 6 45 1 9 11 45 4 7 2 36 1
7 9 46 1
;
run;

proc surveyreg data=IceCream total=4000;
model Spending = Income / solution;
run;

proc reg data=icecream;
model spending=income;run;

Stratified Sampling
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_surveyreg_sect004.htm

data IceCream;
input Grade Spending Income Kids @@;
datalines;
7 7 39 2 7 7 38 1 8 12 47 1
9 10 47 4 7 1 34 4 7 10 43 2
7 3 44 4 8 20 60 3 8 19 57 4
7 2 35 2 7 2 36 1 9 15 51 1
8 16 53 1 7 6 37 4 7 6 41 2
7 6 39 2 9 15 50 4 8 17 57 3
8 14 46 2 9 8 41 2 9 8 41 1
9 7 47 3 7 3 39 3 7 12 50 2
7 4 43 4 9 14 46 3 8 18 58 4
9 9 44 3 7 2 37 1 7 1 37 2
7 4 44 2 7 11 42 2 9 8 41 2
8 10 42 2 8 13 46 1 7 2 40 3
9 6 45 1 9 11 45 4 7 2 36 1
7 9 46 1
;
run;

data StudentTotals;
input Grade _TOTAL_;
datalines;
7 1824
8 1025
9 1151
;run;

data IceCream2;
set IceCream;
if Grade=7 then Prob=20/1824;
if Grade=8 then Prob=9/1025;
if Grade=9 then Prob=11/1151;
Weight=1/Prob;
run;

proc surveyreg data=IceCream2 total=StudentTotals;
strata Grade /list;
class Kids;
model Spending = Income / solution;
weight Weight;
run;

proc reg data=icecream2;
model spending=income;
weight Weight;
run;

Excel function to replicate t-test off SAS PROCs (e.g., GLIMMIX)

Phil of SAS helped me identify this function. Thank you.

T-test conducted in PROC GLIMMIX (or most likely other regression procedures) is expressed in Excel function as:

=2*(1-T.DIST( T_VALUE , DEG_OF_FREEDOM ,TRUE))

where T_value must be an absolute value of the original t-value (e.g., if -2 then 2).

This expresses CDF (cumulative distribution function), not PDF (probability density function).  I will explicitly discuss what these are in the near future.

I wanted to know how much of statistical results (off PROC GLIMMIX in this case) comes from SAS's internal computation (i.e., I can't replicate results outside SAS) and how much of it can be done in Excel given what I get from SAS output.

 

How to conduct a stat test for the difference between two dummy variables in the model

This is a question post.

Imagine that I have a regression model:

Y=b0+b1*black+b2*white+b3*asian+ error

where the ommited category is hispanic.  Because of this omission, each of race variable coefficients corresponds to the difference between each one of the race variables and Hispanic subjects (in other words, Hispanic is being the reference group).

If I want to know if the difference between white and black is statistically significant, I could omit black and see if the coefficient for white subjects is statistically significant (or I can omit white instead).  This, however, requires running of one whole separate model (though it is a mathematically equivalent model as the original model).

Another approach in SAS would be to request stat tests evaluating every possible contrast between the race groups, but this again involves an extra step.

Is there an easy to just to read information I already have from the original model and evaluate the between-group differences statistically (e.g., black vs. white, asian vs. black, white vs. black)?  For example, if I have this table below (w/ hypothetical values), is it possible to know if the black vs. white difference is statistically significant?  Does this table give me sufficient information to know if the white vs black difference (2.1-3) is statistically significant?

coeff. stderr. prob.
Intercept 1.2 (0.2) 0.1
black 2.1 (0.1) 0.6
white 3 (0.2) 0.05
asian 4 (0.01) 0.01

My hunch is that I can pool two standard errors (one from black and the other from black) and use it to evaluate the black and white difference (and somehow I have to figure out what DF should be). However, I don't think pooled standard errors are utilized for stats tests that are being reported already in this table (e.g., black effect is evaluated only based on its own standard error). It would strange that I had to rely on pooled standard errors.

My goal is to create an Excel sheet that does this calculation, so people can conduct the test without necessarily rerunning the models again (and they just rely on result tables).

Reference:

Example results:

www.nippondream.com/file/dummy variable interpretation of reg results.xlsx

Thank you for your input on this (please email k u e k w a @ GMAIL) or leave a comment below.

Doing HLM and Time-series analysis at the same time using GLIMMIX

HLM (multilevel models) and econometric analyses (e.g., time series analysis, ARIMA, etc.) are treated as different approaches (the goal of which is to deal with data dependency problem), they can be implemented in the same model via. SAS PROC GLIMMIX.  However, I believe doing this is computationally demanding and models may not converge.

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_glimmix_sect020.htm

For detailed discussion of this topic, see p.28 of my HLM manual:

http://www.estat.us/sas/PROCMIXED.doc

Kenward-Roger Degrees of Freedom (SAS GLIMMIX)

I will edit this essay later.

Imagine I have an HLM model where level-1 are students and level-2 are schools.  I can enter teacher-level variables, but people who are used to HLM software by SSI will wonder how the use of teacher-level variables is possible without making the model 3-level models (level1 students, level2 teachers, level3 schools).  This is because in the old HLM tradition (people who studied HLM using HLM software in 1990s), equations are written/processed separately by levels and data must be prepared by levels (student data file and school data file).

If there are student-level and school-level, some HLM users may wonder how teacher-level variables can enter the model (or prepared in what file? student-level file or school-level file???).

People who use SAS or other software programs (maybe R?) will not wonder this because they think in terms of one whole equation and say, "of course we can enter variables of any levels."

The software programs they use adjust degree of freedom to take into consideration that teacher-level variables have a lot less number of values compared to the outcome variable.

K-R option in PROC GLIMMIX adjusts degree of freedom to account for the fact that group-level variables have a lot less possible values when compared to outcome variables.

K-R degree of freedom option seems most appropriate for multilevel modeling applied in educational evaluation studies (where typically students are nested within schools).  This option kicks in only when at least one random coefficient is estimated (e.g., intercepts as random effects).

proc glimmix data=i3d;
class school ;
model y =< here variable names >
/solution ddfm=kr dist=normal link=identity s ;
random int /sub=school;
run;

After playing with data type of simulation (I will do a better and well-documented simulation in the future), I learned the following:

For student-level (level-1) variables, degree of freedom under KR option is close to the number of students (minus a small number of cases maybe close to the number of predictors).  DF is larger if variance contained in the variable is greater.  For binary variables, DF is the largest when variance is .50 (DF gets smaller as a proportion gets close to 0 or 1).

For school-level (level2) variables, DF is close to the number of schools minus a small number maybe close to a number of predictors.  This may be also adjusted by variance of the variable.

I created a fake teacher-level predictors by creating two possible numeric values per school.  DF for this variable was close to the number of teachers in the data (two per school) minus a small number.  I think this is also adjusted by variance of the variable (which I didn't test exactly).

 

Proc mixed and Proc glimmix produce identical results

PROC MIXED and PROC GLIMMIX produce identical results for the following linear model settings.

proc glimmix data=kaz.asdf;
where IMPACT_ANALYSIS=1;
class school;
model
y =
x1 x2 x3
/solution ddfm=kr dist=normal link=identity s ;
random school;
run;

proc mixed data=kaz.asdf;
where IMPACT_ANALYSIS=1;
class school;
model
y =
x1 x2 x3
/solution s ddfm=kr;
random school;
run;

Cluster effect: What does HLM solve?

When I first learned HLM (Hierarchical linear modeling) at graduate program in 1994/5, I struggled with the following expression:

Errors are correlated.

Up to that point in Stat 101, correlation was about two columns of data (e.g., math test score and science test score).  Errors in the context of regression analysis are residuals from the model and they are stored in one column.  I had a conceptual difficulty trying to understand why values contained in one column (one variable) can be correlated.

When I learned about geostatistics again at a workshop, the model was supposed to correct data dependence issue caused by geographical proximity.  This time, it was about how temperature of town A, for example, is similar to an adjacent town B and thus observations are dependent on one another.

I also learned about econometric approach of trying to deal with the fact that time and observations are correlated (my test score today is dependent on my test score tomorrow).

After hearing again and again about statisticians' attempts to correct for data dependence, correlation of data, etc., I finally realized that data can be correlated within one column of data.  If you and someone else are from the same school, your outcome data are correlated.

The traditional statistical modeling technique, such as OLS regression model, relies on the assumption that outcome data are uncorrelated (observation 1 and 2 are completely not related to one another).  If this assumption is violated, we can no longer consider results of statistical test good.  In fact, in the presence of data dependence problem, results of statistical test will be over-optimistic (too many statistically significant results).

I also learned that the use of HLM is one thing you can do to improve the situation, but it may be just one of many problems you may have in data.  Student test scores may be also related within friendship networks.  Typically we do not have data of this membership.

In the same model, you can try to deal with group dependence (via. HLM) or time dependence (via. ARIMA  model, for example).  This is not impossible, but testing these two at the same time is computationally challenging.  You will have to choose your battle and fix one thing at one time.

How to derive subgroup adjusted outcome averages and conduct pairwise stat-test

When you run a statistical interaction model (e.g., Y=TREATMENT + GENDER + TREATMENT*GENDER), you also want to run a mathematically equivalent model:
Y= T_MALE + C_MALE + T_FEMALE + C_FEMALE.

SUBGROUP defines the four groups below.

This is for linear model (HLM because I have the random statement).  

proc glimmix data=asdf3 namelen=32;
class CAMPUSID  SUBGROUP;
model y=
x1 x2 x3
SUBGROUP
/solution ddfm=kr dist=normal link=identity s noint;
lsmeans SUBGROUP / ilink diff;
random int / subject = CAMPUSID;
ods output ModelInfo=x1var1 ParameterEstimates=x2var1 CovParms=x3var1
nobs=x4var1 Diffs=DIF_RESULT1;
run;

This is for logistic regression model.  

proc glimmix data=asdf METHOD=RSPL;
class CAMPUS_14 subgroup;
model y=x1 x2 x3 subgroup
/dist=binomial link=logit s ddfm=kr;
lsmeans group / ilink diff;
ods output  ModelInfo=x1var1 ParameterEstimates=x2var1 CovParms=x3var1
Diffs=DIF_RESULT1 LSMeans=LS1;
run;