### SAFILE in Winsteps control file (Experiment)

Just an experiment with a Winsteps control option:
What happens if you enter random numbers to SAFILE.

CATEGORY PROBABILITIES table gets strange shaped curves.
Reliability gets wacky/low.

Conclusion: You will definitely notice something wrong happened.

Winsteps reference:
http://www.winsteps.com/winman/safile.htm

### Advantages of Rasch model

Rasch model analysis has the following set of advantages

1. Being logit scores, Rasch scores have no theoretical upper and lower boundary values (useful for statistical analysis)
2. Rasch scores facilitate pretest and posttest comparison based on different set of test items (You can avoid taking identical test at pre and post)
3. Rasch model can handle missing values (as long as a subject is not missing all items)
4. Rasch model (or Rasch model software programs) comes with an excellent set of diagnostics statistics to evaluate the model and data fit

### QC strategy for Rasch model results

BASIC QC

#01 CHECK RESPONSE VALUES (IF THEY ARE CODED INTO CORRECT NUMERALS)

Items used for Rasch model analysis are usually ordinal variables based on response values such as “Strongly agree,” Agree,” “Disagree,” and “Strongly Disagree.”  Code these so that higher agreement receives higher numbers:

• Strongly agree   4
• Agree 3
• Disagree 2
• Strongly Disagree 1

If by mistake these numbers are flipped, you will have a catastrophic situation where the result is flipped.  Do two things to prevent such a catastrophe:

1. Confirm this by looking at the actual survey and by looking at the data (Look at it until your eyes bleed).
2. People are likely to agree with items as they have social pressure to report good things when taking a survey. Look at the original data and see if you see a lot of positive responses.

#02 CHECK THE N OF SUBJECTS INCLUDED IN THE ANALYSIS

Check the output and confirm that the number of subject used is correct.  Checking the number of subject is el numero uno protection against errors.

#03 CHECK THE N OF ITEMS INCLUDED IN THE ANALYSIS

Check the output and confirm that the number of items used is correct.  Especially when you are not using all item’s data in your analysis (you might have decided to drop some items), be sure you used the ones you wanted to use.  With Winsteps, misspecification of a control file can lead to inclusion of subject IDs as response data by mistake.  Avoid this (such a case will produce an extremely low reliability score).

#04 CHECK WHAT VALUE WAS USED FOR MISSING SCORES

When a subject does not provide any response, Winsteps imputes a token number (-2, I think) to indicate that it is a missing value.  This value should be treated as a missing value and should NOT be included in the analysis dataset.  If you treat a token value (-2 in the case of Winsteps) as a true value, you will have a catastrophic situation where you have an arbitrary value used as a real data point.  You should replace such a number with “.” (dot) before analysis as statistical software, such as SAS or SPSS, will treat a dot as a missing value.

Winteps Reference: Definition of status variable in Winsteps output

When a subject lacks data, missing value is indicated by -2.

http://www.winsteps.com/winman/ifile.htm

ADVANCED QC

Basic QC procedures should catch 99% of errors.  Advanced ones are more intricate ones.

#05 INVESTIGATE ITEM DIFFICULTY SCORES

If you are using item difficulty parameters provided by the developer, compare them against the ones you derived from the dataset you collected.  They must be more or less comparable.  If not, investigate whether it is caused by a data error.

#06 HISTORICALLY COMPARE RESULTS

If you are repeating the study, compare your results with historical data (e.g., last year’s result).

### Cronbach's alpha

The UCLA site explains Cronbach's alpha as the average internal correlation among survey items.  It also says that it is not a measure of unidimensionality.  Rather, it is a measurement of internal consistency (though just intuitively I feel what is coherent tends to be also uni-dimensional... I think the point is that the measure is most optimal by design for the assessment of internal correlation, not dimentionality.

http://www.ats.ucla.edu/stat/spss/faq/alpha.html

Standardized versus Raw

This SAS website says one should use the standardized version of the measure (as opposed to raw).

https://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect032.htm

It says: "Because the variances of some variables vary widely, you should use the standardized score to estimate reliability."

A note to myself: Does this mean if I standardized all items before the analysis, I get the same value for raw and standardized?  I can experiment this.

### SAS DATASETS to delete all files in work directory

proc datasets library = work kill nolist;
quit;

### Survey Sampling Design and Regression Analysis using SAS SURVEYREG

Simple random sampling given the population size

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_surveyreg_sect003.htm

Confirm that regular regression analysis produces larger standard errors. Using the total sample size as a part of the modeling process, PROC SURVEYREG achieves smaller standard errors (more precise measurement).
data IceCream;
input Grade Spending Income Kids @@;
datalines;
7 7 39 2 7 7 38 1 8 12 47 1
9 10 47 4 7 1 34 4 7 10 43 2
7 3 44 4 8 20 60 3 8 19 57 4
7 2 35 2 7 2 36 1 9 15 51 1
8 16 53 1 7 6 37 4 7 6 41 2
7 6 39 2 9 15 50 4 8 17 57 3
8 14 46 2 9 8 41 2 9 8 41 1
9 7 47 3 7 3 39 3 7 12 50 2
7 4 43 4 9 14 46 3 8 18 58 4
9 9 44 3 7 2 37 1 7 1 37 2
7 4 44 2 7 11 42 2 9 8 41 2
8 10 42 2 8 13 46 1 7 2 40 3
9 6 45 1 9 11 45 4 7 2 36 1
7 9 46 1
;
run;

proc surveyreg data=IceCream total=4000;
model Spending = Income / solution;
run;

proc reg data=icecream;
model spending=income;run;

data IceCream;
input Grade Spending Income Kids @@;
datalines;
7 7 39 2 7 7 38 1 8 12 47 1
9 10 47 4 7 1 34 4 7 10 43 2
7 3 44 4 8 20 60 3 8 19 57 4
7 2 35 2 7 2 36 1 9 15 51 1
8 16 53 1 7 6 37 4 7 6 41 2
7 6 39 2 9 15 50 4 8 17 57 3
8 14 46 2 9 8 41 2 9 8 41 1
9 7 47 3 7 3 39 3 7 12 50 2
7 4 43 4 9 14 46 3 8 18 58 4
9 9 44 3 7 2 37 1 7 1 37 2
7 4 44 2 7 11 42 2 9 8 41 2
8 10 42 2 8 13 46 1 7 2 40 3
9 6 45 1 9 11 45 4 7 2 36 1
7 9 46 1
;
run;

data StudentTotals;
input Grade _TOTAL_;
datalines;
7 1824
8 1025
9 1151
;run;

data IceCream2;
set IceCream;
if Grade=7 then Prob=20/1824;
if Grade=8 then Prob=9/1025;
if Grade=9 then Prob=11/1151;
Weight=1/Prob;
run;

proc surveyreg data=IceCream2 total=StudentTotals;
strata Grade /list;
class Kids;
model Spending = Income / solution;
weight Weight;
run;

proc reg data=icecream2;
model spending=income;
weight Weight;
run;

### Excel function to replicate t-test off SAS PROCs (e.g., GLIMMIX)

Phil of SAS helped me identify this function. Thank you.

T-test conducted in PROC GLIMMIX (or most likely other regression procedures) is expressed in Excel function as:

=2*(1-T.DIST( T_VALUE , DEG_OF_FREEDOM ,TRUE))

where T_value must be an absolute value of the original t-value (e.g., if -2 then 2).

This expresses CDF (cumulative distribution function), not PDF (probability density function).  I will explicitly discuss what these are in the near future.

I wanted to know how much of statistical results (off PROC GLIMMIX in this case) comes from SAS's internal computation (i.e., I can't replicate results outside SAS) and how much of it can be done in Excel given what I get from SAS output.