Introduction
The concept of error was confusing to me in the beginning because I couldn't tell where the story of error
begins and ends. There are two parts to statistical models. The structural part (that is NOT about errors)
and error part (that is about errors). Statistics teachers are usually talking about one or the other, but I don't think
they are aware that students get confused precisely because teachers don't make the distinction clear enough.
Stat teachers should say, "now I am talking about the structural part" or "now I am talking about errors." Also they
should say that 99% of their talk is about errors. I think this is true. If you program an OLS regression
model using a matrix language, beta=(inv(t(x)*x))*(t(x)*y); is the only line you need to get the structural
part of the model. The rest is all about how to deal with errors. The other source of confusion is
that error can mean a lot of different things. I think we should use the term "residuals" to refer to what we often
refer to as errors.
What is "error/residual" in the context of obtaining an average score?
If John's math score is 70, Mike's is 80, and Luke's is 90, the average score
is 80. The residuals for John, Mike, and Luke, respectively are 10, 0, and +10. Residuals are about how
far each score is from the average score (and to what direction).
What is "residual" in the context of an OLS regression model?
OLS regression model is usually the first statistical model you learn in STAT 101. The equasion
looks like this:
Y= intercept + beta*X + residual.
Before thinking about how this equation works, we should look at a model that is a lot simpler:
Y= intercept + residual.
This model has no predictor. And this model is the same as the procedure that obtains an average.
If the data set contains math scores of John, Mike, and Luke, it will look like this:
Y= [70, 80, 90]
intercept= 80
residual=[10, 0, +10]
OLS is a technique to obtain values (intercept or an average score in this case) that minimizes the size
of residual. Imagine I completely ignored the algorithm and guessed the average. I say the average is 70 just
because I feel like it! Then observe the size of residual (it gets bigger).
Y= [70, 80, 90]
intercept= 70
residual=[0, 10, +20]
Compare:
residual=[10, 0, +10]
VS
residual=[0, 10, +20]
Can you tell the residual got bigger because I guessed the average/intercept without relying on a correct
algorithm? OLS provides an algorithm that minimizes the size of residuals. You can google for an exact algorithm,
but is seriously is one short line. It is obviously always more correct that my random guessing.
TRY THIS IN SAS AND STUDY THE TABLE "INFLUENCE DIAGNOSTICS":
data this;input Name $ score; cards; John 70 Mike
80 Luke 90 ; run;
proc mixed data=this; model score= /s influence; run;
What is a correlated error problem?
I continue with an example from above. Our OLS regression model doesn't have a predictor. The intercept
will return the average value. Let's think now about standard errors of the intercept. Standard error of the intercept
tells you how accurately the average is (the bigger the error, the worse the precision). There is an algorithm to obtain
a standard error and it is based on only two things, a variance of the residuals and number of observation.
Let's confirm what these two things are. What is a variance
of the residuals? In the example from above, it will be the variance, given the scores of 70, 80, and 90. For
now you can use excel to get this value, by doing =var(70,80,90). The number of observation is 3. Using these
two pieces of information, you can derive a standard error.
You can search for an exact algorithm by googling it, but here is the point.
The obtained standard error is good only when observations are independent and normally distributed. If John and Luke
are brothers, their scores are not independent. If John copied Luke's answers during the test, their scores are not
independent. If John and Luke learn from the same teacher and Mike goes to a different school, their scores are not
independent. So statistical decisions about if something is significant or not (the test of which relies on standard
errors) is only correct when observations are independent. Obviously observations are dependent in education research
or other research settintgs, which is why we have a lot of techniques to solve this "correlated error problem."
Whean I heard the expression, "errors are correlated," I had the hardest time understanding
it. For me, "correlation" requried at least two variables. I was used to the notion of, for example, height and
weight are correlated. But I didn't immediately undersatnd when I was told "errors can be correlated." I was used
to seeing residuals/errors as something showin in one column, i.e., sort of like one variable. But what does it mean
to have errors that are correlated? But it is possible for observations in one column of data to be correlated.
If John and Luke went to the same schools and learned math from the same teachers, their scores are alike (=correlated).
