Cluster effect: What does HLM solve?

When I first learned HLM (Hierarchical linear modeling) at graduate program in 1994/5, I struggled with the following expression:

Errors are correlated.

Up to that point in Stat 101, correlation was about two columns of data (e.g., math test score and science test score). Errors in the context of regression analysis are residuals from the model and they are stored in one column. I had a conceptual difficulty trying to understand why values contained in one column (one variable) can be correlated.

When I learned about geostatistics again at a workshop, the model was supposed to correct data dependence issue caused by geographical proximity. This time, it was about how temperature of town A, for example, is similar to an adjacent town B and thus observations are dependent on one another.

I also learned about econometric approach of trying to deal with the fact that time and observations are correlated (my test score today is dependent on my test score tomorrow).

After hearing again and again about statisticians' attempts to correct for data dependence, correlation of data, etc., I finally realized that data can be correlated within one column of data. If you and someone else are from the same school, your outcome data are correlated.

The traditional statistical modeling technique, such as OLS regression model, relies on the assumption that outcome data are uncorrelated (observation 1 and 2 are completely not related to one another). If this assumption is violated, we can no longer consider results of statistical test good. In fact, in the presence of data dependence problem, results of statistical test will be over-optimistic (too many statistically significant results).

I also learned that the use of HLM is one thing you can do to improve the situation, but it may be just one of many problems you may have in data. Student test scores may be also related within friendship networks. Typically we do not have data of this membership.

In the same model, you can try to deal with group dependence (via. HLM) or time dependence (via. ARIMA model, for example). This is not impossible, but testing these two at the same time is computationally challenging. You will have to choose your battle and fix one thing at one time.

Leave a Reply Cancel reply