Skip to content

Improving the Accuracy of Statistical Projections

Investigators: J. Sunil Rao (Co-PI), Daniel Andres Diaz-Pachon (PI).

Summary

In many important practical problems, there is interest in the prediction for new data that are outside the range of the training data – the so-called extrapolation prediction or projection problem which standard prediction methods have a difficult time with. Extrapolation as a topic has intrigued researchers back to the time of Archimedes who developed among the first methods for function extrapolation for numerical analysis. In statistical circles, the focus has been on predictive modeling with interesting distinctions being made between predictability and prediction. As noted there, the most interesting prediction extrapolation problem asks the question if a result found from one set of data will hold for some other, different data (e.g. for a sample from another population). The main issue then boils down to the choice of model form relating the response of interest to a set of available covariates. Typically, even if the training model is correct, extrapolated prediction intervals widen dramatically. If the training model is incorrect, then bias enters the equation as well.

Here is a typical example. An older individual with later stage cancer (designated patient X) is deciding whether or not to undergo chemotherapy treatment. His doctors are encouraging him to do so and providing five year success probabilities for the treatment (in terms of cancer recurrence) of over 50%. However, patient X also suffers from co-morbidities including significant kidney disease resulting in markedly reduced kidney function and reduced cardiac function that has required him to have had significant bypass surgery. The doctors indicate that their quoted success probability for chemotherapy is really not based on a model which could easily account for the co-morbidities and on top of that are not really designed for patients of advanced ages. So any prediction for patient X would represent an extrapolation from the training model. The best advice they can give is to use the 50% success estimate as an upper bound. When weighing such big decisions, this proves less than satisfactory. Yet, as we know, this type of problem occurs regularly in practice and frankly reflects a lack of diversity (age, gender, race, etc) in research training data models.

The above is an example of projection of a response associated with new data. In mixed model settings, projection of a mixed effect associated new data is also of interest. For instance, in small area estimation (SAE), we use mixed models to derive model-based estimators of area mixed effects, such as small area means, using mixed model prediction (MMP) or classified mixed model prediction (CMMP). When new areas present as outside of the range of the auxiliary variables in the training data, the goal is the generate a mixed model projection of the mixed effect for those areas. It is this type of mixed model projection that will be the focus of this paper.