The power of prediction: Predictive accuracy depends on data set size, sources and quality

In my last blog post, I wrote about how predictive analytics needed comprehensive health care data to have a high degree of prediction. In today’s post, I’ll dig deeper into the variables the increase predictive accuracy.

The first variable is the size of a sample data set. As a sample size grows, the level of a model’s uncertainty and degree of bias decreases. Increasing the sample size increases the chances of seeing all likely events and patient variation. By using a large and diverse sample of patients from many health systems, geographic regions and demographics, you can reduce the likelihood of skewing a study with a homogeneous sample.

For example, a small sample of patients from the same health system with similar demographics would not effectively Orr-imagerepresent the larger heterogeneous congestive heart failure (CHF) population because it does not take into account the variability of all possible independent variables that could lead to a dependent event. Hence, the relative predictive power of a statistical model increases exponentially when using millions of patients instead of hundreds of patients.

Relevant and varied data sources are critical for uncovering the most predictive variables. And it’s not just about adding clinical data to claims data. Socioeconomic and care management data should also be integrated into one the predictive data set.

In 2011, the Veterans Health Administration (VHA) reviewed risk prediction models specifically for hospital readmission. It found that social and environmental factors such as access to care, social support, and substance abuse contributed to readmission risk in some models. In addition, the authors discussed how care management factors such as discharge follow-up and coordination of care with primary care physicians likely impact readmission risk.

Finally, data quality is paramount. It is one thing to aggregate data from diverse sources, settings and organizations, but unless the resulting set is cleaned, normalized and validated, it will not be useful in predictive models. Each data set must be prepared the same way, using the same clinical classification (a single consistent ontology) or else statistical power will suffer.

For more on the variables that determine the accuracy of predictive analytics, download “Predictive analytics: Poised to drive population health.”

–Jeremy Orr, MD, MPH, Chief Medical Officer, Optum Analytics

Leave a Reply