Predicting Hospital Readmissions with Machine Learning

Hospital Automation: A Data-Driven Approach

A major concern for modern hospitals is counting how many patients are readmitted to their hospital within 30 days after an original admission. Such an early readmission may be planned or unplanned, but the Medicare Payment Advisory Committee reported that 17.6% of U.S. hospital admissions resulted in readmissions within 30 days of discharge. As much as 76% of these readmissions were deemed avoidable, however, through patient education, follow-up, and effective communication with primary care doctors. All in all, these avoidable readmissions are said to cost $15 billion in Medicare spending.

Given this avoidable financial burden, governments in the U.S. and around the world have introduced a variety of financial penalties to hospitals with excess early readmissions in an effort to curb unnecessary healthcare spending. But how can hospitals predict which patients are likely to be readmitted and so they can help these patients avoid early readmittance? Many in the healthcare industry have turned to machine learning algorithms to build risk prediction models, narrow in on certain diseases and conditions, like sepsis, and more effectively predict relapse rates.

Methods for Prediction

The problem of predicting early readmissions is fundamentally treated as a binary prediction problem: A patient is either readmitted early or they are not. First, prediction involves separating readmissions from non-readmissions where that data is available.

The Veterans Health Administration, for instance, has extensive data available on patient readmissions for its subset of hospitals. Using publicly available VHA data, for each hospital, we can see the number of readmissions and total number of cases over a two year period:

With detailed hospital-by-hospital information, researchers can use factors like patient diagnosis and other available demographic variables to predict whether each one of these patients will be readmitted or not (however, this patient-level data must be requested privately for privacy reasons). But what does it mean for a readmission to be tagged as “early?"

Medicare’s definition of an early readmission is a readmission that occurred within 30 days of discharge. This determination is both a clinical judgement and an empirical observation. Note that readmission percentages begin to stabilize at around 30 days after discharge for a number of diagnoses:

Graphical Representation of Time to Readmission (From Medicare Hospital-Wide (All-Condition) 30‐Day Risk-Standardized Readmission Measure Report, prepared for the Centers for Medicare & Medicaid Services)

Graphical Representation of Time to Readmission (From Medicare Hospital-Wide (All-Condition) 30‐Day Risk-Standardized Readmission Measure Report, prepared for the Centers for Medicare & Medicaid Services)

Then, once early readmissions have been identified, it is possible to predict a binary event. For instance, fitting a simple binary model to predict whether a patient will be readmitted early or not (0 or 1), might involve two predictor variables: age, and diagnosis. In Python, a simplified training dataset might be structured as follows, where each value in a variable is an individual observation from a patient (all of whom were diagnosed with Pneumonia here):

>>> Age = [20, 30, 75, 60]

>>> Diagnosis = [Pneumonia, Pneumonia, Pneumonia, Pneumonia]

>>> Readmission = [0, 1, 1, 0]

>>> X = [Age, Diagnosis]

>>> y = Readmission

Then, using Python’s sklearn library, we could fit a binary prediction model. For instance, here, we fit a Support Vector Machine (SVM):

>>> from sklearn import svm

>>> clf = svm.SVC()

>>>, y)

The SVM approach works by algorithmically finding an optimal dividing boundary (or hyperplane) between patients who were readmitted and those who were not readmitted, based on the demographic and diagnosis data provided for each patient. The algorithm identifies a decision boundary that maximizes the minimum distance between training data points.

Thus, in this particular example, the SVM algorithm maximizes the minimum distance between the predictor (X) data of readmitted and non-readmitted patients. We can then make predictions based on where an individual patient lies in relation to the SVM decision boundary. Visually, the prediction task looks like the following graph:

Image and Data from Python sklearn's SVM Margins Example

Image and Data from Python sklearn's SVM Margins Example

If we imagine that values in the upper, brown portion of the graph are values the SVM has determined are likely to be readmissions, and those in the lower, blue portion are not readmitted early, then hospitals can feed a patient's data into a fitted SVM classifier to identify on which side of the decision boundary they lie on—whether they are likely to be readmitted or not. If a patient’s data passes through the fitted SVM we produced above and they are predicted to fall in the upper, brown region, then hospital staff might spend more time with the patient, educating them and helping them to avoid a costly early readmission.

To effectively predict whether a patient will be readmitted early or not, researchers have explored using Support Vector Machines like we outlined above, but also have gotten into a variety of different modeling techniques, such as Random Forests, Neural Networks, Logistic Regression, and many more. In addition, some efforts have involved different sources of data. Beyond Medicare and Veterans Health Administration data, some researchers have used retrospective as well as real-time administrative data to bring additional patient data (such as number of address changes and socioeconomic status) to bear on predicting early readmissions.

Validation and Success of Readmissions Predictions

Researchers validate their results by making predictions from patient data that was held out from the original model-fitting phase of the analysis, as well as new data that is collected after the initial fitting of the model.

Let us consider the AUROC c-statistic, or area under the receiver operating characteristic curve as a measure of model performance. The c-statistic is the proportion of times a model correctly discriminates a pair of early-readmitted and non-early-readmitted patients. Across the board, researchers’ models for all of the various binary prediction model types mentioned above tend to produce c-statistics in the 70-80% range at the highest, although some claim values in the 80-90% range for certain datasets using the same methods of binary prediction.

While these c-statistics are not perfect, they are serviceable for beginning the process of reducing the penalties imposed on hospitals for exceeding early readmissions expectations. No doubt, as research continues to identify new and important variables for predicting readmissions these numbers will continue to improve, potentially saving hospitals a great amount of money.

Additionally, other researchers have pointed out that specific models seem to depend heavily on the setting and the population they were originally trained on. As a corollary, models that predict early readmissions for many different populations at once may see reduced performance. There appear to be many socioeconomic, geographic, and other factors at play in their success that researchers have not always been able to make use of in their predictions.


Predicting early readmissions remains a problem of significant importance for those invested in healthcare around the world. Significant financial penalties are assigned to hospitals that exceed their limit of early readmissions. Thus, it is incredibly important for hospitals to be able to predict which patients who may be readmitted early and put extra resources into these “high risk” patients. While existing binary prediction models are by no means perfect, new data and continued research into the variables that effectively predict early readmissions promise to continue to improve these approaches.