The Project

The Diabetes readmission prediction was developed with tree-based ensemble classification algorithms, such as Decision Tree, Extreme Gradient Boost, Ada Boost, and CatBoost. In addition, to increase accuracy, we stacked the models with Stack Classifier using Catboost classifier as the final estimator.


  • Data Extraction: We downloaded the dataset from the UCI Machine Learning Repository and saved it locally in a CSV format.
  • Data Exploration: We explored the dataset to understand each feature and the target class. We observed some missing values and inconsistent entries in the dataset.
  Plot showing missing values in the dataset
  • Data Transformation: We transformed the features in the dataset into appropriate categories. Also, we encoded the categories into numerical values.
Diagnosis Feature after transformation
  • Data Normalization: We rescaled data such that all values fall into range; of 0 and 1.
  • Resampling: Due to the imbalanced nature of the dataset, we resampled to provide an equitable representation to reduce model bias.


  • Base Model: We started each machine learning algorithm with an initial set of parameters, these parameters are used to fit the training and validation datasets.


  • Feature Selection: We used the XGB classifier’s feature importance identifies the most informative features for the model.
Feature Importance
  • Hyper Parameter Tuning: Tuning the model with Grid Search method, we obtained a less bias model.


  • Final Result: CatBoost model outperforms other models in terms of Recall, AUC and Accuracy.


  Model ROC Curve
                                                    Performance Comparison (AUC, Recall, and Accuracy)
  • Model Explanation: We used the TreeExplainer method to give insight on how much each feature contributes to the result of the Catboost classification result.
   Model Explainer


The causes of unexpected readmissions are many. Machine learning algorithms can identify such interdependencies and use them to classify patients as high or low risk. In addition, machine learning provides logical and informative explanations for classification outcomes.

As seen in the results section; switching from the decision tree to the ensemble models results in improved performance. The effect of employing more complex models instead of simplified ones is crucial.

Human-Centred Design (HCD) and Data Ethics

Each phase in the Machine Learning and Big-data analytics design process should consider the data citizens impacted by models, methods, and algorithms developed by data scientists. Biases in defective datasets, algorithms, and human users are numerous and should be considered.

We must not ignore that, owing to the vulnerability of data subjects and groups, the risk of discrimination is severe.

Furthermore, data scientists are also data citizens; asides from developing machine  learning algorithms, they are also affected by such model. As a result, maintaining ethically acceptable data processing and analytics is a win-win scenario for all parties involved.

Other Plots

  • Missing value / Data type
Missing Value across different data types
  • Number of Lab Procedures (Area chart)
  Number of Lab Procedures
  • Number of Time in Hospital
Number of Time in Hospital


Skip to content
%d bloggers like this: