The Project
The Diabetes readmission prediction was developed with tree-based ensemble classification algorithms, such as Decision Tree, Extreme Gradient Boost, Ada Boost, and CatBoost. In addition, to increase accuracy, we stacked the models with Stack Classifier using Catboost classifier as the final estimator.
Experiments
- Data Extraction: We downloaded the dataset from the UCI Machine Learning Repository and saved it locally in a CSV format.
- Data Exploration: We explored the dataset to understand each feature and the target class. We observed some missing values and inconsistent entries in the dataset.

- Data Transformation: We transformed the features in the dataset into appropriate categories. Also, we encoded the categories into numerical values.

- Data Normalization: We rescaled data such that all values fall into range; of 0 and 1.
- Resampling: Due to the imbalanced nature of the dataset, we resampled to provide an equitable representation to reduce model bias.
Results
- Base Model: We started each machine learning algorithm with an initial set of parameters, these parameters are used to fit the training and validation datasets.
- Feature Selection: We used the XGB classifier’s feature importance identifies the most informative features for the model.

- Hyper Parameter Tuning: Tuning the model with Grid Search method, we obtained a less bias model.
- Final Result: CatBoost model outperforms other models in terms of Recall, AUC and Accuracy.


- Model Explanation: We used the TreeExplainer method to give insight on how much each feature contributes to the result of the Catboost classification result.

Conclusion
The causes of unexpected readmissions are many. Machine learning algorithms can identify such interdependencies and use them to classify patients as high or low risk. In addition, machine learning provides logical and informative explanations for classification outcomes.
As seen in the results section; switching from the decision tree to the ensemble models results in improved performance. The effect of employing more complex models instead of simplified ones is crucial.
Human-Centred Design (HCD) and Data Ethics
Each phase in the Machine Learning and Big-data analytics design process should consider the data citizens impacted by models, methods, and algorithms developed by data scientists. Biases in defective datasets, algorithms, and human users are numerous and should be considered.
We must not ignore that, owing to the vulnerability of data subjects and groups, the risk of discrimination is severe.
Furthermore, data scientists are also data citizens; asides from developing machine learning algorithms, they are also affected by such model. As a result, maintaining ethically acceptable data processing and analytics is a win-win scenario for all parties involved.
Other Plots
- Missing value / Data type

- Number of Lab Procedures (Area chart)

- Number of Time in Hospital


COURSE
Applied Data Science
ABOUT ME
I am a meticulous and value-driven Data Scientist with years of demonstrated expertise in data analysis, visualisation, and reporting, as well as in-depth mathematical modelling experience. Teesside University’s Masters of Applied Data Science degree offered appropriate and professional knowledge in data analytics through rigorous lectures, lab sessions, and assessment. Aside from learning about a variety of Data science approaches and technologies, I have also learnt about emerging ethical and security challenges in AI and Data Science in different sectors.
SPECIALTIES
- Machine Learning
- Deep Learning
- Data Science
- Data Science Ethics
- Data Analysis
- Data Visualisation
WORK EXPERIENCE
- Data Analyst – Blue Atoms Media, Lagos, Nigeria
- Data Analyst Intern – Training and Research Dept. NIEPA. Ondo, Nigeria
- Office Assistant – Blue Atoms Media, Lagos, Nigeria
- Volunteer Vaccine Recorder – Primary Health Centre
- Volunteer Enumerator – Primary Health Centre ((NSHIP)), Ondo State, Nigeria