Breast cancer is one of the most popular cancer in women around the world. Therefore, an accurate diagnosis method that can help detect breast cancer is crucial to battle this disease type. This project aims to propose a simple but powerful classification Artificial Neural Network (ANN) model with very high accuracy. The algorithm was employed in R with the support of the Keras library. In order to evaluate the performance of the proposed model, a series of four experiments are conducted using the Wisconsin Diagnostic Breast Cancer data set. Besides full feature input, feature selection and extraction are considered to observe their potential impact on model performance. The results demonstrate that the proposed model works excellently with the full features and the extracted feature inputs using Principal Component Analysis (PCA), achieving an accurate rate of 99.1%. Moreover, the proposed model results also outperform other machine learning methods such as Random Forest Classifier. Based on the finding, the proposed model can be developed as a promising tool to support doctors in their making cancer diagnosis process.
There are some issues within the data set are observed:
– Unbalance data: There are 357 benign (B) and 212 malignant observations in the dataset. The proportion B to M is 0.63: 0.37. This might affect the training model
– Various ranges of cell features values. The average feature values change from about 0.004 to 880.583, which the standard deviation values vary from 0.003 to 569.357
– Skewness is another issue in this data set. As can be seen from the histogram with density line of data distribution, the data is skewed to the right, except for the smoothness mean feature.
The four experiments’ results are as follows:
- Accuracy and loss comparison:
- The learning curve of best result (experiment 3 – PCA)
In this project, a new effective ANN model was proposed to identify the cancer diagnosis. The proposed model was tested on a very popular dataset (WDBC) and signified outstanding performance compared to other algorithms such as Random Forest. It is a simple neural network model but could achieve very high accuracy of 99.1%. Moreover, it works well with the original dataset with the full 30 tumour features and the extracted feature inputs using PCA. Based on the finding, the proposed model can be developed as a promising cancer diagnosis tool.
Applied Data Science
I am a highly motivated Data Analyst with strong experience in data analysis, financial and management reporting, and reporting process automation. I’m currently studying for a Master of Applied Data Science, and my interest is to find meaningful patterns in the data and design report dashboard. I have strong analytical skills, attention to detail, and work well in a team as well as independently.
Software & Hardware Proficiencies
Employment, Work Experience
I currently a student ambassador support delivering business intelligence with the Power BI course. I also involved in designing the Power BI student workbook. Previously, I was a reporting accountant at AB InBev South East Asia and an associate auditor at EY Vietnam.