Project: Cancer Patients
Objective
We will be looking at discovering interesting correlation and/or causation trends associated with the three different cancer levels of the dataset and the given features. We want to find out which 5 are the most influential attributes (out of 23); and also possibly discover why and/or how.
When
June 2023
TECHNIQUES
Oversampling
Feature Selection
Outlier Detection
Standardization
Regularization
Cross Validation
Logistic Regression
Random Forest


notebook
GitHub
Problems
Limited Generalization: Small datasets may not provide enough diverse examples to capture the full range of variations and patterns present in real-world data. Consequently, models trained on small datasets may struggle to generalize well to unseen data, leading to poor performance in real-world scenarios

Overfitting: When the dataset is small, complex models can easily memorize the training samples instead of learning meaningful patterns. This leads to overfitting, where the model becomes too specialized to the training data and performs poorly on new, unseen examples. Overfitting can occur when the model has too many parameters relative to the available data points.

Validation and Testing Challenges: When the dataset is small, splitting it into training, validation, and testing sets can be problematic. The limited number of samples may lead to insufficient validation and testing data, resulting in unreliable estimates of the model's performance.

High Dimensionality: High-dimensional datasets, where the number of features is large compared to the number of samples, can exacerbate the overfitting problem. With limited samples and a large number of features, the model may struggle to find meaningful relationships and instead learn noise or irrelevant patterns.


TOOLs
MySQL, Excel, Python, Power BI

Data

The dataset was downloaded online from Kaggle and used as a CSV file. The original dataset consists of 1 000 cancer patient records with accompanying 24 different attributes (excluding the patient ID). MySQL was then used to create and extract 3 tables corresponding to the 3 available cancer levels; low, medium, and high; for better analysis.

Approach

With this project, the aim is to highlight my thought patterns and general approach to analyzing data for some given purpose. Emphasizing the value of good core principles and practices, as opposed to (also important in its own right) just knowing all about the new tools and techniques.

With Excel, we can then better view and transform our data. Given the dataset is mostly already cleaned, instead of pivot tables and formulas; we will opt to use Python code for analyzing the data, and thus also preluding to our machine learning endeavor. Given the labeled nature of our data and objective, we will use Logistic Regression and Neural Networks to classify patient data under the different cancer levels.

Analysis of Data

Upon examining the basic characteristics of the dataset, I found that the majority of the patients were male. The ratio of cancer levels was approximately 3.65:3.03:3.32, indicating a relatively balanced distribution among the different levels. Notably, patients with high cancer levels were mostly associated with high levels of chest pain, suggesting a strong correlation between these two factors. Additionally, the most frequently recorded chronic lung disease level was 6.

Cancer Level: High

In the high cancer level group, the average genetic risk level was 6.38, and the average clubbing of fingernails level was 4.21. This analysis strongly suggests that male patients dominate the high cancer level group, but as we will then see from the rest of the analysis, this is mostly because the dataset has more male cancer patients. Moreover, at least 89% of these patients recorded the highest level of obesity, which is level 7. It is worth noting that a significant number of patients (246) exhibited level 6 air pollution, followed by 50 patients with level 4 pollution. Additionally, a considerable proportion of male patients (146) reported level 7 smoking, followed by 51 patients reporting level 8 smoking.

Cancer Level: Medium

For the medium cancer level group, the average clubbing of fingernails level was 4.94. The number of patients reporting high levels of smoking decreased significantly to around 12% when considering levels 8, 6, and 5 combined. The patients' obesity levels appeared to follow a normal distribution with a mean value of approximately 4, showing a slight right skew. Similarly, the distribution of wheezing levels among patients also seemed to follow a normal distribution, with an expected value of 5. On average, this group consisted of the oldest patients, with an average age of 38.

Cancer Level: Low

Regarding the low cancer level group, the average clubbing of finger nails level was 2.47. This significantly lower average level compared to the other cancer levels suggests a correlation between higher levels of cancer and more likely clubbing of nails, whereas lower cancer levels were associated with fewer instances of nail clubbing. For our machine learning procedures, the low cancer level group will become our minority class, as it consisted of the smallest number of records, totaling 303 patients. The distribution of smoking levels appeared to follow a normal distribution, with a left skew, particularly at level 2. Similarly, levels of fatigue also seemed to follow a normal distribution, with an expected value of level 2.

Interpretation

The top 5 intuitively assumed risk factors or highly correlated variables for predicting the level of cancer (considering all levels, including low, medium, and high) are as follows:

Clubbing of Finger Nails: The average level of clubbing of fingernails was higher in the high cancer level group compared to other groups. This suggests that clubbing of fingernails may be a significant predictor of cancer level, with higher levels of clubbing indicating a higher likelihood of having cancer.

Smoking: Smoking levels varied among different cancer levels, with higher levels of smoking observed in the medium and high cancer-level groups. This indicates that smoking could be a relevant factor for predicting the level of cancer, with higher smoking levels associated with an increased risk of higher cancer levels.

Obesity: Obesity levels showed variations across different cancer levels, with higher obesity levels observed in the high cancer level group. This suggests that obesity may contribute to predicting the level of cancer, with higher obesity levels indicating a higher likelihood of having a more severe cancer level.

Chronic Lung Disease: The prevalence of chronic lung disease was highest in the high cancer level group. This implies that chronic lung disease may be an important predictor of cancer level, with a higher presence of chronic lung disease associated with a higher likelihood of having a higher cancer level.

Air Pollution: The high cancer level group had a substantial number of patients reporting high levels of air pollution. This indicates that air pollution may play a role in predicting the level of cancer, with higher levels of pollution associated with an increased risk of having a higher cancer level.

Machine Learning

Luckily for us, the data is mostly already cleaned, and it is complete with no null values. After checking for outliers and replacing them with the mean, we will look to use oversampling to deal with our small dataset size problem, increasing our input from 1 000 records to 17 392, as a large number of data points is generally more advised for machine learning.

After analyzing the correlation of each attribute with the level of cancer, with a 0.4 bidirectional cutoff value, we then chose the 10 most influential attributes. We also used mutual information in deciding our leading 10 features for a separate selection. Both will be modeled and learned.

We also normalized the data to center the feature distribution around zero and bring all the features to a similar scale. We then lastly split the dataset into 3 separate sets (for each feature selection technique); training, validation, and testing; to further minimize overfitting to the given data.

Logistic Regression

6 Logistic Regression Models were created and tested. LR-A-1, a Scikit-Learn Logistic Regression model, and two more: LR-A-2 trained on data features selected by mutual information, and LR-A-3 trained on manually feature-selected data. Along with three SGD Classifier Logistic Regression models, LR-B-1 a basic model, LR-B-2 trained on mutual information based selected features, and LR-B-3 trained on features that were manually selected.

The classes of cancer levels are distributed in the ratio 3.65 : 3.03 : 3.32, which suggests that the classes are relatively balanced. Given this information, it is possible that models A1 and B1 may be overfitting to the training data, as their perfect accuracy and perfect precision, recall, and F1-score for all three classes on both validation and testing data may be too good to be true.

The performance of models B2 (MI) and A2 (MI), which were trained using the top ten features selected using mutual information, suggests that feature selection can be effective in improving model performance.

The performance of models B3 (Selected) and A3 (Selected), which were trained using the top ten manually selected features, is similar to that of models B2 (MI) and A2 (MI). This suggests that manual feature selection can be as effective as feature selection using mutual information.

Random Forest

6 Random Forest Models were created and tested. RF-A-1, a Scikit-Learn Random Forest model, and two more: RF-A-2 trained on data features selected by mutual information, and RF-A-3 trained on manually feature-selected data. Along with three XGBoost Random Forest models, RF-B-1 a basic model, RF-B-2 trained on mutual information based selected features, and RF-B-3 trained on features that were manually selected.

Models A2 and B2 show lower precision, recall, and F1 scores for class 0 and class 1, which correspond to the high and low cancer levels, respectively. This indicates that these models might struggle to capture the patterns and characteristics specific to these classes.

On the other hand, models A3 and B3, which utilize manual feature selection, achieve perfect precision, recall, and F1-scores for all three classes. This suggests that these models can effectively capture the patterns and relationships associated with each cancer level, including the high, low, and medium levels

Conclusion

From our analysis, it seems the features from this cancer dataset, tell us more about the correlation between these attributes and the level of cancer, than causality.

Class 0 (High): The average top 3 most seemingly influential features for predicting a high level of cancer seem to be - Obesity, Passive Smoking and Alcohol Usage.

Class 1 (Low): The average top 3 features that are most likely predictors for a low level of cancer seem to be - Obesity, Coughing of Blood and Passive Smoking.

Class 2 (Medium): And lastly, the average top 3 features that seem to most probably indicate a medium level of cancer don't seem to be as clear. But from comparing the different models, they seem to be - Alcoholo Use, Chest Pain and Chronic Lung Disease.

It's important to note that the oversampling technique used to increase the dataset size may have helped alleviate potential overfitting issues. Additionally, the feature selection approaches, whether based on mutual information or manual selection, have allowed the models to focus on the most relevant features for prediction.

Further analysis and exploration of the dataset, as well as consideration of different feature engineering and model optimization techniques, can help refine the models and improve their predictive performance.