This project focuses on building a binary classification system to predict whether a patient has diabetes or not using machine learning models, while properly handling imbalanced data and deploying the final model using Streamlit.
- Problem Type: Binary Classification
- Target Variable: Diabetes (0 = No, 1 = Yes)
- Main Challenges:
- Imbalanced dataset
- Model selection & comparison
- Proper preprocessing
- Deployment
The project follows a clean and modular ML pipeline:
- Data cleaning using
FunctionTransformer - Handling missing values using
SimpleImputer - Feature scaling with
RobustScaler - Feature transformation using
PowerTransformer
To solve class imbalance, SMOTE-based techniques were applied:
SMOTESMOTETomek
These techniques were applied only on the training data to avoid data leakage.
The following models were trained and compared:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Decision Tree
- Random Forest
- Support Vector Machine (SVM)
- Naive Bayes (Gaussian)
- XGBoost
Each model was evaluated using multiple metrics.
Because the dataset is imbalanced, accuracy alone is not enough.
The following metrics were used:
- Accuracy (Train & Test)
- Precision
- Recall
- F1-score
- ROC-AUC
The F1-score was the primary metric for model selection.
- Technique: GridSearchCV
- Cross Validation: 5-Fold
- Scoring Metric: F1-score
Random Forest hyperparameters tuned include:
n_estimatorsmax_depthmin_samples_splitmin_samples_leafmax_features
- Model: Random Forest Classifier
- Why?
- Best balance between precision & recall
- High F1-score
- Stable performance after SMOTE
The trained artifacts were saved using pickle:
rf_model.pkl→ trained modelpreprocessor.pkl→ preprocessing pipeline
This allows:
- Reusing the model without retraining
- Easy deployment
A simple and interactive Streamlit web app was created where users can:
- Input patient medical data
- Get real-time diabetes prediction
- View prediction confidence
streamlit run app.py