This repository contains comprehensive implementations of various machine learning algorithms, covering both classification and regression tasks. Each algorithm is implemented with proper data preprocessing, model training, evaluation, and hyperparameter tuning where applicable.
ml_algorithm/
├── clustering/ # Clustering Algorithms (K-Means, Hierarchical, DBSCAN)
├── KNN/ # K-Nearest Neighbors Classification
├── Navie_bayes/ # Naive Bayes Classification
├── Random_forest_classification/ # Random Forest Classification
├── logistic_regression/ # Logistic Regression (Binary & Multiclass)
├── support_vecotr_regression/ # Support Vector Regression
├── support_vector_classification/# Support Vector Classification
├── Linear_regression/ # Linear Regression
├── multiple_linear_regression/ # Multiple Linear Regression
├── Polynomial_regression/ # Polynomial Regression
└── PCA/ # Principal Component Analysis
- Location:
clustering/ - Technique: Centroid-based clustering
- Features:
- Elbow method for optimal k selection
- Silhouette score analysis
- Automatic elbow detection using KneeLocator
- Dataset: Synthetic blob dataset (150 samples, 2 features, 3 clusters)
- Location:
clustering/ - Technique: Bottom-up hierarchical clustering
- Features:
- Dendrogram visualization
- Ward linkage method
- PCA for dimensionality reduction
- Dataset: Iris dataset (reduced to 2D using PCA)
- Location:
clustering/ - Technique: Density-based clustering with noise detection
- Features:
- Automatic cluster detection
- Outlier/noise identification
- Handles non-spherical clusters
- Dataset: Synthetic moon-shaped dataset (1000 samples, 2 features)
- Location:
KNN/ - Technique: Instance-based learning algorithm
- Features: Binary classification with k=5 neighbors
- Accuracy: Evaluated using accuracy score
- Location:
Navie_bayes/ - Technique: Gaussian Naive Bayes
- Features: Multi-class classification on Iris dataset
- Accuracy: ~96.67%
- Evaluation: Accuracy, classification report, confusion matrix
- Location:
logistic_regression/ - Technique: Logistic Regression with various configurations
- Features:
- Binary classification (91% accuracy)
- Multiclass classification (One-vs-Rest strategy)
- Handling imbalanced datasets with class weights
- ROC curve and AUC score analysis
- Hyperparameter tuning (GridSearchCV, RandomizedSearchCV)
- Accuracy: 91% (binary), 59% (multiclass), 98.85% (imbalanced)
- Location:
Random_forest_classification/ - Technique: Ensemble learning with multiple decision trees
- Features:
- Comprehensive data cleaning and preprocessing
- Feature engineering
- Multiple model comparison (Logistic Regression, Decision Tree, Random Forest)
- Hyperparameter tuning with RandomizedSearchCV
- Accuracy: 90.90%
- Best Model: Random Forest with optimized hyperparameters
- Location:
support_vector_classification/ - Technique: Support Vector Machine for classification
- Features:
- Multiple kernel functions (linear, RBF, polynomial, sigmoid)
- Hyperparameter tuning with GridSearchCV
- Best kernel: RBF
- Accuracy: 90%
- Location:
support_vecotr_regression/ - Technique: Support Vector Machine for regression
- Features:
- Categorical feature encoding (Label Encoding, One-Hot Encoding)
- Multiple kernel support
- Hyperparameter tuning
- Evaluation: R² score
- Location:
Linear_regression/ - Technique: Simple linear regression
- Dataset: Height-weight dataset
- Location:
multiple_linear_regression/ - Technique: Multiple linear regression with multiple features
- Dataset: Economic index dataset
- Location:
PCA/ - Technique: Linear dimensionality reduction
- Features:
- Reduces 30 features to 2 principal components
- Variance preservation
- 2D visualization of high-dimensional data
- Dataset: Breast Cancer Wisconsin dataset (569 samples, 30 features)
- Location:
Polynomial_regression/ - Technique: Polynomial regression for non-linear relationships
- Train-test splitting (typically 80-20 split)
- Handling missing values (median/mode imputation)
- Feature encoding (Label Encoding, One-Hot Encoding)
- Feature scaling (StandardScaler)
- Feature engineering
- Classification Metrics:
- Accuracy score
- Precision, Recall, F1-score
- Confusion matrix
- Classification report
- ROC curve and AUC score
- Regression Metrics:
- R² score (coefficient of determination)
- Mean squared error (where applicable)
- GridSearchCV: Exhaustive search over parameter grid
- RandomizedSearchCV: Random search over parameter distributions
- Cross-validation (typically 3-5 folds)
- Multiple algorithm comparison
- Performance metrics comparison
- Best model selection based on evaluation metrics
- Python: Primary programming language
- scikit-learn: Machine learning library
- pandas: Data manipulation and analysis
- numpy: Numerical computing
- matplotlib: Data visualization
- seaborn: Statistical data visualization
- Jupyter Notebooks: Interactive development environment
pip install pandas numpy scikit-learn matplotlib seaborn jupyter- Navigate to the desired algorithm folder
- Open the Jupyter notebook file (
.ipynb) - Run all cells to execute the complete workflow
cd KNN
jupyter notebook KNN_algorithm.ipynb- Comprehensive Coverage: Both classification and regression algorithms
- Real-world Applications: Practical datasets and use cases
- Best Practices: Proper data preprocessing, evaluation, and hyperparameter tuning
- Documentation: Each folder contains detailed README explaining the implementation
- Code Quality: Clean, well-structured code with comments
| Algorithm | Task Type | Best Accuracy/R² | Dataset |
|---|---|---|---|
| K-Means | Clustering | Optimal k found | Synthetic Blobs |
| Hierarchical | Clustering | 2 clusters | Iris |
| DBSCAN | Clustering | Auto-detected | Synthetic Moons |
| KNN | Classification | Evaluated | Synthetic |
| Naive Bayes | Classification | 96.67% | Iris |
| Logistic Regression | Classification | 91% (binary) | Synthetic |
| Random Forest | Classification | 90.90% | Travel |
| SVC | Classification | 90% | Synthetic |
| SVR | Regression | Evaluated | Tips |
| PCA | Dimensionality Reduction | 2 components | Breast Cancer |
- All implementations use scikit-learn library for consistency
- Random states are set for reproducibility
- Evaluation metrics are comprehensive and appropriate for each task type
- Hyperparameter tuning is performed where applicable to optimize performance
Feel free to explore each algorithm folder for detailed implementation and documentation. Each folder contains:
- A README.md file explaining the technique and implementation
- Jupyter notebook(s) with complete code
- Dataset files (where applicable)
This repository is for educational purposes, demonstrating various machine learning algorithms and their implementations.