A comprehensive collection of feature engineering techniques and data preprocessing methods for machine learning. This project provides practical implementations of essential data preparation techniques organized into focused modules.
- Project Overview
- Project Structure
- Modules Overview
- Installation
- Dependencies
- Usage Guide
- Techniques Summary
- Best Practices
Feature engineering is a crucial step in the machine learning pipeline that involves transforming raw data into features that better represent the underlying problem to predictive models. This project covers four essential areas:
- Data Encoding: Converting categorical data into numerical formats
- Handling Imbalanced Datasets: Techniques to address class imbalance
- Missing Value Imputation: Methods to handle missing data
- Outlier Detection: Statistical methods to identify and analyze outliers
feature_engineering/
βββ encoding/ # Data encoding techniques
β βββ Data_Encoding_nominal.ipynb
β βββ label_encoding.ipynb
β βββ target_encoding.ipynb
β βββ README.md
βββ imbalance_dataset/ # Handling imbalanced datasets
β βββ handle_imbalance_dataset.ipynb
β βββ handle_imbalance_smote.ipynb
β βββ README.md
βββ missing_value/ # Missing value imputation methods
β βββ missing_value.ipynb
β βββ README.md
βββ outliers/ # Outlier detection and handling
β βββ outlier.ipynb
β βββ README.md
βββ README.md # Main project documentation
Techniques for encoding categorical and nominal data into numerical formats suitable for machine learning algorithms.
Types Used:
-
One-Hot Encoding: Creates binary columns for each category using
sklearn.preprocessing.OneHotEncoder- Best for: Nominal categorical data with no inherent order
- Use case: Color categories (red, blue, green)
-
Label Encoding: Assigns numerical labels to categories using
sklearn.preprocessing.LabelEncoder- Best for: Simple categorical variables where order doesn't matter
- Use case: Quick encoding for tree-based algorithms
-
Ordinal Encoding: Encodes ordinal categorical data with ordered relationships using
sklearn.preprocessing.OrdinalEncoder- Best for: Categorical data with inherent order
- Use case: Size categories (small, medium, large)
-
Target Encoding: Encodes categorical variables based on target variable statistics using group-based mean encoding
- Best for: High-cardinality categorical features
- Use case: City names, product categories with many unique values
Methods for handling imbalanced datasets in classification problems to prevent model bias toward majority classes.
Types Used:
-
Upsampling: Increases minority class samples by resampling with replacement using
sklearn.utils.resample- Best for: When you have sufficient data and want to preserve all majority samples
- Method: Random resampling with replacement until classes are balanced
-
Downsampling: Reduces majority class samples to match minority class size using
sklearn.utils.resample- Best for: Large datasets where reducing majority samples is acceptable
- Method: Random resampling with replacement to reduce majority class
-
SMOTE: Synthetic Minority Over-sampling Technique using
imblearn.over_sampling.SMOTE- Best for: Creating synthetic samples instead of duplicating existing ones
- Method: Generates new samples by interpolating between existing minority class samples
Various imputation techniques for handling missing values in datasets, categorized by data type and distribution.
Types of Missing Data:
- MCAR (Missing Completely at Random): No relationship with other data
- MAR (Missing at Random): Depends on observed data
- MNAR (Missing Not at Random): Depends on unobserved data
Types Used:
-
Mean Imputation: Replaces missing values with the mean using
pandas.DataFrame.fillna()- Best for: Numerical features with normal distribution
- Use case: Age, temperature, continuous numerical features
-
Median Imputation: Replaces missing values with the median using
pandas.DataFrame.fillna()- Best for: Numerical features with skewed distributions or outliers
- Use case: Income, price data with outliers
-
Mode Imputation: Replaces missing categorical values with the most frequent category using
pandas.DataFrame.fillna()- Best for: Categorical features where mode is a reasonable default
- Use case: Embarkation port, category fields
-
Frequent Category Imputation: Replaces missing values based on grouped frequency analysis using
pandas.DataFrame.groupby()- Best for: When missing values relate to other categorical features
- Use case: Missing values that depend on other categorical variables
Techniques for detecting and analyzing outliers in numerical data using statistical methods and visualizations.
Types Used:
-
5-Number Summary: Statistical summary using
numpy.quantile()with quantiles [0, 0.25, 0.50, 0.75, 1.0]- Components: Minimum, Q1, Median, Q3, Maximum
- Use case: Initial data exploration and understanding distribution
-
IQR Method: Interquartile Range method to identify outliers
- Formula: IQR = Q3 - Q1
- Lower Bound = Q1 - 1.5 Γ IQR
- Upper Bound = Q3 + 1.5 Γ IQR
- Use case: Detecting outliers in numerical data, especially for non-normal distributions
-
Box Plot Visualization: Visual representation using
seaborn.boxplot()- Shows: Quartiles, whiskers, and outliers as points
- Use case: Visual identification and analysis of outliers
- Python 3.7 or higher
- Jupyter Notebook or JupyterLab
-
Clone or download this repository
-
Install required packages:
pip install pandas numpy scikit-learn seaborn matplotlib imbalanced-learnOr install from requirements file (if available):
pip install -r requirements.txt- pandas: Data manipulation and analysis
- numpy: Numerical computing and statistical operations
- scikit-learn: Machine learning preprocessing tools
- seaborn: Statistical data visualization
- matplotlib: Plotting and visualization
- imbalanced-learn: Advanced techniques for imbalanced datasets (SMOTE)
-
Navigate to the desired module folder (e.g.,
encoding/,missing_value/, etc.) -
Open the Jupyter notebook for the technique you want to learn or use
-
Run the cells sequentially to see the implementation and results
-
Modify the code with your own data to apply these techniques
-
Refer to individual README files in each folder for detailed explanations
# Example: Handling missing values
import pandas as pd
import numpy as np
# Load your data
df = pd.read_csv('your_data.csv')
# Check for missing values
print(df.isnull().sum())
# Apply mean imputation for numerical columns
df['numeric_column'].fillna(df['numeric_column'].mean(), inplace=True)
# Apply mode imputation for categorical columns
df['categorical_column'].fillna(df['categorical_column'].mode()[0], inplace=True)| Module | Technique | Library/Method | Best For |
|---|---|---|---|
| Encoding | One-Hot Encoding | sklearn.preprocessing.OneHotEncoder |
Nominal categorical data |
| Encoding | Label Encoding | sklearn.preprocessing.LabelEncoder |
Simple categorical variables |
| Encoding | Ordinal Encoding | sklearn.preprocessing.OrdinalEncoder |
Ordinal categorical data |
| Encoding | Target Encoding | pandas.groupby().mean() |
High-cardinality features |
| Imbalance | Upsampling | sklearn.utils.resample |
Preserving all data |
| Imbalance | Downsampling | sklearn.utils.resample |
Large datasets |
| Imbalance | SMOTE | imblearn.over_sampling.SMOTE |
Synthetic sample generation |
| Missing Value | Mean Imputation | pandas.fillna(mean()) |
Normal distributions |
| Missing Value | Median Imputation | pandas.fillna(median()) |
Skewed distributions |
| Missing Value | Mode Imputation | pandas.fillna(mode()) |
Categorical data |
| Missing Value | Frequent Category | pandas.groupby() |
Grouped imputation |
| Outliers | 5-Number Summary | numpy.quantile() |
Data exploration |
| Outliers | IQR Method | Statistical calculation | Outlier detection |
| Outliers | Box Plot | seaborn.boxplot() |
Visual analysis |
- Use One-Hot Encoding for nominal data with few categories (< 10)
- Use Target Encoding for high-cardinality categorical features
- Avoid Label Encoding for tree-based algorithms when order matters
- Try SMOTE before simple resampling for better results
- Consider the cost of false positives/negatives when choosing technique
- Evaluate model performance after applying balancing techniques
- Always analyze the pattern of missingness (MCAR, MAR, MNAR)
- Use median for skewed distributions
- Consider domain knowledge when choosing imputation method
- Document the percentage of missing values before imputation
- Investigate outliers before removing them (they might be valid)
- Use IQR method for non-normal distributions
- Consider the business context when handling outliers
- Visualize data before and after outlier treatment
- All notebooks are self-contained with examples
- Each module has its own README with detailed explanations
- Techniques can be combined based on your specific use case
- Always validate the impact of preprocessing on model performance
Feel free to explore each module and adapt the techniques to your specific needs. Each folder contains detailed documentation and working examples.
This project is for educational and reference purposes.
Happy Feature Engineering! π