Skip to content

senthilkumaranT/feature_engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Feature Engineering Project

A comprehensive collection of feature engineering techniques and data preprocessing methods for machine learning. This project provides practical implementations of essential data preparation techniques organized into focused modules.

πŸ“‹ Table of Contents

🎯 Project Overview

Feature engineering is a crucial step in the machine learning pipeline that involves transforming raw data into features that better represent the underlying problem to predictive models. This project covers four essential areas:

  1. Data Encoding: Converting categorical data into numerical formats
  2. Handling Imbalanced Datasets: Techniques to address class imbalance
  3. Missing Value Imputation: Methods to handle missing data
  4. Outlier Detection: Statistical methods to identify and analyze outliers

πŸ“ Project Structure

feature_engineering/
β”œβ”€β”€ encoding/              # Data encoding techniques
β”‚   β”œβ”€β”€ Data_Encoding_nominal.ipynb
β”‚   β”œβ”€β”€ label_encoding.ipynb
β”‚   β”œβ”€β”€ target_encoding.ipynb
β”‚   └── README.md
β”œβ”€β”€ imbalance_dataset/     # Handling imbalanced datasets
β”‚   β”œβ”€β”€ handle_imbalance_dataset.ipynb
β”‚   β”œβ”€β”€ handle_imbalance_smote.ipynb
β”‚   └── README.md
β”œβ”€β”€ missing_value/         # Missing value imputation methods
β”‚   β”œβ”€β”€ missing_value.ipynb
β”‚   └── README.md
β”œβ”€β”€ outliers/              # Outlier detection and handling
β”‚   β”œβ”€β”€ outlier.ipynb
β”‚   └── README.md
└── README.md              # Main project documentation

πŸ“š Modules Overview

1. Encoding (enoding/)

Techniques for encoding categorical and nominal data into numerical formats suitable for machine learning algorithms.

Types Used:

  • One-Hot Encoding: Creates binary columns for each category using sklearn.preprocessing.OneHotEncoder

    • Best for: Nominal categorical data with no inherent order
    • Use case: Color categories (red, blue, green)
  • Label Encoding: Assigns numerical labels to categories using sklearn.preprocessing.LabelEncoder

    • Best for: Simple categorical variables where order doesn't matter
    • Use case: Quick encoding for tree-based algorithms
  • Ordinal Encoding: Encodes ordinal categorical data with ordered relationships using sklearn.preprocessing.OrdinalEncoder

    • Best for: Categorical data with inherent order
    • Use case: Size categories (small, medium, large)
  • Target Encoding: Encodes categorical variables based on target variable statistics using group-based mean encoding

    • Best for: High-cardinality categorical features
    • Use case: City names, product categories with many unique values

πŸ“– Detailed Documentation

2. Imbalance Dataset (imbalance_dataset/)

Methods for handling imbalanced datasets in classification problems to prevent model bias toward majority classes.

Types Used:

  • Upsampling: Increases minority class samples by resampling with replacement using sklearn.utils.resample

    • Best for: When you have sufficient data and want to preserve all majority samples
    • Method: Random resampling with replacement until classes are balanced
  • Downsampling: Reduces majority class samples to match minority class size using sklearn.utils.resample

    • Best for: Large datasets where reducing majority samples is acceptable
    • Method: Random resampling with replacement to reduce majority class
  • SMOTE: Synthetic Minority Over-sampling Technique using imblearn.over_sampling.SMOTE

    • Best for: Creating synthetic samples instead of duplicating existing ones
    • Method: Generates new samples by interpolating between existing minority class samples

πŸ“– Detailed Documentation

3. Missing Value (missing_value/)

Various imputation techniques for handling missing values in datasets, categorized by data type and distribution.

Types of Missing Data:

  • MCAR (Missing Completely at Random): No relationship with other data
  • MAR (Missing at Random): Depends on observed data
  • MNAR (Missing Not at Random): Depends on unobserved data

Types Used:

  • Mean Imputation: Replaces missing values with the mean using pandas.DataFrame.fillna()

    • Best for: Numerical features with normal distribution
    • Use case: Age, temperature, continuous numerical features
  • Median Imputation: Replaces missing values with the median using pandas.DataFrame.fillna()

    • Best for: Numerical features with skewed distributions or outliers
    • Use case: Income, price data with outliers
  • Mode Imputation: Replaces missing categorical values with the most frequent category using pandas.DataFrame.fillna()

    • Best for: Categorical features where mode is a reasonable default
    • Use case: Embarkation port, category fields
  • Frequent Category Imputation: Replaces missing values based on grouped frequency analysis using pandas.DataFrame.groupby()

    • Best for: When missing values relate to other categorical features
    • Use case: Missing values that depend on other categorical variables

πŸ“– Detailed Documentation

4. Outliers (outliers/)

Techniques for detecting and analyzing outliers in numerical data using statistical methods and visualizations.

Types Used:

  • 5-Number Summary: Statistical summary using numpy.quantile() with quantiles [0, 0.25, 0.50, 0.75, 1.0]

    • Components: Minimum, Q1, Median, Q3, Maximum
    • Use case: Initial data exploration and understanding distribution
  • IQR Method: Interquartile Range method to identify outliers

    • Formula: IQR = Q3 - Q1
    • Lower Bound = Q1 - 1.5 Γ— IQR
    • Upper Bound = Q3 + 1.5 Γ— IQR
    • Use case: Detecting outliers in numerical data, especially for non-normal distributions
  • Box Plot Visualization: Visual representation using seaborn.boxplot()

    • Shows: Quartiles, whiskers, and outliers as points
    • Use case: Visual identification and analysis of outliers

πŸ“– Detailed Documentation

πŸ”§ Installation

Prerequisites

  • Python 3.7 or higher
  • Jupyter Notebook or JupyterLab

Setup

  1. Clone or download this repository

  2. Install required packages:

pip install pandas numpy scikit-learn seaborn matplotlib imbalanced-learn

Or install from requirements file (if available):

pip install -r requirements.txt

πŸ“¦ Dependencies

  • pandas: Data manipulation and analysis
  • numpy: Numerical computing and statistical operations
  • scikit-learn: Machine learning preprocessing tools
  • seaborn: Statistical data visualization
  • matplotlib: Plotting and visualization
  • imbalanced-learn: Advanced techniques for imbalanced datasets (SMOTE)

πŸš€ Usage Guide

  1. Navigate to the desired module folder (e.g., encoding/, missing_value/, etc.)

  2. Open the Jupyter notebook for the technique you want to learn or use

  3. Run the cells sequentially to see the implementation and results

  4. Modify the code with your own data to apply these techniques

  5. Refer to individual README files in each folder for detailed explanations

Example Workflow

# Example: Handling missing values
import pandas as pd
import numpy as np

# Load your data
df = pd.read_csv('your_data.csv')

# Check for missing values
print(df.isnull().sum())

# Apply mean imputation for numerical columns
df['numeric_column'].fillna(df['numeric_column'].mean(), inplace=True)

# Apply mode imputation for categorical columns
df['categorical_column'].fillna(df['categorical_column'].mode()[0], inplace=True)

πŸ“Š Techniques Summary

Module Technique Library/Method Best For
Encoding One-Hot Encoding sklearn.preprocessing.OneHotEncoder Nominal categorical data
Encoding Label Encoding sklearn.preprocessing.LabelEncoder Simple categorical variables
Encoding Ordinal Encoding sklearn.preprocessing.OrdinalEncoder Ordinal categorical data
Encoding Target Encoding pandas.groupby().mean() High-cardinality features
Imbalance Upsampling sklearn.utils.resample Preserving all data
Imbalance Downsampling sklearn.utils.resample Large datasets
Imbalance SMOTE imblearn.over_sampling.SMOTE Synthetic sample generation
Missing Value Mean Imputation pandas.fillna(mean()) Normal distributions
Missing Value Median Imputation pandas.fillna(median()) Skewed distributions
Missing Value Mode Imputation pandas.fillna(mode()) Categorical data
Missing Value Frequent Category pandas.groupby() Grouped imputation
Outliers 5-Number Summary numpy.quantile() Data exploration
Outliers IQR Method Statistical calculation Outlier detection
Outliers Box Plot seaborn.boxplot() Visual analysis

πŸ’‘ Best Practices

Encoding

  • Use One-Hot Encoding for nominal data with few categories (< 10)
  • Use Target Encoding for high-cardinality categorical features
  • Avoid Label Encoding for tree-based algorithms when order matters

Imbalanced Datasets

  • Try SMOTE before simple resampling for better results
  • Consider the cost of false positives/negatives when choosing technique
  • Evaluate model performance after applying balancing techniques

Missing Values

  • Always analyze the pattern of missingness (MCAR, MAR, MNAR)
  • Use median for skewed distributions
  • Consider domain knowledge when choosing imputation method
  • Document the percentage of missing values before imputation

Outliers

  • Investigate outliers before removing them (they might be valid)
  • Use IQR method for non-normal distributions
  • Consider the business context when handling outliers
  • Visualize data before and after outlier treatment

πŸ“ Notes

  • All notebooks are self-contained with examples
  • Each module has its own README with detailed explanations
  • Techniques can be combined based on your specific use case
  • Always validate the impact of preprocessing on model performance

🀝 Contributing

Feel free to explore each module and adapt the techniques to your specific needs. Each folder contains detailed documentation and working examples.

πŸ“„ License

This project is for educational and reference purposes.


Happy Feature Engineering! πŸŽ‰

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors