Skip to content

Mattral/Integrated-ETL-and-Machine-Learning-Workflow-Management-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

A Unified ETL and Machine Learning Automation Platform with Real-Time Monitoring and Experiment Tracking

Version Python React License

A platform for managing ETL pipelines, tracking ML experiments, and automating model selection with real-time monitoring.

Features β€’ Architecture β€’ Setup β€’ Usage β€’ API Reference


🎯 Why This Project?

In modern data-driven organizations, teams face a common challenge: the gap between data engineering and machine learning is too wide. Data engineers build pipelines in one system, ML engineers track experiments in another, and everyone struggles with visibility into what's actually happening.

This project bridges that gap by providing a unified platform where:

  • ETL pipelines and ML experiments live together
  • Real-time monitoring keeps everyone informed
  • AutoML democratizes model selection
  • Quality checks ensure data integrity at every step

The Problems We Solve

Problem Traditional Approach Our Solution
Fragmented tooling Airflow + MLflow + custom scripts Single unified dashboard
No real-time visibility Check logs manually, wait for emails WebSocket-powered live updates
ML expertise bottleneck Only senior ML engineers can tune models AutoML with one-click execution
Data quality blindspots Issues discovered in production Integrated validation at every step
Experiment chaos Spreadsheets, notebooks, local files Centralized experiment tracking

πŸ‘₯ Who Is This For?

Primary Users

πŸ”§ Data Engineers

  • Build and monitor ETL pipelines
  • Track data quality metrics
  • Debug pipeline failures in real-time

πŸ§ͺ ML Engineers

  • Run experiments with different algorithms
  • Compare model performance across versions
  • Register and version trained models

πŸ“Š Data Scientists

  • Quickly prototype models with AutoML
  • Focus on problem-solving, not infrastructure
  • Iterate faster with automated hyperparameter tuning

βš™οΈ Platform/MLOps Teams

  • Monitor system health
  • Manage infrastructure at scale
  • Ensure reliability across pipelines

Use Cases

  1. Startups building their first ML infrastructure
  2. Enterprise teams modernizing legacy ETL systems
  3. Research teams needing reproducible experiment tracking
  4. Consultancies delivering ML solutions to clients
  5. Educational institutions teaching MLOps best practices

✨ Features

Pipeline Management

  • Visual Pipeline Builder: Define Extract β†’ Transform β†’ Load β†’ Validate β†’ Train steps
  • Background Execution: Non-blocking pipeline runs with progress tracking
  • Run History: Complete audit trail of all executions with duration and status
  • Real-time Updates: WebSocket-powered live status changes

ML Experiment Tracking

  • Multi-Algorithm Support: RandomForest, GradientBoosting, LogisticRegression, SVM
  • Metrics Dashboard: Accuracy, Precision, Recall, F1-Score visualization
  • Model Versioning: Automatic version management for trained models
  • Parameter Logging: Full reproducibility with stored hyperparameters

AutoML Engine

  • One-Click AutoML: Select algorithms, configure CV folds, and run
  • GridSearchCV Integration: Exhaustive hyperparameter search
  • Best Model Selection: Automatic identification and registration
  • Progress Broadcasting: Real-time updates during optimization

Data Quality

  • Validation Rules: Configurable data quality checks
  • Quality Metrics: Completeness, accuracy, consistency, timeliness
  • Issue Detection: Automated identification of data problems
  • Profile Generation: Dataset statistics and summaries

Real-Time Monitoring

  • WebSocket Connection: Instant updates without polling
  • Live Log Streaming: Watch pipeline execution in real-time
  • Connection Indicator: Visual status of real-time connectivity
  • Multi-Client Support: Broadcast to all connected users

πŸ— Architecture

System Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           FRONTEND                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Dashboard  β”‚  β”‚  Pipelines  β”‚  β”‚ Experiments β”‚  β”‚   AutoML    β”‚ β”‚
β”‚  β”‚   Charts    β”‚  β”‚   Manager   β”‚  β”‚   Tracker   β”‚  β”‚   Engine    β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚         β”‚                β”‚                β”‚                β”‚        β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚                                   β”‚                                 β”‚
β”‚                          WebSocket + REST                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           BACKEND (FastAPI)                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚                    Connection Manager (WebSocket)               β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Pipeline   β”‚  β”‚    ML       β”‚  β”‚   AutoML    β”‚  β”‚    Data     β”‚ β”‚
β”‚  β”‚  Executor   β”‚  β”‚  Service    β”‚  β”‚   Service   β”‚  β”‚  Validator  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚         β”‚                β”‚                β”‚                β”‚        β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚                                   β”‚                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           DATA LAYER                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚      MongoDB        β”‚              β”‚    File Storage     β”‚       β”‚
β”‚  β”‚  β€’ pipelines        β”‚              β”‚  β€’ Model artifacts  β”‚       β”‚
β”‚  β”‚  β€’ experiments      β”‚              β”‚  β€’ Datasets         β”‚       β”‚
β”‚  β”‚  β€’ models           β”‚              β”‚  β€’ Logs             β”‚       β”‚
β”‚  β”‚  β€’ validations      β”‚              β”‚                     β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tech Stack

Layer Technology Purpose
Frontend React 18, Recharts, Lucide Icons Interactive dashboard with visualizations
Backend FastAPI, Uvicorn High-performance async API server
Real-time WebSocket Bidirectional live updates
ML Engine scikit-learn, pandas, numpy Model training and AutoML
Database MongoDB Document storage for flexible schemas
Styling Tailwind-inspired CSS Modern dark theme UI

Directory Structure

/app
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ server.py              # FastAPI application (1200+ lines)
β”‚   β”‚   β”œβ”€β”€ WebSocket Manager  # Real-time connection handling
β”‚   β”‚   β”œβ”€β”€ Pipeline Executor  # Background task execution
β”‚   β”‚   β”œβ”€β”€ ML Service         # Model training logic
β”‚   β”‚   β”œβ”€β”€ AutoML Service     # GridSearchCV automation
β”‚   β”‚   └── REST Endpoints     # 30+ API routes
β”‚   β”œβ”€β”€ requirements.txt       # Python dependencies
β”‚   └── .env                   # Environment configuration
β”‚
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ App.js             # Main React component (1700+ lines)
β”‚   β”‚   β”‚   β”œβ”€β”€ useWebSocket   # Custom hook for real-time
β”‚   β”‚   β”‚   β”œβ”€β”€ DashboardPage  # Stats & charts
β”‚   β”‚   β”‚   β”œβ”€β”€ PipelinesPage  # Pipeline management
β”‚   β”‚   β”‚   β”œβ”€β”€ ExperimentsPage# ML experiment tracking
β”‚   β”‚   β”‚   β”œβ”€β”€ AutoMLPage     # Automated ML interface
β”‚   β”‚   β”‚   β”œβ”€β”€ ValidationsPage# Data quality
β”‚   β”‚   β”‚   └── LogsPage       # Real-time logs
β”‚   β”‚   β”œβ”€β”€ App.css            # Styling (500+ lines)
β”‚   β”‚   └── index.js           # Entry point
β”‚   β”œβ”€β”€ package.json           # Node dependencies
β”‚   └── .env                   # Frontend configuration
β”‚
β”œβ”€β”€ data/                      # Sample datasets
β”œβ”€β”€ memory/
β”‚   └── PRD.md                 # Product requirements document
└── README.md                  # This file

Data Flow

1. User Action (Frontend)
        β”‚
        β–Ό
2. REST API / WebSocket (Backend)
        β”‚
        β–Ό
3. Business Logic (Services)
        β”‚
        β”œβ”€β”€β–Ί Pipeline Executor (Background Task)
        β”‚           β”‚
        β”‚           β–Ό
        β”‚    Step-by-step execution with logging
        β”‚           β”‚
        β”‚           β–Ό
        β”‚    WebSocket broadcast to all clients
        β”‚
        β”œβ”€β”€β–Ί ML Service (Model Training)
        β”‚           β”‚
        β”‚           β–Ό
        β”‚    scikit-learn model fitting
        β”‚           β”‚
        β”‚           β–Ό
        β”‚    Metrics calculation & storage
        β”‚
        └──► AutoML Service (Hyperparameter Search)
                    β”‚
                    β–Ό
             GridSearchCV with CV folds
                    β”‚
                    β–Ό
             Best model selection & registration
        β”‚
        β–Ό
4. MongoDB (Persistence)
        β”‚
        β–Ό
5. Response to Frontend (REST/WebSocket)

πŸš€ Setup

Prerequisites

  • Python 3.11+
  • Node.js 18+
  • MongoDB 6.0+
  • Git

Quick Start

# 1. Clone the repository
git clone https://github.com/Mattral/ETL-ML.git
cd ETL-ML

# 2. Setup Backend
cd backend
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# 3. Configure Environment
cat > .env << EOF
MONGO_URL={your uri}
DB_NAME=etl_ml_dashboard
EOF

# 4. Start Backend
uvicorn server:app --host 0.0.0.0 --port 8001 --reload

# 5. Setup Frontend (new terminal)
cd ../frontend
yarn install  # or npm install

# 6. Configure Frontend Environment
cat > .env << EOF
REACT_APP_BACKEND_URL=http://localhost:8001
EOF

# 7. Start Frontend
yarn start  # or npm start

Docker Setup (Alternative)

# Build and run with Docker Compose
docker-compose up -d

# Access the application
# Frontend: http://localhost:3000
# Backend:  http://localhost:8001/api/docs

Environment Variables

Backend (/backend/.env)

Variable Description Default
MONGO_URL MongoDB connection string mongodb://localhost:27017
DB_NAME Database name etl_ml_dashboard
AWS_ACCESS_KEY_ID AWS credentials (optional) -
AWS_SECRET_ACCESS_KEY AWS credentials (optional) -
AWS_BUCKET_NAME S3 bucket for artifacts etl-ml-storage

Frontend (/frontend/.env)

Variable Description Default
REACT_APP_BACKEND_URL Backend API URL http://localhost:8001

πŸ“– Usage

1. Seed the Database

First, populate the database with sample data:

curl -X POST http://localhost:8001/api/seed

Or use the Settings page in the UI and click "Seed Database".

2. Explore the Dashboard

Navigate to http://localhost:3000 to see:

  • Stats Cards: Total pipelines, experiments, models, AutoML runs
  • Pipeline Runs Chart: Success/failure trends over 7 days
  • Model Accuracy Trend: Version-over-version improvement
  • Data Quality Metrics: Completeness, accuracy, consistency scores

πŸ“Έ Application Screenshots

Dashboard Overview

Dashboard

Main dashboard showing real-time statistics, pipeline trends, model accuracy, and data quality metrics

Experiments Tracking

Experiments

ML experiment tracking interface with multi-algorithm support and comprehensive metrics visualization

Pipeline Management

Pipelines

Visual pipeline management showing ETL workflows, execution status, and run history


3. Run a Pipeline

# List pipelines
curl http://localhost:8001/api/pipelines

# Run a specific pipeline
curl -X POST http://localhost:8001/api/pipelines/pip-001/run

4. Create an ML Experiment

curl -X POST http://localhost:8001/api/experiments \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Activity Recognition v1",
    "algorithm": "RandomForest",
    "parameters": {
      "n_estimators": 100,
      "max_depth": 10
    }
  }'

5. Run AutoML

curl -X POST http://localhost:8001/api/automl/run \
  -H "Content-Type: application/json" \
  -d '{
    "experiment_name": "Best Model Search",
    "algorithms": ["RandomForest", "GradientBoosting", "LogisticRegression"],
    "cv_folds": 5,
    "max_trials": 20
  }'

6. Monitor in Real-Time

Connect to the WebSocket for live updates:

const ws = new WebSocket('ws://localhost:8001/ws');
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log('Real-time update:', data);
};

πŸ“‘ API Reference

Core Endpoints

Method Endpoint Description
GET /api/health Health check
GET /api/dashboard/stats Dashboard statistics
GET /api/dashboard/metrics Chart data

Pipelines

Method Endpoint Description
GET /api/pipelines List all pipelines
POST /api/pipelines Create pipeline
GET /api/pipelines/{id} Get pipeline details
DELETE /api/pipelines/{id} Delete pipeline
POST /api/pipelines/{id}/run Execute pipeline
GET /api/pipelines/{id}/runs Get run history

Experiments

Method Endpoint Description
GET /api/experiments List experiments
POST /api/experiments Create & run experiment
GET /api/experiments/{id} Get experiment details
DELETE /api/experiments/{id} Delete experiment

AutoML

Method Endpoint Description
POST /api/automl/run Start AutoML job
GET /api/automl/runs List AutoML runs
GET /api/automl/runs/{id} Get AutoML results

Models

Method Endpoint Description
GET /api/models List registered models
GET /api/models/{id} Get model details

Data Quality

Method Endpoint Description
GET /api/validations List validations
POST /api/validations Create validation
GET /api/validations/{id} Get validation details

WebSocket

Endpoint Events
ws://localhost:8001/ws pipeline_step, pipeline_completed, pipeline_failed, experiment_completed, automl_progress, automl_completed, log

πŸ§ͺ Testing

Run Backend Tests

cd backend
pytest tests/ -v

Test API Endpoints

# Health check
curl http://localhost:8001/api/health

# Verify all systems
curl http://localhost:8001/api/dashboard/stats

Frontend Lint

cd frontend
yarn lint

πŸ›£ Roadmap

Phase 2 (Planned)

  • AWS S3 integration for model artifacts
  • Pipeline scheduling with cron expressions
  • Email/Slack notifications
  • User authentication (JWT)

Phase 3 (Future)

  • Visual DAG pipeline editor
  • Model deployment as REST APIs
  • Apache Airflow integration
  • Multi-tenant support

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • Reference implementation: ruslanmv/ETL-and-Machine-Learning
  • HMP Dataset for activity recognition benchmarks
  • scikit-learn team for the ML toolkit
  • FastAPI for the excellent async framework

Built with precision for scale. Designed for humans.

Report Bug β€’ Request Feature

About

A Unified ETL and Machine Learning Automation Platform with Real-Time Monitoring and Experiment Tracking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors