A Unified ETL and Machine Learning Automation Platform with Real-Time Monitoring and Experiment Tracking
A platform for managing ETL pipelines, tracking ML experiments, and automating model selection with real-time monitoring.
Features β’ Architecture β’ Setup β’ Usage β’ API Reference
In modern data-driven organizations, teams face a common challenge: the gap between data engineering and machine learning is too wide. Data engineers build pipelines in one system, ML engineers track experiments in another, and everyone struggles with visibility into what's actually happening.
This project bridges that gap by providing a unified platform where:
- ETL pipelines and ML experiments live together
- Real-time monitoring keeps everyone informed
- AutoML democratizes model selection
- Quality checks ensure data integrity at every step
| Problem | Traditional Approach | Our Solution |
|---|---|---|
| Fragmented tooling | Airflow + MLflow + custom scripts | Single unified dashboard |
| No real-time visibility | Check logs manually, wait for emails | WebSocket-powered live updates |
| ML expertise bottleneck | Only senior ML engineers can tune models | AutoML with one-click execution |
| Data quality blindspots | Issues discovered in production | Integrated validation at every step |
| Experiment chaos | Spreadsheets, notebooks, local files | Centralized experiment tracking |
π§ Data Engineers
- Build and monitor ETL pipelines
- Track data quality metrics
- Debug pipeline failures in real-time
π§ͺ ML Engineers
- Run experiments with different algorithms
- Compare model performance across versions
- Register and version trained models
π Data Scientists
- Quickly prototype models with AutoML
- Focus on problem-solving, not infrastructure
- Iterate faster with automated hyperparameter tuning
βοΈ Platform/MLOps Teams
- Monitor system health
- Manage infrastructure at scale
- Ensure reliability across pipelines
- Startups building their first ML infrastructure
- Enterprise teams modernizing legacy ETL systems
- Research teams needing reproducible experiment tracking
- Consultancies delivering ML solutions to clients
- Educational institutions teaching MLOps best practices
- Visual Pipeline Builder: Define Extract β Transform β Load β Validate β Train steps
- Background Execution: Non-blocking pipeline runs with progress tracking
- Run History: Complete audit trail of all executions with duration and status
- Real-time Updates: WebSocket-powered live status changes
- Multi-Algorithm Support: RandomForest, GradientBoosting, LogisticRegression, SVM
- Metrics Dashboard: Accuracy, Precision, Recall, F1-Score visualization
- Model Versioning: Automatic version management for trained models
- Parameter Logging: Full reproducibility with stored hyperparameters
- One-Click AutoML: Select algorithms, configure CV folds, and run
- GridSearchCV Integration: Exhaustive hyperparameter search
- Best Model Selection: Automatic identification and registration
- Progress Broadcasting: Real-time updates during optimization
- Validation Rules: Configurable data quality checks
- Quality Metrics: Completeness, accuracy, consistency, timeliness
- Issue Detection: Automated identification of data problems
- Profile Generation: Dataset statistics and summaries
- WebSocket Connection: Instant updates without polling
- Live Log Streaming: Watch pipeline execution in real-time
- Connection Indicator: Visual status of real-time connectivity
- Multi-Client Support: Broadcast to all connected users
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FRONTEND β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Dashboard β β Pipelines β β Experiments β β AutoML β β
β β Charts β β Manager β β Tracker β β Engine β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β β β
β ββββββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββ β
β β β
β WebSocket + REST β
ββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββ
β BACKEND (FastAPI) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Connection Manager (WebSocket) ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Pipeline β β ML β β AutoML β β Data β β
β β Executor β β Service β β Service β β Validator β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β β β
β ββββββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββ β
β β β
ββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββ
β DATA LAYER β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β MongoDB β β File Storage β β
β β β’ pipelines β β β’ Model artifacts β β
β β β’ experiments β β β’ Datasets β β
β β β’ models β β β’ Logs β β
β β β’ validations β β β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | React 18, Recharts, Lucide Icons | Interactive dashboard with visualizations |
| Backend | FastAPI, Uvicorn | High-performance async API server |
| Real-time | WebSocket | Bidirectional live updates |
| ML Engine | scikit-learn, pandas, numpy | Model training and AutoML |
| Database | MongoDB | Document storage for flexible schemas |
| Styling | Tailwind-inspired CSS | Modern dark theme UI |
/app
βββ backend/
β βββ server.py # FastAPI application (1200+ lines)
β β βββ WebSocket Manager # Real-time connection handling
β β βββ Pipeline Executor # Background task execution
β β βββ ML Service # Model training logic
β β βββ AutoML Service # GridSearchCV automation
β β βββ REST Endpoints # 30+ API routes
β βββ requirements.txt # Python dependencies
β βββ .env # Environment configuration
β
βββ frontend/
β βββ src/
β β βββ App.js # Main React component (1700+ lines)
β β β βββ useWebSocket # Custom hook for real-time
β β β βββ DashboardPage # Stats & charts
β β β βββ PipelinesPage # Pipeline management
β β β βββ ExperimentsPage# ML experiment tracking
β β β βββ AutoMLPage # Automated ML interface
β β β βββ ValidationsPage# Data quality
β β β βββ LogsPage # Real-time logs
β β βββ App.css # Styling (500+ lines)
β β βββ index.js # Entry point
β βββ package.json # Node dependencies
β βββ .env # Frontend configuration
β
βββ data/ # Sample datasets
βββ memory/
β βββ PRD.md # Product requirements document
βββ README.md # This file
1. User Action (Frontend)
β
βΌ
2. REST API / WebSocket (Backend)
β
βΌ
3. Business Logic (Services)
β
ββββΊ Pipeline Executor (Background Task)
β β
β βΌ
β Step-by-step execution with logging
β β
β βΌ
β WebSocket broadcast to all clients
β
ββββΊ ML Service (Model Training)
β β
β βΌ
β scikit-learn model fitting
β β
β βΌ
β Metrics calculation & storage
β
ββββΊ AutoML Service (Hyperparameter Search)
β
βΌ
GridSearchCV with CV folds
β
βΌ
Best model selection & registration
β
βΌ
4. MongoDB (Persistence)
β
βΌ
5. Response to Frontend (REST/WebSocket)
- Python 3.11+
- Node.js 18+
- MongoDB 6.0+
- Git
# 1. Clone the repository
git clone https://github.com/Mattral/ETL-ML.git
cd ETL-ML
# 2. Setup Backend
cd backend
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
# 3. Configure Environment
cat > .env << EOF
MONGO_URL={your uri}
DB_NAME=etl_ml_dashboard
EOF
# 4. Start Backend
uvicorn server:app --host 0.0.0.0 --port 8001 --reload
# 5. Setup Frontend (new terminal)
cd ../frontend
yarn install # or npm install
# 6. Configure Frontend Environment
cat > .env << EOF
REACT_APP_BACKEND_URL=http://localhost:8001
EOF
# 7. Start Frontend
yarn start # or npm start# Build and run with Docker Compose
docker-compose up -d
# Access the application
# Frontend: http://localhost:3000
# Backend: http://localhost:8001/api/docs| Variable | Description | Default |
|---|---|---|
MONGO_URL |
MongoDB connection string | mongodb://localhost:27017 |
DB_NAME |
Database name | etl_ml_dashboard |
AWS_ACCESS_KEY_ID |
AWS credentials (optional) | - |
AWS_SECRET_ACCESS_KEY |
AWS credentials (optional) | - |
AWS_BUCKET_NAME |
S3 bucket for artifacts | etl-ml-storage |
| Variable | Description | Default |
|---|---|---|
REACT_APP_BACKEND_URL |
Backend API URL | http://localhost:8001 |
First, populate the database with sample data:
curl -X POST http://localhost:8001/api/seedOr use the Settings page in the UI and click "Seed Database".
Navigate to http://localhost:3000 to see:
- Stats Cards: Total pipelines, experiments, models, AutoML runs
- Pipeline Runs Chart: Success/failure trends over 7 days
- Model Accuracy Trend: Version-over-version improvement
- Data Quality Metrics: Completeness, accuracy, consistency scores
Main dashboard showing real-time statistics, pipeline trends, model accuracy, and data quality metrics
ML experiment tracking interface with multi-algorithm support and comprehensive metrics visualization
# List pipelines
curl http://localhost:8001/api/pipelines
# Run a specific pipeline
curl -X POST http://localhost:8001/api/pipelines/pip-001/runcurl -X POST http://localhost:8001/api/experiments \
-H "Content-Type: application/json" \
-d '{
"name": "Activity Recognition v1",
"algorithm": "RandomForest",
"parameters": {
"n_estimators": 100,
"max_depth": 10
}
}'curl -X POST http://localhost:8001/api/automl/run \
-H "Content-Type: application/json" \
-d '{
"experiment_name": "Best Model Search",
"algorithms": ["RandomForest", "GradientBoosting", "LogisticRegression"],
"cv_folds": 5,
"max_trials": 20
}'Connect to the WebSocket for live updates:
const ws = new WebSocket('ws://localhost:8001/ws');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
console.log('Real-time update:', data);
};| Method | Endpoint | Description |
|---|---|---|
GET |
/api/health |
Health check |
GET |
/api/dashboard/stats |
Dashboard statistics |
GET |
/api/dashboard/metrics |
Chart data |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/pipelines |
List all pipelines |
POST |
/api/pipelines |
Create pipeline |
GET |
/api/pipelines/{id} |
Get pipeline details |
DELETE |
/api/pipelines/{id} |
Delete pipeline |
POST |
/api/pipelines/{id}/run |
Execute pipeline |
GET |
/api/pipelines/{id}/runs |
Get run history |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/experiments |
List experiments |
POST |
/api/experiments |
Create & run experiment |
GET |
/api/experiments/{id} |
Get experiment details |
DELETE |
/api/experiments/{id} |
Delete experiment |
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/automl/run |
Start AutoML job |
GET |
/api/automl/runs |
List AutoML runs |
GET |
/api/automl/runs/{id} |
Get AutoML results |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/models |
List registered models |
GET |
/api/models/{id} |
Get model details |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/validations |
List validations |
POST |
/api/validations |
Create validation |
GET |
/api/validations/{id} |
Get validation details |
| Endpoint | Events |
|---|---|
ws://localhost:8001/ws |
pipeline_step, pipeline_completed, pipeline_failed, experiment_completed, automl_progress, automl_completed, log |
cd backend
pytest tests/ -v# Health check
curl http://localhost:8001/api/health
# Verify all systems
curl http://localhost:8001/api/dashboard/statscd frontend
yarn lint- AWS S3 integration for model artifacts
- Pipeline scheduling with cron expressions
- Email/Slack notifications
- User authentication (JWT)
- Visual DAG pipeline editor
- Model deployment as REST APIs
- Apache Airflow integration
- Multi-tenant support
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Reference implementation: ruslanmv/ETL-and-Machine-Learning
- HMP Dataset for activity recognition benchmarks
- scikit-learn team for the ML toolkit
- FastAPI for the excellent async framework
Built with precision for scale. Designed for humans.