Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# 💰 Cost/Performance Benchmark

## 🌟 Overview

This template provides a crucial framework for **FinOps (Financial Operations)** by running a **Cost/Performance Benchmark** on deep learning tasks. It accurately measures the trade-off between speed and cost, providing data to answer the core question: *Which hardware configuration delivers the best performance per dollar?*

It uses a **custom Python logger** to record key metrics, generating a structured report that can be used to compare different machine types (e.g., A100 vs. V100, or CPU vs. GPU).

### Key Metrics Tracked

* **Cost/Epoch:** Calculated estimated cost based on the configured hourly rate.
* **Tokens/sec:** Measures the raw speed/throughput of the hardware.
* **Job Summary:** Provides total estimated cost and total execution time.
* **Hardware:** Tracks CPU vs. GPU execution path.

-----

## 🛠️ Implementation Details

### 1\. Project Setup (Bash Script)

Save the following as `setup_benchmark_env.sh`. This script installs the necessary PyTorch library and configuration.

```bash
#!/bin/bash

ENV_NAME="cost_benchmark_env"
PYTHON_VERSION="3.11"

echo "================================================="
echo "🚀 Setting up Cost/Performance Benchmark Environment"
echo "================================================="

# 1. Create and Activate Stable VENV
rm -rf $ENV_NAME
python3.$PYTHON_VERSION -m venv $ENV_NAME
source $ENV_NAME/bin/activate

# 2. Install PyTorch (Required for accurate CUDA event timing)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 3. Install Helpers
pip install numpy pandas psutil

echo "--- Installation Complete ---"
```

#### Execution

1. **Grant Permission:** `chmod +x setup_benchmark_env.sh`
2. **Run Setup:** `./setup_benchmark_env.sh`

-----

### 2\. Procedures (Job Execution)

#### Step A: Activate the Environment

```bash
source cost_benchmark_env/bin/activate
```

#### Step B: Configure Pricing (CRITICAL)

Before running the script, you **must** update the `GPU_HOURLY_RATE` constant in `cost_benchmark.py` to reflect the actual hourly cost of the machine you are testing on Saturn Cloud.

```python
# --- Configuration & Constants in cost_benchmark.py ---
# UPDATE THIS VALUE MANUALLY based on your Saturn Cloud instance type
GPU_HOURLY_RATE = 3.20 # Example $/hour for a high-end GPU (must be updated manually)
```

#### Step C: Run the Benchmark

Execute the Python script (`cost_benchmark.py`).

```bash
python cost_benchmark.py
```

### Verification and Reporting

The script will generate structured output to the console and a persistent file named **`benchmark_results.log`**.

| Log Entry Example | Metric Significance |
| :--- | :--- |
| `Time: 0.0500s` | Raw speed (lower is better). |
| `Cost: $0.00004` | **Cost/Epoch** (lower is better for efficiency). |
| `Tokens/s: 6400` | **Throughput/Speed** (higher is better for performance). |

This log file serves as the definitive source for generating a comparative chart (Cost/Epoch vs. Tokens/sec) for optimal rightsizing.

-----

## 4\. 🔗 Conclusion and Scaling on Saturn Cloud

The **Cost/Performance Benchmark** template is fundamental to the **Optimize** phase of the FinOps lifecycle. By quantifying the true expense of your speed, you can make data-driven decisions to reduce cloud waste.

To operationalize this benchmarking practice, **Saturn Cloud** offers the ideal platform:

* **FinOps Integration:** Saturn Cloud is an all-in-one solution for data science and MLOps, essential for implementing robust FinOps practices.
* **Rightsizing and Optimization:** Easily run this job on different GPU types within Saturn Cloud to determine the most cost-effective solution before deploying models to production. [Saturn Cloud MLOps Documentation](https://www.saturncloud.io/docs/design-principles/concepts/mlops/)
* **Building a Cost-Conscious Culture:** Integrate cost awareness directly into your MLOps pipeline, aligning technical performance with financial goals. [Saturn Cloud Homepage](https://saturncloud.io/)

**Optimize your cloud spend by deploying this template on Saturn Cloud\!**
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
import os
import time
import torch
import torch.nn as nn
import torch.optim as optim
import logging
import sys
import numpy as np

# --- Configuration & Constants ---
# Use the correct GPU pricing for your cloud provider (e.g., Saturn Cloud, AWS, GCP)
# Example: NVIDIA A100 pricing (approximate, for demonstration)
GPU_HOURLY_RATE = 100000 # $/hour for a high-end GPU (Must be updated manually)
LOG_FILE = "benchmark_results.log"

# Hyperparameters for the simulated job
EPOCHS = 5
BATCH_SIZE = 32
TOTAL_SAMPLES = 50000
TOTAL_TOKENS_PER_SAMPLE = 100 # Represents tokens in an NLP task or features in an image
TOTAL_TOKENS = TOTAL_SAMPLES * TOTAL_TOKENS_PER_SAMPLE

# --- Custom Logger Setup ---

def setup_logger():
"""Configures the logger to write structured output to a file."""
# Create the logger object
logger = logging.getLogger('BenchmarkLogger')
logger.setLevel(logging.INFO)

# Define a custom format that includes time and specific placeholders
# We use a custom format to easily parse the final report later
formatter = logging.Formatter(
'%(asctime)s | %(levelname)s | %(message)s'
)

# File Handler
file_handler = logging.FileHandler(LOG_FILE, mode='w')
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)

# Console Handler (for real-time feedback)
stream_handler = logging.StreamHandler(sys.stdout)
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)

return logger

# --- Model & Timing Functions ---

class SimpleModel(nn.Module):
def __init__(self, input_size, output_size):
super().__init__()
self.linear = nn.Linear(input_size, output_size)
def forward(self, x):
return self.linear(x)

def run_training_benchmark(logger, device):

logger.info(f"--- STARTING BENCHMARK ON {device.type.upper()} ---")

# Configuration based on device
INPUT_SIZE = 512
OUTPUT_SIZE = 1

# Model and Data Setup (on the target device)
model = SimpleModel(INPUT_SIZE, OUTPUT_SIZE).to(device)
dummy_input = torch.randn(BATCH_SIZE, INPUT_SIZE, device=device)
dummy_target = torch.randn(BATCH_SIZE, OUTPUT_SIZE, device=device)
optimizer = optim.Adam(model.parameters())
criterion = nn.MSELoss()

# Total estimated cost
total_estimated_cost = 0.0

# Synchronization is crucial for accurate GPU timing
if device.type == 'cuda':
# Warm-up run is necessary to avoid compilation time bias
logger.info("Performing CUDA warm-up run...")
_ = model(dummy_input)
torch.cuda.synchronize()

# Start timing the entire job
job_start_time = time.time()

for epoch in range(1, EPOCHS + 1):

if device.type == 'cuda':
# Use synchronized CUDA events for precise timing
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
else:
start_event = time.time()

# --- Simulated Training Step ---
optimizer.zero_grad()
output = model(dummy_input)
loss = criterion(output, dummy_target)
loss.backward()
optimizer.step()
# --- End Simulated Training Step ---

if device.type == 'cuda':
end_event.record()
torch.cuda.synchronize() # Wait for GPU to finish
# elapsed_time returns milliseconds, convert to seconds
epoch_time_s = start_event.elapsed_time(end_event) / 1000.0
else:
epoch_time_s = time.time() - start_event

# --- COST AND PERFORMANCE CALCULATION ---

# 1. Cost Calculation
cost_per_epoch = (epoch_time_s / 3600.0) * GPU_HOURLY_RATE
total_estimated_cost += cost_per_epoch

# 2. Performance Calculation (Throughput)
throughput_samples_sec = BATCH_SIZE / epoch_time_s
throughput_tokens_sec = (BATCH_SIZE * TOTAL_TOKENS_PER_SAMPLE) / epoch_time_s

# --- LOGGING THE RESULTS ---
logger.info(
f"EPOCH: {epoch}/{EPOCHS} | "
f"Time: {epoch_time_s:.4f}s | "
f"Cost: ${cost_per_epoch:.5f} | "
f"Tokens/s: {throughput_tokens_sec:.0f}"
)

job_total_time = time.time() - job_start_time

# --- FINAL REPORT ---
logger.info("--- JOB SUMMARY ---")
logger.info(f"FINAL_COST: ${total_estimated_cost:.4f}")
logger.info(f"TOTAL_TIME: {job_total_time:.2f}s")
logger.info(f"TOTAL_TOKENS_PROCESSED: {TOTAL_TOKENS * EPOCHS}")
logger.info(f"-------------------")


def main():
logger = setup_logger()
logger.info(f"Configuration: GPU Hourly Rate = ${GPU_HOURLY_RATE}/hr")

# 1. Check for GPU availability
if torch.cuda.is_available():
device = torch.device("cuda")
logger.info("GPU detected. Running GPU Benchmark.")
else:
device = torch.device("cpu")
logger.warning("GPU not detected. Running CPU Benchmark.")

run_training_benchmark(logger, device)

if __name__ == "__main__":
main()
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash

ENV_NAME="cost_benchmark_env"
PYTHON_VERSION="3.12"

echo "--- Setting up Cost/Performance Benchmark Environment ---"

# 1. Create and Activate Stable VENV
rm -rf $ENV_NAME
python$PYTHON_VERSION -m venv $ENV_NAME
source $ENV_NAME/bin/activate

# 2. Install PyTorch (GPU version for CUDA 12)
# We need PyTorch for accurate CUDA timing events.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 3. Install Helpers
pip install numpy pandas psutil

echo "✅ Environment setup complete."
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# 📈 MLflow Experiment Tracking Template (GPU Ready)

## 🌟 Overview

This template provides a robust, reproducible framework for **tracking Deep Learning experiments** on GPU-accelerated hardware. It leverages **MLflow Tracking** to automatically log hyperparameters, model artifacts, and vital **GPU system utilization metrics** (memory, temperature, and usage) during the training process.

This system is essential for comparing model performance and hardware efficiency across different runs—a key capability for MLOps on platforms like **Saturn Cloud**.

### Key Features

* **GPU Readiness:** Dynamically detects and utilizes available CUDA devices.
* **Automatic Tracking:** Uses `mlflow.pytorch.autolog()` to capture hyperparameters and model architecture.
* **System Metrics:** Logs GPU/CPU usage and memory over time using `log_system_metrics=True`.
* **Centralized UI:** Easy verification and comparison of runs via the **MLflow UI table**.

-----

## 🛠️ How to Run the Template

### 1\. Project Setup (Bash Script)

This script sets up a stable Python environment, installs PyTorch, MLflow, and the necessary GPU monitoring packages (`nvidia-ml-py`).

#### File: `setup_mlflow_env.sh`

#### Step A: Grant Execution Permission

In your terminal, grant executable permission to the setup script.

```bash
chmod +x setup_mlflow_env.sh
```

#### Step B: Execute the Setup

Run the script to install all dependencies.

```bash
./setup_mlflow_env.sh
```

-----

### 2\. Procedures (Execution & Monitoring)

#### Step C: Activate the Environment

You must do this every time you open a new terminal session.

```bash
source mlflow_gpu_env_stable/bin/activate
```

#### Step D: Configure Tracking Location

The template uses the environment variable `MLFLOW_TRACKING_URI` to determine where to log data.

| Mode | Configuration (Terminal Command) | Use Case |
| :--- | :--- | :--- |
| **Local (Default)** | (No command needed) | Development and testing where logs are written to the local `mlruns/` folder. |
| **Remote (Server)** | `export MLFLOW_TRACKING_URI="http://<server-ip-or-host>:5000"` | Production jobs requiring centralized, shared tracking (e.g., **Saturn Cloud Managed MLflow**). |

#### Step E: Run the Tracking Sample

Execute the main pipeline script (`train_and_track.py`).

```bash
python train_and_track.py
```

#### Step F: Verification (Checking Tracked Data)

* **Local UI Access:** If running locally, start the UI server:
```bash
mlflow ui --host 0.0.0.0 --port 5000
```
Then, access the exposed IP and port in your browser.
* **Remote UI Access:** Navigate to the host address of your remote tracking server. The **MLflow UI Table** will display the run, confirming successful logging of all parameters, metrics, and **GPU utilization** (see image above).

-----

## 4\. 🔗 Conclusion and Scaling on Saturn Cloud

This template successfully creates a fully observable training environment, fulfilling the core requirements of MLOps for GPU-accelerated workloads. All run details—from hyperparameters to **GPU utilization metrics**—are now centralized and ready for comparison.

To maximize performance, streamline infrastructure management, and integrate MLOps practices, deploy this template on **Saturn Cloud**:

* **Official Saturn Cloud Website:** [Saturn Cloud](https://saturncloud.io/)
* **MLOps Guide:** Saturn Cloud enables a robust MLOps lifecycle by simplifying infrastructure, scaling, and experiment tracking. [A Practical Guide to MLOps](https://saturncloud.io/docs/design-principles/concepts/mlops/)
* **GPU Clusters:** Easily provision and manage GPU-equipped compute resources, including high-performance NVIDIA A100/H100 GPUs, directly within **Saturn Cloud**. [Saturn Cloud Documentation](https://saturncloud.io/docs/user-guide/)

**Start building your scalable MLOps pipeline today on Saturn Cloud\!**
Loading
Loading