saturncloud · GeoSegun · Dec 8, 2025 · Dec 9, 2025
diff --git a/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/README.md b/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/README.md
@@ -0,0 +1,105 @@
+# 💰 Cost/Performance Benchmark
+
+## 🌟 Overview
+
+This template provides a crucial framework for **FinOps (Financial Operations)** by running a **Cost/Performance Benchmark** on deep learning tasks. It accurately measures the trade-off between speed and cost, providing data to answer the core question: *Which hardware configuration delivers the best performance per dollar?*
+
+It uses a **custom Python logger** to record key metrics, generating a structured report that can be used to compare different machine types (e.g., A100 vs. V100, or CPU vs. GPU).
+
+### Key Metrics Tracked
+
+  * **Cost/Epoch:** Calculated estimated cost based on the configured hourly rate.
+  * **Tokens/sec:** Measures the raw speed/throughput of the hardware.
+  * **Job Summary:** Provides total estimated cost and total execution time.
+  * **Hardware:** Tracks CPU vs. GPU execution path.
+
+-----
+
+## 🛠️ Implementation Details
+
+### 1\. Project Setup (Bash Script)
+
+Save the following as `setup_benchmark_env.sh`. This script installs the necessary PyTorch library and configuration.
+
+```bash
+#!/bin/bash
+
+ENV_NAME="cost_benchmark_env"
+PYTHON_VERSION="3.11" 
+
+echo "================================================="
+echo "🚀 Setting up Cost/Performance Benchmark Environment"
+echo "================================================="
+
+# 1. Create and Activate Stable VENV
+rm -rf $ENV_NAME
+python3.$PYTHON_VERSION -m venv $ENV_NAME
+source $ENV_NAME/bin/activate
+
+# 2. Install PyTorch (Required for accurate CUDA event timing)
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+
+# 3. Install Helpers
+pip install numpy pandas psutil
+
+echo "--- Installation Complete ---"
+```
+
+#### Execution
+
+1.  **Grant Permission:** `chmod +x setup_benchmark_env.sh`
+2.  **Run Setup:** `./setup_benchmark_env.sh`
+
+-----
+
+### 2\. Procedures (Job Execution)
+
+#### Step A: Activate the Environment
+
+```bash
+source cost_benchmark_env/bin/activate
+```
+
+#### Step B: Configure Pricing (CRITICAL)
+
+Before running the script, you **must** update the `GPU_HOURLY_RATE` constant in `cost_benchmark.py` to reflect the actual hourly cost of the machine you are testing on Saturn Cloud.
+
+```python
+# --- Configuration & Constants in cost_benchmark.py ---
+# UPDATE THIS VALUE MANUALLY based on your Saturn Cloud instance type
+GPU_HOURLY_RATE = 3.20  # Example $/hour for a high-end GPU (must be updated manually)
+```
+
+#### Step C: Run the Benchmark
+
+Execute the Python script (`cost_benchmark.py`).
+
+```bash
+python cost_benchmark.py
+```
+
+### Verification and Reporting
+
+The script will generate structured output to the console and a persistent file named **`benchmark_results.log`**.
+
+| Log Entry Example | Metric Significance |
+| :--- | :--- |
+| `Time: 0.0500s` | Raw speed (lower is better). |
+| `Cost: $0.00004` | **Cost/Epoch** (lower is better for efficiency). |
+| `Tokens/s: 6400` | **Throughput/Speed** (higher is better for performance). |
+
+This log file serves as the definitive source for generating a comparative chart (Cost/Epoch vs. Tokens/sec) for optimal rightsizing.
+
+-----
+
+## 4\. 🔗 Conclusion and Scaling on Saturn Cloud
+
+The **Cost/Performance Benchmark** template is fundamental to the **Optimize** phase of the FinOps lifecycle. By quantifying the true expense of your speed, you can make data-driven decisions to reduce cloud waste.
+
+To operationalize this benchmarking practice, **Saturn Cloud** offers the ideal platform:
+
+  * **FinOps Integration:** Saturn Cloud is an all-in-one solution for data science and MLOps, essential for implementing robust FinOps practices.
+  * **Rightsizing and Optimization:** Easily run this job on different GPU types within Saturn Cloud to determine the most cost-effective solution before deploying models to production. [Saturn Cloud MLOps Documentation](https://www.saturncloud.io/docs/design-principles/concepts/mlops/)
+  * **Building a Cost-Conscious Culture:** Integrate cost awareness directly into your MLOps pipeline, aligning technical performance with financial goals. [Saturn Cloud Homepage](https://saturncloud.io/)
+
+**Optimize your cloud spend by deploying this template on Saturn Cloud\!**
diff --git a/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/cost_benchmark.py b/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/cost_benchmark.py
@@ -0,0 +1,155 @@
+import os
+import time
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import logging
+import sys
+import numpy as np
+
+# --- Configuration & Constants ---
+# Use the correct GPU pricing for your cloud provider (e.g., Saturn Cloud, AWS, GCP)
+# Example: NVIDIA A100 pricing (approximate, for demonstration)
+GPU_HOURLY_RATE = 100000  # $/hour for a high-end GPU (Must be updated manually)
+LOG_FILE = "benchmark_results.log"
+
+# Hyperparameters for the simulated job
+EPOCHS = 5
+BATCH_SIZE = 32
+TOTAL_SAMPLES = 50000 
+TOTAL_TOKENS_PER_SAMPLE = 100 # Represents tokens in an NLP task or features in an image
+TOTAL_TOKENS = TOTAL_SAMPLES * TOTAL_TOKENS_PER_SAMPLE
+
+# --- Custom Logger Setup ---
+
+def setup_logger():
+    """Configures the logger to write structured output to a file."""
+    # Create the logger object
+    logger = logging.getLogger('BenchmarkLogger')
+    logger.setLevel(logging.INFO)
+
+    # Define a custom format that includes time and specific placeholders
+    # We use a custom format to easily parse the final report later
+    formatter = logging.Formatter(
+        '%(asctime)s | %(levelname)s | %(message)s'
+    )
+
+    # File Handler
+    file_handler = logging.FileHandler(LOG_FILE, mode='w')
+    file_handler.setFormatter(formatter)
+    logger.addHandler(file_handler)
+
+    # Console Handler (for real-time feedback)
+    stream_handler = logging.StreamHandler(sys.stdout)
+    stream_handler.setFormatter(formatter)
+    logger.addHandler(stream_handler)
+
+    return logger
+
+# --- Model & Timing Functions ---
+
+class SimpleModel(nn.Module):
+    def __init__(self, input_size, output_size):
+        super().__init__()
+        self.linear = nn.Linear(input_size, output_size)
+    def forward(self, x):
+        return self.linear(x)
+
+def run_training_benchmark(logger, device):
+
+    logger.info(f"--- STARTING BENCHMARK ON {device.type.upper()} ---")
+
+    # Configuration based on device
+    INPUT_SIZE = 512
+    OUTPUT_SIZE = 1
+
+    # Model and Data Setup (on the target device)
+    model = SimpleModel(INPUT_SIZE, OUTPUT_SIZE).to(device)
+    dummy_input = torch.randn(BATCH_SIZE, INPUT_SIZE, device=device)
+    dummy_target = torch.randn(BATCH_SIZE, OUTPUT_SIZE, device=device)
+    optimizer = optim.Adam(model.parameters())
+    criterion = nn.MSELoss()
+
+    # Total estimated cost
+    total_estimated_cost = 0.0
+
+    # Synchronization is crucial for accurate GPU timing
+    if device.type == 'cuda':
+        # Warm-up run is necessary to avoid compilation time bias
+        logger.info("Performing CUDA warm-up run...")
+        _ = model(dummy_input)
+        torch.cuda.synchronize()
+
+    # Start timing the entire job
+    job_start_time = time.time()
+
+    for epoch in range(1, EPOCHS + 1):
+
+        if device.type == 'cuda':
+            # Use synchronized CUDA events for precise timing
+            start_event = torch.cuda.Event(enable_timing=True)
+            end_event = torch.cuda.Event(enable_timing=True)
+            start_event.record()
+        else:
+            start_event = time.time()
+
+        # --- Simulated Training Step ---
+        optimizer.zero_grad()
+        output = model(dummy_input)
+        loss = criterion(output, dummy_target)
+        loss.backward()
+        optimizer.step()
+        # --- End Simulated Training Step ---
+
+        if device.type == 'cuda':
+            end_event.record()
+            torch.cuda.synchronize() # Wait for GPU to finish
+            # elapsed_time returns milliseconds, convert to seconds
+            epoch_time_s = start_event.elapsed_time(end_event) / 1000.0
+        else:
+            epoch_time_s = time.time() - start_event
+
+        # --- COST AND PERFORMANCE CALCULATION ---
+
+        # 1. Cost Calculation
+        cost_per_epoch = (epoch_time_s / 3600.0) * GPU_HOURLY_RATE
+        total_estimated_cost += cost_per_epoch
+
+        # 2. Performance Calculation (Throughput)
+        throughput_samples_sec = BATCH_SIZE / epoch_time_s
+        throughput_tokens_sec = (BATCH_SIZE * TOTAL_TOKENS_PER_SAMPLE) / epoch_time_s
+
+        # --- LOGGING THE RESULTS ---
+        logger.info(
+            f"EPOCH: {epoch}/{EPOCHS} | "
+            f"Time: {epoch_time_s:.4f}s | "
+            f"Cost: ${cost_per_epoch:.5f} | "
+            f"Tokens/s: {throughput_tokens_sec:.0f}"
+        )
+
+    job_total_time = time.time() - job_start_time
+
+    # --- FINAL REPORT ---
+    logger.info("--- JOB SUMMARY ---")
+    logger.info(f"FINAL_COST: ${total_estimated_cost:.4f}")
+    logger.info(f"TOTAL_TIME: {job_total_time:.2f}s")
+    logger.info(f"TOTAL_TOKENS_PROCESSED: {TOTAL_TOKENS * EPOCHS}")
+    logger.info(f"-------------------")
+
+
+def main():
+    logger = setup_logger()
+    logger.info(f"Configuration: GPU Hourly Rate = ${GPU_HOURLY_RATE}/hr")
+
+    # 1. Check for GPU availability
+    if torch.cuda.is_available():
+        device = torch.device("cuda")
+        logger.info("GPU detected. Running GPU Benchmark.")
+    else:
+        device = torch.device("cpu")
+        logger.warning("GPU not detected. Running CPU Benchmark.")
+
+    run_training_benchmark(logger, device)
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/setup_benchmark_env.sh b/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/setup_benchmark_env.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+
+ENV_NAME="cost_benchmark_env"
+PYTHON_VERSION="3.12" 
+
+echo "--- Setting up Cost/Performance Benchmark Environment ---"
+
+# 1. Create and Activate Stable VENV
+rm -rf $ENV_NAME
+python$PYTHON_VERSION -m venv $ENV_NAME
+source $ENV_NAME/bin/activate
+
+# 2. Install PyTorch (GPU version for CUDA 12)
+# We need PyTorch for accurate CUDA timing events.
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+
+# 3. Install Helpers
+pip install numpy pandas psutil
+
+echo "✅ Environment setup complete."
diff --git a/examples/MLOps_Ops_and_Enterprise_Features/mlflow-tracking/README.md b/examples/MLOps_Ops_and_Enterprise_Features/mlflow-tracking/README.md
@@ -0,0 +1,92 @@
+# 📈 MLflow Experiment Tracking Template (GPU Ready)
+
+## 🌟 Overview
+
+This template provides a robust, reproducible framework for **tracking Deep Learning experiments** on GPU-accelerated hardware. It leverages **MLflow Tracking** to automatically log hyperparameters, model artifacts, and vital **GPU system utilization metrics** (memory, temperature, and usage) during the training process.
+
+This system is essential for comparing model performance and hardware efficiency across different runs—a key capability for MLOps on platforms like **Saturn Cloud**.
+
+### Key Features
+
+  * **GPU Readiness:** Dynamically detects and utilizes available CUDA devices.
+  * **Automatic Tracking:** Uses `mlflow.pytorch.autolog()` to capture hyperparameters and model architecture.
+  * **System Metrics:** Logs GPU/CPU usage and memory over time using `log_system_metrics=True`.
+  * **Centralized UI:** Easy verification and comparison of runs via the **MLflow UI table**.
+
+-----
+
+## 🛠️ How to Run the Template
+
+### 1\. Project Setup (Bash Script)
+
+This script sets up a stable Python environment, installs PyTorch, MLflow, and the necessary GPU monitoring packages (`nvidia-ml-py`).
+
+#### File: `setup_mlflow_env.sh`
+
+#### Step A: Grant Execution Permission
+
+In your terminal, grant executable permission to the setup script.
+
+```bash
+chmod +x setup_mlflow_env.sh
+```
+
+#### Step B: Execute the Setup
+
+Run the script to install all dependencies.
+
+```bash
+./setup_mlflow_env.sh
+```
+
+-----
+
+### 2\. Procedures (Execution & Monitoring)
+
+#### Step C: Activate the Environment
+
+You must do this every time you open a new terminal session.
+
+```bash
+source mlflow_gpu_env_stable/bin/activate
+```
+
+#### Step D: Configure Tracking Location
+
+The template uses the environment variable `MLFLOW_TRACKING_URI` to determine where to log data.
+
+| Mode | Configuration (Terminal Command) | Use Case |
+| :--- | :--- | :--- |
+| **Local (Default)** | (No command needed) | Development and testing where logs are written to the local `mlruns/` folder. |
+| **Remote (Server)** | `export MLFLOW_TRACKING_URI="http://<server-ip-or-host>:5000"` | Production jobs requiring centralized, shared tracking (e.g., **Saturn Cloud Managed MLflow**). |
+
+#### Step E: Run the Tracking Sample
+
+Execute the main pipeline script (`train_and_track.py`).
+
+```bash
+python train_and_track.py
+```
+
+#### Step F: Verification (Checking Tracked Data)
+
+  * **Local UI Access:** If running locally, start the UI server:
+    ```bash
+    mlflow ui --host 0.0.0.0 --port 5000
+    ```
+    Then, access the exposed IP and port in your browser.
+  * **Remote UI Access:** Navigate to the host address of your remote tracking server. The **MLflow UI Table** will display the run, confirming successful logging of all parameters, metrics, and **GPU utilization** (see image above).
+
+-----
+
+## 4\. 🔗 Conclusion and Scaling on Saturn Cloud
+
+This template successfully creates a fully observable training environment, fulfilling the core requirements of MLOps for GPU-accelerated workloads. All run details—from hyperparameters to **GPU utilization metrics**—are now centralized and ready for comparison.
+
+To maximize performance, streamline infrastructure management, and integrate MLOps practices, deploy this template on **Saturn Cloud**:
+
+  * **Official Saturn Cloud Website:** [Saturn Cloud](https://saturncloud.io/)
+  * **MLOps Guide:** Saturn Cloud enables a robust MLOps lifecycle by simplifying infrastructure, scaling, and experiment tracking. [A Practical Guide to MLOps](https://saturncloud.io/docs/design-principles/concepts/mlops/)
+  * **GPU Clusters:** Easily provision and manage GPU-equipped compute resources, including high-performance NVIDIA A100/H100 GPUs, directly within **Saturn Cloud**. [Saturn Cloud Documentation](https://saturncloud.io/docs/user-guide/)
+
+**Start building your scalable MLOps pipeline today on Saturn Cloud\!**