Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions examples/multimodal_ai/cpu-whisper/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# 🎙️ Whisper Speech-to-Text: Saturn Cloud Template

This template provides a production-ready environment for deploying **OpenAI Whisper** for high-accuracy speech-to-text tasks. It is optimized for **Saturn Cloud** GPU/CPU resources, allowing for seamless scaling from single-file transcription to large-scale batch processing.

## 📋 Overview

* **Title**: cpu-whisperSpeech-to-Text
* **Tech Stack**: Whisper AI, PyTorch, FFmpeg, Librosa, Matplotlib
* **Resource Type**: Saturn Cloud Deployment / Jupyter Server
* **Description**: Whisper, Torch Transcribe sample audio, Waveform + transcript, Back logs.

---

## 🚀 Environment Setup

The environment configuration is automated via a dedicated setup script designed for the Saturn Cloud file system.

### 1. Initialize the Environment

Run the custom setup script to install system dependencies (FFmpeg), configure your Python environment, and install the Whisper library.

```bash
# Execute your pre-configured setup script
bash setup_saturn.sh

```

### 2. Activate the Environment

Once the script completes, ensure you are working within the correct virtual environment:

```bash
source whisper_env/bin/activate

```

---

## 🧪 Testing & Verification

Your environment contains two primary test scripts to verify the full functionality of the pipeline.

### 1. Running `test.py` (Audio Acquisition)

This script verifies the network connectivity and hardware detection. It automatically downloads a high-quality sample audio file from Hugging Face and transcribe it (output on terminal).

**Command:**

```bash
python test.py

```

**Terminal Output (Back Logs):**

* **Device Detection**: Shows `Testing on Device: CUDA` (or CPU).
* **Download Log**: Displays `Downloading sample audio...` followed by `Download complete.`.
* **Model Loading**: Shows a progress bar for the Whisper `base` model (139MB).
* **Transcription**: Prints a raw text block of the transcribed audio to the terminal.

### 2. Running `test2.py` (Visualization & Export)

This script tests the advanced features of the template, including waveform generation and local file processing.

**Command:**

```bash
python test2.py

```

**Terminal Output (Back Logs):**

* **Status**: `Loading model and transcribing...`.
* **Visualization Log**: `Generating waveform...` using Librosa and Matplotlib.
* **Success Message**: `Verification Complete: Check transcript.txt and waveform.png`.

---

## 📂 Expected Output Files

After running the tests, verify the presence of these files in your **Explorer**:

* **`sample1.flac`**: The downloaded test audio.
* **`transcript.txt`**: The saved text version of the transcription.
* **`waveform.png`**: The visual representation of the audio waves.

---

## 📊 Model Selection Guide

Choose the model size that best fits your hardware constraints on Saturn Cloud.

| Model | Parameters | Required VRAM | Relative Speed |
| --- | --- | --- | --- |
| **Tiny** | 39 M | ~1 GB | ~10x |
| **Base** | 74 M | ~1 GB | ~7x |
| **Small** | 244 M | ~2 GB | ~4x |
| **Medium** | 769 M | ~5 GB | ~2x |
| **Large** | 1550 M | ~10 GB | 1x |

---

## 🔗 Reference Links

* **Platform**: [Saturn Cloud Dashboard](https://saturncloud.io/)
* **Support**: [Saturn Cloud Documentation](https://saturncloud.io/docs/)
* **Community**: [Whisper AI Discussions](https://github.com/openai/whisper/discussions)
48 changes: 48 additions & 0 deletions examples/multimodal_ai/cpu-whisper/setup_saturn.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
#!/bin/bash

# Exit on any error
set -e

echo "--- 1. Environment Pre-flight Check ---"

# Update package list
echo "Updating system package repositories..."
sudo apt update -y

# Install FFmpeg and Python dependencies
echo "Installing FFmpeg and Python tools..."
sudo apt install -y ffmpeg python3-pip python3-venv

# Verify FFmpeg installation
if ffmpeg -version > /dev/null 2>&1; then
echo "SUCCESS: FFmpeg is installed and ready."
else
echo "ERROR: FFmpeg installation failed."
exit 1
fi

echo "--- 2. Setting Up Python Environment ---"

# Create and activate a virtual environment
echo "Creating virtual environment: whisper_env..."
python3 -m venv whisper_env
source whisper_env/bin/activate

# Install OpenAI Whisper
echo "Installing OpenAI Whisper..."
pip install -U openai-whisper


# Install core transcription and visualization dependencies
pip install openai-whisper librosa matplotlib

# Verify Whisper installation
if whisper --help > /dev/null 2>&1; then
echo "SUCCESS: Whisper AI is installed."
else
echo "ERROR: Whisper AI installation failed."
exit 1
fi

echo "--- Setup Complete ---"
echo "You can now run your transcription tests using 'whisper <audio_file>'."
40 changes: 40 additions & 0 deletions examples/multimodal_ai/cpu-whisper/test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import torch
import whisper
import os
import urllib.request

# 1. Hardware Detection
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Testing on Device: {device.upper()}")

# 2. Verified Stable Test Audio
# This is a sample1.flac file from Hugging Face spaces
audio_url = "https://huggingface.co/spaces/speechbox/whisper-restore-punctuation/resolve/main/sample1.flac"
audio_file = "sample1.flac"

try:
if not os.path.exists(audio_file):
print(f"Downloading sample audio from {audio_url}...")
# Standard headers to ensure the server accepts the request
req = urllib.request.Request(audio_url, headers={'User-Agent': 'Mozilla/5.0'})
with urllib.request.urlopen(req) as response, open(audio_file, 'wb') as out_file:
out_file.write(response.read())
print("Download complete.")
except Exception as e:
print(f"Error downloading audio: {e}")
exit(1)

# 3. Load Model and Transcribe
print("Loading Whisper 'base' model...")
# The 'base' model requires ~1GB VRAM and is ~7x faster than the large model
model = whisper.load_model("base", device=device)

print("Starting transcription...")
# Ensure ffmpeg is installed as it is required for audio processing
result = model.transcribe(audio_file)

# 4. Final Output Verification
print("-" * 30)
print("TRANSCRIPT OUTPUT:")
print(result["text"].strip())
print("-" * 30)
24 changes: 24 additions & 0 deletions examples/multimodal_ai/cpu-whisper/test2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import whisper
import torch
import librosa
import matplotlib.pyplot as plt

# Check hardware
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running on: {device.upper()}")

# Load model and transcribe
model = whisper.load_model("base", device=device)
result = model.transcribe("sample1.flac")

# Export Transcript
with open("transcript.txt", "w") as f:
f.write(result["text"])

# Generate Waveform
y, sr = librosa.load("sample1.flac")
plt.figure(figsize=(10, 4))
librosa.display.waveshow(y, sr=sr)
plt.savefig("waveform.png")

print("Verification Complete: Check transcript.txt and waveform.png")
47 changes: 47 additions & 0 deletions examples/multimodal_ai/nvidia-video-rag/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# 🎥 Video Q&A Pipeline (LangChain + Transformers)

A lightweight, modular pipeline that enables question-answering from video content using frame extraction, image captioning, semantic retrieval, and LLM-based response generation.

## 🚀 Features

* ✅ Frame extraction from video (OpenCV)
* 🧠 Image captioning using ViT-GPT2 (Hugging Face)
* 🔍 Semantic retrieval with ChromaDB + LangChain
* 🤖 Q&A using `flan-t5-small` (Text2Text pipeline)
* 💻 Works in CPU/GPU environments

## 📦 Dependencies

* `torch`, `transformers`, `opencv-python-headless`
* `langchain`, `langchain-community`, `langchain-huggingface`
* `sentence-transformers`, `chromadb`, `Pillow`


## 🧩 Pipeline Overview

```text
Video → Frames → Captions → Embeddings → ChromaDB → Retriever + LLM → Answer
```

## 🛠️ Usage

1. **Run in Jupyter**

2. **Open the notebook** and follow steps:

* 📥 Download video
* 🖼️ Extract frames
* 🧾 Generate captions
* 💾 Store in ChromaDB
* ❓ Ask questions via LLM

## 🧠 Example Questions

* What is happening in the video?
* What objects or people appear?
* Describe the main activity.

## ✅ Conclusion

This template provides a clean foundation for building **video understanding** applications using modern AI tooling. Extend it with your own videos, models, or use cases.

Loading
Loading