diff --git a/.githubification/README.md b/.githubification/README.md new file mode 100644 index 0000000000..66f06f6724 --- /dev/null +++ b/.githubification/README.md @@ -0,0 +1,476 @@ +# Githubification Assessment: Serving NeMo Agent Toolkit from GitHub Workflows + +This document provides a detailed assessment of the possibilities—and limitations—of +migrating the main functionality of the NVIDIA NeMo Agent Toolkit (NAT) to run entirely +as a **GitHub-infrastructured application**, where GitHub Actions workflows serve as +the primary compute and orchestration layer. + +--- + +## 1. Overview of Current Architecture + +NeMo Agent Toolkit is an enterprise-grade Python platform for building, instrumenting, +evaluating, and optimizing AI agents across multiple frameworks. Its runtime involves: + +| Component | Technology | +|-----------|-----------| +| **Core Runtime** | Python 3.11–3.13, FastAPI + Uvicorn | +| **CLI** | `nat` command (workflow execution, configuration, evaluation) | +| **LLM Providers** | NVIDIA NIM, OpenAI, Azure OpenAI, HuggingFace, Ollama, LiteLLM | +| **Vector Databases** | Milvus, Pinecone, Weaviate, ChromaDB | +| **Data Stores** | Redis, MySQL, PostgreSQL, S3/MinIO, DuckDB | +| **Observability** | OpenTelemetry, Arize Phoenix, Weights & Biases Weave | +| **Protocols** | Model Context Protocol (MCP), Agent-to-Agent (A2A) | +| **Front-Ends** | FastAPI REST API, Console CLI, built-in Chat UI | +| **Auth** | OAuth 2.0, API keys, JWT, PKCE | +| **Packaging** | 30+ modular sub-packages, Docker container, PyPI wheels | + +The existing CI already uses GitHub Actions (`pr.yaml` → `ci_pipe.yml`) for linting, +testing across Python 3.11/3.12/3.13 on amd64/arm64, documentation builds, and wheel +packaging. A parallel GitLab CI pipeline adds integration tests with a full service +stack (Redis, MySQL, Milvus, MinIO, Phoenix, Langfuse, Piston, etc.). + +--- + +## 2. Functions That Map Well to GitHub Workflows + +### 2.1 CI/CD Pipeline (Already Implemented) + +**Feasibility: ✅ Fully feasible — already in place.** + +The existing `pr.yaml` and `ci_pipe.yml` workflows already demonstrate that GitHub +Actions can handle: + +- **Code quality checks** — linting via Ruff, formatting via YAPF, pre-commit hooks. +- **Unit tests** — pytest across a 3×2 matrix (3 Python versions × 2 architectures). +- **Documentation builds** — Sphinx-based doc generation and artifact upload. +- **Wheel packaging** — building and uploading distribution wheels for all 30+ packages. +- **Coverage reporting** — Codecov integration. + +### 2.2 Scheduled Evaluation Runs + +**Feasibility: ✅ Highly feasible.** + +NAT's evaluation system (`nvidia_nat_eval`) runs offline benchmarks against agent +workflows using configurable evaluators. These are batch jobs that: + +- Accept a workflow YAML configuration and a test dataset. +- Execute the agent, collect outputs, and score them. +- Produce JSON/XML evaluation reports. + +GitHub Actions `schedule` triggers (cron) could run nightly or weekly evaluation +sweeps. Results would be stored as workflow artifacts or committed back to the +repository as versioned reports. + +```yaml +on: + schedule: + - cron: '0 3 * * 1' # Weekly Monday 3 AM +jobs: + evaluate: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - run: pip install "nvidia-nat[eval]" + - run: nat evaluate --config_file eval_config.yml + - uses: actions/upload-artifact@v4 + with: + name: eval-results + path: results/ +``` + +### 2.3 Prompt and Hyper-Parameter Optimization + +**Feasibility: ✅ Feasible with caveats (API keys, runtime limits).** + +The optimizer (`nat optimize`) performs iterative prompt tuning and hyper-parameter +search using Optuna. Each iteration calls an LLM API and evaluates the result. This +is compute-light but latency-bound (waiting on API responses). + +GitHub Actions supports up to **6 hours per job** (or 72 hours for self-hosted +runners), which is sufficient for most optimization runs. Matrix strategies could +parallelize the search space across multiple jobs. + +**Requirements:** +- LLM API keys stored as GitHub Secrets. +- Optimization state persisted across runs via artifacts or external storage. + +### 2.4 Documentation Publishing + +**Feasibility: ✅ Fully feasible.** + +Documentation is already built in the CI pipeline. Adding a deployment step to +GitHub Pages is straightforward: + +```yaml +- uses: actions/deploy-pages@v4 +``` + +This would serve the full NAT documentation at +`https://.github.io//`. + +### 2.5 Release Automation and Wheel Publishing + +**Feasibility: ✅ Fully feasible.** + +Wheels are already built in the CI pipeline. A release workflow triggered on Git tags +could: + +1. Build all 30+ package wheels. +2. Publish them to PyPI via `twine` or the `pypa/gh-action-pypi-publish` action. +3. Create a GitHub Release with changelogs and attached artifacts. +4. Build and push Docker images to GHCR (GitHub Container Registry). + +### 2.6 Security Scanning and Dependency Auditing + +**Feasibility: ✅ Fully feasible.** + +GitHub-native tools integrate directly: + +- **Dependabot** for automated dependency updates across all 30+ packages. +- **CodeQL** for static analysis of the Python codebase. +- **Secret scanning** for accidental credential leaks. +- **`pip-audit`** or **`uv audit`** run as workflow steps. + +### 2.7 Agent Workflow Smoke Tests (Headless) + +**Feasibility: ✅ Feasible for API-backed agents.** + +Simple workflows that call external LLM APIs (e.g., the "Hello World" Wikipedia +example) can run end-to-end in a GitHub Actions runner: + +```bash +nat run --config_file workflow.yml --input "List five subspecies of Aardvarks" +``` + +This only requires an API key (stored as a secret) and network access to the LLM +provider. No GPU or local model inference is needed. + +--- + +## 3. Functions That Are Partially Feasible + +### 3.1 Integration Testing with Service Dependencies + +**Feasibility: ⚠️ Partially feasible — requires service containers.** + +The GitLab CI configuration reveals that full integration testing depends on 12+ +external services running simultaneously: + +| Service | Purpose | +|---------|---------| +| Redis | Caching, session memory | +| MySQL | Relational data storage | +| PostgreSQL | Langfuse backend | +| MinIO (S3) | Object storage | +| Milvus | Vector database | +| etcd | Milvus coordination | +| Arize Phoenix | LLM observability | +| ClickHouse | Analytics database | +| Langfuse (server + worker) | LLM tracing platform | +| OpenSearch | Search/analytics engine | +| Piston | Code execution sandbox | +| OAuth2 Server | Authentication testing | + +**GitHub Actions can run service containers** via the `services:` key in job +definitions. However: + +- **Container limit:** GitHub-hosted runners support service containers, but + orchestrating 12+ services simultaneously may hit memory limits on the standard + runners (7 GB RAM for `ubuntu-latest`). +- **Custom images:** Some services (Piston, OAuth2 server) use custom registry images + (`$CI_REGISTRY_IMAGE/...`) that would need to be rebuilt and pushed to GHCR. +- **Startup ordering:** Complex dependency chains (e.g., Milvus → etcd + MinIO, + Langfuse → PostgreSQL + ClickHouse + Redis) require careful health-check scripting. + +**Mitigation strategies:** +- Use **larger runners** (`ubuntu-latest-16-cores` with 64 GB RAM) for integration tests. +- Split the integration test suite by service dependency into separate jobs. +- Use **GitHub-hosted larger runners** or **self-hosted runners** for the full stack. + +### 3.2 MCP and A2A Server Hosting + +**Feasibility: ⚠️ Partially feasible — for testing only, not production hosting.** + +NAT can serve tools and agents as MCP (Model Context Protocol) servers and A2A +(Agent-to-Agent) protocol endpoints. These are long-running FastAPI servers. + +GitHub Actions can start these servers within a job for **integration testing** +purposes (background process + test client in the same job). However, Actions +workflows are not suitable for **production hosting** of persistent servers because: + +- Jobs have a maximum runtime of 6 hours. +- There is no inbound network routing to workflow runners. +- Runners are ephemeral and stateless between runs. + +**For production MCP/A2A serving**, the recommendation is to use GitHub-adjacent +infrastructure (e.g., Azure Container Apps triggered by GitHub deployments). + +### 3.3 Fine-Tuning Orchestration + +**Feasibility: ⚠️ Partially feasible — orchestration only, not GPU compute.** + +NAT supports fine-tuning LLMs via: +- **DPO with NeMo Customizer** — calls the NVIDIA NeMo Customizer API. +- **GRPO with OpenPipe ART** — calls the OpenPipe API. + +The **orchestration** (preparing datasets, launching fine-tuning jobs, monitoring +progress, running post-training evaluations) can run in GitHub Actions. The actual +GPU-intensive training happens on remote infrastructure (NVIDIA cloud, OpenPipe). + +A workflow could: +1. Prepare training data from evaluation results. +2. Submit the fine-tuning job to the NeMo Customizer API. +3. Poll for completion (with `workflow_dispatch` for manual re-triggers). +4. Run evaluation against the newly fine-tuned model. +5. Open a PR with updated model configuration if results improve. + +### 3.4 Data Flywheel Automation + +**Feasibility: ⚠️ Partially feasible.** + +The data flywheel package (`nvidia_nat_data_flywheel`) collects runtime traces, +identifies failure patterns, and generates training data. In a GitHub-infrastructured +model: + +- **Scheduled workflows** could pull traces from an Elasticsearch/OpenSearch instance. +- **Processing and analysis** would run in the workflow job. +- **Output** (curated training datasets) would be stored as artifacts or pushed to S3. + +The limitation is that the flywheel depends on a live observability backend to read +from, which must be hosted externally. + +--- + +## 4. Functions That Are Not Feasible + +### 4.1 Production Agent Serving + +**Feasibility: ❌ Not feasible.** + +NAT's core value proposition is running AI agents in production via FastAPI servers +that handle user requests in real-time. This requires: + +- **Persistent, low-latency HTTP endpoints** — GitHub Actions runners cannot serve + inbound traffic and are ephemeral. +- **Stateful sessions** — agent conversations require session persistence across + requests; runners are destroyed after each job. +- **Horizontal scaling** — production workloads need load balancing and auto-scaling, + which Actions does not provide. +- **GPU access** — local model inference (Ollama, HuggingFace, Dynamo) requires GPU + hardware not available on standard GitHub runners. + +**Alternative:** Use GitHub Actions for **deployment automation** (build container → +push to registry → deploy to Kubernetes/Cloud Run/ECS), not for hosting the agent +itself. + +### 4.2 Real-Time Observability and Monitoring + +**Feasibility: ❌ Not feasible.** + +NAT's observability stack (OpenTelemetry, Phoenix, Weave) requires continuously +running collectors and dashboards. These are long-lived services that: + +- Ingest streaming telemetry data from running agents. +- Provide real-time dashboards and alerting. +- Store historical traces for analysis. + +GitHub Actions is a batch job runner and cannot host persistent monitoring +infrastructure. + +### 4.3 Built-In Chat UI + +**Feasibility: ❌ Not feasible for interactive use.** + +NAT provides a built-in chat interface served by FastAPI. This requires a running web +server accessible to users. GitHub Actions cannot serve this because: + +- No inbound HTTP routing to runners. +- Jobs are time-limited and non-interactive from a user's perspective. + +**Alternative:** Deploy the UI as a static site (if it's a SPA) to GitHub Pages with +an API backend on external infrastructure, or use GitHub Codespaces for a development +preview. + +### 4.4 Local/On-Premise Model Inference + +**Feasibility: ❌ Not feasible.** + +Running LLMs locally via Ollama, HuggingFace Transformers, or NVIDIA Dynamo requires: + +- GPU hardware (CUDA-capable). +- Large model weights (multi-GB downloads). +- Persistent model caches. + +GitHub-hosted runners do not provide GPU access. Even with self-hosted GPU runners, +the ephemeral nature of workflow jobs makes model caching inefficient. + +--- + +## 5. Proposed GitHub-Infrastructured Architecture + +Below is a vision for maximizing the use of GitHub infrastructure while acknowledging +its boundaries. + +### Tier 1: Fully on GitHub Actions + +| Function | Trigger | Implementation | +|----------|---------|----------------| +| CI/CD (lint, test, build) | `push`, `pull_request` | Already implemented via `pr.yaml`/`ci_pipe.yml` | +| Nightly evaluations | `schedule` (cron) | New workflow calling `nat evaluate` | +| Prompt optimization | `workflow_dispatch` | Manual trigger with input parameters | +| Documentation publishing | `push` to `main` | Build docs → deploy to GitHub Pages | +| Release & PyPI publishing | Tag push | Build wheels → publish to PyPI + GHCR | +| Security scanning | `push`, `schedule` | Dependabot, CodeQL, pip-audit | +| Agent smoke tests | `pull_request` | Run headless agent workflows with API keys | + +### Tier 2: GitHub Actions as Orchestrator + +| Function | Trigger | Implementation | +|----------|---------|----------------| +| Fine-tuning orchestration | `workflow_dispatch` | Submit jobs to NeMo Customizer/OpenPipe APIs | +| Data flywheel processing | `schedule` | Pull traces → process → store datasets | +| Integration testing | `push` | Service containers on larger runners | +| Container builds | `push` to `main` | Build Docker image → push to GHCR | + +### Tier 3: External Infrastructure (Deployed by GitHub Actions) + +| Function | Infrastructure | Deployment Trigger | +|----------|---------------|-------------------| +| Agent serving (FastAPI) | Azure Container Apps / AWS ECS / GKE | GitHub Actions CD workflow | +| MCP/A2A servers | Kubernetes | GitHub Actions CD workflow | +| Observability stack | Managed services (Datadog, Grafana Cloud) | GitHub Actions IaC (Terraform) | +| Chat UI | Cloud hosting or GitHub Pages (static) | GitHub Actions CD workflow | +| Redis/MySQL/PostgreSQL | Managed cloud databases | IaC via GitHub Actions | + +### Architecture Diagram + +``` +┌─────────────────────────────────────────────────────────┐ +│ GitHub Platform │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ GitHub │ │ GitHub │ │ GitHub │ │ +│ │ Actions │ │ Pages │ │ Packages │ │ +│ │ │ │ │ │ (GHCR) │ │ +│ │ • CI/CD │ │ • Docs │ │ • Docker │ │ +│ │ • Eval │ │ • Reports │ │ images │ │ +│ │ • Optimize │ │ │ │ • Wheels │ │ +│ │ • Deploy │ │ │ │ │ │ +│ └──────┬───────┘ └──────────────┘ └──────────────┘ │ +│ │ │ +│ ┌──────┴───────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ Secrets & │ │ Dependabot │ │ CodeQL │ │ +│ │ Variables │ │ │ │ Scanning │ │ +│ │ (API keys) │ │ │ │ │ │ +│ └──────────────┘ └──────────────┘ └──────────────┘ │ +└─────────────────────────┬───────────────────────────────┘ + │ Deploys to / calls + ▼ +┌─────────────────────────────────────────────────────────┐ +│ External Infrastructure │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ Cloud │ │ LLM APIs │ │ Managed │ │ +│ │ Compute │ │ │ │ Databases │ │ +│ │ │ │ • NVIDIA NIM│ │ │ │ +│ │ • Agent │ │ • OpenAI │ │ • Redis │ │ +│ │ servers │ │ • Azure │ │ • PostgreSQL│ │ +│ │ • MCP/A2A │ │ • Bedrock │ │ • Milvus │ │ +│ │ • Chat UI │ │ │ │ │ │ +│ └──────────────┘ └──────────────┘ └──────────────┘ │ +└─────────────────────────────────────────────────────────┘ +``` + +--- + +## 6. GitHub Actions Resource Constraints + +Understanding GitHub's runner limits is essential for planning: + +| Resource | GitHub-Hosted (Standard) | GitHub-Hosted (Larger) | Self-Hosted | +|----------|------------------------|----------------------|-------------| +| **vCPUs** | 4 | Up to 64 | Custom | +| **RAM** | 16 GB | Up to 256 GB | Custom | +| **Storage** | 14 GB SSD | Up to 2 TB | Custom | +| **Job timeout** | 6 hours | 6 hours | Configurable | +| **Workflow timeout** | 35 days | 35 days | Configurable | +| **Concurrent jobs** | 20 (free) / 500 (enterprise) | Same | Unlimited | +| **GPU** | Not available | Not available | Custom | +| **Network ingress** | Not routable | Not routable | Custom | + +**Key implications for NAT:** +- The 12+ service integration test stack requires ≥16 GB RAM → use larger runners. +- No GPU means no local model inference → must use API-based LLM providers. +- No inbound routing means no production serving → use external compute for agents. +- 6-hour job limit is sufficient for evaluations and optimization runs. + +--- + +## 7. Cost Estimation + +For a typical month with active development: + +| Activity | Runner Type | Minutes/Month | Est. Cost | +|----------|-----------|---------------|-----------| +| CI per PR (lint+test+docs+wheels) | Standard (Linux) | ~3,000 | ~$24 | +| Nightly evaluations | Standard | ~1,500 | ~$12 | +| Weekly optimization runs | Standard | ~600 | ~$5 | +| Integration tests (larger runner) | 16-core Linux | ~500 | ~$32 | +| Container builds | Standard | ~200 | ~$2 | +| **Total GitHub Actions** | | | **~$75/month** | + +*Note: LLM API costs for evaluations and smoke tests are separate and depend on usage.* + +--- + +## 8. Migration Recommendations + +### Phase 1: Consolidate CI on GitHub Actions (Low Effort) + +The GitHub Actions CI is already in place. The remaining work is: + +1. **Port integration tests from GitLab CI** — Recreate the service container stack + in a GitHub Actions workflow using the `services:` key and larger runners. +2. **Add Dependabot configuration** — Enable automated dependency updates for all + 30+ packages. +3. **Add CodeQL scanning** — Enable static analysis for the Python codebase. + +### Phase 2: Add Automation Workflows (Medium Effort) + +4. **Nightly evaluation workflow** — Scheduled runs of the evaluation system with + results stored as artifacts. +5. **Release automation** — Tag-triggered workflow that builds wheels, publishes to + PyPI, builds Docker images, and creates GitHub Releases. +6. **Documentation deployment** — Publish docs to GitHub Pages on merge to `main`. + +### Phase 3: Orchestration via GitHub Actions (Higher Effort) + +7. **Fine-tuning orchestration** — Workflow that prepares data and submits training + jobs to external APIs, then runs evaluation on the resulting model. +8. **Data flywheel automation** — Scheduled workflow that processes observability + traces and produces training datasets. +9. **Deployment workflows** — CD pipelines that deploy agent servers, MCP endpoints, + and the chat UI to cloud infrastructure, triggered by GitHub releases. + +--- + +## 9. Conclusion + +**GitHub Actions can serve approximately 60–70% of NAT's operational needs**, covering +CI/CD, batch evaluation, optimization orchestration, release management, security +scanning, and documentation publishing. These functions align well with GitHub's +event-driven, batch-job execution model. + +**The remaining 30–40%—production agent serving, real-time observability, interactive +UI hosting, and GPU-based inference—require persistent, routable, and often +GPU-equipped infrastructure** that is fundamentally outside GitHub Actions' design. +However, GitHub Actions excels as the **orchestration and deployment layer** for +these external services, managing the full lifecycle from code change to production +deployment. + +The recommended approach is a **hybrid model**: maximize GitHub's native capabilities +for all batch and event-driven workloads while using GitHub Actions as the deployment +control plane for external runtime infrastructure. This provides a unified developer +experience centered on GitHub while leveraging the right infrastructure for each +workload type.