Skip to content

RunPod: Add reproducible pod setup, storage, and checkpoint-resume playbook #10

@georgepullen

Description

@georgepullen

Purpose

Make RunPod execution reproducible, interruption-safe, and auditable for this repo.

Mandatory Reading (blocking)

First comment must summarize:

  • reports/NL_IMPLEMENTATION_ORACLE.md section 6.3.3 and 6.3.4
  • docs/FSDP_SCALING_GUIDE.md
  • docs/release_checklist.md
  • docs/env_matrix.md

Also include links reviewed from RunPod docs in the first comment.

Required Code Anchors

  • scripts/compute/
  • training entrypoints (train.py, train_fsdp.py, train_deepspeed.py)
  • docs under docs/

Scope

  • Add concrete RunPod playbook:
    • pod create settings
    • persistent storage conventions
    • SSH + transfer commands
    • checkpoint frequency guidance for spot/on-demand
    • forced stop/resume drill

Deliverables

Acceptance Criteria

  • Fresh pod can run smoke end-to-end using docs only.
  • Resume drill validated with evidence.
  • First issue comment contains mandatory reading summary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexecution-boardExecution board ticket set for paper alignmentquality-gateHas explicit acceptance criteria and test gatesrunpodRunPod infra and training execution tasks

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions