Skip to content

Phase 4: Port paper-faithful online/per-layer path to FSDP with parity tests #9

@georgepullen

Description

@georgepullen

Purpose

Bring distributed training (starting with FSDP) to paper-faithful parity with single-GPU online/per-layer behavior.

Mandatory Reading (blocking)

First comment must summarize:

  • reports/NL_IMPLEMENTATION_ORACLE.md sections 5.3 and 6.1.4
  • docs/PAPER_COMPLIANCE.md distributed caveats
  • train_fsdp.py
  • src/nested_learning/training.py distributed guards and online loop

Required Code Anchors

  • train_fsdp.py
  • train_dist.py
  • src/nested_learning/training.py
  • parity tests in tests/

Scope

  • Add FSDP support for:
    • online chunk updates
    • per-layer teach signals
  • Keep fail-fast behavior explicit for unsupported combinations.
  • Add single-GPU vs FSDP parity harness.

Deliverables

  • Updated FSDP path.
  • Parity test script + report template.

Acceptance Criteria

  • 1k-step FSDP faithful run completes.
  • Parity drift against single-GPU baseline is within defined tolerance.
  • First issue comment contains mandatory reading summary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexecution-boardExecution board ticket set for paper alignmentphase-4Phase 4: distributed faithful path parityquality-gateHas explicit acceptance criteria and test gates

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions