Skip to content

feat(skill): add skill for debug gradient flow in the pt expt backend#5280

Merged
iProzd merged 1 commit intodeepmodeling:masterfrom
wanghan-iapcm:feat-pt-expt-debug-skill
Mar 3, 2026
Merged

feat(skill): add skill for debug gradient flow in the pt expt backend#5280
iProzd merged 1 commit intodeepmodeling:masterfrom
wanghan-iapcm:feat-pt-expt-debug-skill

Conversation

@wanghan-iapcm
Copy link
Collaborator

@wanghan-iapcm wanghan-iapcm commented Mar 2, 2026

Summary by CodeRabbit

Release Notes

  • Documentation
    • Added comprehensive debugging guide for diagnosing gradient flow issues during model training, including step-by-step diagnostic methodology and common root causes.
    • Added diagnostic script with practical code examples and walkthroughs for isolating gradient behavior across different training modes.

@wanghan-iapcm wanghan-iapcm requested a review from iProzd March 2, 2026 09:40
@dosubot dosubot bot added the new feature label Mar 2, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 2, 2026

📝 Walkthrough

Walkthrough

This PR adds comprehensive debugging documentation for diagnosing gradient flow issues in model training with compiled pathways. It includes a structured diagnostic guide and a reusable Python script for identifying and isolating gradient flow discrepancies between compiled and uncompiled training regimes.

Changes

Cohort / File(s) Summary
Gradient Flow Debugging Guide
.github/skills/debug-gradient-flow/SKILL.md, .github/skills/debug-gradient-flow/references/gradient-probe-script.md
New documentation providing step-by-step methodology for diagnosing gradient flow issues. Includes per-component gradient isolation procedures, code snippets for a gradient-probing function, test scenarios (energy only, force only, virial only, all losses), comparison framework between uncompiled/compiled paths, common root causes with fixes, and a complete reusable Python script for automated gradient analysis across different torch.compile backends.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Suggested labels

new feature

Suggested reviewers

  • iProzd
  • njzjz
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a new skill for debugging gradient flow in the PyTorch experimental backend.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f34c3c7080

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
.github/skills/debug-gradient-flow/references/gradient-probe-script.md (2)

227-254: Restore monkeypatch via try/finally around patch scope.

If any unexpected exception escapes the backend loop, _trace_and_compile may stay patched for the current process.

Proposed fix
         from deepmd.pt_expt.train import training as training_mod
 
         orig_trace = training_mod._trace_and_compile
-
-        for backend in ["eager", "aot_eager", "inductor"]:
+        try:
+            for backend in ["eager", "aot_eager", "inductor"]:
 
-            def patched(model, ec, ea, nl, mp, fp, ap, opts, _b=backend):
-                opts["backend"] = _b
-                return orig_trace(model, ec, ea, nl, mp, fp, ap, opts)
+                def patched(model, ec, ea, nl, mp, fp, ap, opts, _b=backend):
+                    opts["backend"] = _b
+                    return orig_trace(model, ec, ea, nl, mp, fp, ap, opts)
 
-            training_mod._trace_and_compile = patched
-            try:
-                config = make_config(data_dir, enable_compile=True)
-                config = update_deepmd_input(config, warning=False)
-                config = normalize(config)
-                trainer = get_trainer(config)
-                status, _ = run_and_get_grads(
-                    trainer, {"find_energy": 0.0, "find_virial": 0.0}
-                )
-                count = sum(1 for v in status.values() if v)
-                total = len(status)
-                print(f"  {backend:<12}: {count}/{total} params have force grad")
-                del trainer
-            except Exception as e:
-                print(f"  {backend:<12}: FAILED ({e})")
-
-        training_mod._trace_and_compile = orig_trace
+                training_mod._trace_and_compile = patched
+                try:
+                    config = make_config(data_dir, enable_compile=True)
+                    config = update_deepmd_input(config, warning=False)
+                    config = normalize(config)
+                    trainer = get_trainer(config)
+                    status, _ = run_and_get_grads(
+                        trainer, {"find_energy": 0.0, "find_virial": 0.0}
+                    )
+                    count = sum(1 for v in status.values() if v)
+                    total = len(status)
+                    print(f"  {backend:<12}: {count}/{total} params have force grad")
+                    del trainer
+                except Exception as e:
+                    print(f"  {backend:<12}: FAILED ({e})")
+        finally:
+            training_mod._trace_and_compile = orig_trace
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/skills/debug-gradient-flow/references/gradient-probe-script.md
around lines 227 - 254, The patch replaces training_mod._trace_and_compile with
patched inside a loop but restores it only after the loop, leaving the
monkeypatch in place if an exception escapes; wrap the per-backend patching and
test in a try/finally so that training_mod._trace_and_compile is always reset to
orig_trace even on error. Specifically, for each backend when assigning
training_mod._trace_and_compile = patched and running
get_trainer/run_and_get_grads, ensure you restore
training_mod._trace_and_compile = orig_trace in a finally block (referencing
orig_trace, patched, and the backend loop) so the original _trace_and_compile is
never left patched.

139-142: Avoid leaking temporary directories.

tempfile.mkdtemp() leaves artifacts behind. Use TemporaryDirectory() so cleanup is automatic.

Proposed fix
-    tmpdir = tempfile.mkdtemp(prefix="grad_probe_")
-    old_cwd = os.getcwd()
-    os.chdir(tmpdir)
-    try:
+    old_cwd = os.getcwd()
+    with tempfile.TemporaryDirectory(prefix="grad_probe_") as tmpdir:
+        os.chdir(tmpdir)
+        try:
             # ... existing body ...
-    finally:
-        os.chdir(old_cwd)
+        finally:
+            os.chdir(old_cwd)

Also applies to: 255-257

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/skills/debug-gradient-flow/references/gradient-probe-script.md
around lines 139 - 142, Replace use of tempfile.mkdtemp/tmpdir with
tempfile.TemporaryDirectory to ensure automatic cleanup: wrap the block that
sets old_cwd = os.getcwd(), os.chdir(tmpdir) and the subsequent try/finally in a
with tempfile.TemporaryDirectory(prefix="grad_probe_") as tmpdir: context, use
tmpdir for chdir, and restore cwd in finally; apply the same change for the
other occurrence that mirrors this pattern (the code around the second tmpdir
usage). Ensure you remove manual rmdir/cleanup since TemporaryDirectory handles
it.
.github/skills/debug-gradient-flow/SKILL.md (1)

170-172: Use dtype-aware tolerance guidance instead of hardcoded 1e-10.

atol=1e-10, rtol=1e-10 is unrealistically strict for common float32 training paths and can create false negatives in this debugging workflow.

Proposed fix
-1. **Numerical consistency**: compiled model energy/force/virial should match uncompiled to float precision (`atol=1e-10, rtol=1e-10`)
+1. **Numerical consistency**: compiled model energy/force/virial should match uncompiled within dtype-appropriate tolerance (e.g. float64: `atol=1e-10, rtol=1e-10`; float32: `atol=1e-6, rtol=1e-5`)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/skills/debug-gradient-flow/SKILL.md around lines 170 - 172, The
"Numerical consistency" check currently hardcodes atol=1e-10 and rtol=1e-10
which is too strict for float32; change the guidance so tolerances are
dtype-aware (e.g., compute tolerances from np.finfo(dtype).eps or
jnp.finfo(dtype).eps with a small multiplier) instead of fixed values, and
document recommended defaults for float32/float64 (for example float32 ->
atol/rtol ~1e-5–1e-6, float64 -> ~1e-10) so the test for "Numerical consistency"
uses dtype-derived tolerances when comparing compiled vs uncompiled
energies/forces/virials and when asserting decreases for rmse_f / rmse_v.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/skills/debug-gradient-flow/SKILL.md:
- Around line 78-84: Add a language specifier to the fenced code block that
contains the table so markdownlint MD040 is satisfied: update the opening fence
(the triple backticks that start the block containing "Uncompiled  Compiled" and
the rows "energy only", "force only", "virial only", "all losses") to include a
language like "text" (e.g., ```text) so the block is explicitly marked as plain
text.

---

Nitpick comments:
In @.github/skills/debug-gradient-flow/references/gradient-probe-script.md:
- Around line 227-254: The patch replaces training_mod._trace_and_compile with
patched inside a loop but restores it only after the loop, leaving the
monkeypatch in place if an exception escapes; wrap the per-backend patching and
test in a try/finally so that training_mod._trace_and_compile is always reset to
orig_trace even on error. Specifically, for each backend when assigning
training_mod._trace_and_compile = patched and running
get_trainer/run_and_get_grads, ensure you restore
training_mod._trace_and_compile = orig_trace in a finally block (referencing
orig_trace, patched, and the backend loop) so the original _trace_and_compile is
never left patched.
- Around line 139-142: Replace use of tempfile.mkdtemp/tmpdir with
tempfile.TemporaryDirectory to ensure automatic cleanup: wrap the block that
sets old_cwd = os.getcwd(), os.chdir(tmpdir) and the subsequent try/finally in a
with tempfile.TemporaryDirectory(prefix="grad_probe_") as tmpdir: context, use
tmpdir for chdir, and restore cwd in finally; apply the same change for the
other occurrence that mirrors this pattern (the code around the second tmpdir
usage). Ensure you remove manual rmdir/cleanup since TemporaryDirectory handles
it.

In @.github/skills/debug-gradient-flow/SKILL.md:
- Around line 170-172: The "Numerical consistency" check currently hardcodes
atol=1e-10 and rtol=1e-10 which is too strict for float32; change the guidance
so tolerances are dtype-aware (e.g., compute tolerances from np.finfo(dtype).eps
or jnp.finfo(dtype).eps with a small multiplier) instead of fixed values, and
document recommended defaults for float32/float64 (for example float32 ->
atol/rtol ~1e-5–1e-6, float64 -> ~1e-10) so the test for "Numerical consistency"
uses dtype-derived tolerances when comparing compiled vs uncompiled
energies/forces/virials and when asserting decreases for rmse_f / rmse_v.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3db25a and f34c3c7.

📒 Files selected for processing (2)
  • .github/skills/debug-gradient-flow/SKILL.md
  • .github/skills/debug-gradient-flow/references/gradient-probe-script.md

@codecov
Copy link

codecov bot commented Mar 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.12%. Comparing base (a3db25a) to head (f34c3c7).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5280      +/-   ##
==========================================
- Coverage   82.12%   82.12%   -0.01%     
==========================================
  Files         753      753              
  Lines       75825    75826       +1     
  Branches     3648     3648              
==========================================
  Hits        62275    62275              
- Misses      12382    12384       +2     
+ Partials     1168     1167       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@iProzd iProzd enabled auto-merge March 3, 2026 11:25
@iProzd iProzd added this pull request to the merge queue Mar 3, 2026
Merged via the queue into deepmodeling:master with commit b30bfcb Mar 3, 2026
73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants