feat(skill): add skill for debug gradient flow in the pt expt backend#5280
Conversation
📝 WalkthroughWalkthroughThis PR adds comprehensive debugging documentation for diagnosing gradient flow issues in model training with compiled pathways. It includes a structured diagnostic guide and a reusable Python script for identifying and isolating gradient flow discrepancies between compiled and uncompiled training regimes. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f34c3c7080
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
.github/skills/debug-gradient-flow/references/gradient-probe-script.md (2)
227-254: Restore monkeypatch viatry/finallyaround patch scope.If any unexpected exception escapes the backend loop,
_trace_and_compilemay stay patched for the current process.Proposed fix
from deepmd.pt_expt.train import training as training_mod orig_trace = training_mod._trace_and_compile - - for backend in ["eager", "aot_eager", "inductor"]: + try: + for backend in ["eager", "aot_eager", "inductor"]: - def patched(model, ec, ea, nl, mp, fp, ap, opts, _b=backend): - opts["backend"] = _b - return orig_trace(model, ec, ea, nl, mp, fp, ap, opts) + def patched(model, ec, ea, nl, mp, fp, ap, opts, _b=backend): + opts["backend"] = _b + return orig_trace(model, ec, ea, nl, mp, fp, ap, opts) - training_mod._trace_and_compile = patched - try: - config = make_config(data_dir, enable_compile=True) - config = update_deepmd_input(config, warning=False) - config = normalize(config) - trainer = get_trainer(config) - status, _ = run_and_get_grads( - trainer, {"find_energy": 0.0, "find_virial": 0.0} - ) - count = sum(1 for v in status.values() if v) - total = len(status) - print(f" {backend:<12}: {count}/{total} params have force grad") - del trainer - except Exception as e: - print(f" {backend:<12}: FAILED ({e})") - - training_mod._trace_and_compile = orig_trace + training_mod._trace_and_compile = patched + try: + config = make_config(data_dir, enable_compile=True) + config = update_deepmd_input(config, warning=False) + config = normalize(config) + trainer = get_trainer(config) + status, _ = run_and_get_grads( + trainer, {"find_energy": 0.0, "find_virial": 0.0} + ) + count = sum(1 for v in status.values() if v) + total = len(status) + print(f" {backend:<12}: {count}/{total} params have force grad") + del trainer + except Exception as e: + print(f" {backend:<12}: FAILED ({e})") + finally: + training_mod._trace_and_compile = orig_trace🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/skills/debug-gradient-flow/references/gradient-probe-script.md around lines 227 - 254, The patch replaces training_mod._trace_and_compile with patched inside a loop but restores it only after the loop, leaving the monkeypatch in place if an exception escapes; wrap the per-backend patching and test in a try/finally so that training_mod._trace_and_compile is always reset to orig_trace even on error. Specifically, for each backend when assigning training_mod._trace_and_compile = patched and running get_trainer/run_and_get_grads, ensure you restore training_mod._trace_and_compile = orig_trace in a finally block (referencing orig_trace, patched, and the backend loop) so the original _trace_and_compile is never left patched.
139-142: Avoid leaking temporary directories.
tempfile.mkdtemp()leaves artifacts behind. UseTemporaryDirectory()so cleanup is automatic.Proposed fix
- tmpdir = tempfile.mkdtemp(prefix="grad_probe_") - old_cwd = os.getcwd() - os.chdir(tmpdir) - try: + old_cwd = os.getcwd() + with tempfile.TemporaryDirectory(prefix="grad_probe_") as tmpdir: + os.chdir(tmpdir) + try: # ... existing body ... - finally: - os.chdir(old_cwd) + finally: + os.chdir(old_cwd)Also applies to: 255-257
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/skills/debug-gradient-flow/references/gradient-probe-script.md around lines 139 - 142, Replace use of tempfile.mkdtemp/tmpdir with tempfile.TemporaryDirectory to ensure automatic cleanup: wrap the block that sets old_cwd = os.getcwd(), os.chdir(tmpdir) and the subsequent try/finally in a with tempfile.TemporaryDirectory(prefix="grad_probe_") as tmpdir: context, use tmpdir for chdir, and restore cwd in finally; apply the same change for the other occurrence that mirrors this pattern (the code around the second tmpdir usage). Ensure you remove manual rmdir/cleanup since TemporaryDirectory handles it..github/skills/debug-gradient-flow/SKILL.md (1)
170-172: Use dtype-aware tolerance guidance instead of hardcoded1e-10.
atol=1e-10, rtol=1e-10is unrealistically strict for common float32 training paths and can create false negatives in this debugging workflow.Proposed fix
-1. **Numerical consistency**: compiled model energy/force/virial should match uncompiled to float precision (`atol=1e-10, rtol=1e-10`) +1. **Numerical consistency**: compiled model energy/force/virial should match uncompiled within dtype-appropriate tolerance (e.g. float64: `atol=1e-10, rtol=1e-10`; float32: `atol=1e-6, rtol=1e-5`)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/skills/debug-gradient-flow/SKILL.md around lines 170 - 172, The "Numerical consistency" check currently hardcodes atol=1e-10 and rtol=1e-10 which is too strict for float32; change the guidance so tolerances are dtype-aware (e.g., compute tolerances from np.finfo(dtype).eps or jnp.finfo(dtype).eps with a small multiplier) instead of fixed values, and document recommended defaults for float32/float64 (for example float32 -> atol/rtol ~1e-5–1e-6, float64 -> ~1e-10) so the test for "Numerical consistency" uses dtype-derived tolerances when comparing compiled vs uncompiled energies/forces/virials and when asserting decreases for rmse_f / rmse_v.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/skills/debug-gradient-flow/SKILL.md:
- Around line 78-84: Add a language specifier to the fenced code block that
contains the table so markdownlint MD040 is satisfied: update the opening fence
(the triple backticks that start the block containing "Uncompiled Compiled" and
the rows "energy only", "force only", "virial only", "all losses") to include a
language like "text" (e.g., ```text) so the block is explicitly marked as plain
text.
---
Nitpick comments:
In @.github/skills/debug-gradient-flow/references/gradient-probe-script.md:
- Around line 227-254: The patch replaces training_mod._trace_and_compile with
patched inside a loop but restores it only after the loop, leaving the
monkeypatch in place if an exception escapes; wrap the per-backend patching and
test in a try/finally so that training_mod._trace_and_compile is always reset to
orig_trace even on error. Specifically, for each backend when assigning
training_mod._trace_and_compile = patched and running
get_trainer/run_and_get_grads, ensure you restore
training_mod._trace_and_compile = orig_trace in a finally block (referencing
orig_trace, patched, and the backend loop) so the original _trace_and_compile is
never left patched.
- Around line 139-142: Replace use of tempfile.mkdtemp/tmpdir with
tempfile.TemporaryDirectory to ensure automatic cleanup: wrap the block that
sets old_cwd = os.getcwd(), os.chdir(tmpdir) and the subsequent try/finally in a
with tempfile.TemporaryDirectory(prefix="grad_probe_") as tmpdir: context, use
tmpdir for chdir, and restore cwd in finally; apply the same change for the
other occurrence that mirrors this pattern (the code around the second tmpdir
usage). Ensure you remove manual rmdir/cleanup since TemporaryDirectory handles
it.
In @.github/skills/debug-gradient-flow/SKILL.md:
- Around line 170-172: The "Numerical consistency" check currently hardcodes
atol=1e-10 and rtol=1e-10 which is too strict for float32; change the guidance
so tolerances are dtype-aware (e.g., compute tolerances from np.finfo(dtype).eps
or jnp.finfo(dtype).eps with a small multiplier) instead of fixed values, and
document recommended defaults for float32/float64 (for example float32 ->
atol/rtol ~1e-5–1e-6, float64 -> ~1e-10) so the test for "Numerical consistency"
uses dtype-derived tolerances when comparing compiled vs uncompiled
energies/forces/virials and when asserting decreases for rmse_f / rmse_v.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/skills/debug-gradient-flow/SKILL.md.github/skills/debug-gradient-flow/references/gradient-probe-script.md
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5280 +/- ##
==========================================
- Coverage 82.12% 82.12% -0.01%
==========================================
Files 753 753
Lines 75825 75826 +1
Branches 3648 3648
==========================================
Hits 62275 62275
- Misses 12382 12384 +2
+ Partials 1168 1167 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary by CodeRabbit
Release Notes