feat(pt_expt): add `dp freeze` support for pt_expt backend (.pte/.pt2) by wanghan-iapcm · Pull Request #5299 · deepmodeling/deepmd-kit

wanghan-iapcm · 2026-03-09T03:27:08Z

Add freeze() function that loads a training checkpoint, reconstructs the model, serializes it, and exports via the existing deserialize_to_file pipeline. Wire the freeze command in the main() CLI dispatcher with checkpoint-dir resolution and automatic .pt2 suffix defaulting.

Summary by CodeRabbit

Release Notes

New Features
- Added support for .pt2 (AOTInductor) model format as an alternative to .pte for exporting frozen models.
- Introduced freeze command to export PyTorch checkpoints directly to .pt2 or .pte formats.
- Enhanced descriptor implementations for improved numerical stability across different execution paths.
Bug Fixes
- Fixed neighbor list indexing issues in export scenarios with non-periodic boundary conditions.

- add support for serialization to .pt2 - deep eval with .pt2 - C/C++ inference api with .pt2 - add ut for python inference - add ut for c++/c inference: se_e2_a, fparam/aparam, multi-frames, dpa1

Replace flat (nf*nall,) indexing in dpa1.py and exclude_mask.py with xp_take_along_axis and xp_take_first_n. The flat indexing creates Ne(nall, nloc) assertions during torch.export tracing that fail when nall == nloc (NoPbc). The new gather-based approach avoids these shape constraints. Unify PT/pt_expt test models so .pth and .pt2 are generated from the same dpmodel weights (type_one_side=True) via gen scripts. Remove deeppot_sea.pth, deeppot_dpa.pth, and fparam_aparam.pth from git tracking — they are now regenerated in CI via convert-models.sh. Changes: - dpa1.py, exclude_mask.py: xp_take_along_axis replaces flat indexing - gen_dpa1.py: generates deeppot_dpa1.pth + .pt2 from dpmodel - gen_fparam_aparam.py: generates fparam_aparam.pth + .pt2 from dpmodel - convert-models.sh: regenerate .pth from yaml and run gen scripts in CI - test_deeppot_dpa_ptexpt.cc: add NoPbc test fixture - test_deeppot_dpa1_pt.cc: new DPA1 .pth test (PBC + NoPbc) - test_deeppot_a_fparam_aparam_pt.cc: update reference values

…ckends Fix a latent bug in all 5 C++ backends (DeepPotPT, DeepPotPTExpt, DeepPotJAX, DeepSpinPT, DeepPotPD) where the ghost-to-local mapping was incorrectly indexed through fwd_map when virtual atoms are present. The old code `mapping[ii] = lmp_list.mapping[fwd_map[ii]]` treats the post-filter index as an original index, causing undefined behavior when fwd_map contains -1 entries for filtered atoms. The fix uses bkw_map to correctly translate: new_idx -> orig_idx -> mapped_orig -> new_mapped. Also fix a use-after-free in DeepPotPTExpt.cc where torch::from_blob referenced a local vector's data after the vector went out of scope. Additionally fix dpmodel code (dpa2.py, repformers.py, make_model.py, nlist.py) to use xp_take_first_n and xp_expand_dims instead of slice indexing that creates shape constraints incompatible with torch.export. Add DPA2 C++ test files for both .pth and .pt2 backends with PBC, NoPbc, and type_sel test cases. Reduce DPA2 test model size for machine-precision agreement between jit and export paths.

Add freeze() function that loads a training checkpoint, reconstructs the model, serializes it, and exports via the existing deserialize_to_file pipeline. Wire the freeze command in the main() CLI dispatcher with checkpoint-dir resolution and automatic .pt2 suffix defaulting.

wanghan-iapcm · 2026-03-09T03:27:56Z

merge after #5298

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6145fe8d69

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-09T03:33:34Z

deepmd/pt_expt/entrypoints/main.py

+    elif FLAGS.command == "freeze":
+        if Path(FLAGS.checkpoint_folder).is_dir():
+            checkpoint_path = Path(FLAGS.checkpoint_folder)
+            latest_ckpt_file = (checkpoint_path / "checkpoint").read_text()


Handle pt_expt checkpoint dirs without checkpoint file

Reading checkpoint_path / "checkpoint" unconditionally breaks dp --pt_expt freeze -c <dir> for directories produced by the pt_expt trainer, because deepmd/pt_expt/train/training.py writes model.ckpt-<step>.pt plus a model.ckpt.pt symlink and does not create a checkpoint pointer file; in that common flow (including the default -c .), freeze raises FileNotFoundError instead of resolving the latest checkpoint.

Useful? React with 👍 / 👎.

deepmd/dpmodel/descriptor/dpa1.py

        )
        nf, nloc, nnei, _ = dmatrix.shape
-        atype = atype_ext[:, :nloc]
+        nall = atype_ext.shape[1]


source/tests/infer/gen_fparam_aparam.py

+        atomic=True,
+    )
+
+    nloc = 6


coderabbitai · 2026-03-09T03:44:24Z

📝 Walkthrough

Walkthrough

This PR introduces PyTorch-exportable (.pt2) model format support alongside the existing .pte format, using AOTInductor for compiled model execution. It refactors descriptor array operations using new utility functions (xp_take_first_n, xp_take_along_axis), adds a new PyTorchExportable backend enum, implements full C++ and C APIs for inference, provides model serialization/deserialization with ZIP-based metadata, adds a CLI freeze command for model export, and includes comprehensive test coverage across Python, C++, and C integration layers.

Changes

Cohort / File(s)	Summary
Backend and format recognition `deepmd/backend/pt_expt.py`, `source/api_cc/include/common.h`, `source/api_cc/src/common.cc`, `source/api_cc/src/DeepPot.cc`	Added PyTorchExportable backend enum value, updated file extension recognition to detect .pt2 files, and integrated new backend into DeepPot initialization path alongside existing backends.
Array API helpers `deepmd/dpmodel/model/make_model.py`	Introduced xp_take_along_axis and xp_take_first_n public exports for standardized array slicing operations; refactored neighbor-list formatting to use new helpers instead of direct array operations.
Descriptor refactoring (DPA1, DPA2, repformers) `deepmd/dpmodel/descriptor/dpa1.py`, `deepmd/dpmodel/descriptor/dpa2.py`, `deepmd/dpmodel/descriptor/repformers.py`	Replaced explicit slicing with xp_take_first_n and xp_take_along_axis utilities; introduced 2D neighbor-list gathering to avoid flat indexing issues during export; refactored mapping construction using expand_dims approach.
Utility function refactoring `deepmd/dpmodel/utils/exclude_mask.py`, `deepmd/dpmodel/utils/nlist.py`	Replaced direct array slicing with xp_take_first_n and xp_take_along_axis for consistent neighbor-type and coordinate gathering; reshaped gather logic to mitigate export-related issues.
Mapping index fixes across backends `source/api_cc/src/DeepPotJAX.cc`, `source/api_cc/src/DeepPotPD.cc`, `source/api_cc/src/DeepPotPT.cc`, `source/api_cc/src/DeepSpinPT.cc`	Updated mapping construction to use backward/forward map reindexing pattern (fwd_map[lmp_list.mapping[bkw_map[ii]]]) for correct per-atom index translation across all PT-based backends.
C++ pt_expt backend implementation `source/api_cc/include/DeepPotPTExpt.h`, `source/api_cc/src/DeepPotPTExpt.cc`	New DeepPotPTExpt class providing full compute API with AOTInductor integration, custom ZIP parsing for metadata extraction, type-sorted neighbor list building, frame-based and mixed-type computation pipelines, and comprehensive template instantiations for multiple numeric types.
Model serialization and export `deepmd/pt_expt/utils/serialization.py`, `deepmd/pt_expt/infer/deep_eval.py`, `deepmd/pt_expt/entrypoints/main.py`	Added serialization support for .pt2 format with metadata embedding, new _trace_and_export and _deserialize_to_file_pt2 functions, pt2 file recognition in DeepEval with dual-path execution, and new CLI freeze command with output format enforcement.
C API test suite for pt_expt `source/api_c/tests/test_deeppot_a_ptexpt.cc`, `source/api_c/tests/test_deeppot_a_fparam_aparam_ptexpt.cc`	New C API test fixtures for double/float precision inference validation against precomputed references; includes PBC, NoPBC, and parameter-variant test cases.
C++ API test suites for pt_expt `source/api_cc/tests/test_deeppot_ptexpt.cc`, `source/api_cc/tests/test_deeppot_a_fparam_aparam_ptexpt.cc`, `source/api_cc/tests/test_deeppot_a_fparam_aparam_nframes_ptexpt.cc`, `source/api_cc/tests/test_deeppot_dpa1_pt.cc`, `source/api_cc/tests/test_deeppot_dpa2_pt.cc`, `source/api_cc/tests/test_deeppot_dpa2_ptexpt.cc`, `source/api_cc/tests/test_deeppot_dpa_ptexpt.cc`	Comprehensive templated test suites covering single/multi-frame inference, atomic outputs, LAMMPS-style neighbor lists, NoPBC scenarios, and cross-format validation; includes tolerance adjustments per value type.
PT test data updates and utilities `source/api_cc/tests/test_deeppot_a_fparam_aparam_pt.cc`, `source/api_cc/tests/test_utils.h`	Updated expected reference values for pt backend tests; promoted test utility member visibility from private to protected for derived test classes.
Model generation and conversion utilities `source/tests/infer/convert-models.sh`, `source/tests/infer/gen_dpa1.py`, `source/tests/infer/gen_dpa2.py`, `source/tests/infer/gen_fparam_aparam.py`	Added conversion scripts and Python generators for .pt2 and .pth model artifacts; includes PBC/NoPBC inference validation and cross-format consistency checks.
Python integration tests `source/tests/pt_expt/infer/test_deep_eval.py`, `source/tests/pt_expt/test_freeze.py`	New test classes for pt_expt deep evaluation covering model metadata, ZIP file validation, multi-frame scenarios, and freeze CLI functionality; includes round-trip serialization and format cross-checks.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as Freeze Command
    participant Serializer as Serialization
    participant Tracer as PyTorch Tracer
    participant Compiler as AOTInductor
    participant Storage as File Storage

    User->>CLI: freeze(checkpoint_path, output.pt2)
    CLI->>CLI: Load checkpoint state_dict
    CLI->>CLI: Reconstruct model from params
    CLI->>Serializer: Collect metadata and serialize
    Serializer->>Tracer: Create sample inputs
    Tracer->>Tracer: Trace model forward pass
    Tracer->>Compiler: Export with dynamic shapes
    Compiler->>Compiler: Compile to AOTInductor format
    Compiler->>Storage: Package model + metadata + extra/
    Storage-->>User: Output .pt2 file (ZIP archive)

sequenceDiagram
    actor User
    participant DeepPot as DeepPot Loader
    participant Format as Format Detector
    participant PTExpt as DeepPotPTExpt
    participant Loader as AOTILoader
    participant Model as Compiled Model
    participant Extractor as Output Extractor

    User->>DeepPot: Load model file (.pt2)
    DeepPot->>Format: Detect file type
    Format-->>DeepPot: PyTorchExportable (.pt2)
    DeepPot->>PTExpt: Create DeepPotPTExpt instance
    PTExpt->>PTExpt: Read metadata from ZIP
    PTExpt->>Loader: Initialize AOTIModelPackageLoader
    Loader-->>PTExpt: Package loaded
    User->>PTExpt: compute(coord, atype, box, ...)
    PTExpt->>PTExpt: Build tensors and neighbor lists
    PTExpt->>Model: run_model(inputs)
    Model-->>PTExpt: Raw flat outputs as tensors
    PTExpt->>Extractor: extract_outputs(flat_outputs)
    Extractor-->>PTExpt: Keyed output map
    PTExpt-->>User: energy, force, virial, ...

sequenceDiagram
    participant Descriptor as DPA1/DPA2 Descriptor
    participant ArrayAPI as Array API
    participant Embedding as Embedding Layer
    participant Gathering as Gather Operations

    Descriptor->>ArrayAPI: xp_take_first_n(atype_embd_ext, 1, nloc)
    ArrayAPI->>Gathering: First-n selection along axis
    Gathering-->>Descriptor: atype_embd (local atoms only)
    Descriptor->>ArrayAPI: xp_take_along_axis(atype_embd_nlist, indices_2d)
    ArrayAPI->>Gathering: 2D along-axis gather
    Gathering-->>Descriptor: atype_embd_nlist (neighbor embeddings)
    Descriptor->>Descriptor: Reshape for concatenation
    Descriptor->>Embedding: Concat embeddings for attention
    Embedding-->>Descriptor: Combined features

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

chore(pt): make comm_dict for dpa2 noncompulsory when nghost is 0 #4144: Modifies mapping construction in DeepPotPT::compute (neighbor/mapping tensor handling), directly related to mapping index fixes in this PR.
feat: new backend pytorch exportable. #5194: Extends PyTorch-exportable backend with descriptor and pt_expt changes, overlapping code-level modifications in the same feature area.
feat(pt_expt): add descriptors dpa1 dpa2 dpa3 and hybrid #5248: Uses the same array API take helpers (xp_take_along_axis, xp_take_first_n) and extends PT-Expt descriptor integration with related refactoring patterns.

Suggested labels

new feature, PyTorch, C++, export

Suggested reviewers

iProzd
njzjz

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.75% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title clearly and specifically describes the main change: adding freeze support for the pt_expt backend with .pte/.pt2 format support.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 16

🧹 Nitpick comments (7)

source/tests/infer/gen_dpa1.py (1)

143-164: Prefix unused unpacked variables with underscore for ruff compliance.

Static analysis flags several unused variables from the dp.eval() return tuples. Per coding guidelines, prefix them with _ to pass ruff check.

♻️ Proposed fix

-    e1, f1, v1, ae1, av1 = dp.eval(coord, box, atype, atomic=True)
+    e1, f1, _v1, ae1, av1 = dp.eval(coord, box, atype, atomic=True)
     print(f"\n// PBC total energy: {e1[0, 0]:.18e}")  # noqa: T201
     print_cpp_values("PBC reference values", ae1, f1, av1)

     # ---- 5. Run inference for NoPbc test ----
-    e_np, f_np, v_np, ae_np, av_np = dp.eval(coord, None, atype, atomic=True)
+    e_np, f_np, _v_np, ae_np, av_np = dp.eval(coord, None, atype, atomic=True)
     print(f"\n// NoPbc total energy: {e_np[0, 0]:.18e}")  # noqa: T201
     print_cpp_values("NoPbc reference values", ae_np, f_np, av_np)

     # ---- 6. Verify .pth gives same results ----
     dp_pth = DeepPot(pth_path)
-    e_pth, f_pth, v_pth, ae_pth, av_pth = dp_pth.eval(coord, box, atype, atomic=True)
+    e_pth, f_pth, _v_pth, _ae_pth, _av_pth = dp_pth.eval(coord, box, atype, atomic=True)
     print(f"\n// .pth PBC total energy: {e_pth[0, 0]:.18e}")  # noqa: T201
     print(f"// .pth vs .pt2 energy diff: {abs(e1[0, 0] - e_pth[0, 0]):.2e}")  # noqa: T201
     print(f"// .pth vs .pt2 force max diff: {np.max(np.abs(f1 - f_pth)):.2e}")  # noqa: T201

-    e_pth_np, f_pth_np, _, ae_pth_np, av_pth_np = dp_pth.eval(
+    e_pth_np, _f_pth_np, _, _ae_pth_np, _av_pth_np = dp_pth.eval(
         coord, None, atype, atomic=True
     )

As per coding guidelines: **/*.py: Always run ruff check . and ruff format . before committing changes or CI will fail.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@source/tests/infer/gen_dpa1.py` around lines 143 - 164, The tmp variables
returned by dp.eval/dp_pth.eval that are unused (v1, v_np, v_pth) should be
prefixed with an underscore (e.g., rename v1 -> _v1, v_np -> _v_np, v_pth ->
_v_pth) to satisfy ruff; update the unpacking in the calls to DeepPot.eval (the
tuples assigned to e1,f1,v1,ae1,av1; e_np,f_np,v_np,ae_np,av_np; and
e_pth,f_pth,v_pth,ae_pth,av_pth) accordingly and then run ruff check . and ruff
format . before committing.

source/tests/pt_expt/infer/test_deep_eval.py (1)

490-791: Mirror the dynamic-shape regression for .pt2 as well.

This new suite validates fixed-size .pt2 exports, but the refactor in this PR is specifically about avoiding torch.export shape constraints. Porting the existing dynamic-shape check here would cover the main failure mode this change is targeting.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@source/tests/pt_expt/infer/test_deep_eval.py` around lines 490 - 791, The
test class TestDeepEvalEnerPt2 is only exercising fixed-size .pt2 exports; add a
dynamic-shape regression equivalent to the existing dynamic-shape check (used
for .pte) to ensure .pt2 DeepPot.eval accepts variable frame counts and avoids
torch.export shape constraints—modify TestDeepEvalEnerPt2 to include a new test
(e.g., test_pt2_dynamic_shape) that mirrors the logic used in the .pte
dynamic-shape test: generate variable nframes (1,3,5...), create coords and
cells accordingly, call self.dp.eval(...) with atomic=True and compare outputs
(energy, force, virial, atom_energy, atom_virial) across frames to either the
direct model forward (self.model.forward) or the .pte reference
(self.dp_pte.eval) using np.testing.assert_allclose with tight rtol/atol; ensure
the test uses the same RNG seed pattern and atom_types as other tests so it
targets the torch.export shape issue.

source/api_cc/src/DeepPotPTExpt.cc (5)

805-808: Redundant resize() before assign() on datom_energy.

Line 805 resizes datom_energy to nall_real elements, then line 806-808 immediately reassigns it using assign(), which discards the previous resize. The resize() call is unnecessary.

♻️ Proposed fix

-    datom_energy.resize(nall_real, 0.0);
     datom_energy.assign(
         cpu_atom_energy_.data_ptr<VALUETYPE>(),
         cpu_atom_energy_.data_ptr<VALUETYPE>() + cpu_atom_energy_.numel());

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@source/api_cc/src/DeepPotPTExpt.cc` around lines 805 - 808, Remove the
unnecessary pre-sizing: the call to datom_energy.resize(nall_real, 0.0) is
redundant because datom_energy is immediately overwritten by
datom_energy.assign(...). Delete the resize() invocation and keep the
datom_energy.assign(...) that copies from cpu_atom_energy_.data_ptr<VALUETYPE>()
to correctly set the container contents.

526-535: Redundant catch (...) { throw; } block.

The catch block on lines 532-534 catches all exceptions and immediately rethrows them without any processing. This is redundant and can be removed.

♻️ Proposed fix

 DeepPotPTExpt::DeepPotPTExpt(const std::string& model,
                              const int& gpu_rank,
                              const std::string& file_content)
     : inited(false) {
-  try {
-    translate_error([&] { init(model, gpu_rank, file_content); });
-  } catch (...) {
-    throw;
-  }
+  translate_error([&] { init(model, gpu_rank, file_content); });
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@source/api_cc/src/DeepPotPTExpt.cc` around lines 526 - 535, The DeepPotPTExpt
constructor contains a redundant catch-all rethrow (catch (...) { throw; })
after calling translate_error which serves no purpose; remove the catch (...)
block so the constructor simply calls translate_error([&]{ init(model, gpu_rank,
file_content); }); and lets exceptions propagate naturally—update the
DeepPotPTExpt::DeepPotPTExpt implementation accordingly, leaving translate_error
and init unchanged.

1124-1129: Minor: get_type_map appends trailing space.

The function appends a space after each type name, resulting in a trailing space in the output string. This may be intentional for compatibility but could cause issues if the caller expects clean output.

♻️ Proposed fix if trailing space is unintended

 void DeepPotPTExpt::get_type_map(std::string& type_map_str) {
-  for (const auto& t : type_map) {
-    type_map_str += t;
-    type_map_str += " ";
+  for (size_t i = 0; i < type_map.size(); ++i) {
+    if (i > 0) {
+      type_map_str += " ";
+    }
+    type_map_str += type_map[i];
   }
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@source/api_cc/src/DeepPotPTExpt.cc` around lines 1124 - 1129,
DeepPotPTExpt::get_type_map currently appends a space after every entry in
type_map which leaves a trailing space in type_map_str; change the logic to only
insert spaces between entries (e.g., iterate with an index or detect first/last
element) so type_map_str contains the type names separated by single spaces with
no trailing space, preserving the function signature and using the existing
type_map and type_map_str variables.

158-174: parse_bool() and parse_null() lack validation of actual content.

parse_bool() only checks if the input starts with "true" but assumes anything else is "false" without verifying the actual characters. Similarly, parse_null() blindly advances 4 positions without confirming the input is actually "null".

Since this parses trusted model metadata, this is low risk, but adding validation would make it more robust.

♻️ Proposed fix to add validation

   JsonValue parse_bool() {
     JsonValue v;
     v.type = JsonValue::Bool;
     if (s_.substr(pos_, 4) == "true") {
       v.bool_val = true;
       pos_ += 4;
-    } else {
+    } else if (s_.substr(pos_, 5) == "false") {
       v.bool_val = false;
       pos_ += 5;
+    } else {
+      throw deepmd::deepmd_exception("Invalid JSON boolean");
     }
     return v;
   }

   JsonValue parse_null() {
+    if (s_.substr(pos_, 4) != "null") {
+      throw deepmd::deepmd_exception("Invalid JSON null");
+    }
     pos_ += 4;
     return JsonValue();
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@source/api_cc/src/DeepPotPTExpt.cc` around lines 158 - 174, parse_bool() and
parse_null() should validate the actual input before advancing pos_ and
returning a JsonValue: check that s_.substr(pos_, ...) matches "true" or "false"
(and that there are enough characters remaining) and only then set v.bool_val
and advance pos_ by 4 or 5; if neither matches, handle the parse error (e.g.,
throw/return an error state). Likewise, in parse_null() verify s_.substr(pos_,4)
== "null" and enough length before advancing pos_ by 4 and returning
JsonValue(), otherwise handle the parse error. Use the existing symbols
parse_bool, parse_null, JsonValue, s_, and pos_ to locate and update the logic.

601-614: Silent exception swallowing when setting thread counts.

The empty catch (...) blocks on lines 606-607 and 612-613 silently discard any exceptions from at::set_num_interop_threads() and at::set_num_threads(). This makes debugging thread configuration issues difficult.

Consider logging a warning when thread configuration fails.

♻️ Proposed fix to add warning logging

   if (num_inter_nthreads) {
     try {
       at::set_num_interop_threads(num_inter_nthreads);
     } catch (...) {
+      std::cerr << "WARNING: Failed to set inter-op threads to "
+                << num_inter_nthreads << std::endl;
     }
   }
   if (num_intra_nthreads) {
     try {
       at::set_num_threads(num_intra_nthreads);
     } catch (...) {
+      std::cerr << "WARNING: Failed to set intra-op threads to "
+                << num_intra_nthreads << std::endl;
     }
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@source/api_cc/src/DeepPotPTExpt.cc` around lines 601 - 614, The code
currently swallows exceptions from at::set_num_interop_threads and
at::set_num_threads via empty catch(...) blocks; replace these with try/catch
that logs a warning: catch(const std::exception& e) should log a clear warning
including e.what() and the function name (at::set_num_interop_threads or
at::set_num_threads) and the attempted thread count, and retain a catch(...)
that logs an "unknown exception" warning if non-std exceptions occur; locate
these changes around get_env_nthreads handling and ensure the warning uses your
project's logging mechanism (or std::cerr if no logger is available).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/dpmodel/descriptor/dpa1.py`:
- Line 1060: Remove the unused local assignment "nall = atype_ext.shape[1]" from
dpa1.py (the dead variable 'nall' referenced in the review) so the F841 lint
error is resolved; locate the assignment near the code that references
'atype_ext' and delete that single line, then run "ruff check ." and "ruff
format ." to ensure no other lint errors remain.

In `@deepmd/pt_expt/entrypoints/main.py`:
- Around line 247-250: The code reads the checkpoint pointer into
latest_ckpt_file with Path(...).read_text() which can include a trailing
newline, causing FLAGS.model to point to a non-existent path; update the logic
that sets latest_ckpt_file (used when FLAGS.checkpoint_folder is a directory) to
strip whitespace/newlines (e.g., .strip()) before calling
checkpoint_path.joinpath(...) so FLAGS.model is set to the clean filename used
by torch.load().
- Around line 163-208: The freeze function currently ignores the head parameter;
update freeze (the function handling model loading, get_model, ModelWrapper,
serialize, deserialize_to_file) to fail-fast when head is not None by raising a
clear exception (e.g., ValueError) indicating per-head freezing is not yet
supported, so callers are not silently misled; add the check near the start of
freeze before reconstructing the model and ensure the error message references
the unsupported head behavior, then run ruff check . and ruff format . to
satisfy linting/style rules.

In `@deepmd/pt_expt/infer/deep_eval.py`:
- Around line 167-169: The change clears self.exported_module after loading a
.pt2 model which breaks get_model() (which returns exported_module); update the
loader so that after aoti_load_package(model_file) you either set
self.exported_module = self._pt2_runner (so get_model() continues to work) or
update get_model() to detect the .pt2 case and return self._pt2_runner (or raise
a clear UnsupportedOperationError) — locate the code around
aoti_load_package(model_file), the attributes self._pt2_runner and
self.exported_module, and the get_model() method to implement one of these
fixes.

In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 355-362: The code uses the private API
torch._inductor.aoti_compile_and_package (called with exported and
package_path=model_file after _trace_and_export), which is unstable; update the
call site to guard against API changes by either pinning the supported PyTorch
version in packaging/tests or adding a runtime compatibility layer: attempt an
import of aoti_compile_and_package from torch._inductor, fall back to an
alternative public API or raise a clear error that includes the PyTorch version
and helpful instructions; keep references to exported, _trace_and_export, and
model_file so the fallback/compatibility check wraps the exact call site and
logs actionable messages when the private symbol is unavailable or
signature-changed.

In `@source/api_cc/include/common.h`:
- Line 16: The new PyTorchExportable enumerator shifted existing ordinals and
can break binary compatibility for get_backend(); to fix, preserve existing enum
values by moving PyTorchExportable to the end just before Unknown (i.e., keep
the original order TensorFlow, PyTorch, Paddle, JAX, PyTorchExportable, Unknown)
or alternatively give each DPBackend an explicit integral value so TensorFlow,
PyTorch, Paddle, JAX retain their previous numeric values; update the DPBackend
enum definition accordingly to avoid renumbering.

In `@source/api_cc/src/common.cc`:
- Around line 1455-1461: The get_backend function (deepmd::get_backend)
currently checks for ".pt2" but not ".pte", causing ".pte" filenames to be
treated as unsupported; update the suffix checks in deepmd::get_backend so that
files ending with ".pte" are routed to deepmd::DPBackend::PyTorchExportable
(i.e., treat ".pte" the same as ".pt2")—modify the branch that returns
deepmd::DPBackend::PyTorchExportable to accept either suffix.

In `@source/api_cc/src/DeepPotPTExpt.cc`:
- Around line 65-66: The get() method lacks a bounds check and can read past s_
when the input is malformed; update get() (which uses pos_ and s_) to first
verify pos_ < s_.size() and handle the out-of-bounds case (e.g., return '\0' or
signal an error) instead of unconditionally indexing and incrementing pos_,
ensuring it mirrors the safety of peek().
- Around line 240-263: The ZIP parser can underflow when computing
content.size() - 22; before the EOCD search, add a guard that verifies
content.size() >= 22 (or return/throw an error) and avoid computing
content.size() - 22 when the file is smaller; update the EOCD search around
variables content and eocd_pos (and the loop that starts from content.size() -
22) to only run when the size check passes and otherwise throw the same
deepmd::deepmd_exception("Invalid ZIP file: " + zip_path) or a more specific
message.
- Around line 641-648: The size check in DeepPotPTExpt::extract_outputs
currently uses assert(), which is compiled out in release builds; replace it
with a runtime check (e.g., an if that compares flat_outputs.size() and
output_keys.size()) and throw a descriptive exception (std::runtime_error or
similar) if they differ to prevent undefined behavior when populating output_map
using output_keys and flat_outputs; keep the rest of the function logic the
same.

In `@source/api_cc/tests/test_deeppot_dpa_ptexpt.cc`:
- Around line 378-382: The loop that populates coord_vir only copies one scalar
per virtual atom into a buffer sized nvir*3, producing malformed virtual atom
coordinates; update the loop that fills coord_vir (and keeps atype_vir) to copy
all three coordinates per virtual atom by writing coord_vir[3*ii + 0..2] from
coord[3*ii + 0..2] for each ii (0..nvir-1) so each synthetic atom has full x,y,z
positions.

In `@source/api_cc/tests/test_deeppot_dpa2_ptexpt.cc`:
- Around line 387-391: The coord_vir buffer is sized for nvir*3 but the loop
only copies one scalar per virtual atom, leaving two coordinates zero; update
the loop that fills coord_vir (the section using coord_vir, coord, and nvir) to
copy all three coordinates per atom (e.g., iterate j=0..2 and assign
coord_vir[3*ii + j] = coord[3*ii + j] or memcpy three values) so each virtual
atom gets its full (x,y,z) position rather than a single scalar.

In `@source/api_cc/tests/test_deeppot_ptexpt.cc`:
- Around line 360-364: The loop that initializes coord_vir only copies a single
coordinate per virtual atom into a buffer sized nvir*3; update the
initialization for coord_vir so that for each virtual atom index ii you copy all
three coordinates (e.g., assign coord_vir[3*ii + 0..2] = coord[0..2] or use
std::copy/memcpy) so the 3D coordinates for each synthetic virtual atom are
properly populated; keep atype_vir initialization as-is.

In `@source/tests/infer/gen_dpa2.py`:
- Around line 127-135: The current flow attempts
pt_deserialize_to_file(pth_path, ...) and then later still checks
os.path.exists(pth_path), which can cause loading a stale deeppot_dpa2.pth if
the export failed; change the logic so that after catching RuntimeError from
pt_deserialize_to_file you do not proceed to verify or load pth_path (e.g., set
a local success flag or return/continue) and only perform the
os.path.exists(pth_path) verification when the export actually succeeded; apply
the same fix for the second occurrence that mirrors lines 181-198.
- Around line 171-190: The lint error comes from unpacking unused returns from
dp.eval/dp_pth.eval; update the unused variable names by prefixing them with an
underscore (e.g., change v_np -> _v_np, ae_np -> _ae_np, av_np -> _av_np in the
dp.eval call and similarly use _ for the unused third return in dp_pth.eval and
prefix any unused ae_pth_np/av_pth_np as needed) so that eval(...) still returns
all values but ruff RUF059 is satisfied; locate uses around the dp.eval and
dp_pth.eval calls and rename only the unused symbols (leave e_np, f_np, e_pth,
f_pth etc. as-is).

In `@source/tests/infer/gen_fparam_aparam.py`:
- Around line 127-177: The new test generator declares unused variables (v,
nloc, ae_pth, av_pth) which causes ruff F841; fix by either using or marking
them as intentionally unused—rename them with a leading underscore (e.g., _v,
_nloc, _ae_pth, _av_pth) or replace their assignments with a single
underscore/tuple discard when calling DeepPot.eval (e.g., e_pth, f_pth, _ = ...
or e_pth, f_pth, *_ = ...) so that functions/methods DeepPot.eval and local
names v, nloc, ae_pth, av_pth no longer trigger unused-variable errors.

---

Nitpick comments:
In `@source/api_cc/src/DeepPotPTExpt.cc`:
- Around line 805-808: Remove the unnecessary pre-sizing: the call to
datom_energy.resize(nall_real, 0.0) is redundant because datom_energy is
immediately overwritten by datom_energy.assign(...). Delete the resize()
invocation and keep the datom_energy.assign(...) that copies from
cpu_atom_energy_.data_ptr<VALUETYPE>() to correctly set the container contents.
- Around line 526-535: The DeepPotPTExpt constructor contains a redundant
catch-all rethrow (catch (...) { throw; }) after calling translate_error which
serves no purpose; remove the catch (...) block so the constructor simply calls
translate_error([&]{ init(model, gpu_rank, file_content); }); and lets
exceptions propagate naturally—update the DeepPotPTExpt::DeepPotPTExpt
implementation accordingly, leaving translate_error and init unchanged.
- Around line 1124-1129: DeepPotPTExpt::get_type_map currently appends a space
after every entry in type_map which leaves a trailing space in type_map_str;
change the logic to only insert spaces between entries (e.g., iterate with an
index or detect first/last element) so type_map_str contains the type names
separated by single spaces with no trailing space, preserving the function
signature and using the existing type_map and type_map_str variables.
- Around line 158-174: parse_bool() and parse_null() should validate the actual
input before advancing pos_ and returning a JsonValue: check that
s_.substr(pos_, ...) matches "true" or "false" (and that there are enough
characters remaining) and only then set v.bool_val and advance pos_ by 4 or 5;
if neither matches, handle the parse error (e.g., throw/return an error state).
Likewise, in parse_null() verify s_.substr(pos_,4) == "null" and enough length
before advancing pos_ by 4 and returning JsonValue(), otherwise handle the parse
error. Use the existing symbols parse_bool, parse_null, JsonValue, s_, and pos_
to locate and update the logic.
- Around line 601-614: The code currently swallows exceptions from
at::set_num_interop_threads and at::set_num_threads via empty catch(...) blocks;
replace these with try/catch that logs a warning: catch(const std::exception& e)
should log a clear warning including e.what() and the function name
(at::set_num_interop_threads or at::set_num_threads) and the attempted thread
count, and retain a catch(...) that logs an "unknown exception" warning if
non-std exceptions occur; locate these changes around get_env_nthreads handling
and ensure the warning uses your project's logging mechanism (or std::cerr if no
logger is available).

In `@source/tests/infer/gen_dpa1.py`:
- Around line 143-164: The tmp variables returned by dp.eval/dp_pth.eval that
are unused (v1, v_np, v_pth) should be prefixed with an underscore (e.g., rename
v1 -> _v1, v_np -> _v_np, v_pth -> _v_pth) to satisfy ruff; update the unpacking
in the calls to DeepPot.eval (the tuples assigned to e1,f1,v1,ae1,av1;
e_np,f_np,v_np,ae_np,av_np; and e_pth,f_pth,v_pth,ae_pth,av_pth) accordingly and
then run ruff check . and ruff format . before committing.

In `@source/tests/pt_expt/infer/test_deep_eval.py`:
- Around line 490-791: The test class TestDeepEvalEnerPt2 is only exercising
fixed-size .pt2 exports; add a dynamic-shape regression equivalent to the
existing dynamic-shape check (used for .pte) to ensure .pt2 DeepPot.eval accepts
variable frame counts and avoids torch.export shape constraints—modify
TestDeepEvalEnerPt2 to include a new test (e.g., test_pt2_dynamic_shape) that
mirrors the logic used in the .pte dynamic-shape test: generate variable nframes
(1,3,5...), create coords and cells accordingly, call self.dp.eval(...) with
atomic=True and compare outputs (energy, force, virial, atom_energy,
atom_virial) across frames to either the direct model forward
(self.model.forward) or the .pte reference (self.dp_pte.eval) using
np.testing.assert_allclose with tight rtol/atol; ensure the test uses the same
RNG seed pattern and atom_types as other tests so it targets the torch.export
shape issue.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4c480049-2bf8-44ef-a12e-752b9dcff61a

📥 Commits

Reviewing files that changed from the base of the PR and between 24e54bf and 6145fe8.

📒 Files selected for processing (39)

deepmd/backend/pt_expt.py
deepmd/dpmodel/descriptor/dpa1.py
deepmd/dpmodel/descriptor/dpa2.py
deepmd/dpmodel/descriptor/repformers.py
deepmd/dpmodel/model/make_model.py
deepmd/dpmodel/utils/exclude_mask.py
deepmd/dpmodel/utils/nlist.py
deepmd/pt_expt/entrypoints/main.py
deepmd/pt_expt/infer/deep_eval.py
deepmd/pt_expt/utils/serialization.py
source/api_c/tests/test_deeppot_a_fparam_aparam_ptexpt.cc
source/api_c/tests/test_deeppot_a_ptexpt.cc
source/api_cc/include/DeepPotPTExpt.h
source/api_cc/include/common.h
source/api_cc/src/DeepPot.cc
source/api_cc/src/DeepPotJAX.cc
source/api_cc/src/DeepPotPD.cc
source/api_cc/src/DeepPotPT.cc
source/api_cc/src/DeepPotPTExpt.cc
source/api_cc/src/DeepSpinPT.cc
source/api_cc/src/common.cc
source/api_cc/tests/test_deeppot_a_fparam_aparam_nframes_ptexpt.cc
source/api_cc/tests/test_deeppot_a_fparam_aparam_pt.cc
source/api_cc/tests/test_deeppot_a_fparam_aparam_ptexpt.cc
source/api_cc/tests/test_deeppot_dpa1_pt.cc
source/api_cc/tests/test_deeppot_dpa2_pt.cc
source/api_cc/tests/test_deeppot_dpa2_ptexpt.cc
source/api_cc/tests/test_deeppot_dpa_ptexpt.cc
source/api_cc/tests/test_deeppot_ptexpt.cc
source/api_cc/tests/test_utils.h
source/tests/infer/convert-models.sh
source/tests/infer/deeppot_dpa.pth
source/tests/infer/deeppot_sea.pth
source/tests/infer/fparam_aparam.pth
source/tests/infer/gen_dpa1.py
source/tests/infer/gen_dpa2.py
source/tests/infer/gen_fparam_aparam.py
source/tests/pt_expt/infer/test_deep_eval.py
source/tests/pt_expt/test_freeze.py

coderabbitai · 2026-03-09T03:44:28Z

deepmd/dpmodel/descriptor/dpa1.py

        )
        nf, nloc, nnei, _ = dmatrix.shape
-        atype = atype_ext[:, :nloc]
+        nall = atype_ext.shape[1]


⚠️ Potential issue | 🟠 Major

Drop the dead nall assignment.

Ruff already flags this as F841, so the Python lint job will keep failing until this local is removed.

💡 Suggested fix

- nall = atype_ext.shape[1] atype = xp_take_first_n(atype_ext, 1, nloc)

As per coding guidelines, `**/*.py`: Always run `ruff check .` and `ruff format .` before committing changes or CI will fail.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

nall = atype_ext.shape[1]

atype = xp_take_first_n(atype_ext, 1, nloc)

🧰 Tools

🪛 Ruff (0.15.4)

[error] 1060-1060: Local variable nall is assigned to but never used

Remove assignment to unused variable nall

(F841)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@deepmd/dpmodel/descriptor/dpa1.py` at line 1060, Remove the unused local assignment "nall = atype_ext.shape[1]" from dpa1.py (the dead variable 'nall' referenced in the review) so the F841 lint error is resolved; locate the assignment near the code that references 'atype_ext' and delete that single line, then run "ruff check ." and "ruff format ." to ensure no other lint errors remain.

coderabbitai · 2026-03-09T03:44:28Z

deepmd/pt_expt/entrypoints/main.py

+def freeze(
+    model: str,
+    output: str = "frozen_model.pt2",
+    head: str | None = None,
+) -> None:
+    """Freeze a pt_expt training checkpoint to .pte or .pt2 format.
+
+    Parameters
+    ----------
+    model : str
+        Path to the training checkpoint (.pt file).
+    output : str
+        Path for the frozen model output (.pte or .pt2).
+    head : str or None
+        Head to freeze in a multi-task model (not yet supported).
+    """
+    import torch
+
+    from deepmd.pt_expt.model import (
+        get_model,
+    )
+    from deepmd.pt_expt.train.wrapper import (
+        ModelWrapper,
+    )
+    from deepmd.pt_expt.utils.env import (
+        DEVICE,
+    )
+    from deepmd.pt_expt.utils.serialization import (
+        deserialize_to_file,
+    )
+
+    state_dict = torch.load(model, map_location=DEVICE, weights_only=True)
+    if "model" in state_dict:
+        state_dict = state_dict["model"]
+    model_params = state_dict["_extra_state"]["model_params"]
+
+    # Reconstruct model and load weights
+    pt_expt_model = get_model(model_params).to(DEVICE)
+    wrapper = ModelWrapper(pt_expt_model)
+    wrapper.load_state_dict(state_dict)
+    pt_expt_model.eval()
+
+    # Serialize to dict and export
+    model_dict = pt_expt_model.serialize()
+    deserialize_to_file(output, {"model": model_dict})
+    log.info(f"Saved frozen model to {output}")


⚠️ Potential issue | 🟠 Major

Don't silently ignore --head.

If the caller passes a head today, freeze() still exports the default/full model and never tells them the request was ignored. Fail fast until per-head freezing is actually implemented.

🛑 Suggested guard

def freeze( model: str, output: str = "frozen_model.pt2", head: str | None = None, ) -> None: @@ from deepmd.pt_expt.utils.serialization import ( deserialize_to_file, ) + if head is not None: + raise NotImplementedError( + "--head is not supported for the pt_expt freeze command yet." + ) + state_dict = torch.load(model, map_location=DEVICE, weights_only=True)

As per coding guidelines, "Always run ruff check . and ruff format . before committing changes or CI will fail."

🧰 Tools

🪛 Ruff (0.15.4)

[warning] 166-166: Unused function argument: head

(ARG001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@deepmd/pt_expt/entrypoints/main.py` around lines 163 - 208, The freeze function currently ignores the head parameter; update freeze (the function handling model loading, get_model, ModelWrapper, serialize, deserialize_to_file) to fail-fast when head is not None by raising a clear exception (e.g., ValueError) indicating per-head freezing is not yet supported, so callers are not silently misled; add the check near the start of freeze before reconstructing the model and ensure the error message references the unsupported head behavior, then run ruff check . and ruff format . to satisfy linting/style rules.

coderabbitai · 2026-03-09T03:44:28Z

deepmd/pt_expt/entrypoints/main.py

+        if Path(FLAGS.checkpoint_folder).is_dir():
+            checkpoint_path = Path(FLAGS.checkpoint_folder)
+            latest_ckpt_file = (checkpoint_path / "checkpoint").read_text()
+            FLAGS.model = str(checkpoint_path.joinpath(latest_ckpt_file))


⚠️ Potential issue | 🟡 Minor

Strip the checkpoint pointer before joining the path.

Path.read_text() preserves trailing newlines, so a line-based checkpoint file can produce FLAGS.model with a trailing \n and make torch.load() miss the file.

🧩 Suggested fix

- latest_ckpt_file = (checkpoint_path / "checkpoint").read_text() - FLAGS.model = str(checkpoint_path.joinpath(latest_ckpt_file)) + latest_ckpt_file = (checkpoint_path / "checkpoint").read_text().strip() + FLAGS.model = str(checkpoint_path / latest_ckpt_file)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if Path(FLAGS.checkpoint_folder).is_dir():

checkpoint_path = Path(FLAGS.checkpoint_folder)

latest_ckpt_file = (checkpoint_path / "checkpoint").read_text()

FLAGS.model = str(checkpoint_path.joinpath(latest_ckpt_file))

if Path(FLAGS.checkpoint_folder).is_dir():

checkpoint_path = Path(FLAGS.checkpoint_folder)

latest_ckpt_file = (checkpoint_path / "checkpoint").read_text().strip()

FLAGS.model = str(checkpoint_path / latest_ckpt_file)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@deepmd/pt_expt/entrypoints/main.py` around lines 247 - 250, The code reads the checkpoint pointer into latest_ckpt_file with Path(...).read_text() which can include a trailing newline, causing FLAGS.model to point to a non-existent path; update the logic that sets latest_ckpt_file (used when FLAGS.checkpoint_folder is a directory) to strip whitespace/newlines (e.g., .strip()) before calling checkpoint_path.joinpath(...) so FLAGS.model is set to the clean filename used by torch.load().

coderabbitai · 2026-03-09T03:44:28Z

deepmd/pt_expt/infer/deep_eval.py

+        # Load the AOTInductor model
+        self._pt2_runner = aoti_load_package(model_file)
+        self.exported_module = None


⚠️ Potential issue | 🟠 Major

Don’t silently break get_model() for .pt2.

Line 169 clears self.exported_module, but get_model() still returns that attribute at Line 710. After this change, .pt2 callers get None instead of the loaded artifact. Either return _pt2_runner from get_model() for this format or make get_model() raise a clear unsupported-operation error.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@deepmd/pt_expt/infer/deep_eval.py` around lines 167 - 169, The change clears self.exported_module after loading a .pt2 model which breaks get_model() (which returns exported_module); update the loader so that after aoti_load_package(model_file) you either set self.exported_module = self._pt2_runner (so get_model() continues to work) or update get_model() to detect the .pt2 case and return self._pt2_runner (or raise a clear UnsupportedOperationError) — locate the code around aoti_load_package(model_file), the attributes self._pt2_runner and self.exported_module, and the get_model() method to implement one of these fixes.

coderabbitai · 2026-03-09T03:44:28Z

deepmd/pt_expt/utils/serialization.py

+    from torch._inductor import (
+        aoti_compile_and_package,
+    )
+
+    exported, metadata, data_for_json, output_keys = _trace_and_export(data)
+
+    # Compile via AOTInductor into a .pt2 package
+    aoti_compile_and_package(exported, package_path=model_file)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

PyTorch AOTInductor aoti_compile_and_package stable API

💡 Result:

torch._inductor.aoti_compile_and_package() is not a stable/public API in the PyTorch sense: it lives under the underscored torch._inductor namespace, which PyTorch uses for internal/developer-facing functionality that may change without a backward-compatibility guarantee (even if it’s documented). PyTorch’s docs explicitly warn that private/developer-facing APIs “may be subject to changes without warning.” [3]

What is “stable” around AOTInductor today is not the Python compile entrypoint, but the deployment/runtime ABI surface:

The official AOTInductor tutorial documents torch._inductor.aoti_compile_and_package(exported_program, *, package_path=None, inductor_configs=None) -> str and torch._inductor.aoti_load_package(...) for producing/loading a .pt2 artifact (PT2 archive). [1]

For cross-version binary compatibility, PyTorch has a LibTorch Stable ABI effort that includes stable C headers (originating from AOTInductor) and higher-level torch::stable C++ wrappers, with an explicit ABI stability promise (not modifying headers for at least ~2 years after release). [2][4]

Practical takeaway: if you need “stable,” treat aoti_compile_and_package as version-pinned build tooling, and rely on the LibTorch Stable ABI / torch::stable side for the runtime integration guarantees. [2][4]

Sources: [1] AOTInductor docs (PyTorch) • [2] LibTorch Stable ABI docs (PyTorch) • [3] “Extending PyTorch” docs note on private APIs changing without warning • [4] Torch Stable API C++ docs.

torch._inductor.aoti_compile_and_package is a private/unstable PyTorch API that may change between versions.

While currently the documented way to use AOTInductor for .pt2 packages, torch._inductor is internal and not covered by PyTorch's backward-compatibility guarantees. Consider either pinning the PyTorch version or adding API compatibility checks to handle potential breaking changes in future releases.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@deepmd/pt_expt/utils/serialization.py` around lines 355 - 362, The code uses the private API torch._inductor.aoti_compile_and_package (called with exported and package_path=model_file after _trace_and_export), which is unstable; update the call site to guard against API changes by either pinning the supported PyTorch version in packaging/tests or adding a runtime compatibility layer: attempt an import of aoti_compile_and_package from torch._inductor, fall back to an alternative public API or raise a clear error that includes the PyTorch version and helpful instructions; keep references to exported, _trace_and_export, and model_file so the fallback/compatibility check wraps the exact call site and logs actionable messages when the private symbol is unavailable or signature-changed.

coderabbitai · 2026-03-09T03:44:29Z

source/api_cc/tests/test_deeppot_dpa2_ptexpt.cc

+  std::vector<VALUETYPE> coord_vir(nvir * 3);
+  std::vector<int> atype_vir(nvir, 2);
+  for (int ii = 0; ii < nvir; ++ii) {
+    coord_vir[ii] = coord[ii];
+  }


⚠️ Potential issue | 🟡 Minor

Copy all 3 coordinates for each synthetic virtual atom.

This loop only copies nvir scalars into a nvir * 3 coordinate buffer, so most of the virtual-atom coordinates stay zero. The type-selection test then runs against malformed positions.

💡 Suggested fix

- for (int ii = 0; ii < nvir; ++ii) { - coord_vir[ii] = coord[ii]; - } + std::copy_n(coord.begin(), nvir * 3, coord_vir.begin());

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@source/api_cc/tests/test_deeppot_dpa2_ptexpt.cc` around lines 387 - 391, The coord_vir buffer is sized for nvir*3 but the loop only copies one scalar per virtual atom, leaving two coordinates zero; update the loop that fills coord_vir (the section using coord_vir, coord, and nvir) to copy all three coordinates per atom (e.g., iterate j=0..2 and assign coord_vir[3*ii + j] = coord[3*ii + j] or memcpy three values) so each virtual atom gets its full (x,y,z) position rather than a single scalar.

coderabbitai · 2026-03-09T03:44:29Z

source/api_cc/tests/test_deeppot_ptexpt.cc

+  std::vector<VALUETYPE> coord_vir(nvir * 3);
+  std::vector<int> atype_vir(nvir, 2);
+  for (int ii = 0; ii < nvir; ++ii) {
+    coord_vir[ii] = coord[ii];
+  }


⚠️ Potential issue | 🟡 Minor

Copy all 3 coordinates for each synthetic virtual atom.

This loop only fills the first nvir entries of a nvir * 3 buffer, so the remaining coordinates stay zero. The test then exercises the type-selection path with malformed virtual atoms.

💡 Suggested fix

- for (int ii = 0; ii < nvir; ++ii) { - coord_vir[ii] = coord[ii]; - } + std::copy_n(coord.begin(), nvir * 3, coord_vir.begin());

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

std::vector<VALUETYPE> coord_vir(nvir * 3);

std::vector<int> atype_vir(nvir, 2);

for (int ii = 0; ii < nvir; ++ii) {

coord_vir[ii] = coord[ii];

}

std::vector<VALUETYPE> coord_vir(nvir * 3);

std::vector<int> atype_vir(nvir, 2);

std::copy_n(coord.begin(), nvir * 3, coord_vir.begin());

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@source/api_cc/tests/test_deeppot_ptexpt.cc` around lines 360 - 364, The loop that initializes coord_vir only copies a single coordinate per virtual atom into a buffer sized nvir*3; update the initialization for coord_vir so that for each virtual atom index ii you copy all three coordinates (e.g., assign coord_vir[3*ii + 0..2] = coord[0..2] or use std::copy/memcpy) so the 3D coordinates for each synthetic virtual atom are properly populated; keep atype_vir initialization as-is.

coderabbitai · 2026-03-09T03:44:29Z

source/tests/infer/gen_dpa2.py

+    pth_path = os.path.join(base_dir, "deeppot_dpa2.pth")
+    print(f"Exporting to {pth_path} ...")  # noqa: T201
+    try:
+        pt_deserialize_to_file(pth_path, copy.deepcopy(data))
+    except RuntimeError as e:
+        # Custom ops (e.g. tabulate_fusion_se_t_tebd) may not be available
+        # in all build environments; .pth generation is not critical.
+        print(f"WARNING: .pth export failed ({e}), skipping.")  # noqa: T201
+


⚠️ Potential issue | 🟠 Major

Don’t verify a stale .pth artifact after export failure.

If pt_deserialize_to_file() throws and an old deeppot_dpa2.pth is already present, the later os.path.exists(pth_path) branch will still load that stale artifact. That can silently validate the wrong weights and leave regenerated fixtures out of sync with the fresh .pt2 model.

💡 Suggested fix

pth_path = os.path.join(base_dir, "deeppot_dpa2.pth") + pth_generated = False print(f"Exporting to {pth_path} ...") # noqa: T201 try: + if os.path.exists(pth_path): + os.remove(pth_path) pt_deserialize_to_file(pth_path, copy.deepcopy(data)) + pth_generated = True except RuntimeError as e: + if os.path.exists(pth_path): + os.remove(pth_path) # Custom ops (e.g. tabulate_fusion_se_t_tebd) may not be available # in all build environments; .pth generation is not critical. print(f"WARNING: .pth export failed ({e}), skipping.") # noqa: T201 @@ - if os.path.exists(pth_path): + if pth_generated:

Also applies to: 181-198

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@source/tests/infer/gen_dpa2.py` around lines 127 - 135, The current flow attempts pt_deserialize_to_file(pth_path, ...) and then later still checks os.path.exists(pth_path), which can cause loading a stale deeppot_dpa2.pth if the export failed; change the logic so that after catching RuntimeError from pt_deserialize_to_file you do not proceed to verify or load pth_path (e.g., set a local success flag or return/continue) and only perform the os.path.exists(pth_path) verification when the export actually succeeded; apply the same fix for the second occurrence that mirrors lines 181-198.

coderabbitai · 2026-03-09T03:44:29Z

source/tests/infer/gen_dpa2.py

+    e1, f1, v1, ae1, av1 = dp.eval(coord, box, atype, atomic=True)
+    print(f"\n// PBC total energy: {e1[0, 0]:.18e}")  # noqa: T201
+    print_cpp_values("PBC reference values", ae1, f1, av1)
+
+    # ---- 5. Run inference for NoPbc test ----
+    e_np, f_np, v_np, ae_np, av_np = dp.eval(coord, None, atype, atomic=True)
+    print(f"\n// NoPbc total energy: {e_np[0, 0]:.18e}")  # noqa: T201
+    print_cpp_values("NoPbc reference values", ae_np, f_np, av_np)
+
+    # ---- 6. Verify .pth gives same results ----
+    if os.path.exists(pth_path):
+        dp_pth = DeepPot(pth_path)
+        e_pth, f_pth, v_pth, ae_pth, av_pth = dp_pth.eval(
+            coord, box, atype, atomic=True
+        )
+        print(f"\n// .pth PBC total energy: {e_pth[0, 0]:.18e}")  # noqa: T201
+        print(f"// .pth vs .pt2 energy diff: {abs(e1[0, 0] - e_pth[0, 0]):.2e}")  # noqa: T201
+        print(f"// .pth vs .pt2 force max diff: {np.max(np.abs(f1 - f_pth)):.2e}")  # noqa: T201
+
+        e_pth_np, f_pth_np, _, ae_pth_np, av_pth_np = dp_pth.eval(


⚠️ Potential issue | 🟠 Major

Prefix the intentionally unused eval outputs.

ruff is already reporting RUF059 here for the unused unpacked values, so this helper will keep failing lint until those names are marked as intentionally unused.

💡 Suggested fix

- e1, f1, v1, ae1, av1 = dp.eval(coord, box, atype, atomic=True) + e1, f1, _v1, ae1, av1 = dp.eval(coord, box, atype, atomic=True) @@ - e_np, f_np, v_np, ae_np, av_np = dp.eval(coord, None, atype, atomic=True) + e_np, f_np, _v_np, ae_np, av_np = dp.eval(coord, None, atype, atomic=True) @@ - e_pth, f_pth, v_pth, ae_pth, av_pth = dp_pth.eval( + e_pth, f_pth, _v_pth, _ae_pth, _av_pth = dp_pth.eval( coord, box, atype, atomic=True ) @@ - e_pth_np, f_pth_np, _, ae_pth_np, av_pth_np = dp_pth.eval( + e_pth_np, _f_pth_np, _, _ae_pth_np, _av_pth_np = dp_pth.eval( coord, None, atype, atomic=True )

As per coding guidelines, `**/*.py`: Always run `ruff check .` and `ruff format .` before committing changes or CI will fail.

🧰 Tools

🪛 Ruff (0.15.4)

[warning] 171-171: Unpacked variable v1 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

[warning] 176-176: Unpacked variable v_np is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

[warning] 183-183: Unpacked variable v_pth is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

[warning] 183-183: Unpacked variable ae_pth is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

[warning] 183-183: Unpacked variable av_pth is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

[warning] 190-190: Unpacked variable f_pth_np is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

[warning] 190-190: Unpacked variable ae_pth_np is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

[warning] 190-190: Unpacked variable av_pth_np is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@source/tests/infer/gen_dpa2.py` around lines 171 - 190, The lint error comes from unpacking unused returns from dp.eval/dp_pth.eval; update the unused variable names by prefixing them with an underscore (e.g., change v_np -> _v_np, ae_np -> _ae_np, av_np -> _av_np in the dp.eval call and similarly use _ for the unused third return in dp_pth.eval and prefix any unused ae_pth_np/av_pth_np as needed) so that eval(...) still returns all values but ruff RUF059 is satisfied; locate uses around the dp.eval and dp_pth.eval calls and rename only the unused symbols (leave e_np, f_np, e_pth, f_pth etc. as-is).

coderabbitai · 2026-03-09T03:44:29Z

source/tests/infer/gen_fparam_aparam.py

+    e, f, v, ae, av = dp.eval(
+        coord,
+        box,
+        atype,
+        fparam=fparam_val,
+        aparam=aparam_val,
+        atomic=True,
+    )
+
+    nloc = 6
+    atom_energy = ae[0, :, 0]
+    force = f[0]
+    atom_virial = av[0]
+
+    # Print C++ format
+    print("\n// ---- Reference values for C++ test (shared by .pth and .pt2) ----")  # noqa: T201
+    print(f"// Total energy: {e[0, 0]:.18e}")  # noqa: T201
+    print()  # noqa: T201
+    print("  std::vector<VALUETYPE> expected_e = {")  # noqa: T201
+    for ii, ev in enumerate(atom_energy):
+        comma = "," if ii < len(atom_energy) - 1 else ""
+        print(f"      {ev:.18e}{comma}")  # noqa: T201
+    print("  };")  # noqa: T201
+
+    print("  std::vector<VALUETYPE> expected_f = {")  # noqa: T201
+    force_flat = force.flatten()
+    for ii, fv in enumerate(force_flat):
+        comma = "," if ii < len(force_flat) - 1 else ""
+        print(f"      {fv:.18e}{comma}")  # noqa: T201
+    print("  };")  # noqa: T201
+
+    print("  std::vector<VALUETYPE> expected_v = {")  # noqa: T201
+    virial_flat = atom_virial.flatten()
+    for ii, vv in enumerate(virial_flat):
+        comma = "," if ii < len(virial_flat) - 1 else ""
+        print(f"      {vv:.18e}{comma}")  # noqa: T201
+    print("  };")  # noqa: T201
+
+    # ---- 5. Verify .pth gives same results ----
+    dp_pth = DeepPot(pth_path)
+    e_pth, f_pth, _, ae_pth, av_pth = dp_pth.eval(
+        coord,
+        box,
+        atype,
+        fparam=fparam_val,
+        aparam=aparam_val,
+        atomic=True,
+    )
+    print(f"\n// .pth vs .pt2 energy diff: {abs(e[0, 0] - e_pth[0, 0]):.2e}")  # noqa: T201
+    print(f"// .pth vs .pt2 force max diff: {np.max(np.abs(f - f_pth)):.2e}")  # noqa: T201
+


⚠️ Potential issue | 🟠 Major

Fix the Ruff failures in this generator.

v, nloc, ae_pth, and av_pth are unused here, so this new file will fail ruff check as-is.

♻️ Suggested cleanup

- e, f, v, ae, av = dp.eval( + e, f, _, ae, av = dp.eval( coord, box, atype, fparam=fparam_val, aparam=aparam_val, atomic=True, ) - nloc = 6 atom_energy = ae[0, :, 0] force = f[0] atom_virial = av[0] @@ - e_pth, f_pth, _, ae_pth, av_pth = dp_pth.eval( + e_pth, f_pth, _, _ae_pth, _av_pth = dp_pth.eval( coord, box, atype, fparam=fparam_val, aparam=aparam_val, atomic=True, )

As per coding guidelines, "Always run ruff check . and ruff format . before committing changes or CI will fail."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

e, f, v, ae, av = dp.eval(

coord,

box,

atype,

fparam=fparam_val,

aparam=aparam_val,

atomic=True,

)

nloc = 6

atom_energy = ae[0, :, 0]

force = f[0]

atom_virial = av[0]

# Print C++ format

print("\n// ---- Reference values for C++ test (shared by .pth and .pt2) ----") # noqa: T201

print(f"// Total energy: {e[0, 0]:.18e}") # noqa: T201

print() # noqa: T201

print(" std::vector<VALUETYPE> expected_e = {") # noqa: T201

for ii, ev in enumerate(atom_energy):

comma = "," if ii < len(atom_energy) - 1 else ""

print(f" {ev:.18e}{comma}") # noqa: T201

print(" };") # noqa: T201

print(" std::vector<VALUETYPE> expected_f = {") # noqa: T201

force_flat = force.flatten()

for ii, fv in enumerate(force_flat):

comma = "," if ii < len(force_flat) - 1 else ""

print(f" {fv:.18e}{comma}") # noqa: T201

print(" };") # noqa: T201

print(" std::vector<VALUETYPE> expected_v = {") # noqa: T201

virial_flat = atom_virial.flatten()

for ii, vv in enumerate(virial_flat):

comma = "," if ii < len(virial_flat) - 1 else ""

print(f" {vv:.18e}{comma}") # noqa: T201

print(" };") # noqa: T201

# ---- 5. Verify .pth gives same results ----

dp_pth = DeepPot(pth_path)

e_pth, f_pth, _, ae_pth, av_pth = dp_pth.eval(

coord,

box,

atype,

fparam=fparam_val,

aparam=aparam_val,

atomic=True,

)

print(f"\n// .pth vs .pt2 energy diff: {abs(e[0, 0] - e_pth[0, 0]):.2e}") # noqa: T201

print(f"// .pth vs .pt2 force max diff: {np.max(np.abs(f - f_pth)):.2e}") # noqa: T201

e, f, _, ae, av = dp.eval(

coord,

box,

atype,

fparam=fparam_val,

aparam=aparam_val,

atomic=True,

)

atom_energy = ae[0, :, 0]

force = f[0]

atom_virial = av[0]

# Print C++ format

print("\n// ---- Reference values for C++ test (shared by .pth and .pt2) ----") # noqa: T201

print(f"// Total energy: {e[0, 0]:.18e}") # noqa: T201

print() # noqa: T201

print(" std::vector<VALUETYPE> expected_e = {") # noqa: T201

for ii, ev in enumerate(atom_energy):

comma = "," if ii < len(atom_energy) - 1 else ""

print(f" {ev:.18e}{comma}") # noqa: T201

print(" };") # noqa: T201

print(" std::vector<VALUETYPE> expected_f = {") # noqa: T201

force_flat = force.flatten()

for ii, fv in enumerate(force_flat):

comma = "," if ii < len(force_flat) - 1 else ""

print(f" {fv:.18e}{comma}") # noqa: T201

print(" };") # noqa: T201

print(" std::vector<VALUETYPE> expected_v = {") # noqa: T201

virial_flat = atom_virial.flatten()

for ii, vv in enumerate(virial_flat):

comma = "," if ii < len(virial_flat) - 1 else ""

print(f" {vv:.18e}{comma}") # noqa: T201

print(" };") # noqa: T201

# ---- 5. Verify .pth gives same results ----

dp_pth = DeepPot(pth_path)

e_pth, f_pth, _, _ae_pth, _av_pth = dp_pth.eval(

coord,

box,

atype,

fparam=fparam_val,

aparam=aparam_val,

atomic=True,

)

print(f"\n// .pth vs .pt2 energy diff: {abs(e[0, 0] - e_pth[0, 0]):.2e}") # noqa: T201

print(f"// .pth vs .pt2 force max diff: {np.max(np.abs(f - f_pth)):.2e}") # noqa: T201

🧰 Tools

🪛 Ruff (0.15.4)

[warning] 127-127: Unpacked variable v is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

[error] 136-136: Local variable nloc is assigned to but never used

Remove assignment to unused variable nloc

(F841)

[warning] 167-167: Unpacked variable ae_pth is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

[warning] 167-167: Unpacked variable av_pth is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@source/tests/infer/gen_fparam_aparam.py` around lines 127 - 177, The new test generator declares unused variables (v, nloc, ae_pth, av_pth) which causes ruff F841; fix by either using or marking them as intentionally unused—rename them with a leading underscore (e.g., _v, _nloc, _ae_pth, _av_pth) or replace their assignments with a single underscore/tuple discard when calling DeepPot.eval (e.g., e_pth, f_pth, _ = ... or e_pth, f_pth, *_ = ...) so that functions/methods DeepPot.eval and local names v, nloc, ae_pth, av_pth no longer trigger unused-variable errors.