Skip to content

feat(pt_expt): add dp freeze support for pt_expt backend (.pte/.pt2)#5299

Draft
wanghan-iapcm wants to merge 4 commits intodeepmodeling:masterfrom
wanghan-iapcm:feat-pt-expt-dpfrz
Draft

feat(pt_expt): add dp freeze support for pt_expt backend (.pte/.pt2)#5299
wanghan-iapcm wants to merge 4 commits intodeepmodeling:masterfrom
wanghan-iapcm:feat-pt-expt-dpfrz

Conversation

@wanghan-iapcm
Copy link
Collaborator

@wanghan-iapcm wanghan-iapcm commented Mar 9, 2026

Add freeze() function that loads a training checkpoint, reconstructs the model, serializes it, and exports via the existing deserialize_to_file pipeline. Wire the freeze command in the main() CLI dispatcher with checkpoint-dir resolution and automatic .pt2 suffix defaulting.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for .pt2 (AOTInductor) model format as an alternative to .pte for exporting frozen models.
    • Introduced freeze command to export PyTorch checkpoints directly to .pt2 or .pte formats.
    • Enhanced descriptor implementations for improved numerical stability across different execution paths.
  • Bug Fixes

    • Fixed neighbor list indexing issues in export scenarios with non-periodic boundary conditions.

Han Wang added 4 commits March 7, 2026 15:55
- add support for serialization to .pt2
- deep eval with .pt2
- C/C++ inference api with .pt2
- add ut for python inference
- add ut for c++/c inference: se_e2_a, fparam/aparam, multi-frames, dpa1
Replace flat (nf*nall,) indexing in dpa1.py and exclude_mask.py with
xp_take_along_axis and xp_take_first_n. The flat indexing creates
Ne(nall, nloc) assertions during torch.export tracing that fail when
nall == nloc (NoPbc). The new gather-based approach avoids these
shape constraints.

Unify PT/pt_expt test models so .pth and .pt2 are generated from the
same dpmodel weights (type_one_side=True) via gen scripts. Remove
deeppot_sea.pth, deeppot_dpa.pth, and fparam_aparam.pth from git
tracking — they are now regenerated in CI via convert-models.sh.

Changes:
- dpa1.py, exclude_mask.py: xp_take_along_axis replaces flat indexing
- gen_dpa1.py: generates deeppot_dpa1.pth + .pt2 from dpmodel
- gen_fparam_aparam.py: generates fparam_aparam.pth + .pt2 from dpmodel
- convert-models.sh: regenerate .pth from yaml and run gen scripts in CI
- test_deeppot_dpa_ptexpt.cc: add NoPbc test fixture
- test_deeppot_dpa1_pt.cc: new DPA1 .pth test (PBC + NoPbc)
- test_deeppot_a_fparam_aparam_pt.cc: update reference values
…ckends

Fix a latent bug in all 5 C++ backends (DeepPotPT, DeepPotPTExpt,
DeepPotJAX, DeepSpinPT, DeepPotPD) where the ghost-to-local mapping
was incorrectly indexed through fwd_map when virtual atoms are present.
The old code `mapping[ii] = lmp_list.mapping[fwd_map[ii]]` treats the
post-filter index as an original index, causing undefined behavior when
fwd_map contains -1 entries for filtered atoms. The fix uses bkw_map
to correctly translate: new_idx -> orig_idx -> mapped_orig -> new_mapped.

Also fix a use-after-free in DeepPotPTExpt.cc where torch::from_blob
referenced a local vector's data after the vector went out of scope.

Additionally fix dpmodel code (dpa2.py, repformers.py, make_model.py,
nlist.py) to use xp_take_first_n and xp_expand_dims instead of slice
indexing that creates shape constraints incompatible with torch.export.

Add DPA2 C++ test files for both .pth and .pt2 backends with PBC,
NoPbc, and type_sel test cases. Reduce DPA2 test model size for
machine-precision agreement between jit and export paths.
Add freeze() function that loads a training checkpoint, reconstructs the
model, serializes it, and exports via the existing deserialize_to_file
pipeline. Wire the freeze command in the main() CLI dispatcher with
checkpoint-dir resolution and automatic .pt2 suffix defaulting.
@wanghan-iapcm wanghan-iapcm requested a review from njzjz March 9, 2026 03:27
@wanghan-iapcm wanghan-iapcm marked this pull request as draft March 9, 2026 03:27
@wanghan-iapcm
Copy link
Collaborator Author

merge after #5298

@dosubot dosubot bot added the new feature label Mar 9, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6145fe8d69

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

elif FLAGS.command == "freeze":
if Path(FLAGS.checkpoint_folder).is_dir():
checkpoint_path = Path(FLAGS.checkpoint_folder)
latest_ckpt_file = (checkpoint_path / "checkpoint").read_text()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle pt_expt checkpoint dirs without checkpoint file

Reading checkpoint_path / "checkpoint" unconditionally breaks dp --pt_expt freeze -c <dir> for directories produced by the pt_expt trainer, because deepmd/pt_expt/train/training.py writes model.ckpt-<step>.pt plus a model.ckpt.pt symlink and does not create a checkpoint pointer file; in that common flow (including the default -c .), freeze raises FileNotFoundError instead of resolving the latest checkpoint.

Useful? React with 👍 / 👎.

)
nf, nloc, nnei, _ = dmatrix.shape
atype = atype_ext[:, :nloc]
nall = atype_ext.shape[1]

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable nall is not used.
atomic=True,
)

nloc = 6

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable nloc is not used.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

This PR introduces PyTorch-exportable (.pt2) model format support alongside the existing .pte format, using AOTInductor for compiled model execution. It refactors descriptor array operations using new utility functions (xp_take_first_n, xp_take_along_axis), adds a new PyTorchExportable backend enum, implements full C++ and C APIs for inference, provides model serialization/deserialization with ZIP-based metadata, adds a CLI freeze command for model export, and includes comprehensive test coverage across Python, C++, and C integration layers.

Changes

Cohort / File(s) Summary
Backend and format recognition
deepmd/backend/pt_expt.py, source/api_cc/include/common.h, source/api_cc/src/common.cc, source/api_cc/src/DeepPot.cc
Added PyTorchExportable backend enum value, updated file extension recognition to detect .pt2 files, and integrated new backend into DeepPot initialization path alongside existing backends.
Array API helpers
deepmd/dpmodel/model/make_model.py
Introduced xp_take_along_axis and xp_take_first_n public exports for standardized array slicing operations; refactored neighbor-list formatting to use new helpers instead of direct array operations.
Descriptor refactoring (DPA1, DPA2, repformers)
deepmd/dpmodel/descriptor/dpa1.py, deepmd/dpmodel/descriptor/dpa2.py, deepmd/dpmodel/descriptor/repformers.py
Replaced explicit slicing with xp_take_first_n and xp_take_along_axis utilities; introduced 2D neighbor-list gathering to avoid flat indexing issues during export; refactored mapping construction using expand_dims approach.
Utility function refactoring
deepmd/dpmodel/utils/exclude_mask.py, deepmd/dpmodel/utils/nlist.py
Replaced direct array slicing with xp_take_first_n and xp_take_along_axis for consistent neighbor-type and coordinate gathering; reshaped gather logic to mitigate export-related issues.
Mapping index fixes across backends
source/api_cc/src/DeepPotJAX.cc, source/api_cc/src/DeepPotPD.cc, source/api_cc/src/DeepPotPT.cc, source/api_cc/src/DeepSpinPT.cc
Updated mapping construction to use backward/forward map reindexing pattern (fwd_map[lmp_list.mapping[bkw_map[ii]]]) for correct per-atom index translation across all PT-based backends.
C++ pt_expt backend implementation
source/api_cc/include/DeepPotPTExpt.h, source/api_cc/src/DeepPotPTExpt.cc
New DeepPotPTExpt class providing full compute API with AOTInductor integration, custom ZIP parsing for metadata extraction, type-sorted neighbor list building, frame-based and mixed-type computation pipelines, and comprehensive template instantiations for multiple numeric types.
Model serialization and export
deepmd/pt_expt/utils/serialization.py, deepmd/pt_expt/infer/deep_eval.py, deepmd/pt_expt/entrypoints/main.py
Added serialization support for .pt2 format with metadata embedding, new _trace_and_export and _deserialize_to_file_pt2 functions, pt2 file recognition in DeepEval with dual-path execution, and new CLI freeze command with output format enforcement.
C API test suite for pt_expt
source/api_c/tests/test_deeppot_a_ptexpt.cc, source/api_c/tests/test_deeppot_a_fparam_aparam_ptexpt.cc
New C API test fixtures for double/float precision inference validation against precomputed references; includes PBC, NoPBC, and parameter-variant test cases.
C++ API test suites for pt_expt
source/api_cc/tests/test_deeppot_ptexpt.cc, source/api_cc/tests/test_deeppot_a_fparam_aparam_ptexpt.cc, source/api_cc/tests/test_deeppot_a_fparam_aparam_nframes_ptexpt.cc, source/api_cc/tests/test_deeppot_dpa1_pt.cc, source/api_cc/tests/test_deeppot_dpa2_pt.cc, source/api_cc/tests/test_deeppot_dpa2_ptexpt.cc, source/api_cc/tests/test_deeppot_dpa_ptexpt.cc
Comprehensive templated test suites covering single/multi-frame inference, atomic outputs, LAMMPS-style neighbor lists, NoPBC scenarios, and cross-format validation; includes tolerance adjustments per value type.
PT test data updates and utilities
source/api_cc/tests/test_deeppot_a_fparam_aparam_pt.cc, source/api_cc/tests/test_utils.h
Updated expected reference values for pt backend tests; promoted test utility member visibility from private to protected for derived test classes.
Model generation and conversion utilities
source/tests/infer/convert-models.sh, source/tests/infer/gen_dpa1.py, source/tests/infer/gen_dpa2.py, source/tests/infer/gen_fparam_aparam.py
Added conversion scripts and Python generators for .pt2 and .pth model artifacts; includes PBC/NoPBC inference validation and cross-format consistency checks.
Python integration tests
source/tests/pt_expt/infer/test_deep_eval.py, source/tests/pt_expt/test_freeze.py
New test classes for pt_expt deep evaluation covering model metadata, ZIP file validation, multi-frame scenarios, and freeze CLI functionality; includes round-trip serialization and format cross-checks.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as Freeze Command
    participant Serializer as Serialization
    participant Tracer as PyTorch Tracer
    participant Compiler as AOTInductor
    participant Storage as File Storage

    User->>CLI: freeze(checkpoint_path, output.pt2)
    CLI->>CLI: Load checkpoint state_dict
    CLI->>CLI: Reconstruct model from params
    CLI->>Serializer: Collect metadata and serialize
    Serializer->>Tracer: Create sample inputs
    Tracer->>Tracer: Trace model forward pass
    Tracer->>Compiler: Export with dynamic shapes
    Compiler->>Compiler: Compile to AOTInductor format
    Compiler->>Storage: Package model + metadata + extra/
    Storage-->>User: Output .pt2 file (ZIP archive)
Loading
sequenceDiagram
    actor User
    participant DeepPot as DeepPot Loader
    participant Format as Format Detector
    participant PTExpt as DeepPotPTExpt
    participant Loader as AOTILoader
    participant Model as Compiled Model
    participant Extractor as Output Extractor

    User->>DeepPot: Load model file (.pt2)
    DeepPot->>Format: Detect file type
    Format-->>DeepPot: PyTorchExportable (.pt2)
    DeepPot->>PTExpt: Create DeepPotPTExpt instance
    PTExpt->>PTExpt: Read metadata from ZIP
    PTExpt->>Loader: Initialize AOTIModelPackageLoader
    Loader-->>PTExpt: Package loaded
    User->>PTExpt: compute(coord, atype, box, ...)
    PTExpt->>PTExpt: Build tensors and neighbor lists
    PTExpt->>Model: run_model(inputs)
    Model-->>PTExpt: Raw flat outputs as tensors
    PTExpt->>Extractor: extract_outputs(flat_outputs)
    Extractor-->>PTExpt: Keyed output map
    PTExpt-->>User: energy, force, virial, ...
Loading
sequenceDiagram
    participant Descriptor as DPA1/DPA2 Descriptor
    participant ArrayAPI as Array API
    participant Embedding as Embedding Layer
    participant Gathering as Gather Operations

    Descriptor->>ArrayAPI: xp_take_first_n(atype_embd_ext, 1, nloc)
    ArrayAPI->>Gathering: First-n selection along axis
    Gathering-->>Descriptor: atype_embd (local atoms only)
    Descriptor->>ArrayAPI: xp_take_along_axis(atype_embd_nlist, indices_2d)
    ArrayAPI->>Gathering: 2D along-axis gather
    Gathering-->>Descriptor: atype_embd_nlist (neighbor embeddings)
    Descriptor->>Descriptor: Reshape for concatenation
    Descriptor->>Embedding: Concat embeddings for attention
    Embedding-->>Descriptor: Combined features
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

new feature, PyTorch, C++, export

Suggested reviewers

  • iProzd
  • njzjz
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and specifically describes the main change: adding freeze support for the pt_expt backend with .pte/.pt2 format support.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

🧹 Nitpick comments (7)
source/tests/infer/gen_dpa1.py (1)

143-164: Prefix unused unpacked variables with underscore for ruff compliance.

Static analysis flags several unused variables from the dp.eval() return tuples. Per coding guidelines, prefix them with _ to pass ruff check.

♻️ Proposed fix
-    e1, f1, v1, ae1, av1 = dp.eval(coord, box, atype, atomic=True)
+    e1, f1, _v1, ae1, av1 = dp.eval(coord, box, atype, atomic=True)
     print(f"\n// PBC total energy: {e1[0, 0]:.18e}")  # noqa: T201
     print_cpp_values("PBC reference values", ae1, f1, av1)

     # ---- 5. Run inference for NoPbc test ----
-    e_np, f_np, v_np, ae_np, av_np = dp.eval(coord, None, atype, atomic=True)
+    e_np, f_np, _v_np, ae_np, av_np = dp.eval(coord, None, atype, atomic=True)
     print(f"\n// NoPbc total energy: {e_np[0, 0]:.18e}")  # noqa: T201
     print_cpp_values("NoPbc reference values", ae_np, f_np, av_np)

     # ---- 6. Verify .pth gives same results ----
     dp_pth = DeepPot(pth_path)
-    e_pth, f_pth, v_pth, ae_pth, av_pth = dp_pth.eval(coord, box, atype, atomic=True)
+    e_pth, f_pth, _v_pth, _ae_pth, _av_pth = dp_pth.eval(coord, box, atype, atomic=True)
     print(f"\n// .pth PBC total energy: {e_pth[0, 0]:.18e}")  # noqa: T201
     print(f"// .pth vs .pt2 energy diff: {abs(e1[0, 0] - e_pth[0, 0]):.2e}")  # noqa: T201
     print(f"// .pth vs .pt2 force max diff: {np.max(np.abs(f1 - f_pth)):.2e}")  # noqa: T201

-    e_pth_np, f_pth_np, _, ae_pth_np, av_pth_np = dp_pth.eval(
+    e_pth_np, _f_pth_np, _, _ae_pth_np, _av_pth_np = dp_pth.eval(
         coord, None, atype, atomic=True
     )

As per coding guidelines: **/*.py: Always run ruff check . and ruff format . before committing changes or CI will fail.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/tests/infer/gen_dpa1.py` around lines 143 - 164, The tmp variables
returned by dp.eval/dp_pth.eval that are unused (v1, v_np, v_pth) should be
prefixed with an underscore (e.g., rename v1 -> _v1, v_np -> _v_np, v_pth ->
_v_pth) to satisfy ruff; update the unpacking in the calls to DeepPot.eval (the
tuples assigned to e1,f1,v1,ae1,av1; e_np,f_np,v_np,ae_np,av_np; and
e_pth,f_pth,v_pth,ae_pth,av_pth) accordingly and then run ruff check . and ruff
format . before committing.
source/tests/pt_expt/infer/test_deep_eval.py (1)

490-791: Mirror the dynamic-shape regression for .pt2 as well.

This new suite validates fixed-size .pt2 exports, but the refactor in this PR is specifically about avoiding torch.export shape constraints. Porting the existing dynamic-shape check here would cover the main failure mode this change is targeting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/tests/pt_expt/infer/test_deep_eval.py` around lines 490 - 791, The
test class TestDeepEvalEnerPt2 is only exercising fixed-size .pt2 exports; add a
dynamic-shape regression equivalent to the existing dynamic-shape check (used
for .pte) to ensure .pt2 DeepPot.eval accepts variable frame counts and avoids
torch.export shape constraints—modify TestDeepEvalEnerPt2 to include a new test
(e.g., test_pt2_dynamic_shape) that mirrors the logic used in the .pte
dynamic-shape test: generate variable nframes (1,3,5...), create coords and
cells accordingly, call self.dp.eval(...) with atomic=True and compare outputs
(energy, force, virial, atom_energy, atom_virial) across frames to either the
direct model forward (self.model.forward) or the .pte reference
(self.dp_pte.eval) using np.testing.assert_allclose with tight rtol/atol; ensure
the test uses the same RNG seed pattern and atom_types as other tests so it
targets the torch.export shape issue.
source/api_cc/src/DeepPotPTExpt.cc (5)

805-808: Redundant resize() before assign() on datom_energy.

Line 805 resizes datom_energy to nall_real elements, then line 806-808 immediately reassigns it using assign(), which discards the previous resize. The resize() call is unnecessary.

♻️ Proposed fix
-    datom_energy.resize(nall_real, 0.0);
     datom_energy.assign(
         cpu_atom_energy_.data_ptr<VALUETYPE>(),
         cpu_atom_energy_.data_ptr<VALUETYPE>() + cpu_atom_energy_.numel());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/api_cc/src/DeepPotPTExpt.cc` around lines 805 - 808, Remove the
unnecessary pre-sizing: the call to datom_energy.resize(nall_real, 0.0) is
redundant because datom_energy is immediately overwritten by
datom_energy.assign(...). Delete the resize() invocation and keep the
datom_energy.assign(...) that copies from cpu_atom_energy_.data_ptr<VALUETYPE>()
to correctly set the container contents.

526-535: Redundant catch (...) { throw; } block.

The catch block on lines 532-534 catches all exceptions and immediately rethrows them without any processing. This is redundant and can be removed.

♻️ Proposed fix
 DeepPotPTExpt::DeepPotPTExpt(const std::string& model,
                              const int& gpu_rank,
                              const std::string& file_content)
     : inited(false) {
-  try {
-    translate_error([&] { init(model, gpu_rank, file_content); });
-  } catch (...) {
-    throw;
-  }
+  translate_error([&] { init(model, gpu_rank, file_content); });
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/api_cc/src/DeepPotPTExpt.cc` around lines 526 - 535, The DeepPotPTExpt
constructor contains a redundant catch-all rethrow (catch (...) { throw; })
after calling translate_error which serves no purpose; remove the catch (...)
block so the constructor simply calls translate_error([&]{ init(model, gpu_rank,
file_content); }); and lets exceptions propagate naturally—update the
DeepPotPTExpt::DeepPotPTExpt implementation accordingly, leaving translate_error
and init unchanged.

1124-1129: Minor: get_type_map appends trailing space.

The function appends a space after each type name, resulting in a trailing space in the output string. This may be intentional for compatibility but could cause issues if the caller expects clean output.

♻️ Proposed fix if trailing space is unintended
 void DeepPotPTExpt::get_type_map(std::string& type_map_str) {
-  for (const auto& t : type_map) {
-    type_map_str += t;
-    type_map_str += " ";
+  for (size_t i = 0; i < type_map.size(); ++i) {
+    if (i > 0) {
+      type_map_str += " ";
+    }
+    type_map_str += type_map[i];
   }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/api_cc/src/DeepPotPTExpt.cc` around lines 1124 - 1129,
DeepPotPTExpt::get_type_map currently appends a space after every entry in
type_map which leaves a trailing space in type_map_str; change the logic to only
insert spaces between entries (e.g., iterate with an index or detect first/last
element) so type_map_str contains the type names separated by single spaces with
no trailing space, preserving the function signature and using the existing
type_map and type_map_str variables.

158-174: parse_bool() and parse_null() lack validation of actual content.

parse_bool() only checks if the input starts with "true" but assumes anything else is "false" without verifying the actual characters. Similarly, parse_null() blindly advances 4 positions without confirming the input is actually "null".

Since this parses trusted model metadata, this is low risk, but adding validation would make it more robust.

♻️ Proposed fix to add validation
   JsonValue parse_bool() {
     JsonValue v;
     v.type = JsonValue::Bool;
     if (s_.substr(pos_, 4) == "true") {
       v.bool_val = true;
       pos_ += 4;
-    } else {
+    } else if (s_.substr(pos_, 5) == "false") {
       v.bool_val = false;
       pos_ += 5;
+    } else {
+      throw deepmd::deepmd_exception("Invalid JSON boolean");
     }
     return v;
   }

   JsonValue parse_null() {
+    if (s_.substr(pos_, 4) != "null") {
+      throw deepmd::deepmd_exception("Invalid JSON null");
+    }
     pos_ += 4;
     return JsonValue();
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/api_cc/src/DeepPotPTExpt.cc` around lines 158 - 174, parse_bool() and
parse_null() should validate the actual input before advancing pos_ and
returning a JsonValue: check that s_.substr(pos_, ...) matches "true" or "false"
(and that there are enough characters remaining) and only then set v.bool_val
and advance pos_ by 4 or 5; if neither matches, handle the parse error (e.g.,
throw/return an error state). Likewise, in parse_null() verify s_.substr(pos_,4)
== "null" and enough length before advancing pos_ by 4 and returning
JsonValue(), otherwise handle the parse error. Use the existing symbols
parse_bool, parse_null, JsonValue, s_, and pos_ to locate and update the logic.

601-614: Silent exception swallowing when setting thread counts.

The empty catch (...) blocks on lines 606-607 and 612-613 silently discard any exceptions from at::set_num_interop_threads() and at::set_num_threads(). This makes debugging thread configuration issues difficult.

Consider logging a warning when thread configuration fails.

♻️ Proposed fix to add warning logging
   if (num_inter_nthreads) {
     try {
       at::set_num_interop_threads(num_inter_nthreads);
     } catch (...) {
+      std::cerr << "WARNING: Failed to set inter-op threads to "
+                << num_inter_nthreads << std::endl;
     }
   }
   if (num_intra_nthreads) {
     try {
       at::set_num_threads(num_intra_nthreads);
     } catch (...) {
+      std::cerr << "WARNING: Failed to set intra-op threads to "
+                << num_intra_nthreads << std::endl;
     }
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/api_cc/src/DeepPotPTExpt.cc` around lines 601 - 614, The code
currently swallows exceptions from at::set_num_interop_threads and
at::set_num_threads via empty catch(...) blocks; replace these with try/catch
that logs a warning: catch(const std::exception& e) should log a clear warning
including e.what() and the function name (at::set_num_interop_threads or
at::set_num_threads) and the attempted thread count, and retain a catch(...)
that logs an "unknown exception" warning if non-std exceptions occur; locate
these changes around get_env_nthreads handling and ensure the warning uses your
project's logging mechanism (or std::cerr if no logger is available).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/dpmodel/descriptor/dpa1.py`:
- Line 1060: Remove the unused local assignment "nall = atype_ext.shape[1]" from
dpa1.py (the dead variable 'nall' referenced in the review) so the F841 lint
error is resolved; locate the assignment near the code that references
'atype_ext' and delete that single line, then run "ruff check ." and "ruff
format ." to ensure no other lint errors remain.

In `@deepmd/pt_expt/entrypoints/main.py`:
- Around line 247-250: The code reads the checkpoint pointer into
latest_ckpt_file with Path(...).read_text() which can include a trailing
newline, causing FLAGS.model to point to a non-existent path; update the logic
that sets latest_ckpt_file (used when FLAGS.checkpoint_folder is a directory) to
strip whitespace/newlines (e.g., .strip()) before calling
checkpoint_path.joinpath(...) so FLAGS.model is set to the clean filename used
by torch.load().
- Around line 163-208: The freeze function currently ignores the head parameter;
update freeze (the function handling model loading, get_model, ModelWrapper,
serialize, deserialize_to_file) to fail-fast when head is not None by raising a
clear exception (e.g., ValueError) indicating per-head freezing is not yet
supported, so callers are not silently misled; add the check near the start of
freeze before reconstructing the model and ensure the error message references
the unsupported head behavior, then run ruff check . and ruff format . to
satisfy linting/style rules.

In `@deepmd/pt_expt/infer/deep_eval.py`:
- Around line 167-169: The change clears self.exported_module after loading a
.pt2 model which breaks get_model() (which returns exported_module); update the
loader so that after aoti_load_package(model_file) you either set
self.exported_module = self._pt2_runner (so get_model() continues to work) or
update get_model() to detect the .pt2 case and return self._pt2_runner (or raise
a clear UnsupportedOperationError) — locate the code around
aoti_load_package(model_file), the attributes self._pt2_runner and
self.exported_module, and the get_model() method to implement one of these
fixes.

In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 355-362: The code uses the private API
torch._inductor.aoti_compile_and_package (called with exported and
package_path=model_file after _trace_and_export), which is unstable; update the
call site to guard against API changes by either pinning the supported PyTorch
version in packaging/tests or adding a runtime compatibility layer: attempt an
import of aoti_compile_and_package from torch._inductor, fall back to an
alternative public API or raise a clear error that includes the PyTorch version
and helpful instructions; keep references to exported, _trace_and_export, and
model_file so the fallback/compatibility check wraps the exact call site and
logs actionable messages when the private symbol is unavailable or
signature-changed.

In `@source/api_cc/include/common.h`:
- Line 16: The new PyTorchExportable enumerator shifted existing ordinals and
can break binary compatibility for get_backend(); to fix, preserve existing enum
values by moving PyTorchExportable to the end just before Unknown (i.e., keep
the original order TensorFlow, PyTorch, Paddle, JAX, PyTorchExportable, Unknown)
or alternatively give each DPBackend an explicit integral value so TensorFlow,
PyTorch, Paddle, JAX retain their previous numeric values; update the DPBackend
enum definition accordingly to avoid renumbering.

In `@source/api_cc/src/common.cc`:
- Around line 1455-1461: The get_backend function (deepmd::get_backend)
currently checks for ".pt2" but not ".pte", causing ".pte" filenames to be
treated as unsupported; update the suffix checks in deepmd::get_backend so that
files ending with ".pte" are routed to deepmd::DPBackend::PyTorchExportable
(i.e., treat ".pte" the same as ".pt2")—modify the branch that returns
deepmd::DPBackend::PyTorchExportable to accept either suffix.

In `@source/api_cc/src/DeepPotPTExpt.cc`:
- Around line 65-66: The get() method lacks a bounds check and can read past s_
when the input is malformed; update get() (which uses pos_ and s_) to first
verify pos_ < s_.size() and handle the out-of-bounds case (e.g., return '\0' or
signal an error) instead of unconditionally indexing and incrementing pos_,
ensuring it mirrors the safety of peek().
- Around line 240-263: The ZIP parser can underflow when computing
content.size() - 22; before the EOCD search, add a guard that verifies
content.size() >= 22 (or return/throw an error) and avoid computing
content.size() - 22 when the file is smaller; update the EOCD search around
variables content and eocd_pos (and the loop that starts from content.size() -
22) to only run when the size check passes and otherwise throw the same
deepmd::deepmd_exception("Invalid ZIP file: " + zip_path) or a more specific
message.
- Around line 641-648: The size check in DeepPotPTExpt::extract_outputs
currently uses assert(), which is compiled out in release builds; replace it
with a runtime check (e.g., an if that compares flat_outputs.size() and
output_keys.size()) and throw a descriptive exception (std::runtime_error or
similar) if they differ to prevent undefined behavior when populating output_map
using output_keys and flat_outputs; keep the rest of the function logic the
same.

In `@source/api_cc/tests/test_deeppot_dpa_ptexpt.cc`:
- Around line 378-382: The loop that populates coord_vir only copies one scalar
per virtual atom into a buffer sized nvir*3, producing malformed virtual atom
coordinates; update the loop that fills coord_vir (and keeps atype_vir) to copy
all three coordinates per virtual atom by writing coord_vir[3*ii + 0..2] from
coord[3*ii + 0..2] for each ii (0..nvir-1) so each synthetic atom has full x,y,z
positions.

In `@source/api_cc/tests/test_deeppot_dpa2_ptexpt.cc`:
- Around line 387-391: The coord_vir buffer is sized for nvir*3 but the loop
only copies one scalar per virtual atom, leaving two coordinates zero; update
the loop that fills coord_vir (the section using coord_vir, coord, and nvir) to
copy all three coordinates per atom (e.g., iterate j=0..2 and assign
coord_vir[3*ii + j] = coord[3*ii + j] or memcpy three values) so each virtual
atom gets its full (x,y,z) position rather than a single scalar.

In `@source/api_cc/tests/test_deeppot_ptexpt.cc`:
- Around line 360-364: The loop that initializes coord_vir only copies a single
coordinate per virtual atom into a buffer sized nvir*3; update the
initialization for coord_vir so that for each virtual atom index ii you copy all
three coordinates (e.g., assign coord_vir[3*ii + 0..2] = coord[0..2] or use
std::copy/memcpy) so the 3D coordinates for each synthetic virtual atom are
properly populated; keep atype_vir initialization as-is.

In `@source/tests/infer/gen_dpa2.py`:
- Around line 127-135: The current flow attempts
pt_deserialize_to_file(pth_path, ...) and then later still checks
os.path.exists(pth_path), which can cause loading a stale deeppot_dpa2.pth if
the export failed; change the logic so that after catching RuntimeError from
pt_deserialize_to_file you do not proceed to verify or load pth_path (e.g., set
a local success flag or return/continue) and only perform the
os.path.exists(pth_path) verification when the export actually succeeded; apply
the same fix for the second occurrence that mirrors lines 181-198.
- Around line 171-190: The lint error comes from unpacking unused returns from
dp.eval/dp_pth.eval; update the unused variable names by prefixing them with an
underscore (e.g., change v_np -> _v_np, ae_np -> _ae_np, av_np -> _av_np in the
dp.eval call and similarly use _ for the unused third return in dp_pth.eval and
prefix any unused ae_pth_np/av_pth_np as needed) so that eval(...) still returns
all values but ruff RUF059 is satisfied; locate uses around the dp.eval and
dp_pth.eval calls and rename only the unused symbols (leave e_np, f_np, e_pth,
f_pth etc. as-is).

In `@source/tests/infer/gen_fparam_aparam.py`:
- Around line 127-177: The new test generator declares unused variables (v,
nloc, ae_pth, av_pth) which causes ruff F841; fix by either using or marking
them as intentionally unused—rename them with a leading underscore (e.g., _v,
_nloc, _ae_pth, _av_pth) or replace their assignments with a single
underscore/tuple discard when calling DeepPot.eval (e.g., e_pth, f_pth, _ = ...
or e_pth, f_pth, *_ = ...) so that functions/methods DeepPot.eval and local
names v, nloc, ae_pth, av_pth no longer trigger unused-variable errors.

---

Nitpick comments:
In `@source/api_cc/src/DeepPotPTExpt.cc`:
- Around line 805-808: Remove the unnecessary pre-sizing: the call to
datom_energy.resize(nall_real, 0.0) is redundant because datom_energy is
immediately overwritten by datom_energy.assign(...). Delete the resize()
invocation and keep the datom_energy.assign(...) that copies from
cpu_atom_energy_.data_ptr<VALUETYPE>() to correctly set the container contents.
- Around line 526-535: The DeepPotPTExpt constructor contains a redundant
catch-all rethrow (catch (...) { throw; }) after calling translate_error which
serves no purpose; remove the catch (...) block so the constructor simply calls
translate_error([&]{ init(model, gpu_rank, file_content); }); and lets
exceptions propagate naturally—update the DeepPotPTExpt::DeepPotPTExpt
implementation accordingly, leaving translate_error and init unchanged.
- Around line 1124-1129: DeepPotPTExpt::get_type_map currently appends a space
after every entry in type_map which leaves a trailing space in type_map_str;
change the logic to only insert spaces between entries (e.g., iterate with an
index or detect first/last element) so type_map_str contains the type names
separated by single spaces with no trailing space, preserving the function
signature and using the existing type_map and type_map_str variables.
- Around line 158-174: parse_bool() and parse_null() should validate the actual
input before advancing pos_ and returning a JsonValue: check that
s_.substr(pos_, ...) matches "true" or "false" (and that there are enough
characters remaining) and only then set v.bool_val and advance pos_ by 4 or 5;
if neither matches, handle the parse error (e.g., throw/return an error state).
Likewise, in parse_null() verify s_.substr(pos_,4) == "null" and enough length
before advancing pos_ by 4 and returning JsonValue(), otherwise handle the parse
error. Use the existing symbols parse_bool, parse_null, JsonValue, s_, and pos_
to locate and update the logic.
- Around line 601-614: The code currently swallows exceptions from
at::set_num_interop_threads and at::set_num_threads via empty catch(...) blocks;
replace these with try/catch that logs a warning: catch(const std::exception& e)
should log a clear warning including e.what() and the function name
(at::set_num_interop_threads or at::set_num_threads) and the attempted thread
count, and retain a catch(...) that logs an "unknown exception" warning if
non-std exceptions occur; locate these changes around get_env_nthreads handling
and ensure the warning uses your project's logging mechanism (or std::cerr if no
logger is available).

In `@source/tests/infer/gen_dpa1.py`:
- Around line 143-164: The tmp variables returned by dp.eval/dp_pth.eval that
are unused (v1, v_np, v_pth) should be prefixed with an underscore (e.g., rename
v1 -> _v1, v_np -> _v_np, v_pth -> _v_pth) to satisfy ruff; update the unpacking
in the calls to DeepPot.eval (the tuples assigned to e1,f1,v1,ae1,av1;
e_np,f_np,v_np,ae_np,av_np; and e_pth,f_pth,v_pth,ae_pth,av_pth) accordingly and
then run ruff check . and ruff format . before committing.

In `@source/tests/pt_expt/infer/test_deep_eval.py`:
- Around line 490-791: The test class TestDeepEvalEnerPt2 is only exercising
fixed-size .pt2 exports; add a dynamic-shape regression equivalent to the
existing dynamic-shape check (used for .pte) to ensure .pt2 DeepPot.eval accepts
variable frame counts and avoids torch.export shape constraints—modify
TestDeepEvalEnerPt2 to include a new test (e.g., test_pt2_dynamic_shape) that
mirrors the logic used in the .pte dynamic-shape test: generate variable nframes
(1,3,5...), create coords and cells accordingly, call self.dp.eval(...) with
atomic=True and compare outputs (energy, force, virial, atom_energy,
atom_virial) across frames to either the direct model forward
(self.model.forward) or the .pte reference (self.dp_pte.eval) using
np.testing.assert_allclose with tight rtol/atol; ensure the test uses the same
RNG seed pattern and atom_types as other tests so it targets the torch.export
shape issue.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4c480049-2bf8-44ef-a12e-752b9dcff61a

📥 Commits

Reviewing files that changed from the base of the PR and between 24e54bf and 6145fe8.

📒 Files selected for processing (39)
  • deepmd/backend/pt_expt.py
  • deepmd/dpmodel/descriptor/dpa1.py
  • deepmd/dpmodel/descriptor/dpa2.py
  • deepmd/dpmodel/descriptor/repformers.py
  • deepmd/dpmodel/model/make_model.py
  • deepmd/dpmodel/utils/exclude_mask.py
  • deepmd/dpmodel/utils/nlist.py
  • deepmd/pt_expt/entrypoints/main.py
  • deepmd/pt_expt/infer/deep_eval.py
  • deepmd/pt_expt/utils/serialization.py
  • source/api_c/tests/test_deeppot_a_fparam_aparam_ptexpt.cc
  • source/api_c/tests/test_deeppot_a_ptexpt.cc
  • source/api_cc/include/DeepPotPTExpt.h
  • source/api_cc/include/common.h
  • source/api_cc/src/DeepPot.cc
  • source/api_cc/src/DeepPotJAX.cc
  • source/api_cc/src/DeepPotPD.cc
  • source/api_cc/src/DeepPotPT.cc
  • source/api_cc/src/DeepPotPTExpt.cc
  • source/api_cc/src/DeepSpinPT.cc
  • source/api_cc/src/common.cc
  • source/api_cc/tests/test_deeppot_a_fparam_aparam_nframes_ptexpt.cc
  • source/api_cc/tests/test_deeppot_a_fparam_aparam_pt.cc
  • source/api_cc/tests/test_deeppot_a_fparam_aparam_ptexpt.cc
  • source/api_cc/tests/test_deeppot_dpa1_pt.cc
  • source/api_cc/tests/test_deeppot_dpa2_pt.cc
  • source/api_cc/tests/test_deeppot_dpa2_ptexpt.cc
  • source/api_cc/tests/test_deeppot_dpa_ptexpt.cc
  • source/api_cc/tests/test_deeppot_ptexpt.cc
  • source/api_cc/tests/test_utils.h
  • source/tests/infer/convert-models.sh
  • source/tests/infer/deeppot_dpa.pth
  • source/tests/infer/deeppot_sea.pth
  • source/tests/infer/fparam_aparam.pth
  • source/tests/infer/gen_dpa1.py
  • source/tests/infer/gen_dpa2.py
  • source/tests/infer/gen_fparam_aparam.py
  • source/tests/pt_expt/infer/test_deep_eval.py
  • source/tests/pt_expt/test_freeze.py

)
nf, nloc, nnei, _ = dmatrix.shape
atype = atype_ext[:, :nloc]
nall = atype_ext.shape[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Drop the dead nall assignment.

Ruff already flags this as F841, so the Python lint job will keep failing until this local is removed.

💡 Suggested fix
-        nall = atype_ext.shape[1]
         atype = xp_take_first_n(atype_ext, 1, nloc)
As per coding guidelines, `**/*.py`: Always run `ruff check .` and `ruff format .` before committing changes or CI will fail.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
nall = atype_ext.shape[1]
atype = xp_take_first_n(atype_ext, 1, nloc)
🧰 Tools
🪛 Ruff (0.15.4)

[error] 1060-1060: Local variable nall is assigned to but never used

Remove assignment to unused variable nall

(F841)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/descriptor/dpa1.py` at line 1060, Remove the unused local
assignment "nall = atype_ext.shape[1]" from dpa1.py (the dead variable 'nall'
referenced in the review) so the F841 lint error is resolved; locate the
assignment near the code that references 'atype_ext' and delete that single
line, then run "ruff check ." and "ruff format ." to ensure no other lint errors
remain.

Comment on lines +163 to +208
def freeze(
model: str,
output: str = "frozen_model.pt2",
head: str | None = None,
) -> None:
"""Freeze a pt_expt training checkpoint to .pte or .pt2 format.

Parameters
----------
model : str
Path to the training checkpoint (.pt file).
output : str
Path for the frozen model output (.pte or .pt2).
head : str or None
Head to freeze in a multi-task model (not yet supported).
"""
import torch

from deepmd.pt_expt.model import (
get_model,
)
from deepmd.pt_expt.train.wrapper import (
ModelWrapper,
)
from deepmd.pt_expt.utils.env import (
DEVICE,
)
from deepmd.pt_expt.utils.serialization import (
deserialize_to_file,
)

state_dict = torch.load(model, map_location=DEVICE, weights_only=True)
if "model" in state_dict:
state_dict = state_dict["model"]
model_params = state_dict["_extra_state"]["model_params"]

# Reconstruct model and load weights
pt_expt_model = get_model(model_params).to(DEVICE)
wrapper = ModelWrapper(pt_expt_model)
wrapper.load_state_dict(state_dict)
pt_expt_model.eval()

# Serialize to dict and export
model_dict = pt_expt_model.serialize()
deserialize_to_file(output, {"model": model_dict})
log.info(f"Saved frozen model to {output}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't silently ignore --head.

If the caller passes a head today, freeze() still exports the default/full model and never tells them the request was ignored. Fail fast until per-head freezing is actually implemented.

🛑 Suggested guard
 def freeze(
     model: str,
     output: str = "frozen_model.pt2",
     head: str | None = None,
 ) -> None:
@@
     from deepmd.pt_expt.utils.serialization import (
         deserialize_to_file,
     )

+    if head is not None:
+        raise NotImplementedError(
+            "--head is not supported for the pt_expt freeze command yet."
+        )
+
     state_dict = torch.load(model, map_location=DEVICE, weights_only=True)

As per coding guidelines, "Always run ruff check . and ruff format . before committing changes or CI will fail."

🧰 Tools
🪛 Ruff (0.15.4)

[warning] 166-166: Unused function argument: head

(ARG001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt_expt/entrypoints/main.py` around lines 163 - 208, The freeze
function currently ignores the head parameter; update freeze (the function
handling model loading, get_model, ModelWrapper, serialize, deserialize_to_file)
to fail-fast when head is not None by raising a clear exception (e.g.,
ValueError) indicating per-head freezing is not yet supported, so callers are
not silently misled; add the check near the start of freeze before
reconstructing the model and ensure the error message references the unsupported
head behavior, then run ruff check . and ruff format . to satisfy linting/style
rules.

Comment on lines +247 to +250
if Path(FLAGS.checkpoint_folder).is_dir():
checkpoint_path = Path(FLAGS.checkpoint_folder)
latest_ckpt_file = (checkpoint_path / "checkpoint").read_text()
FLAGS.model = str(checkpoint_path.joinpath(latest_ckpt_file))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Strip the checkpoint pointer before joining the path.

Path.read_text() preserves trailing newlines, so a line-based checkpoint file can produce FLAGS.model with a trailing \n and make torch.load() miss the file.

🧩 Suggested fix
-            latest_ckpt_file = (checkpoint_path / "checkpoint").read_text()
-            FLAGS.model = str(checkpoint_path.joinpath(latest_ckpt_file))
+            latest_ckpt_file = (checkpoint_path / "checkpoint").read_text().strip()
+            FLAGS.model = str(checkpoint_path / latest_ckpt_file)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if Path(FLAGS.checkpoint_folder).is_dir():
checkpoint_path = Path(FLAGS.checkpoint_folder)
latest_ckpt_file = (checkpoint_path / "checkpoint").read_text()
FLAGS.model = str(checkpoint_path.joinpath(latest_ckpt_file))
if Path(FLAGS.checkpoint_folder).is_dir():
checkpoint_path = Path(FLAGS.checkpoint_folder)
latest_ckpt_file = (checkpoint_path / "checkpoint").read_text().strip()
FLAGS.model = str(checkpoint_path / latest_ckpt_file)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt_expt/entrypoints/main.py` around lines 247 - 250, The code reads
the checkpoint pointer into latest_ckpt_file with Path(...).read_text() which
can include a trailing newline, causing FLAGS.model to point to a non-existent
path; update the logic that sets latest_ckpt_file (used when
FLAGS.checkpoint_folder is a directory) to strip whitespace/newlines (e.g.,
.strip()) before calling checkpoint_path.joinpath(...) so FLAGS.model is set to
the clean filename used by torch.load().

Comment on lines +167 to +169
# Load the AOTInductor model
self._pt2_runner = aoti_load_package(model_file)
self.exported_module = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t silently break get_model() for .pt2.

Line 169 clears self.exported_module, but get_model() still returns that attribute at Line 710. After this change, .pt2 callers get None instead of the loaded artifact. Either return _pt2_runner from get_model() for this format or make get_model() raise a clear unsupported-operation error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt_expt/infer/deep_eval.py` around lines 167 - 169, The change clears
self.exported_module after loading a .pt2 model which breaks get_model() (which
returns exported_module); update the loader so that after
aoti_load_package(model_file) you either set self.exported_module =
self._pt2_runner (so get_model() continues to work) or update get_model() to
detect the .pt2 case and return self._pt2_runner (or raise a clear
UnsupportedOperationError) — locate the code around
aoti_load_package(model_file), the attributes self._pt2_runner and
self.exported_module, and the get_model() method to implement one of these
fixes.

Comment on lines +355 to +362
from torch._inductor import (
aoti_compile_and_package,
)

exported, metadata, data_for_json, output_keys = _trace_and_export(data)

# Compile via AOTInductor into a .pt2 package
aoti_compile_and_package(exported, package_path=model_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

PyTorch AOTInductor aoti_compile_and_package stable API

💡 Result:

torch._inductor.aoti_compile_and_package() is not a stable/public API in the PyTorch sense: it lives under the underscored torch._inductor namespace, which PyTorch uses for internal/developer-facing functionality that may change without a backward-compatibility guarantee (even if it’s documented). PyTorch’s docs explicitly warn that private/developer-facing APIs “may be subject to changes without warning.” [3]

What is “stable” around AOTInductor today is not the Python compile entrypoint, but the deployment/runtime ABI surface:

  • The official AOTInductor tutorial documents torch._inductor.aoti_compile_and_package(exported_program, *, package_path=None, inductor_configs=None) -> str and torch._inductor.aoti_load_package(...) for producing/loading a .pt2 artifact (PT2 archive). [1]
  • For cross-version binary compatibility, PyTorch has a LibTorch Stable ABI effort that includes stable C headers (originating from AOTInductor) and higher-level torch::stable C++ wrappers, with an explicit ABI stability promise (not modifying headers for at least ~2 years after release). [2][4]

Practical takeaway: if you need “stable,” treat aoti_compile_and_package as version-pinned build tooling, and rely on the LibTorch Stable ABI / torch::stable side for the runtime integration guarantees. [2][4]

Sources: [1] AOTInductor docs (PyTorch) • [2] LibTorch Stable ABI docs (PyTorch) • [3] “Extending PyTorch” docs note on private APIs changing without warning • [4] Torch Stable API C++ docs.


torch._inductor.aoti_compile_and_package is a private/unstable PyTorch API that may change between versions.

While currently the documented way to use AOTInductor for .pt2 packages, torch._inductor is internal and not covered by PyTorch's backward-compatibility guarantees. Consider either pinning the PyTorch version or adding API compatibility checks to handle potential breaking changes in future releases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt_expt/utils/serialization.py` around lines 355 - 362, The code uses
the private API torch._inductor.aoti_compile_and_package (called with exported
and package_path=model_file after _trace_and_export), which is unstable; update
the call site to guard against API changes by either pinning the supported
PyTorch version in packaging/tests or adding a runtime compatibility layer:
attempt an import of aoti_compile_and_package from torch._inductor, fall back to
an alternative public API or raise a clear error that includes the PyTorch
version and helpful instructions; keep references to exported,
_trace_and_export, and model_file so the fallback/compatibility check wraps the
exact call site and logs actionable messages when the private symbol is
unavailable or signature-changed.

Comment on lines +387 to +391
std::vector<VALUETYPE> coord_vir(nvir * 3);
std::vector<int> atype_vir(nvir, 2);
for (int ii = 0; ii < nvir; ++ii) {
coord_vir[ii] = coord[ii];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Copy all 3 coordinates for each synthetic virtual atom.

This loop only copies nvir scalars into a nvir * 3 coordinate buffer, so most of the virtual-atom coordinates stay zero. The type-selection test then runs against malformed positions.

💡 Suggested fix
-  for (int ii = 0; ii < nvir; ++ii) {
-    coord_vir[ii] = coord[ii];
-  }
+  std::copy_n(coord.begin(), nvir * 3, coord_vir.begin());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/api_cc/tests/test_deeppot_dpa2_ptexpt.cc` around lines 387 - 391, The
coord_vir buffer is sized for nvir*3 but the loop only copies one scalar per
virtual atom, leaving two coordinates zero; update the loop that fills coord_vir
(the section using coord_vir, coord, and nvir) to copy all three coordinates per
atom (e.g., iterate j=0..2 and assign coord_vir[3*ii + j] = coord[3*ii + j] or
memcpy three values) so each virtual atom gets its full (x,y,z) position rather
than a single scalar.

Comment on lines +360 to +364
std::vector<VALUETYPE> coord_vir(nvir * 3);
std::vector<int> atype_vir(nvir, 2);
for (int ii = 0; ii < nvir; ++ii) {
coord_vir[ii] = coord[ii];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Copy all 3 coordinates for each synthetic virtual atom.

This loop only fills the first nvir entries of a nvir * 3 buffer, so the remaining coordinates stay zero. The test then exercises the type-selection path with malformed virtual atoms.

💡 Suggested fix
-  for (int ii = 0; ii < nvir; ++ii) {
-    coord_vir[ii] = coord[ii];
-  }
+  std::copy_n(coord.begin(), nvir * 3, coord_vir.begin());
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::vector<VALUETYPE> coord_vir(nvir * 3);
std::vector<int> atype_vir(nvir, 2);
for (int ii = 0; ii < nvir; ++ii) {
coord_vir[ii] = coord[ii];
}
std::vector<VALUETYPE> coord_vir(nvir * 3);
std::vector<int> atype_vir(nvir, 2);
std::copy_n(coord.begin(), nvir * 3, coord_vir.begin());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/api_cc/tests/test_deeppot_ptexpt.cc` around lines 360 - 364, The loop
that initializes coord_vir only copies a single coordinate per virtual atom into
a buffer sized nvir*3; update the initialization for coord_vir so that for each
virtual atom index ii you copy all three coordinates (e.g., assign
coord_vir[3*ii + 0..2] = coord[0..2] or use std::copy/memcpy) so the 3D
coordinates for each synthetic virtual atom are properly populated; keep
atype_vir initialization as-is.

Comment on lines +127 to +135
pth_path = os.path.join(base_dir, "deeppot_dpa2.pth")
print(f"Exporting to {pth_path} ...") # noqa: T201
try:
pt_deserialize_to_file(pth_path, copy.deepcopy(data))
except RuntimeError as e:
# Custom ops (e.g. tabulate_fusion_se_t_tebd) may not be available
# in all build environments; .pth generation is not critical.
print(f"WARNING: .pth export failed ({e}), skipping.") # noqa: T201

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t verify a stale .pth artifact after export failure.

If pt_deserialize_to_file() throws and an old deeppot_dpa2.pth is already present, the later os.path.exists(pth_path) branch will still load that stale artifact. That can silently validate the wrong weights and leave regenerated fixtures out of sync with the fresh .pt2 model.

💡 Suggested fix
     pth_path = os.path.join(base_dir, "deeppot_dpa2.pth")
+    pth_generated = False
     print(f"Exporting to {pth_path} ...")  # noqa: T201
     try:
+        if os.path.exists(pth_path):
+            os.remove(pth_path)
         pt_deserialize_to_file(pth_path, copy.deepcopy(data))
+        pth_generated = True
     except RuntimeError as e:
+        if os.path.exists(pth_path):
+            os.remove(pth_path)
         # Custom ops (e.g. tabulate_fusion_se_t_tebd) may not be available
         # in all build environments; .pth generation is not critical.
         print(f"WARNING: .pth export failed ({e}), skipping.")  # noqa: T201
@@
-    if os.path.exists(pth_path):
+    if pth_generated:

Also applies to: 181-198

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/tests/infer/gen_dpa2.py` around lines 127 - 135, The current flow
attempts pt_deserialize_to_file(pth_path, ...) and then later still checks
os.path.exists(pth_path), which can cause loading a stale deeppot_dpa2.pth if
the export failed; change the logic so that after catching RuntimeError from
pt_deserialize_to_file you do not proceed to verify or load pth_path (e.g., set
a local success flag or return/continue) and only perform the
os.path.exists(pth_path) verification when the export actually succeeded; apply
the same fix for the second occurrence that mirrors lines 181-198.

Comment on lines +171 to +190
e1, f1, v1, ae1, av1 = dp.eval(coord, box, atype, atomic=True)
print(f"\n// PBC total energy: {e1[0, 0]:.18e}") # noqa: T201
print_cpp_values("PBC reference values", ae1, f1, av1)

# ---- 5. Run inference for NoPbc test ----
e_np, f_np, v_np, ae_np, av_np = dp.eval(coord, None, atype, atomic=True)
print(f"\n// NoPbc total energy: {e_np[0, 0]:.18e}") # noqa: T201
print_cpp_values("NoPbc reference values", ae_np, f_np, av_np)

# ---- 6. Verify .pth gives same results ----
if os.path.exists(pth_path):
dp_pth = DeepPot(pth_path)
e_pth, f_pth, v_pth, ae_pth, av_pth = dp_pth.eval(
coord, box, atype, atomic=True
)
print(f"\n// .pth PBC total energy: {e_pth[0, 0]:.18e}") # noqa: T201
print(f"// .pth vs .pt2 energy diff: {abs(e1[0, 0] - e_pth[0, 0]):.2e}") # noqa: T201
print(f"// .pth vs .pt2 force max diff: {np.max(np.abs(f1 - f_pth)):.2e}") # noqa: T201

e_pth_np, f_pth_np, _, ae_pth_np, av_pth_np = dp_pth.eval(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prefix the intentionally unused eval outputs.

ruff is already reporting RUF059 here for the unused unpacked values, so this helper will keep failing lint until those names are marked as intentionally unused.

💡 Suggested fix
-    e1, f1, v1, ae1, av1 = dp.eval(coord, box, atype, atomic=True)
+    e1, f1, _v1, ae1, av1 = dp.eval(coord, box, atype, atomic=True)
@@
-    e_np, f_np, v_np, ae_np, av_np = dp.eval(coord, None, atype, atomic=True)
+    e_np, f_np, _v_np, ae_np, av_np = dp.eval(coord, None, atype, atomic=True)
@@
-        e_pth, f_pth, v_pth, ae_pth, av_pth = dp_pth.eval(
+        e_pth, f_pth, _v_pth, _ae_pth, _av_pth = dp_pth.eval(
             coord, box, atype, atomic=True
         )
@@
-        e_pth_np, f_pth_np, _, ae_pth_np, av_pth_np = dp_pth.eval(
+        e_pth_np, _f_pth_np, _, _ae_pth_np, _av_pth_np = dp_pth.eval(
             coord, None, atype, atomic=True
         )
As per coding guidelines, `**/*.py`: Always run `ruff check .` and `ruff format .` before committing changes or CI will fail.
🧰 Tools
🪛 Ruff (0.15.4)

[warning] 171-171: Unpacked variable v1 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 176-176: Unpacked variable v_np is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 183-183: Unpacked variable v_pth is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 183-183: Unpacked variable ae_pth is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 183-183: Unpacked variable av_pth is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 190-190: Unpacked variable f_pth_np is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 190-190: Unpacked variable ae_pth_np is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 190-190: Unpacked variable av_pth_np is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/tests/infer/gen_dpa2.py` around lines 171 - 190, The lint error comes
from unpacking unused returns from dp.eval/dp_pth.eval; update the unused
variable names by prefixing them with an underscore (e.g., change v_np -> _v_np,
ae_np -> _ae_np, av_np -> _av_np in the dp.eval call and similarly use _ for the
unused third return in dp_pth.eval and prefix any unused ae_pth_np/av_pth_np as
needed) so that eval(...) still returns all values but ruff RUF059 is satisfied;
locate uses around the dp.eval and dp_pth.eval calls and rename only the unused
symbols (leave e_np, f_np, e_pth, f_pth etc. as-is).

Comment on lines +127 to +177
e, f, v, ae, av = dp.eval(
coord,
box,
atype,
fparam=fparam_val,
aparam=aparam_val,
atomic=True,
)

nloc = 6
atom_energy = ae[0, :, 0]
force = f[0]
atom_virial = av[0]

# Print C++ format
print("\n// ---- Reference values for C++ test (shared by .pth and .pt2) ----") # noqa: T201
print(f"// Total energy: {e[0, 0]:.18e}") # noqa: T201
print() # noqa: T201
print(" std::vector<VALUETYPE> expected_e = {") # noqa: T201
for ii, ev in enumerate(atom_energy):
comma = "," if ii < len(atom_energy) - 1 else ""
print(f" {ev:.18e}{comma}") # noqa: T201
print(" };") # noqa: T201

print(" std::vector<VALUETYPE> expected_f = {") # noqa: T201
force_flat = force.flatten()
for ii, fv in enumerate(force_flat):
comma = "," if ii < len(force_flat) - 1 else ""
print(f" {fv:.18e}{comma}") # noqa: T201
print(" };") # noqa: T201

print(" std::vector<VALUETYPE> expected_v = {") # noqa: T201
virial_flat = atom_virial.flatten()
for ii, vv in enumerate(virial_flat):
comma = "," if ii < len(virial_flat) - 1 else ""
print(f" {vv:.18e}{comma}") # noqa: T201
print(" };") # noqa: T201

# ---- 5. Verify .pth gives same results ----
dp_pth = DeepPot(pth_path)
e_pth, f_pth, _, ae_pth, av_pth = dp_pth.eval(
coord,
box,
atype,
fparam=fparam_val,
aparam=aparam_val,
atomic=True,
)
print(f"\n// .pth vs .pt2 energy diff: {abs(e[0, 0] - e_pth[0, 0]):.2e}") # noqa: T201
print(f"// .pth vs .pt2 force max diff: {np.max(np.abs(f - f_pth)):.2e}") # noqa: T201

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the Ruff failures in this generator.

v, nloc, ae_pth, and av_pth are unused here, so this new file will fail ruff check as-is.

♻️ Suggested cleanup
-    e, f, v, ae, av = dp.eval(
+    e, f, _, ae, av = dp.eval(
         coord,
         box,
         atype,
         fparam=fparam_val,
         aparam=aparam_val,
         atomic=True,
     )

-    nloc = 6
     atom_energy = ae[0, :, 0]
     force = f[0]
     atom_virial = av[0]
@@
-    e_pth, f_pth, _, ae_pth, av_pth = dp_pth.eval(
+    e_pth, f_pth, _, _ae_pth, _av_pth = dp_pth.eval(
         coord,
         box,
         atype,
         fparam=fparam_val,
         aparam=aparam_val,
         atomic=True,
     )

As per coding guidelines, "Always run ruff check . and ruff format . before committing changes or CI will fail."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
e, f, v, ae, av = dp.eval(
coord,
box,
atype,
fparam=fparam_val,
aparam=aparam_val,
atomic=True,
)
nloc = 6
atom_energy = ae[0, :, 0]
force = f[0]
atom_virial = av[0]
# Print C++ format
print("\n// ---- Reference values for C++ test (shared by .pth and .pt2) ----") # noqa: T201
print(f"// Total energy: {e[0, 0]:.18e}") # noqa: T201
print() # noqa: T201
print(" std::vector<VALUETYPE> expected_e = {") # noqa: T201
for ii, ev in enumerate(atom_energy):
comma = "," if ii < len(atom_energy) - 1 else ""
print(f" {ev:.18e}{comma}") # noqa: T201
print(" };") # noqa: T201
print(" std::vector<VALUETYPE> expected_f = {") # noqa: T201
force_flat = force.flatten()
for ii, fv in enumerate(force_flat):
comma = "," if ii < len(force_flat) - 1 else ""
print(f" {fv:.18e}{comma}") # noqa: T201
print(" };") # noqa: T201
print(" std::vector<VALUETYPE> expected_v = {") # noqa: T201
virial_flat = atom_virial.flatten()
for ii, vv in enumerate(virial_flat):
comma = "," if ii < len(virial_flat) - 1 else ""
print(f" {vv:.18e}{comma}") # noqa: T201
print(" };") # noqa: T201
# ---- 5. Verify .pth gives same results ----
dp_pth = DeepPot(pth_path)
e_pth, f_pth, _, ae_pth, av_pth = dp_pth.eval(
coord,
box,
atype,
fparam=fparam_val,
aparam=aparam_val,
atomic=True,
)
print(f"\n// .pth vs .pt2 energy diff: {abs(e[0, 0] - e_pth[0, 0]):.2e}") # noqa: T201
print(f"// .pth vs .pt2 force max diff: {np.max(np.abs(f - f_pth)):.2e}") # noqa: T201
e, f, _, ae, av = dp.eval(
coord,
box,
atype,
fparam=fparam_val,
aparam=aparam_val,
atomic=True,
)
atom_energy = ae[0, :, 0]
force = f[0]
atom_virial = av[0]
# Print C++ format
print("\n// ---- Reference values for C++ test (shared by .pth and .pt2) ----") # noqa: T201
print(f"// Total energy: {e[0, 0]:.18e}") # noqa: T201
print() # noqa: T201
print(" std::vector<VALUETYPE> expected_e = {") # noqa: T201
for ii, ev in enumerate(atom_energy):
comma = "," if ii < len(atom_energy) - 1 else ""
print(f" {ev:.18e}{comma}") # noqa: T201
print(" };") # noqa: T201
print(" std::vector<VALUETYPE> expected_f = {") # noqa: T201
force_flat = force.flatten()
for ii, fv in enumerate(force_flat):
comma = "," if ii < len(force_flat) - 1 else ""
print(f" {fv:.18e}{comma}") # noqa: T201
print(" };") # noqa: T201
print(" std::vector<VALUETYPE> expected_v = {") # noqa: T201
virial_flat = atom_virial.flatten()
for ii, vv in enumerate(virial_flat):
comma = "," if ii < len(virial_flat) - 1 else ""
print(f" {vv:.18e}{comma}") # noqa: T201
print(" };") # noqa: T201
# ---- 5. Verify .pth gives same results ----
dp_pth = DeepPot(pth_path)
e_pth, f_pth, _, _ae_pth, _av_pth = dp_pth.eval(
coord,
box,
atype,
fparam=fparam_val,
aparam=aparam_val,
atomic=True,
)
print(f"\n// .pth vs .pt2 energy diff: {abs(e[0, 0] - e_pth[0, 0]):.2e}") # noqa: T201
print(f"// .pth vs .pt2 force max diff: {np.max(np.abs(f - f_pth)):.2e}") # noqa: T201
🧰 Tools
🪛 Ruff (0.15.4)

[warning] 127-127: Unpacked variable v is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[error] 136-136: Local variable nloc is assigned to but never used

Remove assignment to unused variable nloc

(F841)


[warning] 167-167: Unpacked variable ae_pth is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 167-167: Unpacked variable av_pth is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/tests/infer/gen_fparam_aparam.py` around lines 127 - 177, The new test
generator declares unused variables (v, nloc, ae_pth, av_pth) which causes ruff
F841; fix by either using or marking them as intentionally unused—rename them
with a leading underscore (e.g., _v, _nloc, _ae_pth, _av_pth) or replace their
assignments with a single underscore/tuple discard when calling DeepPot.eval
(e.g., e_pth, f_pth, _ = ... or e_pth, f_pth, *_ = ...) so that
functions/methods DeepPot.eval and local names v, nloc, ae_pth, av_pth no longer
trigger unused-variable errors.

std::vector<VALUETYPE> frame_aparam;
if (!aparam.empty()) {
frame_aparam.assign(aparam.begin() + ff * dap,
aparam.begin() + (ff + 1) * dap);

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'int' before it is converted to 'difference_type'.
}
std::vector<VALUETYPE> frame_aparam;
if (!aparam.empty()) {
frame_aparam.assign(aparam.begin() + ff * dap,

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'int' before it is converted to 'difference_type'.
std::vector<VALUETYPE> frame_fparam;
if (!fparam.empty()) {
frame_fparam.assign(fparam.begin() + ff * dfp,
fparam.begin() + (ff + 1) * dfp);

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'int' before it is converted to 'difference_type'.
}
std::vector<VALUETYPE> frame_fparam;
if (!fparam.empty()) {
frame_fparam.assign(fparam.begin() + ff * dfp,

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'int' before it is converted to 'difference_type'.
std::vector<VALUETYPE> frame_coord(coord.begin() + ff * natoms * 3,
coord.begin() + (ff + 1) * natoms * 3);
std::vector<int> frame_atype(atype.begin() + ff * natoms,
atype.begin() + (ff + 1) * natoms);

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'int' before it is converted to 'difference_type'.
atom_virial.clear();
}
for (int ff = 0; ff < nframes; ++ff) {
std::vector<VALUETYPE> frame_coord(coord.begin() + ff * natoms * 3,

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'int' before it is converted to 'difference_type'.
for (int kk = 0; kk < nframes; ++kk) {
std::copy(in.begin() + kk * nall * ndim,
in.begin() + kk * nall * ndim + nloc * ndim,
out.begin() + kk * nloc * ndim);

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'int' before it is converted to 'difference_type'.
out.resize(static_cast<size_t>(nframes) * nloc * ndim);
for (int kk = 0; kk < nframes; ++kk) {
std::copy(in.begin() + kk * nall * ndim,
in.begin() + kk * nall * ndim + nloc * ndim,

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'int' before it is converted to 'difference_type'.
out.resize(static_cast<size_t>(nframes) * nloc * ndim);
for (int kk = 0; kk < nframes; ++kk) {
std::copy(in.begin() + kk * nall * ndim,
in.begin() + kk * nall * ndim + nloc * ndim,

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'int' before it is converted to 'difference_type'.
const int nframes = 1) {
out.resize(static_cast<size_t>(nframes) * nloc * ndim);
for (int kk = 0; kk < nframes; ++kk) {
std::copy(in.begin() + kk * nall * ndim,

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'int' before it is converted to 'difference_type'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant