Skip to content

fix(c++): fix NULL type in custom op#4889

Merged
wanghan-iapcm merged 26 commits intodeepmodeling:masterfrom
iProzd:D0814_debug_c_nlist
Mar 4, 2026
Merged

fix(c++): fix NULL type in custom op#4889
wanghan-iapcm merged 26 commits intodeepmodeling:masterfrom
iProzd:D0814_debug_c_nlist

Conversation

@iProzd
Copy link
Collaborator

@iProzd iProzd commented Aug 14, 2025

Replaces usage of lmp_list send/recv arrays with new vectors that map indices using fwd_map and synchronize counts via MPI. Updates tensor construction to use these new vectors, improving correctness and flexibility in distributed communication.

Summary by CodeRabbit

  • Refactor
    • Message-passing now remaps send lists to actual atoms before building communication tensors, improving consistency for distributed runs.
  • Bug Fixes
    • Invalid or out-of-range send indices are filtered and counts updated to prevent communication mismatches and related errors.
  • Tests
    • Added a test covering a type-map scenario with a NULL mapping to exercise the updated handling.

Replaces usage of lmp_list send/recv arrays with new vectors that map indices using fwd_map and synchronize counts via MPI. Updates tensor construction to use these new vectors, improving correctness and flexibility in distributed communication.
@iProzd iProzd marked this pull request as draft August 14, 2025 13:26
@github-actions github-actions bot added the C++ label Aug 14, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 14, 2025

📝 Walkthrough

Walkthrough

Calls a new helper to remap LAMMPS sendlists using a forward-map before constructing MPI/send tensors in DeepPotPT/DeepPotPD/DeepSpinPT, and adds the helper declaration/implementation and a small test. No public API signature changes beyond the new helper.

Changes

Cohort / File(s) Summary
Sendlist remapping helper
source/api_cc/src/common.cc, source/api_cc/include/common.h
- Adds deepmd::select_real_atoms_sendlist(const deepmd::InputNlist& inlist, const std::vector<int>& fwd_map) implementation and declaration. Filters/remaps per-swap sendlist indices via fwd_map, discards invalid mappings, updates sendlist, sendnum, and recvnum.
DeepPot message-passing callers*
source/api_cc/src/DeepPotPT.cc, source/api_cc/src/DeepPotPD.cc, source/api_cc/src/DeepSpinPT.cc
- After obtaining nswap in the message-passing branch, call select_real_atoms_sendlist(lmp_list, fwd_map) so subsequent tensor/comm_dict construction uses remapped sendlists. No other control-flow or API changes.
Tests
source/lmp/tests/test_lammps_dpa_pt.py
- Adds test_pair_deepmd_type_map_with_null which sets up a DeepMD type-map with pair_coeff("* * H NULL") and runs 0 steps; no assertions added.

Sequence Diagram(s)

sequenceDiagram
  participant Compute as DeepPot*::compute
  participant Select as select_real_atoms_sendlist
  participant Map as fwd_map
  participant Tensors as Tensor & comm_dict build

  Compute->>Select: call select_real_atoms_sendlist(lmp_list, fwd_map)
  Select->>Map: map each send index -> forwarded index
  Map-->>Select: remapped sendlist (invalids removed)
  Select-->>Compute: lmp_list updated (sendlist/sendnum/recvnum)
  Compute->>Tensors: build send/recv tensors and comm_dict from updated lmp_list
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • wanghan-iapcm
  • njzjz
  • CaRoLZhangxy
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
source/api_cc/src/DeepPotPT.cc (2)

251-257: Remove stale commented-out code

Dead commented code obscures the current data path and makes maintenance harder.

Apply this diff:

-//      torch::Tensor firstrecv_tensor =
-//          torch::from_blob(lmp_list.firstrecv, {nswap}, int32_option);
-//      torch::Tensor recvnum_tensor =
-//          torch::from_blob(lmp_list.recvnum, {nswap}, int32_option);
-//      torch::Tensor sendnum_tensor =
-//          torch::from_blob(lmp_list.sendnum, {nswap}, int32_option);

266-269: Remove redundant commented-out legacy code

Same reasoning; commented legacy path is preserved in git history.

Apply this diff:

-//      int total_send =
-//          std::accumulate(lmp_list.sendnum, lmp_list.sendnum + nswap, 0);
-//      torch::Tensor sendlist_tensor =
-//          torch::from_blob(lmp_list.sendlist, {total_send}, int32_option);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between accc331 and f99ad4d.

📒 Files selected for processing (1)
  • source/api_cc/src/DeepPotPT.cc (3 hunks)
🔇 Additional comments (2)
source/api_cc/src/DeepPotPT.cc (2)

185-204: Remapping logic LGTM

Correctly rebuilds per-swap send counts and a dense send list using fwd_map with bounds checks and filtering. Reserving capacity via the accumulated legacy counts is a good optimization.


226-232: firstrecv_new is unused and not required — original comment is incorrect

Short: deepmd/pt/model/descriptor/repflows.py builds comm_dict and calls torch.ops.deepmd.border_op with send_list, send_proc, recv_proc, send_num, recv_num, communicator (no first_recv). The computed firstrecv_new/firstrecv_tensor in the PT wrappers is dead code — remove it or document why it is kept.

Files to update:

  • source/api_cc/src/DeepPotPT.cc
    • Remove the firstrecv_new prefix-sum computation (around lines 226–231) and the unused firstrecv_tensor creation (around line 238).
  • source/api_cc/src/DeepSpinPT.cc
    • Same pattern: firstrecv_tensor is created around lines 187–191 but never used/inserted.

Suggested change (remove unused code) — example diff for DeepPotPT.cc:
@@

  •     std::vector<int> firstrecv_new(nswap, 0);
    
  •      int acc = 0;
    
  •      for (int s = 0; s < nswap; ++s) {
    
  •        firstrecv_new[s] = acc;
    
  •        acc += recvnum_new[s];
    
  •      }
    
  •     /* firstrecv computation removed — not used by border_op */
    

@@

  •    torch::Tensor firstrecv_tensor =
    
  •  torch::from_blob(firstrecv_new.data(), {nswap}, int32_option).clone();
    
  •    /* firstrecv tensor omitted — border_op expects recv_num, not first_recv */
    

If you prefer to keep the computation for clarity, add a short comment explaining it's intentionally unused.

Likely an incorrect or invalid review comment.

@caic99 caic99 linked an issue Aug 22, 2025 that may be closed by this pull request
@iProzd iProzd marked this pull request as ready for review August 27, 2025 10:25
@caic99 caic99 requested a review from Copilot August 27, 2025 10:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug in the C++ custom operations for distributed MPI communication by implementing proper handling of NULL types in the send/receive lists. The fix ensures that atom indices are correctly mapped and synchronized across MPI processes by filtering out invalid indices and updating counts accordingly.

Key changes:

  • Added a new function to select and remap real atoms in send lists using forward mapping
  • Integrated the send list filtering into PyTorch, Paddle, and DeepSpin compute functions
  • Added a test case to verify NULL type handling in LAMMPS pair coefficients

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
source/api_cc/src/common.cc Implements select_real_atoms_sendlist function to filter and remap atom indices in MPI send lists
source/api_cc/include/common.h Adds function declaration for the new send list selection function
source/api_cc/src/DeepPotPT.cc Integrates send list filtering into PyTorch backend compute function
source/api_cc/src/DeepPotPD.cc Integrates send list filtering into Paddle backend compute function
source/api_cc/src/DeepSpinPT.cc Integrates send list filtering into DeepSpin PyTorch compute function
source/lmp/tests/test_lammps_dpa_pt.py Adds test case for NULL type handling in LAMMPS pair coefficients

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
source/api_cc/src/DeepSpinPT.cc (1)

205-213: Flatten the sendlist pointer-of-pointers into a contiguous buffer before calling from_blob

InputNlist::sendlist is declared as int ** (an array of pointers to per-swap arrays) in api_c/include/deepmd.hpp, so passing it directly to

torch::from_blob(lmp_list.sendlist, {total_send}, int32_option)

will interpret the array‐of‐pointers as a contiguous block of ints, which is invalid and will lead to incorrect data being read (or a crash). You must first flatten all per-swap subarrays into a single contiguous int buffer.

Please update the following files:

  • source/api_cc/src/DeepSpinPT.cc (around line 207)
  • source/api_cc/src/DeepPotPT.cc (around line 200)

Proposed refactoring to safely flatten (ensuring the buffer outlives the call to run_method):

-      torch::Tensor sendlist_tensor =
-          torch::from_blob(lmp_list.sendlist, {total_send}, int32_option);
+      // Flatten the int** sendlist into a single contiguous vector
+      std::vector<int> flat_sendlist;
+      flat_sendlist.reserve(total_send);
+      for (int s = 0; s < nswap; ++s) {
+        flat_sendlist.insert(
+            flat_sendlist.end(),
+            lmp_list.sendlist[s],
+            lmp_list.sendlist[s] + lmp_list.sendnum[s]);
+      }
+      // Create a tensor from the contiguous buffer
+      torch::Tensor sendlist_tensor =
+          torch::from_blob(flat_sendlist.data(), {total_send}, int32_option);

[mend_review_comment]

source/api_cc/src/DeepPotPD.cc (1)

421-439: Shape bug: send_list should carry nswap pointer addresses, not total_send elements

You reshape send_list to total_send but copy only nswap pointer addresses. This is a mismatched shape and can corrupt reads downstream.

-      int total_send =
-          std::accumulate(lmp_list.sendnum, lmp_list.sendnum + nswap, 0);
-      sendlist_tensor->Reshape({total_send});
+      // send_list carries one pointer per swap
+      sendlist_tensor->Reshape({nswap});
       /**
       ** NOTE: paddle do not support construct a Tensor with from_blob(T**, ...)
       ** from a double pointer, so we convert int* pointer to indptr_t for each
       ** entry and wrap it into int64 Tensor as a workaround.
       */
       std::vector<std::intptr_t> pointer_addresses;
-      pointer_addresses.reserve(nswap);
+      pointer_addresses.reserve(nswap);
🧹 Nitpick comments (1)
source/api_cc/include/common.h (1)

105-107: API const-correctness: this mutates inlist; avoid const on parameter

The helper writes through inlist pointers (sendnum/recvnum/sendlist). Marking the parameter const is misleading.

-void select_real_atoms_sendlist(const deepmd::InputNlist& inlist,
-                                const std::vector<int>& fwd_map);
+// In-place: trims and remaps sendlist; updates sendnum/recvnum/(firstrecv if present)
+void select_real_atoms_sendlist(deepmd::InputNlist& inlist,
+                                const std::vector<int>& fwd_map);

If changing the signature ripples too far now, at least document the in-place side effects in a comment.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f99ad4d and af2411d.

📒 Files selected for processing (6)
  • source/api_cc/include/common.h (1 hunks)
  • source/api_cc/src/DeepPotPD.cc (1 hunks)
  • source/api_cc/src/DeepPotPT.cc (1 hunks)
  • source/api_cc/src/DeepSpinPT.cc (1 hunks)
  • source/api_cc/src/common.cc (1 hunks)
  • source/lmp/tests/test_lammps_dpa_pt.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • source/api_cc/src/DeepPotPT.cc
🧰 Additional context used
🧬 Code graph analysis (5)
source/lmp/tests/test_lammps_dpa_pt.py (1)
source/lmp/tests/test_lammps.py (1)
  • lammps_type_map (289-292)
source/api_cc/src/common.cc (1)
source/op/tf/prod_env_mat_multi_device.cc (1)
  • inlist (1590-1590)
source/api_cc/include/common.h (2)
source/api_cc/src/common.cc (2)
  • select_real_atoms_sendlist (235-258)
  • select_real_atoms_sendlist (235-236)
source/api_c/include/deepmd.hpp (4)
  • InputNlist (807-814)
  • InputNlist (815-822)
  • InputNlist (823-850)
  • InputNlist (851-851)
source/api_cc/src/DeepSpinPT.cc (1)
source/api_cc/src/common.cc (2)
  • select_real_atoms_sendlist (235-258)
  • select_real_atoms_sendlist (235-236)
source/api_cc/src/DeepPotPD.cc (1)
source/api_cc/src/common.cc (2)
  • select_real_atoms_sendlist (235-258)
  • select_real_atoms_sendlist (235-236)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Analyze (c-cpp)
🔇 Additional comments (1)
source/api_cc/src/common.cc (1)

235-258: Recompute firstrecv after trimming and verify symmetry

You need to keep the receive‐offset array in sync whenever you shrink recvnum, and also confirm that the assumption recvnum[s] == sendnum[s] holds across all ranks/types.

• After your loop that sets

inlist.sendnum[s] = cnt;
inlist.recvnum[s] = cnt;

add:

+  // Recompute receive‐offsets to match updated recvnum[]
+  if (inlist.firstrecv) {
+    int offset = 0;
+    for (int s = 0; s < nswap; ++s) {
+      inlist.firstrecv[s] = offset;
+      offset += inlist.recvnum[s];
+    }
+  }

This keeps the firstrecv prefix sums correct ( firstrecv is “where to put 1st recv atom in each swap” ) (docs.deepmodeling.com).

• Because you’re setting recvnum[s] = sendnum[s], you’re assuming perfect send/recv symmetry per swap. Please verify this holds under all type‐map filters and MPI ranks. Consider adding a debug‐only assertion or a collective check (e.g. an MPI_Allreduce on the arrays) to catch any imbalance early.

@link89
Copy link
Contributor

link89 commented Feb 12, 2026

It looks like this PR is pending by a false positive failed of build machine. Can you rerun those test and see if is PR can be merged? I am looking forward to this issue to be resolved.

@codecov
Copy link

codecov bot commented Feb 20, 2026

Codecov Report

❌ Patch coverage is 71.87500% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.16%. Comparing base (a3db25a) to head (280729f).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
source/api_cc/src/DeepPotPD.cc 5.55% 17 Missing ⚠️
source/api_cc/src/DeepSpinPT.cc 6.66% 13 Missing and 1 partial ⚠️
source/api_cc/include/commonPT.h 92.30% 0 Missing and 2 partials ⚠️
source/api_cc/src/DeepPotPT.cc 87.50% 0 Missing and 1 partial ⚠️
source/lmp/fix_dplr.cpp 0.00% 1 Missing ⚠️
source/lmp/pair_deepspin.cpp 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4889      +/-   ##
==========================================
+ Coverage   82.12%   82.16%   +0.03%     
==========================================
  Files         753      755       +2     
  Lines       75825    75874      +49     
  Branches     3648     3660      +12     
==========================================
+ Hits        62275    62342      +67     
+ Misses      12382    12362      -20     
- Partials     1168     1170       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@link89
Copy link
Contributor

link89 commented Feb 25, 2026

I find the latest CI job was failed due to a test case timeout. I guess dead loop or something are trigger by this PR.
A another PR is created to set a timeout for pytest will make it easy to debug such error: #5265

Maybe you should merge or cherry pick this fix and rerun the pipeline to see which case is broken.

@njzjz njzjz disabled auto-merge February 25, 2026 21:16
@iProzd
Copy link
Collaborator Author

iProzd commented Mar 3, 2026

@link89 thanks for your commit, I have merged your modification from #5268 into this PR and made them complete. Now I think this bug is fixed.

@iProzd iProzd requested review from njzjz and wanghan-iapcm and removed request for CaRoLZhangxy March 3, 2026 11:36
@link89
Copy link
Contributor

link89 commented Mar 3, 2026

@iProzd Great. However, the current approach requires rebuilding various atom arrays even if only a single virtual atom is present. For large-scale systems, this introduces significant performance overhead. Should we consider moving the masking logic directly into the model inference stage in the future? This would eliminate unnecessary atom index remapping and array reconstruction, leading to a more performant and cleaner implementation.

@wanghan-iapcm wanghan-iapcm added this pull request to the merge queue Mar 4, 2026
@iProzd
Copy link
Collaborator Author

iProzd commented Mar 4, 2026

@iProzd Great. However, the current approach requires rebuilding various atom arrays even if only a single virtual atom is present. For large-scale systems, this introduces significant performance overhead. Should we consider moving the masking logic directly into the model inference stage in the future? This would eliminate unnecessary atom index remapping and array reconstruction, leading to a more performant and cleaner implementation.

@link89 Yes you are right. I'll open an issue to track this.

Merged via the queue into deepmodeling:master with commit 7e648f0 Mar 4, 2026
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NULL atom type can not be used with deepmd 3.1.0a0 pth

7 participants