Skip to content

feat(pt/dpmodel): add lmdb dataloader#5283

Open
iProzd wants to merge 5 commits intodeepmodeling:masterfrom
iProzd:0225-lmdb-dataloader
Open

feat(pt/dpmodel): add lmdb dataloader#5283
iProzd wants to merge 5 commits intodeepmodeling:masterfrom
iProzd:0225-lmdb-dataloader

Conversation

@iProzd
Copy link
Collaborator

@iProzd iProzd commented Mar 3, 2026

Summary by CodeRabbit

  • New Features

    • Full LMDB data support across training, validation and testing flows, with PyTorch-friendly dataset/dataloader integration and per-atom-count batching.
    • Distributed-aware batch sampling that partitions and deterministically shuffles across ranks.
  • Tests

    • Extensive unit tests validating reader, dataset, samplers, collation, and mixed-/uniform-nloc scenarios.
  • Documentation

    • Added example LMDB training configuration.

Copilot AI review requested due to automatic review settings March 3, 2026 07:56
@iProzd iProzd marked this pull request as draft March 3, 2026 07:56
@dosubot dosubot bot added the new feature label Mar 3, 2026

def print_summary(self, name: str, prob: Any) -> None:
"""Print basic dataset info."""
unique_nlocs = sorted(self._nloc_groups.keys())

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable unique_nlocs is not used.
DataRequirementItem,
)

log = logging.getLogger(__name__)

Check notice

Code scanning / CodeQL

Unused global variable Note

The global variable 'log' is not used.
def test_getitem_out_of_range(self, lmdb_dir):
ds = LmdbDataset(lmdb_dir, type_map=["O", "H"], batch_size=2)
with pytest.raises(IndexError):
ds[999]

Check notice

Code scanning / CodeQL

Statement has no effect Note test

This statement has no effect.
self._world_size - 1 - self._rank :: self._world_size
]

s_default = DistributedSameNlocBatchSampler(

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable s_default is not used.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 3, 2026

📝 Walkthrough

Walkthrough

Adds full LMDB data support: a framework-agnostic LMDB reader and utilities, PyTorch LMDB Dataset/colation/samplers, training and test entrypoint LMDB branches, example LMDB config, and extensive unit tests; also adds lmdb and msgpack dependencies.

Changes

Cohort / File(s) Summary
LMDB Core
deepmd/dpmodel/utils/lmdb_data.py, deepmd/dpmodel/utils/__init__.py
New LMDB utilities: LmdbDataReader, LmdbTestData, merge_lmdb, key decoding/remapping, per-frame handling, SameNlocBatchSampler, DistributedSameNlocBatchSampler, and exports added to module init.
PyTorch integration
deepmd/pt/utils/lmdb_dataset.py, deepmd/pt/entrypoints/main.py, deepmd/pt/train/training.py
New LmdbDataset, torch collate _collate_lmdb_batch, sampler adapter _SameNlocBatchSamplerTorch, and training pipeline updates to accept/use LmdbDataset and LMDB samplers/collation.
CLI / Test entrypoint
deepmd/entrypoints/test.py
Detects LMDB inputs via is_lmdb, uses LmdbTestData and per-nloc views, and runs tests per LMDB nloc group.
Examples
examples/lmdb_data/input_lmdb.json
Adds example LMDB training configuration demonstrating LMDB training/validation paths and batch settings.
Tests
source/tests/consistent/test_lmdb_data.py, source/tests/pt/test_lmdb_dataloader.py
Extensive tests for reader vs dataset consistency, uniform and mixed-nloc scenarios, sampler determinism, collate behavior, and distributed batching.
Project deps
pyproject.toml
Adds lmdb and msgpack to project dependencies.

Sequence Diagrams

sequenceDiagram
    participant Trainer as Trainer
    participant Main as prepare_trainer_input_single
    participant Checker as is_lmdb
    participant DS as LmdbDataset / DpLoaderSet
    participant Reader as LmdbDataReader
    participant LMDB as LMDB DB
    participant DL as PyTorch DataLoader

    Trainer->>Main: request(train_systems, val_systems)
    Main->>Checker: is_lmdb(train_systems)
    alt LMDB path
        Main->>DS: create LmdbDataset(train_path)
        Main->>DS: create LmdbDataset(val_path) or DpLoaderSet
    else Non-LMDB
        Main->>DS: create DpLoaderSet(...)
    end
    Main-->>Trainer: (train_data, val_data, DPPath)

    Trainer->>DL: get_data_loader(train_data)
    DL->>DS: initialize (sampler, collate_fn)
    DS->>Reader: Reader.__init__(lmdb_path)
    Reader->>LMDB: open read-only txn and read __metadata__

    loop per-batch
        DL->>Reader: Reader.__getitem__(frame_id)
        Reader->>LMDB: fetch frame bytes
        Reader->>Reader: decode arrays & remap keys
        Reader-->>DL: frame dict (np arrays)
        DL->>DL: _collate_lmdb_batch(batch)
        DL-->>Trainer: batched tensors
    end
Loading
sequenceDiagram
    participant App as Test CLI
    participant LD as LmdbTestData
    participant View as _LmdbTestDataNlocView
    participant TestRunner as test functions

    App->>LD: LmdbTestData(lmdb_path)
    LD->>LD: compute nloc_groups
    alt multiple nloc groups
        LD->>View: create per-nloc views
        loop per nloc
            App->>TestRunner: run tests(view, label)
        end
    else single group
        App->>TestRunner: run tests(LD, label)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • njzjz
  • wanghan-iapcm
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.26% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature addition: LMDB dataloader support for both PyTorch and DPModel frameworks, which is comprehensively covered across the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
deepmd/pt/train/training.py (1)

529-542: ⚠️ Potential issue | 🔴 Critical

Multi-task LMDB path still assumes .sampler.weights and can crash.

In LMDB mode, dataloaders are batch-sampler based and do not provide sampler weights. The current multi-task code still dereferences .sampler.weights in both summary printing and num_epoch_dict total-batch computation.

Proposed fix
                 training_data[model_key].print_summary(
                     f"training in {model_key}",
-                    to_numpy_array(self.training_dataloader[model_key].sampler.weights),
+                    to_numpy_array(self.training_dataloader[model_key].sampler.weights)
+                    if not isinstance(training_data[model_key], LmdbDataset)
+                    else [1.0],
                 )
@@
                     validation_data[model_key].print_summary(
                         f"validation in {model_key}",
-                        to_numpy_array(
-                            self.validation_dataloader[model_key].sampler.weights
-                        ),
+                        to_numpy_array(
+                            self.validation_dataloader[model_key].sampler.weights
+                        )
+                        if not isinstance(validation_data[model_key], LmdbDataset)
+                        else [1.0],
                     )
@@
                 for model_key in self.model_keys:
-                    sampler_weights = to_numpy_array(
-                        self.training_dataloader[model_key].sampler.weights
-                    )
-                    per_task_total.append(
-                        compute_total_numb_batch(
-                            training_data[model_key].index,
-                            sampler_weights,
-                        )
-                    )
+                    if isinstance(training_data[model_key], LmdbDataset):
+                        per_task_total.append(training_data[model_key].total_batch)
+                    else:
+                        sampler_weights = to_numpy_array(
+                            self.training_dataloader[model_key].sampler.weights
+                        )
+                        per_task_total.append(
+                            compute_total_numb_batch(
+                                training_data[model_key].index,
+                                sampler_weights,
+                            )
+                        )

Also applies to: 583-590

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt/train/training.py` around lines 529 - 542, The code assumes
sampler.weights exists for training/validation dataloaders (see
training_data[model_key].print_summary calls and the num_epoch_dict total-batch
computation) which breaks for LMDB batch-sampler dataloaders; fix by guarding
access to sampler.weights: when printing call
to_numpy_array(self.training_dataloader[model_key].sampler.weights) only if
hasattr(self.training_dataloader[model_key].sampler, "weights") (or use getattr
with a default None) and pass None or an empty array otherwise, and for
num_epoch_dict total-batch computation replace uses of sum(sampler.weights) with
a safe fallback that uses len(self.training_dataloader[model_key]) (number of
batches) when weights are absent; update the same logic for
validation_dataloader and any other places that reference .sampler.weights
(e.g., the block around num_epoch_dict).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 242-253: The batch_size handling silently coerces unknown strings
and accepts non-positive integers; update the branch in the batch_size setter so
that only "auto" or "auto:<positive int>" strings are allowed (parse "auto:<n>"
into an int and validate n>0), otherwise raise a ValueError with a clear
message; likewise, when batch_size is numeric (the else branch that sets
self.batch_size = int(batch_size)) validate the parsed integer is >0 and raise a
ValueError if not; keep using _compute_batch_size(self._natoms, self._auto_rule)
for the "auto" case but ensure _auto_rule is a positive int after parsing.
- Line 374: The local variable unique_nlocs assigned from
sorted(self._nloc_groups.keys()) is unused and triggers a Ruff F841 lint error;
remove this assignment statement (the unused local unique_nlocs) from the method
where it appears so the code no longer creates unique_nlocs but retains any
needed logic that uses self._nloc_groups directly.
- Around line 880-882: The merge logic assumes src_nlocs has at least nframes
entries and does frame_nlocs.append(int(src_nlocs[i])) without bounds checking;
change this to guard the index: if src_nlocs is not None and i < len(src_nlocs)
then append int(src_nlocs[i]) else compute/derive the per-frame natoms (e.g., by
parsing the frame from the source data or using the fallback natoms value
already available) so that malformed/truncated metadata does not raise
IndexError during the merge; update the block around the frame_nlocs population
(references: frame_nlocs, src_nlocs, nframes) to perform this fallback behavior.
- Around line 705-719: In get_test(), handle the empty-dataset case before using
max on self._nloc_groups: if self._nloc_groups is empty (or len(self._frames) ==
0) raise a clear ValueError (e.g. "No frames in LMDB / no nloc groups
available") instead of letting max(...) raise a cryptic error; otherwise proceed
with the existing logic that sets natoms and frame_indices from the largest
group in self._nloc_groups.
- Around line 263-267: Before building natoms_vec/vec, validate that all atom
type ids in atype are within [0, self._ntypes-1]; detect any ids <0 or >=
self._ntypes and raise a descriptive ValueError (including the offending ids or
their min/max and, if available, the frame index) instead of silently truncating
via np.bincount(... )[: self._ntypes]. Place this check immediately before the
np.bincount call that computes counts (the code using atype and self._ntypes and
filling vec/natoms_vec) so you fail fast and keep natoms_vec consistent with the
frame contents.

In `@deepmd/entrypoints/test.py`:
- Around line 233-266: The append_detail flag is incorrectly derived from cc so
LMDB sub-groups for the first system overwrite detail files; instead track
whether a given sys_label has already been written and pass append_detail=True
for subsequent sub-groups. Modify the loop over data_items: introduce a small
seen map (e.g., seen_systems dict keyed by sys_label) and compute append_detail
= seen_systems.get(sys_label, False) when calling
test_ener/test_dos/test_property (replace the current append_detail=(cc != 0));
after the call set seen_systems[sys_label] = True so later sub-groups append
rather than overwrite.

In `@source/tests/consistent/test_lmdb_data.py`:
- Line 128: The loop variable j in the loop "for j in range(count):" is unused;
rename it to "_" (or "_i"/"_count" per project convention) to satisfy Ruff B007
and avoid lint failures, update any related references in that loop (none
expected), and re-run ruff check . and ruff format . to ensure formatting and
linting pass.
- Around line 405-407: The current test contains a tautological assertion
self.assertEqual(atype.shape[1], atype.shape[1]) which does not validate batch
consistency; replace it with an assertion that compares the nloc dimension
across frames in the batch (use atype as the per-batch array) — e.g. compute a
reference_nloc from the first frame (reference_nloc = atype[0].shape[1]) and
assert every frame's shape[1] equals reference_nloc (or use numpy/all to check
atype[:,1?] consistency) so the test in test_lmdb_data.py actually verifies "All
frames in batch have same nloc".

In `@source/tests/pt/test_lmdb_dataloader.py`:
- Line 480: The test contains Ruff violations: the loop uses an unused loop
variable i and the variable s_default is assigned but never used; fix by
changing the loop to use a throwaway name (e.g., replace "for i in range(10):"
with "for _ in range(10):") or otherwise use i, and remove or use the s_default
assignment (remove the dead "s_default = ..." line if it's not needed or
reference it where intended) so there are no unused variables left.

---

Outside diff comments:
In `@deepmd/pt/train/training.py`:
- Around line 529-542: The code assumes sampler.weights exists for
training/validation dataloaders (see training_data[model_key].print_summary
calls and the num_epoch_dict total-batch computation) which breaks for LMDB
batch-sampler dataloaders; fix by guarding access to sampler.weights: when
printing call
to_numpy_array(self.training_dataloader[model_key].sampler.weights) only if
hasattr(self.training_dataloader[model_key].sampler, "weights") (or use getattr
with a default None) and pass None or an empty array otherwise, and for
num_epoch_dict total-batch computation replace uses of sum(sampler.weights) with
a safe fallback that uses len(self.training_dataloader[model_key]) (number of
batches) when weights are absent; update the same logic for
validation_dataloader and any other places that reference .sampler.weights
(e.g., the block around num_epoch_dict).

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3db25a and 5a10d4d.

📒 Files selected for processing (9)
  • deepmd/dpmodel/utils/__init__.py
  • deepmd/dpmodel/utils/lmdb_data.py
  • deepmd/entrypoints/test.py
  • deepmd/pt/entrypoints/main.py
  • deepmd/pt/train/training.py
  • deepmd/pt/utils/lmdb_dataset.py
  • examples/lmdb_data/input_lmdb.json
  • source/tests/consistent/test_lmdb_data.py
  • source/tests/pt/test_lmdb_dataloader.py

Comment on lines +242 to +253
if isinstance(batch_size, str):
if batch_size == "auto":
self._auto_rule = 32
elif batch_size.startswith("auto:"):
self._auto_rule = int(batch_size.split(":")[1])
else:
self._auto_rule = 32
# Default batch_size uses first frame's nloc (for total_batch estimate)
self.batch_size = _compute_batch_size(self._natoms, self._auto_rule)
else:
self.batch_size = int(batch_size)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate batch_size strictly instead of silently coercing invalid input.

At Line 248, unknown string specs are silently treated as "auto:32", and at Line 252 non-positive integers are accepted. This can hide config mistakes and trigger runtime failures later (e.g., division by zero at Line 393).

Suggested fix
         self._auto_rule: int | None = None
         if isinstance(batch_size, str):
             if batch_size == "auto":
                 self._auto_rule = 32
             elif batch_size.startswith("auto:"):
-                self._auto_rule = int(batch_size.split(":")[1])
+                try:
+                    self._auto_rule = int(batch_size.split(":", 1)[1])
+                except ValueError as exc:
+                    raise ValueError(
+                        "batch_size must be 'auto', 'auto:N', or a positive integer"
+                    ) from exc
+                if self._auto_rule <= 0:
+                    raise ValueError("auto batch_size rule N must be > 0")
             else:
-                self._auto_rule = 32
+                raise ValueError(
+                    "batch_size must be 'auto', 'auto:N', or a positive integer"
+                )
             # Default batch_size uses first frame's nloc (for total_batch estimate)
             self.batch_size = _compute_batch_size(self._natoms, self._auto_rule)
         else:
             self.batch_size = int(batch_size)
+            if self.batch_size <= 0:
+                raise ValueError("batch_size must be > 0")

Also applies to: 393-393

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 242 - 253, The batch_size
handling silently coerces unknown strings and accepts non-positive integers;
update the branch in the batch_size setter so that only "auto" or
"auto:<positive int>" strings are allowed (parse "auto:<n>" into an int and
validate n>0), otherwise raise a ValueError with a clear message; likewise, when
batch_size is numeric (the else branch that sets self.batch_size =
int(batch_size)) validate the parsed integer is >0 and raise a ValueError if
not; keep using _compute_batch_size(self._natoms, self._auto_rule) for the
"auto" case but ensure _auto_rule is a positive int after parsing.

Comment on lines +263 to +267
counts = np.bincount(atype, minlength=self._ntypes)[: self._ntypes]
vec = np.empty(self._ntypes + 2, dtype=np.int64)
vec[0] = nloc
vec[1] = nloc
vec[2:] = counts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against out-of-range atom type ids before building natoms_vec.

np.bincount(... )[: self._ntypes] truncates counts for type ids >= self._ntypes, so natoms_vec can become inconsistent with the frame contents without any error.

Suggested fix
         nloc = len(atype)
+        if atype.size and (np.min(atype) < 0 or np.max(atype) >= self._ntypes):
+            raise ValueError(
+                f"atype contains out-of-range ids; expected [0, {self._ntypes - 1}]"
+            )
         counts = np.bincount(atype, minlength=self._ntypes)[: self._ntypes]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 263 - 267, Before building
natoms_vec/vec, validate that all atom type ids in atype are within [0,
self._ntypes-1]; detect any ids <0 or >= self._ntypes and raise a descriptive
ValueError (including the offending ids or their min/max and, if available, the
frame index) instead of silently truncating via np.bincount(... )[:
self._ntypes]. Place this check immediately before the np.bincount call that
computes counts (the code using atype and self._ntypes and filling
vec/natoms_vec) so you fail fast and keep natoms_vec consistent with the frame
contents.


def print_summary(self, name: str, prob: Any) -> None:
"""Print basic dataset info."""
unique_nlocs = sorted(self._nloc_groups.keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove unused local to keep Ruff clean (F841).

unique_nlocs is assigned but never used. This is already flagged by static analysis and should be removed to avoid CI lint failures.

Suggested fix
-        unique_nlocs = sorted(self._nloc_groups.keys())
         nloc_info = ", ".join(
             f"{nloc}({len(idxs)})" for nloc, idxs in sorted(self._nloc_groups.items())
         )

As per coding guidelines, **/*.py: Always run ruff check . and ruff format . before committing changes or CI will fail.

🧰 Tools
🪛 Ruff (0.15.2)

[error] 374-374: Local variable unique_nlocs is assigned to but never used

Remove assignment to unused variable unique_nlocs

(F841)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` at line 374, The local variable
unique_nlocs assigned from sorted(self._nloc_groups.keys()) is unused and
triggers a Ruff F841 lint error; remove this assignment statement (the unused
local unique_nlocs) from the method where it appears so the code no longer
creates unique_nlocs but retains any needed logic that uses self._nloc_groups
directly.

Comment on lines +705 to +719
if nloc is not None:
if nloc not in self._nloc_groups:
raise ValueError(
f"No frames with nloc={nloc}. Available: {sorted(self._nloc_groups.keys())}"
)
frame_indices = self._nloc_groups[nloc]
natoms = nloc
elif len(self._nloc_groups) == 1:
# Uniform nloc — use all frames
natoms = next(iter(self._nloc_groups))
frame_indices = list(range(len(self._frames)))
else:
# Mixed nloc — use the largest group
natoms = max(self._nloc_groups, key=lambda k: len(self._nloc_groups[k]))
frame_indices = self._nloc_groups[natoms]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle empty datasets explicitly in get_test().

If the LMDB has zero frames, self._nloc_groups is empty and max(...) at Line 718 raises a cryptic error path. Add an explicit guard for the empty case.

Suggested fix
         if nloc is not None:
             if nloc not in self._nloc_groups:
                 raise ValueError(
                     f"No frames with nloc={nloc}. Available: {sorted(self._nloc_groups.keys())}"
                 )
             frame_indices = self._nloc_groups[nloc]
             natoms = nloc
+        elif not self._frames:
+            raise ValueError("LMDB dataset contains no frames")
         elif len(self._nloc_groups) == 1:
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 707-709: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 705 - 719, In get_test(),
handle the empty-dataset case before using max on self._nloc_groups: if
self._nloc_groups is empty (or len(self._frames) == 0) raise a clear ValueError
(e.g. "No frames in LMDB / no nloc groups available") instead of letting
max(...) raise a cryptic error; otherwise proceed with the existing logic that
sets natoms and frame_indices from the largest group in self._nloc_groups.

Comment on lines +880 to +882
if src_nlocs is not None:
frame_nlocs.append(int(src_nlocs[i]))
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard frame_nlocs metadata indexing during merge.

src_nlocs[i] assumes metadata length matches nframes. On malformed/truncated metadata this raises IndexError and aborts merge. Add a bounds check and fall back to frame parsing/fallback natoms.

Suggested fix
-                if src_nlocs is not None:
-                    frame_nlocs.append(int(src_nlocs[i]))
-                else:
+                if src_nlocs is not None and i < len(src_nlocs):
+                    frame_nlocs.append(int(src_nlocs[i]))
+                else:
                     frame_raw = msgpack.unpackb(raw, raw=False)
                     atype_raw = frame_raw.get("atom_types")
                     if isinstance(atype_raw, dict):
                         shape = atype_raw.get("shape") or atype_raw.get(b"shape")
                         if shape:
                             frame_nlocs.append(int(shape[0]))
                         else:
                             frame_nlocs.append(fallback_natoms)
                     else:
                         frame_nlocs.append(fallback_natoms)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 880 - 882, The merge logic
assumes src_nlocs has at least nframes entries and does
frame_nlocs.append(int(src_nlocs[i])) without bounds checking; change this to
guard the index: if src_nlocs is not None and i < len(src_nlocs) then append
int(src_nlocs[i]) else compute/derive the per-frame natoms (e.g., by parsing the
frame from the source data or using the fallback natoms value already available)
so that malformed/truncated metadata does not raise IndexError during the merge;
update the block around the frame_nlocs population (references: frame_nlocs,
src_nlocs, nframes) to perform this fallback behavior.

Comment on lines +233 to +266
for data, sys_label in data_items:
if sys_label != system:
log.info(f"# testing sub-group : {sys_label}")

if isinstance(dp, DeepPot):
err = test_ener(
dp,
data,
sys_label,
numb_test,
detail_file,
atomic,
append_detail=(cc != 0),
)
elif isinstance(dp, DeepDOS):
err = test_dos(
dp,
data,
sys_label,
numb_test,
detail_file,
atomic,
append_detail=(cc != 0),
)
elif isinstance(dp, DeepProperty):
err = test_property(
dp,
data,
sys_label,
numb_test,
detail_file,
atomic,
append_detail=(cc != 0),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

append_detail is wrong for LMDB sub-groups and causes detail-file overwrite.

When one LMDB system is split into multiple nloc groups, all sub-groups of the first system still pass append_detail=False, so .e/.f/.v detail files get overwritten.

Proposed fix
-        for data, sys_label in data_items:
+        for sub_idx, (data, sys_label) in enumerate(data_items):
+            append_detail = (cc != 0) or (sub_idx != 0)
             if sys_label != system:
                 log.info(f"# testing sub-group : {sys_label}")
@@
                 err = test_ener(
@@
-                    append_detail=(cc != 0),
+                    append_detail=append_detail,
                 )
@@
                 err = test_dos(
@@
-                    append_detail=(cc != 0),
+                    append_detail=append_detail,
                 )
@@
                 err = test_property(
@@
-                    append_detail=(cc != 0),
+                    append_detail=append_detail,
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/entrypoints/test.py` around lines 233 - 266, The append_detail flag is
incorrectly derived from cc so LMDB sub-groups for the first system overwrite
detail files; instead track whether a given sys_label has already been written
and pass append_detail=True for subsequent sub-groups. Modify the loop over
data_items: introduce a small seen map (e.g., seen_systems dict keyed by
sys_label) and compute append_detail = seen_systems.get(sys_label, False) when
calling test_ener/test_dos/test_property (replace the current append_detail=(cc
!= 0)); after the call set seen_systems[sys_label] = True so later sub-groups
append rather than overwrite.

txn.put(b"__metadata__", msgpack.packb(meta, use_bin_type=True))
idx = 0
for natoms, count in frames_spec:
for j in range(count):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Rename unused loop variable to satisfy Ruff (B007).

Line 128 uses a loop variable that is never read.

Proposed cleanup
-            for j in range(count):
+            for _j in range(count):

As per coding guidelines **/*.py: Always run ruff check . and ruff format . before committing changes or CI will fail.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for j in range(count):
for _j in range(count):
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 128-128: Loop control variable j not used within loop body

Rename unused j to _j

(B007)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/tests/consistent/test_lmdb_data.py` at line 128, The loop variable j
in the loop "for j in range(count):" is unused; rename it to "_" (or
"_i"/"_count" per project convention) to satisfy Ruff B007 and avoid lint
failures, update any related references in that loop (none expected), and re-run
ruff check . and ruff format . to ensure formatting and linting pass.

Comment on lines +405 to +407
# All frames in batch have same nloc
self.assertEqual(atype.shape[1], atype.shape[1])
break # just check first batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Tautological assertion makes this test ineffective.

Line 406 compares the same value to itself, so it never validates batch consistency.

Proposed assertion
                 if atype is not None:
-                    # All frames in batch have same nloc
-                    self.assertEqual(atype.shape[1], atype.shape[1])
+                    nloc = atype.shape[1]
+                    for i in range(atype.shape[0]):
+                        self.assertEqual(int(batch["natoms"][i, 0]), nloc)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/tests/consistent/test_lmdb_data.py` around lines 405 - 407, The
current test contains a tautological assertion self.assertEqual(atype.shape[1],
atype.shape[1]) which does not validate batch consistency; replace it with an
assertion that compares the nloc dimension across frames in the batch (use atype
as the per-batch array) — e.g. compute a reference_nloc from the first frame
(reference_nloc = atype[0].shape[1]) and assert every frame's shape[1] equals
reference_nloc (or use numpy/all to check atype[:,1?] consistency) so the test
in test_lmdb_data.py actually verifies "All frames in batch have same nloc".

with env.begin(write=True) as txn:
idx = 0
for natoms in [4, 6, 8]:
for i in range(10):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix Ruff violations before merge (unused loop variable + dead assignment).

Line 617 (s_default) is assigned but never used (F841), and Line 480 uses an unused loop variable (B007).

Proposed cleanup
-            for i in range(10):
+            for _i in range(10):
                 key = format(idx, fmt).encode()
                 frame = _make_frame(natoms=natoms, seed=idx * 100)
                 txn.put(key, msgpack.packb(frame, use_bin_type=True))
                 frame_nlocs.append(natoms)
                 idx += 1
@@
-        s_default = DistributedSameNlocBatchSampler(
-            reader, rank=0, world_size=2, shuffle=True, seed=42
-        )
         s_custom = ReversePartition(reader, rank=0, world_size=2, shuffle=True, seed=42)

As per coding guidelines **/*.py: Always run ruff check . and ruff format . before committing changes or CI will fail.

Also applies to: 617-617

🧰 Tools
🪛 Ruff (0.15.2)

[warning] 480-480: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/tests/pt/test_lmdb_dataloader.py` at line 480, The test contains Ruff
violations: the loop uses an unused loop variable i and the variable s_default
is assigned but never used; fix by changing the loop to use a throwaway name
(e.g., replace "for i in range(10):" with "for _ in range(10):") or otherwise
use i, and remove or use the s_default assignment (remove the dead "s_default =
..." line if it's not needed or reference it where intended) so there are no
unused variables left.

@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 78.85196% with 140 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.29%. Comparing base (24e54bf) to head (f91ba7d).

Files with missing lines Patch % Lines
deepmd/dpmodel/utils/lmdb_data.py 81.58% 86 Missing ⚠️
deepmd/entrypoints/test.py 58.82% 21 Missing ⚠️
deepmd/pt/train/training.py 26.08% 17 Missing ⚠️
deepmd/pt/entrypoints/main.py 52.63% 9 Missing ⚠️
deepmd/pt/utils/lmdb_dataset.py 93.06% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5283      +/-   ##
==========================================
- Coverage   82.32%   82.29%   -0.03%     
==========================================
  Files         768      770       +2     
  Lines       77098    77726     +628     
  Branches     3659     3659              
==========================================
+ Hits        63469    63963     +494     
- Misses      12458    12592     +134     
  Partials     1171     1171              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@iProzd iProzd requested review from njzjz and wanghan-iapcm March 9, 2026 07:22
@iProzd iProzd marked this pull request as ready for review March 9, 2026 07:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
deepmd/dpmodel/utils/lmdb_data.py (5)

248-260: ⚠️ Potential issue | 🟡 Minor

Validate batch_size to prevent silent misconfiguration.

Unknown string specs (e.g., "auto_invalid") fall through to self._auto_rule = 32 at line 256, hiding config mistakes. Non-positive integers are also accepted at line 260.

Suggested fix
         self._auto_rule: int | None = None
         if isinstance(batch_size, str):
             if batch_size == "auto":
                 self._auto_rule = 32
             elif batch_size.startswith("auto:"):
-                self._auto_rule = int(batch_size.split(":")[1])
+                try:
+                    self._auto_rule = int(batch_size.split(":", 1)[1])
+                except ValueError as exc:
+                    raise ValueError(
+                        f"Invalid batch_size spec: {batch_size!r}"
+                    ) from exc
+                if self._auto_rule <= 0:
+                    raise ValueError("auto batch_size rule must be > 0")
             else:
-                self._auto_rule = 32
+                raise ValueError(
+                    f"batch_size must be 'auto', 'auto:N', or a positive int, got {batch_size!r}"
+                )
             self.batch_size = _compute_batch_size(self._natoms, self._auto_rule)
         else:
             self.batch_size = int(batch_size)
+            if self.batch_size <= 0:
+                raise ValueError("batch_size must be > 0")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 248 - 260, The batch_size
parsing in the constructor silently treats unknown string specs and non-positive
integers as valid (setting self._auto_rule = 32 or accepting non-positive ints),
so add explicit validation: in the batch_size handling for strings
(function/class using self._auto_rule and _compute_batch_size) only accept
"auto" or "auto:<positive_int>" and raise a ValueError for any other string
spec; when batch_size is numeric (the else branch that sets self.batch_size =
int(batch_size)) validate that the parsed int is > 0 and raise ValueError if
not; ensure error messages reference the provided batch_size value and keep use
of _compute_batch_size(self._natoms, self._auto_rule) unchanged when the auto
rule is valid.

974-976: ⚠️ Potential issue | 🟡 Minor

Guard against truncated frame_nlocs metadata during merge.

If src_nlocs has fewer entries than nframes (malformed metadata), src_nlocs[i] raises IndexError.

Suggested fix
-                if src_nlocs is not None:
+                if src_nlocs is not None and i < len(src_nlocs):
                     frame_nlocs.append(int(src_nlocs[i]))
                 else:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 974 - 976, The code indexing
src_nlocs[i] can raise IndexError when src_nlocs is shorter than nframes; in the
block that checks "if src_nlocs is not None:" (where
frame_nlocs.append(int(src_nlocs[i])) is called), first verify i <
len(src_nlocs) (or use a safe iterator) and if the entry is missing append a
sensible default (e.g., 0) or otherwise handle the malformed metadata (log/warn
and append 0) instead of indexing past the end; update the logic around
frame_nlocs, src_nlocs and nframes to perform this bounds check before
casting/indexing.

785-808: ⚠️ Potential issue | 🟡 Minor

Handle empty datasets in get_test().

If the LMDB has zero frames, self._nloc_groups is empty and max(self._nloc_groups, ...) at line 798 raises ValueError: max() arg is an empty sequence.

Suggested fix
         if nloc is not None:
             if nloc not in self._nloc_groups:
                 raise ValueError(
                     f"No frames with nloc={nloc}. Available: {sorted(self._nloc_groups.keys())}"
                 )
             frame_indices = self._nloc_groups[nloc]
             natoms = nloc
+        elif not self._frames:
+            raise ValueError("LMDB dataset contains no frames")
         elif len(self._nloc_groups) == 1:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 785 - 808, The method that
selects frames (inside get_test / the function using self._nloc_groups and
_frames) fails when the LMDB is empty because max(...) is called on an empty
self._nloc_groups; add an early check for empty dataset (e.g., if not
self._nloc_groups or not self._frames) and handle it by either raising a clear
ValueError ("No frames in LMDB") or returning an appropriate empty result before
reaching the mixed-nloc branch; update the branches that reference natoms and
frame_indices (the code that computes natoms = max(...) and frame_indices = ...)
so they only run when groups exist to avoid the max() on empty sequence.

436-447: ⚠️ Potential issue | 🟡 Minor

Remove unused unique_nlocs variable (Ruff F841).

unique_nlocs is assigned but never used. This will cause CI failure per coding guidelines.

Suggested fix
     def print_summary(self, name: str, prob: Any) -> None:
         """Print basic dataset info."""
-        unique_nlocs = sorted(self._nloc_groups.keys())
         nloc_info = ", ".join(
             f"{nloc}({len(idxs)})" for nloc, idxs in sorted(self._nloc_groups.items())
         )

As per coding guidelines, **/*.py: Always run ruff check . and ruff format . before committing changes or CI will fail.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 436 - 447, In print_summary
(method print_summary) the local variable unique_nlocs is assigned from
self._nloc_groups but never used; remove the unused assignment (unique_nlocs =
sorted(self._nloc_groups.keys())) so the method only builds nloc_info from
self._nloc_groups and logs it, eliminating the Ruff F841 unused-variable error.

265-276: ⚠️ Potential issue | 🟡 Minor

Guard against out-of-range atom type IDs.

np.bincount(...)[: self._ntypes] truncates counts for type IDs >= self._ntypes, making natoms_vec inconsistent with frame contents.

Suggested fix
     def _compute_natoms_vec(self, atype: np.ndarray) -> np.ndarray:
         nloc = len(atype)
+        if atype.size and (np.min(atype) < 0 or np.max(atype) >= self._ntypes):
+            raise ValueError(
+                f"atype contains out-of-range IDs; expected [0, {self._ntypes - 1}]"
+            )
         counts = np.bincount(atype, minlength=self._ntypes)[: self._ntypes]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 265 - 276,
_compute_natoms_vec currently truncates counts for atom type IDs >=
self._ntypes; to fix, validate the atype array before counting: compute max_id =
atype.max() (handle empty atype properly) and if max_id >= self._ntypes or any
id < 0 raise a clear ValueError referencing _compute_natoms_vec and include the
offending max_id and self._ntypes in the message; otherwise use
np.bincount(atype, minlength=self._ntypes) to build counts and keep the existing
vec[0]=nloc, vec[1]=nloc, vec[2:]=counts logic.
🧹 Nitpick comments (2)
deepmd/pt/train/training.py (1)

263-265: Consider documenting the _reader access pattern.

_data._reader accesses a private attribute of LmdbDataset. While acceptable within the same package, consider either making this a public property or adding a brief comment explaining this coupling is intentional for distributed sampler construction.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt/train/training.py` around lines 263 - 265, The code is accessing a
private attribute _data._reader of LmdbDataset when constructing the
DistributedSameNlocBatchSampler; update the codebase to avoid relying on a
private member by either exposing a public property/method on LmdbDataset (e.g.,
reader or get_reader) and using that, or add a concise inline comment next to
the DistributedSameNlocBatchSampler construction documenting that accessing
_reader is an intentional coupling for distributed sampler creation and should
not be refactored without updating the sampler; reference symbols:
_data._reader, DistributedSameNlocBatchSampler, LmdbDataset.
deepmd/dpmodel/utils/lmdb_data.py (1)

869-869: Rename unused loop variable to _req_info.

req_info is not used within the loop body (Ruff B007). Rename to _req_info to indicate it's intentionally unused.

-        for key, req_info in all_keys.items():
+        for key, _req_info in all_keys.items():
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` at line 869, The loop over all_keys uses
an unused variable name req_info which triggers a lint warning; update the for
statement in lmdb_data.py (the loop "for key, req_info in all_keys.items():") to
rename req_info to _req_info so it becomes "for key, _req_info in
all_keys.items():" to indicate the variable is intentionally unused (no other
changes needed if req_info is not referenced elsewhere).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/pt/train/training.py`:
- Around line 554-555: The reported mismatch comes from using
LmdbDataset.total_batch (which does nframes // batch_size) to set
total_numb_batch while SameNlocBatchSampler.__len__ uses per-nloc ceiling
division; fix by computing total_numb_batch from the actual sampler instead of
total_batch: when using SameNlocBatchSampler (or when auto batch sizing / mixed
nloc are possible) set total_numb_batch = len(sampler) or recompute it by
summing ceil(group_size / batch_size_for_that_nloc) so num_steps (used later)
matches the sampler’s actual number of batches; update the block in training.py
that currently uses training_data.total_batch to use
SameNlocBatchSampler.__len__ (or equivalent ceiling-sum logic) so the two agree.

---

Duplicate comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 248-260: The batch_size parsing in the constructor silently treats
unknown string specs and non-positive integers as valid (setting self._auto_rule
= 32 or accepting non-positive ints), so add explicit validation: in the
batch_size handling for strings (function/class using self._auto_rule and
_compute_batch_size) only accept "auto" or "auto:<positive_int>" and raise a
ValueError for any other string spec; when batch_size is numeric (the else
branch that sets self.batch_size = int(batch_size)) validate that the parsed int
is > 0 and raise ValueError if not; ensure error messages reference the provided
batch_size value and keep use of _compute_batch_size(self._natoms,
self._auto_rule) unchanged when the auto rule is valid.
- Around line 974-976: The code indexing src_nlocs[i] can raise IndexError when
src_nlocs is shorter than nframes; in the block that checks "if src_nlocs is not
None:" (where frame_nlocs.append(int(src_nlocs[i])) is called), first verify i <
len(src_nlocs) (or use a safe iterator) and if the entry is missing append a
sensible default (e.g., 0) or otherwise handle the malformed metadata (log/warn
and append 0) instead of indexing past the end; update the logic around
frame_nlocs, src_nlocs and nframes to perform this bounds check before
casting/indexing.
- Around line 785-808: The method that selects frames (inside get_test / the
function using self._nloc_groups and _frames) fails when the LMDB is empty
because max(...) is called on an empty self._nloc_groups; add an early check for
empty dataset (e.g., if not self._nloc_groups or not self._frames) and handle it
by either raising a clear ValueError ("No frames in LMDB") or returning an
appropriate empty result before reaching the mixed-nloc branch; update the
branches that reference natoms and frame_indices (the code that computes natoms
= max(...) and frame_indices = ...) so they only run when groups exist to avoid
the max() on empty sequence.
- Around line 436-447: In print_summary (method print_summary) the local
variable unique_nlocs is assigned from self._nloc_groups but never used; remove
the unused assignment (unique_nlocs = sorted(self._nloc_groups.keys())) so the
method only builds nloc_info from self._nloc_groups and logs it, eliminating the
Ruff F841 unused-variable error.
- Around line 265-276: _compute_natoms_vec currently truncates counts for atom
type IDs >= self._ntypes; to fix, validate the atype array before counting:
compute max_id = atype.max() (handle empty atype properly) and if max_id >=
self._ntypes or any id < 0 raise a clear ValueError referencing
_compute_natoms_vec and include the offending max_id and self._ntypes in the
message; otherwise use np.bincount(atype, minlength=self._ntypes) to build
counts and keep the existing vec[0]=nloc, vec[1]=nloc, vec[2:]=counts logic.

---

Nitpick comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Line 869: The loop over all_keys uses an unused variable name req_info which
triggers a lint warning; update the for statement in lmdb_data.py (the loop "for
key, req_info in all_keys.items():") to rename req_info to _req_info so it
becomes "for key, _req_info in all_keys.items():" to indicate the variable is
intentionally unused (no other changes needed if req_info is not referenced
elsewhere).

In `@deepmd/pt/train/training.py`:
- Around line 263-265: The code is accessing a private attribute _data._reader
of LmdbDataset when constructing the DistributedSameNlocBatchSampler; update the
codebase to avoid relying on a private member by either exposing a public
property/method on LmdbDataset (e.g., reader or get_reader) and using that, or
add a concise inline comment next to the DistributedSameNlocBatchSampler
construction documenting that accessing _reader is an intentional coupling for
distributed sampler creation and should not be refactored without updating the
sampler; reference symbols: _data._reader, DistributedSameNlocBatchSampler,
LmdbDataset.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e42a3a8f-33f3-46a5-b818-47bcea016afe

📥 Commits

Reviewing files that changed from the base of the PR and between 5a10d4d and f91ba7d.

📒 Files selected for processing (3)
  • deepmd/dpmodel/utils/lmdb_data.py
  • deepmd/pt/train/training.py
  • pyproject.toml

Comment on lines +554 to +555
if isinstance(training_data, LmdbDataset):
total_numb_batch = training_data.total_batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how total_batch is computed vs sampler length
echo "=== LmdbDataReader.total_batch property ==="
rg -n "def total_batch" deepmd/dpmodel/utils/lmdb_data.py -A 3

echo -e "\n=== SameNlocBatchSampler.__len__ ==="
rg -n "def __len__" deepmd/dpmodel/utils/lmdb_data.py -A 6 | head -30

echo -e "\n=== LmdbDataReader.index property ==="
rg -n "def index" deepmd/dpmodel/utils/lmdb_data.py -A 3

Repository: deepmodeling/deepmd-kit

Length of output: 1452


Fix batch count mismatch between total_batch and SameNlocBatchSampler.__len__.

training_data.total_batch uses simple integer division: nframes // batch_size, while SameNlocBatchSampler.__len__ computes batches per nloc group using ceiling division and sums them. For datasets with mixed nloc values and auto batch_size, these calculations diverge, causing num_steps (line 568) to not match actual batches produced per epoch.

The issue: total_batch returns self.index[0] (max(1, nframes // batch_size)), but the sampler counts Σ ceil(group_size / batch_size_for_nloc) across all nloc groups—fundamentally different formulas.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt/train/training.py` around lines 554 - 555, The reported mismatch
comes from using LmdbDataset.total_batch (which does nframes // batch_size) to
set total_numb_batch while SameNlocBatchSampler.__len__ uses per-nloc ceiling
division; fix by computing total_numb_batch from the actual sampler instead of
total_batch: when using SameNlocBatchSampler (or when auto batch sizing / mixed
nloc are possible) set total_numb_batch = len(sampler) or recompute it by
summing ceil(group_size / batch_size_for_that_nloc) so num_steps (used later)
matches the sampler’s actual number of batches; update the block in training.py
that currently uses training_data.total_batch to use
SameNlocBatchSampler.__len__ (or equivalent ceiling-sum logic) so the two agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants