Skip to content

Fix O(N) query regression in bulk folder operations (#1199)#1237

Open
JSv4 wants to merge 17 commits intomainfrom
claude/fix-issue-1199-cA3bg
Open

Fix O(N) query regression in bulk folder operations (#1199)#1237
JSv4 wants to merge 17 commits intomainfrom
claude/fix-issue-1199-cA3bg

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented Apr 12, 2026

Summary

Optimize move_documents_to_folder() and delete_folder() in DocumentFolderService to reduce database queries from ~3 per document to 2 total queries per batch operation. This fixes a performance regression where bulk operations issued hundreds of queries instead of leveraging batch operations.

Key Changes

  • Batch path disambiguation: Replace per-document _disambiguate_path() queries with a single pre-fetch of occupied paths in the target directory, shared across all documents in the batch. Within-batch conflicts are resolved in-memory by mutating the shared set as each path is claimed.

  • Batch database writes: Replace individual save(update_fields=["is_current"]) and DocumentPath.objects.create() calls with two bulk operations:

    • Single DocumentPath.objects.filter().update(is_current=False) for all superseded paths
    • Single DocumentPath.objects.bulk_create() for all new successor paths
  • Signal dispatch for bulk_create: Since bulk_create() bypasses per-row post_save signals, manually dispatch them via new _dispatch_document_path_created_signals() to preserve the document-text embedding side effect.

  • Query optimization helpers:

    • _fetch_occupied_paths_in_directory(): Single-query fetch of active paths in a directory (with regex filtering to avoid pulling entire subtrees)
    • _target_directory_string(): Normalize folder paths to the format expected by the fetch helper
    • Updated _disambiguate_path() to accept occupied_override parameter for pre-fetched sets
  • Row-level locking refinement: Add of=("self",) to select_for_update() calls to scope locks to DocumentPath rows only, avoiding accidental locks on Document rows for the transaction duration.

Implementation Details

  • The batch path planning phase pre-computes all target paths and detects within-batch conflicts before any database writes occur
  • Occupied paths are tracked in a mutable set that grows as each document claims its disambiguated path, ensuring siblings with identical filenames receive unique suffixes
  • All changes occur within existing transaction.atomic() blocks, preserving rollback semantics
  • Test updates reflect the shift from sequential to batch processing (e.g., extra_occupiedoccupied_override parameter)

https://claude.ai/code/session_01Wp8voAR22aHiENmipX9jjL

Fix issue #1199: move_documents_to_folder() and delete_folder() regressed
to ~3 DB round-trips per document after PR #1195 added DocumentPath
history nodes. A 100-document bulk move issued ~300 queries instead of
the single .update() the pre-lineage code used.

Both methods now:

1. Pre-fetch occupied target-directory paths in a single query via the
   new _fetch_occupied_paths_in_directory() helper and share a mutable
   set across all disambiguations via _disambiguate_path's new
   occupied_override kwarg (replaces the single-use extra_occupied).
2. Batch-deactivate superseded rows with .filter(pk__in=...).update().
3. Batch-insert successor rows with bulk_create(), then manually
   dispatch post_save(created=True) via the new
   _dispatch_document_path_created_signals() helper so the
   document-text embedding side effect in
   documents.signals.process_doc_on_document_path_create still fires.
4. Use select_related("document") + select_for_update(of=("self",)) on
   the affected-path query to eliminate an N+1 on current.document and
   scope row locks to the DocumentPath table.

Net result: a 100-document bulk move is ~4 DB round-trips instead of
~300, with full Path Tree history preserved.

Tests were updated to match the batched call shape:
- The two monkeypatches of _disambiguate_path now use **kwargs so they
  pass through the new occupied_override parameter transparently.
- The two IntegrityError rollback tests for bulk move now patch
  bulk_create instead of create.
@claude
Copy link
Copy Markdown

claude bot commented Apr 12, 2026

Code Review

The optimization is well-conceived and correctly implemented. The shift from ~2N round-trips to 4 total queries for a bulk operation is the right approach. A few things worth looking at before merge:


1. _fetch_occupied_paths_in_directory — regex vs. index-friendly prefix scan

path__regex=r"^/[^/]+$" and path__regex=rf"^{re.escape(directory)}[^/]+$" are correct but PostgreSQL can't use a btree index for these patterns. A prefix scan would be index-eligible:

# root case
qs = qs.filter(path__startswith="/").exclude(path__contains="/", path__regex=r"^/.+/.+$")
# OR simply:
qs = qs.filter(path__regex=r"^/[^/]+$")  # current — correct but seq-scan on path

A more index-friendly alternative for the non-root case:

# Uses the index for the prefix, then filters the suffix in-memory
qs = qs.filter(path__startswith=directory).extra(
    where=["path NOT LIKE %s"],
    params=[directory + "%/%"],
)

Or, since path always starts with /, you can count slashes:

from django.db.models.functions import Length, Replace
qs = qs.filter(path__startswith=directory).annotate(
    slash_count=Length("path") - Length(Replace("path", Value("/"), Value("")))
).filter(slash_count=directory.count("/"))

For typical corpus sizes this is unlikely to matter, but if a corpus has thousands of files at root the regex forces a sequential scan on the path column. Worth at least adding a comment acknowledging this, or adding a GinIndex / pg_trgm index if the project doesn't have one already.


2. Test name vs. new semantics — test_delete_folder_rolls_back_successful_relocations_on_later_failure

Before this PR, the failure occurred mid-write (after the first document was already persisted), so the test genuinely verified rollback of partial writes. With the new two-phase approach, _disambiguate_path is called during the planning phase — before any DB writes — so the ValueError on call 2 propagates before filter().update() or bulk_create() ever runs.

The test still passes and the observable end-state is the same (no rows changed), but the test name and docstring now describe behavior that no longer occurs. Suggesting a rename/docstring update:

test_delete_folder_rolls_back_on_planning_failure

and update the docstring to reflect that the failure now happens in the path-planning phase, before any writes.


3. _dispatch_document_path_created_signals — implicit update_fields omission

post_save.send() is called without update_fields. For created=True this defaults to None, which matches what a regular create() would send. The process_doc_on_document_path_create handler doesn't inspect update_fields, so this is fine in practice. It's a minor doc/convention gap but not a bug.


4. occupied_override mutation contract — worth a comment

In _disambiguate_path the parameter is used as a read-only snapshot:

occupied = occupied_override  # direct alias, not a copy

The callers then mutate occupied_paths after the call returns. If a future change were to accidentally mutate occupied inside the function (e.g. during the suffix loop), callers would see unexpected side-effects. A short contract comment would help:

# ``occupied_override`` is treated as read-only inside this method;
# callers are responsible for appending the returned path to the set.
occupied = occupied_override

Confirmed correct

  • from __future__ import annotations (line 36) defers evaluation of the list[DocumentPath] type hint in _dispatch_document_path_created_signals, so the nested import correctly avoids the circular dependency without a NameError at class definition time. ✓

  • Signal handler N+1: DocumentPath(document=current.document, ...) in the list comprehension populates _document_cache from the pre-fetched current.document (which came via select_related). instance.document in the signal handler resolves from cache, not a new query. ✓

  • transaction.on_commit inside atomic block: Embedding tasks are queued via on_commit in the handler, so they fire only when the outermost transaction commits — same semantics as the old per-row create(). ✓

  • bulk_create PK population on PostgreSQL: Django uses RETURNING id on PostgreSQL, so the instances in created_paths have PKs set and instance=path in the signal dispatch is fully populated. ✓

  • Row-lock scoping of=("self",): Correctly limits SELECT FOR UPDATE to the DocumentPath table, preventing accidental locks on joined Document rows. ✓

  • IntegrityError tests updated: Mocking bulk_create instead of create is the correct change given the new write path. ✓


Overall this is a clean, well-documented optimization. The three items above are minor — the regex index efficiency note is the only one I'd consider before shipping on a large-dataset instance.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 12, 2026

Codecov Report

❌ Patch coverage is 98.46154% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
opencontractserver/corpuses/folder_service.py 98.46% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

JSv4 added 2 commits April 12, 2026 22:20
- Add comment noting regex filters in _fetch_occupied_paths_in_directory
  cannot use btree indexes and may benefit from GIN/pg_trgm for large
  directory listings.
- Add read-only contract comment on occupied_override parameter in
  _disambiguate_path to prevent accidental in-method mutation.
- Document intentional update_fields=None omission in
  _dispatch_document_path_created_signals matching Model.save() semantics.
- Rename test_delete_folder_rolls_back_successful_relocations_on_later_failure
  to test_delete_folder_rolls_back_on_planning_failure with updated docstring
  reflecting that failure now occurs in the path-planning phase before writes.
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

PR Review: Fix O(N) query regression in bulk folder operations

This is a well-executed performance fix. The architecture is sound, the comments are thorough, and the batch approach correctly eliminates the O(N) round-trip regression. A few things worth addressing before merge:


Medium — move_single_document_to_folder not updated to of=("self",)

The bulk methods received select_for_update(of=("self",)) to scope row locks to DocumentPath only, but move_single_document_to_folder at line 998 still uses the unscoped select_for_update(), which would accidentally hold a lock on the joined Document row for the duration of its savepoint. Should be consistent:

# folder_service.py ~line 997-998
current = (
    DocumentPath.objects.select_for_update(of=("self",))  # add of=("self",)
    ...
)

Medium — _dispatch_document_path_created_signals omits raw and using kwargs

Django's Model.save() dispatches post_save with these kwargs:

post_save.send(sender=origin, instance=self, created=..., update_fields=..., raw=raw, using=using)

The manual dispatch omits raw and using. The current handler absorbs them via **kwargs, so nothing breaks today, but any future handler that inspects raw (e.g. a data-import guard) or using (multi-DB routing) would receive wrong values silently. Consider:

for path in paths:
    post_save.send(
        sender=DocumentPath,
        instance=path,
        created=True,
        raw=False,
        using=path._state.db,
    )

Low — _fetch_occupied_paths_in_directory empty-string fallback loads all corpus paths

When directory == "" (i.e., base_path has no /), the method queries is_current=True, is_deleted=False with no path filter — pulling the entire active path set for the corpus into memory. The docstring acknowledges this as a "rare fallback," but paths without a leading / appear to be structurally invalid in this codebase (every stored path starts with /). A defensive guard or assertion would make the invariant explicit and prevent a silent full-table scan if a malformed path ever reaches this code:

if not directory:
    # base_path had no leading slash — structurally unexpected.
    # Fall through to the unfiltered query rather than crashing,
    # but log a warning so we can catch regressions early.
    logger.warning("_fetch_occupied_paths_in_directory called with empty directory for corpus %s", corpus.id)

Low — Test coverage gap: signal dispatch after bulk_create

The test suite verifies rollback scenarios and path disambiguation for the new batch path, but I don't see a test that asserts _dispatch_document_path_created_signals (or the downstream calculate_embedding_for_doc_text.delay call) is actually invoked after a successful bulk move. The old code got this for free via Django's native post_save; the new manual dispatch is a new code path that could silently regress. A test that mocks post_save.send (or the Celery task) and asserts it was called N times would be good insurance.


Nit — _target_directory_string docstring example

- Folder ``Legal/Contracts`` → ``"/Legal/Contracts/"``

Should mention that the input comes from get_path() which returns a path already stripped of leading/trailing slashes (or link to the get_path() contract), since the .strip("/") call is load-bearing here and it's not obvious why it's needed.


Positive callouts

  • The re.escape(directory) in the regex filter is the right call — prevents injection from user-controlled folder names.
  • The select_related("document") + of=("self",) pairing is clean; it avoids the N+1 on current.document in the build loop while keeping the lock footprint minimal.
  • The in-memory occupied_paths.add(new_path) mutation for within-batch sibling disambiguation is well-explained and correctly mirrors the old batch_claimed approach.
  • Test updates accurately reflect the semantic shift from sequential→batch (the fail_on_second mock using **kwargs pass-through is clean).
  • Deleting the TODO(perf) comments that this PR resolves is the right move per the "no dead code" rule in CLAUDE.md.

JSv4 added 2 commits April 12, 2026 23:41
…tch test

- Add of=("self",) to move_document_to_folder's select_for_update for
  consistency with bulk methods (prevents accidental Document row locks)
- Supply raw=False and using=path._state.db in manual post_save dispatch
  to match Django's native Model.save() kwargs
- Add logger.warning for empty-directory fallback in
  _fetch_occupied_paths_in_directory to surface malformed paths early
- Clarify _target_directory_string docstring re: strip("/") rationale
- Add test_bulk_move_dispatches_post_save_for_each_created_path to verify
  manual signal dispatch fires exactly N times with correct kwargs
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Code Review

This is a well-executed optimization. The core approach — pre-fetching occupied paths once, planning all moves in-memory, then executing two bulk queries — is correct and the implementation is clean. A few items worth addressing before merge:


Bug / Correctness

1. _target_directory_string edge case with empty get_path() return

# _target_directory_string
folder_path = target_folder.get_path().strip("/")
return f"/{folder_path}/"

If CorpusFolder.get_path() ever returns an empty string (e.g. a root-level folder that isn't None), strip("/") produces "" and the result is "//". _fetch_occupied_paths_in_directory would then fall into the elif directory: branch and apply regex ^//[^/]+$, matching no real paths — the batch move would silently succeed with no conflict detection. Worth adding an assertion or guard:

if not folder_path:
    raise ValueError(f"CorpusFolder {target_folder.pk} returned empty path from get_path()")

Test Coverage Gap

2. No test for delete_folder signal dispatch

test_bulk_move_dispatches_post_save_for_each_created_path only covers move_documents_to_folder. The delete_folder method also calls _dispatch_document_path_created_signals (line 132 of the diff), but there's no corresponding test. Given that the signal dispatch is the primary side-effect safety net for document-text embedding, this path deserves the same coverage.


Minor Issues

3. directory == "" fallback loads ALL active paths

In _fetch_occupied_paths_in_directory, the empty-directory fallback:

else:
    logger.warning(...)
    # falls through with no filter — loads all active paths in corpus

The warning is good, but a corpus with tens of thousands of documents could hit an OOM or very slow query if this path is ever triggered. Consider raising instead of warning, or at minimum adding a hard filter like qs = qs.none() since this state is documented as "structurally unexpected":

else:
    logger.error(...)
    raise ValueError(f"_fetch_occupied_paths_in_directory: empty directory for corpus {corpus.id}")

4. occupied_override mutation contract is implicit

The docstring says occupied_override is "treated as read-only inside this method", but callers immediately mutate it with occupied_paths.add(new_path) after each call. This shared-mutable-set contract is unusual. It works correctly, but a comment at the call site clarifying the invariant (rather than just in the method docstring) would help future readers:

new_path = cls._disambiguate_path(new_path, corpus, occupied_override=occupied_paths)
occupied_paths.add(new_path)  # Claim path so next sibling in batch resolves to a different suffix

(The comment at line 208 already does this — just flagging it's worth keeping/not losing in future refactors.)


Positive Notes

  • select_for_update(of=("self",)) is an excellent improvement — the original lock on joined rows could deadlock with concurrent operations on the Document table.
  • The select_related("document") avoidance of N+1 in the planning loop is clean.
  • Renaming test_delete_folder_rolls_back_successful_relocations_on_later_failuretest_delete_folder_rolls_back_on_planning_failure is accurate — the semantics genuinely changed (no writes happen before the rollback now).
  • The regex injection safety via re.escape(directory) is correctly handled.
  • Manual post_save dispatch arguments (raw=False, using=path._state.db, omitted update_fields) correctly replicate Django's native Model.save() dispatch semantics.

Summary

Blocking: item 1 (silent correctness failure on empty get_path()) and item 2 (missing test for delete_folder signal dispatch). Items 3 and 4 are suggestions rather than blockers, but 3 in particular would prevent a hard-to-diagnose OOM in production.

JSv4 added 2 commits April 13, 2026 08:31
…-cA3bg

# Conflicts:
#	CHANGELOG.md
#	opencontractserver/corpuses/folder_service.py
…te_folder signal test

- _target_directory_string: raise ValueError if CorpusFolder.get_path()
  returns empty string, preventing "//" directory which would match all
  paths instead of the intended folder
- _fetch_occupied_paths_in_directory: upgrade empty-directory fallback
  from a warning (which silently loads ALL active paths) to a ValueError,
  failing fast on malformed input rather than masking bugs
- Add test_delete_folder_dispatches_post_save_for_each_created_path to
  verify signal dispatch parity with the existing bulk-move signal test
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Review: Fix O(N) query regression in bulk folder operations (#1199)

This is a well-executed optimization. The batch architecture is correct, the concurrency documentation is thorough, and the test expansion is meaningful. A few items worth addressing.


Bug: missing update_fields in post_save.send()

File: opencontractserver/corpuses/folder_service.py, lines 1387-1394

Django's Model.save() always dispatches post_save with update_fields set explicitly (None for creates). The manual dispatch in _dispatch_document_path_created_signals omits it:

# current
post_save.send(
    sender=DocumentPath,
    instance=path,
    created=True,
    raw=False,
    using=path._state.db,
)

The inline comment says "update_fields is intentionally omitted (defaults to None)", but omitting the key is not the same as passing update_fields=None. Any signal handler that declares update_fields as an explicit keyword argument without a default will raise TypeError. The current process_doc_on_document_path_create handler uses **kwargs so it survives, but this creates a silent compatibility trap for future handlers.

Fix — add the missing kwarg:

post_save.send(
    sender=DocumentPath,
    instance=path,
    created=True,
    update_fields=None,   # matches Django Model.save() dispatch for creates
    raw=False,
    using=path._state.db,
)

The two signal dispatch tests should also assert kwargs.get("update_fields") is None to lock this in.


Missing test: delete_folder + IntegrityError during bulk_create

TestCoverageGap_BulkMoveIntegrityErrorRollback covers the move_documents_to_folder path. delete_folder uses the identical bulk_create -> signal dispatch pattern (lines 918-919) but has no equivalent test. The planning-phase rollback test (test_delete_folder_rolls_back_on_planning_failure) covers failures before any DB writes; a complementary test that patches DocumentPath.objects.bulk_create to raise IntegrityError after the filter().update(is_current=False) call would verify that the outer transaction.atomic() correctly undoes the deactivation update too.


Minor: _target_directory_string error path has no test

The ValueError raised when a non-root CorpusFolder.get_path() returns an empty string (lines 1353-1356) is not exercised anywhere. A short unit test that mocks get_path to return "" would complete the contract.


Observations (no action required)

Regex index note is accurate but worth a follow-up ticket: The comment in _fetch_occupied_paths_in_directory correctly flags that path__regex can't use a btree index on path. For typical corpus sizes this is fine; opening a tracking issue for the GIN/pg_trgm path would keep it from being forgotten.

In-memory sync after filter().update(): The loop that sets current.is_current = False in memory (lines 1221-1222) is a good defensive pattern. It only syncs the objects in planned_paths; the original current_paths list (which includes already-in-target skipped entries) is not touched — but that's fine since the method returns immediately after.

Concurrency documentation: The _disambiguate_path docstring accurately describes the TOCTOU window and the unique constraint as the true safety net. This is exactly the right way to document this class of race condition.

target_folder_path pre-resolution: Hoisting folder.get_path() out of the inner loop (line 1188) correctly avoids O(N) recursive CTE queries. Good catch.

@JSv4 JSv4 linked an issue Apr 14, 2026 that may be closed by this pull request
JSv4 added 3 commits April 13, 2026 20:48
- move_documents_to_folder now calls get_path() exactly once and derives
  both the target directory string and the compute-moved-path argument
  from the cached value via new _target_directory_string_from_path helper
- TestCoverageGap_DisambiguateNoSlashPath tests updated to expect
  ValueError since _fetch_occupied_paths_in_directory now intentionally
  rejects empty directory strings (commit 62cfc7d)
- Add explicit update_fields=None to post_save.send() in
  _dispatch_document_path_created_signals to match Django Model.save()
  dispatch signature and prevent TypeError in future handlers
- Add update_fields assertion to both signal dispatch tests
- Add TestCoverageGap_DeleteFolderIntegrityErrorRollback: verifies
  IntegrityError during delete_folder's bulk_create rolls back the
  deactivation update and folder deletion
- Add TestCoverageGap_TargetDirectoryStringEmptyPath: exercises the
  ValueError guard when get_path() returns empty string
@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

PR Review: Fix O(N) query regression in bulk folder operations (#1199)

This is a solid performance fix with well-thought-out batching semantics. Here are my observations:


Strengths

Query reduction is well-targeted. The two-query batch pattern (filter().update() + bulk_create()) is the right approach for this use case. The move from ~3 queries/document to ~4 queries total for any batch size is a genuine O(N)→O(1) reduction.

Signal dispatch is correctly preserved. Manually firing post_save via _dispatch_document_path_created_signals() correctly compensates for bulk_create's signal bypass. The kwargs (created=True, raw=False, update_fields=None, using=path._state.db) match what Django's Model.save() would emit for a new row.

In-memory occupancy tracking is sound. Using a shared mutable set that grows as each path is claimed (within the planning loop) correctly handles same-batch filename conflicts without extra DB round-trips.

of=("self",) on select_for_update is a good defensive fix—scoping the row lock to DocumentPath prevents accidental locks on joined Document rows.

Test coverage is thorough. New tests cover: signal dispatch count, rollback on IntegrityError during bulk_create, ValueError on empty get_path(), and the behavior change for slashless paths.


Issues & Concerns

1. Behavior break: slashless paths now raise instead of silently succeeding

In _disambiguate_path, when occupied_override is None and the path has no /, the old code set directory = "" and matched all active paths (a wide but non-crashing fallback). The new code delegates to _fetch_occupied_paths_in_directory which explicitly raise ValueError for directory == "".

The test class TestCoverageGap_DisambiguateNoSlashPath has been updated to assert the new ValueError behavior—but this is a semantic change that the PR description does not highlight. If any caller can reach _disambiguate_path with a slashless path at runtime (e.g. a document with a legacy stored path missing the leading /), that caller now gets an unhandled exception rather than a graceful degradation.

Recommendation: Add a guard in callers or at the top of _disambiguate_path that asserts/normalizes the leading slash, and document this invariant clearly (it's partially noted in the docstring already).

2. Two redundant static methods doing the same thing

_target_directory_string(folder: CorpusFolder | None) and _target_directory_string_from_path(folder_path: str | None) are nearly identical:

# _target_directory_string
if target_folder is None:
    return "/"
folder_path = target_folder.get_path().strip("/")
...
return f"/{folder_path}/"

# _target_directory_string_from_path
if folder_path is None:
    return "/"
stripped = folder_path.strip("/")
...
return f"/{stripped}/"

The only difference is whether the caller hands in a CorpusFolder or a pre-resolved path string. This duplicates logic and two error messages. Per the project's DRY principle, a single method with an optional pre-resolved path parameter would be cleaner—or simply inline _target_directory_string_from_path at its one call site in move_documents_to_folder (it's a 3-line transformation).

3. _fetch_occupied_paths_in_directory uses unindexed regex scans

The docstring calls this out, but it's worth underscoring: path__regex=r"^/[^/]+" and path__regex=rf"^{re.escape(directory)}[^/]+$" cannot use a btree index on path. For any corpus with thousands of files, every call to this helper does a full table scan of DocumentPath filtered to (corpus, is_current=True, is_deleted=False).

The comment suggests a GIN/pg_trgm index as a future improvement. Given this is a performance fix, it's worth leaving a concrete TODO(index) comment or opening a follow-up issue so this doesn't silently become the new bottleneck for large corpora.

4. Signal dispatch test is fragile—it patches post_save at the module level

@patch("opencontractserver.corpuses.folder_service.post_save")
def test_delete_folder_dispatches_post_save_for_each_created_path(self, mock_signal):

Patching post_save at the module reference level is correct, but if _dispatch_document_path_created_signals is ever moved to a utility module or the import path changes, these tests will silently stop verifying the real dispatch (they'd patch a different name). This is low-risk today, but worth noting.

Also, the call.kwargs extraction logic inside the assertion loop is duplicated verbatim between the two new signal-dispatch tests:

kwargs = call.kwargs if call.kwargs else {}
if not kwargs and len(call) > 1:
    kwargs = call[1]

This would be worth extracting to a helper in the test class.

5. moved_count = len(created_paths) changes semantics slightly

Previously moved_count was incremented inside the per-document loop (i.e. it counted successful individual moves). Now it equals len(created_paths). Since bulk_create returns all inserted rows (including any that might have been filtered by update_conflicts—though that flag isn't used here), this is functionally equivalent. Just noting this in case bulk_create behavior changes or update_conflicts gets added later.

6. Minor: in-memory sync after .update() iterates twice over planned_paths

DocumentPath.objects.filter(pk__in=old_path_pks).update(is_current=False)
for current, _ in planned_paths:
    current.is_current = False

This double-iteration (once to build old_path_pks, once to sync in-memory) is fine for correctness, but the comment says "Keep in-memory instances in sync… so that callers relying on current.is_current after this method returns see the correct value without a refetch." In practice, current instances aren't returned to callers—only moved_count and an error string are. This sync is defensive but documented reasoning could note it's for signal handlers that might inspect the instance during _dispatch_document_path_created_signals.


Minor Nits

  • The _dispatch_document_path_created_signals docstring says "Nested import to avoid circular dependency during app initialization"—this import is inside the method body, which is fine, but the comment would be more accurate if it noted this is a runtime circular import guard (not just initialization).
  • planned_paths: list[tuple] = [] # (current, new_path) — worth typing this as list[tuple[DocumentPath, str]] for clarity.

Summary

The core optimization is correct, well-tested, and clearly explained. The main issues worth addressing before merge are:

  1. The slashless-path behavior change (now raises instead of degrading gracefully) should be called out explicitly and any legacy path data validated.
  2. The two near-identical directory-string helpers violate DRY and should be consolidated.
  3. A follow-up issue for the regex index gap should be filed or the TODO made more actionable.

Everything else is minor. Good work on the batching logic and thorough signal-dispatch testing.

JSv4 added 2 commits April 14, 2026 00:21
…ding-slash guard

- Consolidate _target_directory_string to delegate to _target_directory_string_from_path,
  eliminating duplicated path-normalization logic (DRY)
- Add early leading-slash guard in _disambiguate_path so callers that pass slashless
  paths get a clear error at the point of call rather than a confusing ValueError
  from _fetch_occupied_paths_in_directory
- Remove dead else branch (directory = "") in _disambiguate_path now that the guard
  makes it unreachable
@JSv4
Copy link
Copy Markdown
Collaborator Author

JSv4 commented Apr 14, 2026

Review feedback addressed

Merged main into the PR branch and assessed the four review items:

1. Slashless paths raising from wrong location (FIXED)

Valid concern, addressed. Added an early startswith("/") guard at the top of _disambiguate_path so callers that pass slashless paths get a clear, localized error message rather than a confusing ValueError from _fetch_occupied_paths_in_directory. Also removed the now-unreachable directory = "" dead code branch.

In practice, all current callers pass paths from _compute_moved_path, which always returns paths starting with /, so this is defense-in-depth for future callers.

2. Two redundant static methods (FIXED)

Valid concern, addressed. Consolidated _target_directory_string to delegate to _target_directory_string_from_path instead of duplicating the path-normalization logic. The _from_path variant is the canonical implementation; the folder-accepting variant is now a thin wrapper that calls get_path() then delegates.

3. move_single_document_to_folder not updated to of=("self",) (ALREADY ADDRESSED)

Not a real issue. The single-document method move_document_to_folder at line 998 already uses select_for_update(of=("self",)). This was present before the review.

4. Missing update_fields in post_save.send() (ALREADY ADDRESSED)

Not a real issue. The signal dispatch in _dispatch_document_path_created_signals already passes update_fields=None, which is correct for created=True signals — this matches Django's own Model.save() behavior for newly created instances.

@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

Review: Fix O(N) query regression in bulk folder operations (#1237)

The core optimization is sound. Replacing ~3N DB round-trips with ~4 total queries using bulk_create + filter().update() is the right approach, the batch path-planning phase correctly resolves within-batch filename conflicts in memory, and the select_for_update(of=("self",)) scoping is a good tightening of the lock strategy. Here are a few things worth addressing before merge:


Bug / Dead Code

_target_directory_string is unreachable production code (folder_service.py:1363–1378)

The new _target_directory_string(target_folder) wrapper is never called by production code. move_documents_to_folder calls _target_directory_string_from_path(target_folder_path) directly (after resolving get_path() once), and delete_folder hardcodes "/" without going through either helper. The method is only exercised by two test assertions in TestCoverageGap_TargetDirectoryStringEmptyPath. Per the "No dead code" rule in CLAUDE.md, this should either be removed or actively used in the call sites that comment about it.

Note: there's also a stale comment at folder_service.py:1172 that says "We reuse the same value for both _target_directory_string and _compute_moved_path" — but _target_directory_string is not actually called there.


Correctness Concern

TOCTOU blast radius is now larger with bulk_create

The docstring correctly notes the TOCTOU race: a concurrent insert can claim a path between the occupied-paths pre-fetch and the bulk_create. Previously, per-row DocumentPath.objects.create() calls meant a race only caused the one conflicting document to raise IntegrityError, which could be handled with a savepoint and retry. Now, a single constraint violation in bulk_create rolls back the entire batch. For most workloads this is fine — the caller retries — but it's a behavioral change worth calling out explicitly in the docstring, since the old code's partial-success semantics are gone.


Minor Issues

Imprecise type annotation (folder_service.py:881, 1190)

planned_paths: list[tuple] = []  # (current, new_path)

This could be list[tuple[DocumentPath, str]]. The comment explains the tuple shape, but the type is not enforced or communicated to tooling.


Fragile call_args inspection in signal dispatch tests (test_document_folder_service.py:2601–2608)

kwargs = call.kwargs if call.kwargs else {}
if not kwargs and len(call) > 1:
    kwargs = call[1]

unittest.mock.call objects expose .args and .kwargs. The len(call) > 1 branch is accessing mock internals and will silently produce an empty dict if the call structure ever changes. A simpler and more robust pattern:

for call in send_calls:
    _, call_kwargs = call
    self.assertEqual(call_kwargs["sender"], DocumentPath)
    self.assertTrue(call_kwargs["created"])

Or just call.kwargs if all arguments are passed as keyword args (they are — post_save.send is called with all kwargs explicitly).


delete_folder inconsistency with _target_directory_string_from_path (folder_service.py:879)

occupied_paths = cls._fetch_occupied_paths_in_directory(corpus, "/")

Using _target_directory_string_from_path(None) here would mirror the move_documents_to_folder pattern and make the intent explicit that "/" means corpus root. Minor, but the helper exists for exactly this purpose.


What's Good

  • Signal dispatch fires inside transaction.atomic(), so transaction.on_commit callbacks in process_doc_on_document_path_create are correctly deferred until the outer transaction commits. ✓
  • re.escape(directory) in _fetch_occupied_paths_in_directory prevents regex injection from adversarial folder names. ✓
  • The performance regression test (test_bulk_move_calls_get_path_once) that asserts get_path() is called exactly once is a great pattern for locking in this invariant. ✓
  • In-memory is_current = False sync after the batch update is a clean defensive touch. ✓
  • The btree-index caveat in _fetch_occupied_paths_in_directory docstring is a good proactive note for future maintainers. ✓

JSv4 added 2 commits April 14, 2026 06:16
# Conflicts:
#	CHANGELOG.md
#	opencontractserver/corpuses/folder_service.py
#	opencontractserver/tests/test_document_folder_service.py
…ify tests

- Remove unused _target_directory_string wrapper (dead code per CLAUDE.md rules);
  update its test class to exercise _target_directory_string_from_path directly
  with additional edge-case coverage (empty, slash-only, normal path).
- Fix stale comment referencing removed _target_directory_string method.
- Tighten planned_paths type annotation from list[tuple] to
  list[tuple[DocumentPath, str]] for static analysis and readability.
- Simplify fragile call_args inspection in signal dispatch tests: replace
  multi-branch kwargs extraction with direct tuple unpacking.
@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

PR Review: Fix O(N) query regression in bulk folder operations (#1199)

The performance optimisation direction is sound — replacing per-document round-trips with two batch queries is a meaningful win for large corpora. The implementation is clean and well-commented. However, there are a few issues worth addressing before this merges.


Bug: Pre-existing tests will fail (critical)

Three tests that are not in the diff patch DocumentPath.objects.create, but the new bulk path calls DocumentPath.objects.bulk_create instead. The mocks never fire, so the assertion on call counts fails — and in one case the entire test outcome flips:

Test Assertion that breaks Expected Will get
TestBulkMoveIntegrityRecovery.test_bulk_move_recovers_from_transient_integrity_error self.assertEqual(attempts["count"], 3) 3 (retry counted) 0 (mock never called)
TestDeleteFolderIntegrityRecovery.test_delete_folder_recovers_from_transient_integrity_error self.assertEqual(attempts["count"], 2) 2 0
TestDeleteFolderIntegrityRecovery.test_persistent_failure_preserves_folder_and_documents self.assertFalse(success) False True — folder is deleted

The third case is the most dangerous: the test that was supposed to verify "persistent failures preserve the folder" now passes the operation through cleanly because bulk_create isn't mocked, and the assertion on success fails in the wrong direction.

These tests need to either be updated to patch bulk_create or removed/replaced if the per-batch-retry invariant is intentionally dropped.


Correctness concern: TOCTOU retry protection removed for bulk operations

The previous implementation used _create_successor_path_with_retry per document, which wraps each insert in a savepoint and retries on IntegrityError from the unique_active_path_per_corpus partial unique index. This PR removes that retry loop for bulk paths.

The new approach: pre-fetch occupied paths → plan all paths in memory → batch updatebulk_create. The window between the pre-fetch and bulk_create is still vulnerable to concurrent transactions inserting paths in the target directory. When that race is lost, the entire batch is rolled back with no retry.

For small corpora this is unlikely to matter. For high-concurrency workloads (multiple users bulk-moving documents to the same folder simultaneously), a batch of 100 documents can now fail catastrophically where previously only the one document that lost the race would retry transparently.

The tradeoff (throughput vs. concurrency resilience) is legitimate, but it should be:

  1. Explicitly acknowledged in the docstring
  2. Reflected in a changelog entry under the TOCTOU issue (DocumentPath: TOCTOU race on path uniqueness needs partial unique index #1200), noting the scope change
  3. Covered by a test that verifies the all-or-nothing rollback for bulk paths (the new TestCoverageGapBulkMoveIntegrityErrorRollback does this — it just needs to be paired with the removal/update of the retry-specific tests above)

Stale docstring in move_documents_to_folder

Lines 1085-1096 still describe the old sequential approach:

**Within-batch conflict detection**: Each move runs sequentially
inside the outer transaction.  As each successor row is committed,
subsequent disambiguations naturally see the new path via both an
in-memory ``batch_claimed`` set and a re-query of the corpus's
active paths.

batch_claimed no longer exists; within-batch detection is now done via the mutable occupied_paths set during the planning phase. The docstring should be updated to describe the two-phase approach (plan all paths → batch write).

Similarly, lines 1093-1096 say:

**TOCTOU race recovery**: Each successor insert runs in a savepoint
and is retried ... — see ``_create_successor_path_with_retry``.

This is no longer true for the bulk path. Only move_document_to_folder (single-doc) still uses _create_successor_path_with_retry.


Minor nits

_dispatch_document_path_created_signals circular import comment: The nested import in that helper (from opencontractserver.documents.models import DocumentPath) is fine, but DocumentPath is already imported at the top of the class methods via local imports (e.g. line 840). Adding a module-level note or grouping the deferred imports together would make it easier to audit.

delete_folder error message doesn't include corpus ID: The logger.error at line 931 logs folder.id but not corpus.id, while move_documents_to_folder's error log includes both. Consistent logging makes on-call debugging easier.

Regex filter performance note is good but could include migration guidance: The comment noting that path__regex can't use a btree index (lines 387-392) is appreciated. If this becomes a hotspot, a concrete follow-up ticket reference would help someone pick it up later.


What works well

  • select_for_update(of=("self",)) scoping is a clean fix for the accidental Document table locking
  • select_related("document") eliminates the N+1 in the successor-row build loop
  • Pre-fetching occupied_paths once and mutating it in the planning loop correctly handles within-batch filename collisions
  • The manual post_save signal dispatch is correctly placed inside the transaction.atomic() block so on_commit callbacks fire against the outer transaction
  • New tests for _target_directory_string_from_path edge cases and the _dispatch_document_path_created_signals count are solid additions
  • The ValueError on bare paths (no leading /) in _disambiguate_path is a good guard

JSv4 added 2 commits April 14, 2026 19:36
Remove TestBulkMoveIntegrityRecovery and TestDeleteFolderIntegrityRecovery
test classes that mocked DocumentPath.objects.create but the implementation
uses bulk_create. These tests tested per-row retry logic from the old
sequential approach that no longer exists in the batch code path. The
equivalent rollback behavior is already covered by
TestCoverageGapBulkMoveIntegrityErrorRollback and
TestCoverageGap_DeleteFolderIntegrityErrorRollback which correctly mock
bulk_create.

Also update the move_documents_to_folder docstring to accurately describe
the two-phase plan-then-execute approach and all-or-nothing TOCTOU
semantics, replacing stale references to the old sequential approach
and _create_successor_path_with_retry.

Add corpus_id to delete_folder error log for consistency with
move_documents_to_folder error logging.
@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Code Review: Fix O(N) query regression in bulk folder operations (#1199)

Overview

This PR optimizes move_documents_to_folder() and delete_folder() in DocumentFolderService from ~3 DB round-trips per document down to ~4 total for an entire batch. The core strategy: pre-fetch occupied paths once, plan all disambiguated target paths in-memory, then execute deactivations and inserts in two bulk queries. A new _dispatch_document_path_created_signals() helper compensates for bulk_create bypassing per-row post_save signals.

The changes are well-structured and solve a real problem. The notes below are mostly about edge cases and risks that appear worth discussing before merge.


Correctness / Potential Bugs

TOCTOU regression — bulk path is now more fragile than the old per-row approach

The old code used _create_successor_path_with_retry() (savepoint + up to MAX_PATH_CREATE_RETRIES retries per row) to handle the race between the pre-fetch and the INSERT. The new code drops per-row retry entirely: a single concurrent path claim between _fetch_occupied_paths_in_directory and bulk_create rolls back the entire batch. The docstring acknowledges this:

A concurrent transaction can claim a path between the occupied-paths
pre-fetch and the ``bulk_create``.  If that happens, the
``unique_active_path_per_corpus`` partial unique constraint raises
``IntegrityError``, which rolls back the entire batch.

For small, human-initiated moves this is probably fine. For large programmatic bulk-moves (e.g. import pipelines) under concurrent load this could cause repeated full-batch failures that are difficult to recover from. Consider whether this trade-off is acceptable for all callers, and document it clearly in the public method's docstring (the current note is buried in the implementation comment).

_dispatch_document_path_created_signals imports DocumentPath locally but the parameter type annotation also references it

# opencontractserver/corpuses/folder_service.py
def _dispatch_document_path_created_signals(
    paths: list[DocumentPath],   # ← uses top-level import or forward ref?
) -> None:
    from opencontractserver.documents.models import DocumentPath  # ← local import

If DocumentPath is not imported at the module level (due to the circular-import avoidance pattern already present in this file), the type annotation on the parameter will fail at class-body evaluation time in Python < 3.10 unless from __future__ import annotations is present. Verify this is covered by annotations futures or add a TYPE_CHECKING guard for the annotation.

bulk_create return value portability

DocumentPath.objects.bulk_create(new_path_rows) returns the list of created objects with PKs populated on PostgreSQL (the only supported backend here), but the code relies on _state.db being set on the returned objects for _dispatch_document_path_created_signals. This is safe on Postgres with Django 4.1+, but it's worth a one-line comment confirming the assumption (or asserting path._state.db is not None in the dispatch helper).

moved_count = len(created_paths) vs len(planned_paths)

If bulk_create raises, moved_count is never set (the exception propagates). That's fine. But if bulk_create silently skips rows (e.g. ignore_conflicts=True were ever added in the future), moved_count would be understated. The current code doesn't pass ignore_conflicts, so this is only a latent risk — worth noting.

In-memory sync after .update(is_current=False)

for current, _ in planned_paths:
    current.is_current = False

This is a nice touch for callers that hold references to the old DocumentPath instances. However, the affected_paths list in delete_folder is fetched as list(qs) — Django doesn't maintain a live reference, so callers outside this method won't see the update regardless. The sync is harmless but the comment ("Keep in-memory instances in sync … without a refetch") might be misleading. It only helps callers within this call stack that captured planned_paths entries.


Performance Observations

Regex filters don't use btree indexes — acknowledged but no mitigation

The PR comment is honest:

NOTE: These regex filters cannot use a btree index on ``path``
(PostgreSQL requires anchored patterns with no alternation for
index-only scans). ... corpuses with thousands of files at the same
directory level may benefit from a GIN/pg_trgm index…

For the target use-case (bulk move of N documents to a directory with M existing files), N is bounded but M could be large. A path__startswith=directory + Python-side slash-count filter would use the existing btree index on (corpus, is_current, path) if one exists. Consider documenting this as a follow-up issue rather than a code comment so it doesn't get lost.

_fetch_occupied_paths_in_directory loads all paths into Python memory

For directories with tens of thousands of files this could be significant. The current behaviour is bounded by corpus size and directory size, which are typically small — but a note in the method docstring with an estimated safe upper limit would help future contributors gauge when the approach needs revisiting.


Design / Architecture

extra_occupied parameter kept but semantics changed

_disambiguate_path still accepts extra_occupied but the docstring now says it is "used during retry loops" — yet _create_successor_path_with_retry (the only retry helper) has been removed. Nothing in the new code passes extra_occupied. If no caller uses it, remove it to avoid dead code (per the project's "No dead code" rule in CLAUDE.md). If it's intentionally kept as a hook for the single-document retry path in move_document_to_folder, document that explicitly.

TestBulkMoveIntegrityRecovery and TestDeleteFolderIntegrityRecovery removed without replacement

These test classes tested the per-row retry semantics of _create_successor_path_with_retry. Now that per-row retry is gone the tests are correctly removed. However, the new behaviour — a TOCTOU race causes a full-batch IntegrityError rollback — is tested only via TestCoverageGapBulkMoveIntegrityErrorRollback (mocked bulk_create). It would be worth adding at least one integration-style test that actually hits the partial unique constraint (via a concurrent thread or a pre-existing conflicting path) to verify the rollback semantics against a real database, not just a mock.

_target_directory_string_from_path raises on "/" input

stripped = folder_path.strip("/")
if not stripped:
    raise ValueError(...)

A caller passing "/" (slash-only string) gets a ValueError. The docstring says to pass None for root. This is fine, but if CorpusFolder.get_path() can ever return "/" for a root-level folder, that would silently surface here as an error rather than being caught early. Verify that get_path() never returns "/" for non-root folders, or add a guard in move_documents_to_folder before calling _target_directory_string_from_path.


Security

No new attack surface introduced. The select_for_update(of=("self",)) scoping improvement is a correctness win — previously the Document rows were being accidentally locked. The permission checks upstream of these service methods are unchanged.


Test Coverage

  • Signal dispatch tests (test_delete_folder_dispatches_post_save_for_each_created_path, test_bulk_move_dispatches_post_save_for_each_created_path) correctly verify the manual dispatch. Patching post_save at the module level is the right approach.
  • TestCoverageGap_TargetDirectoryStringFromPathEdgeCases is comprehensive.
  • The no-slash-path tests correctly flip from "returns unchanged" to "raises ValueError" to match the new strict validation.
  • Missing: a test for _fetch_occupied_paths_in_directory with exclude_pk to confirm that parameter is exercised (it's used in single-document moves via _disambiguate_path but not directly tested).

Minor / Nits

  • The @classmethod decorator was added to _disambiguate_path (needed because _fetch_occupied_paths_in_directory is now called via cls.), but _fetch_occupied_paths_in_directory and _dispatch_document_path_created_signals are @staticmethod. The asymmetry is a bit inconsistent — _disambiguate_path could stay @staticmethod if _fetch_occupied_paths_in_directory is called directly (no cls needed). Minor, but worth a quick check.
  • The _dispatch_document_path_created_signals docstring references documents.signals.connect_corpus_document_signals but the PR description mentions process_doc_on_document_path_create. Align the docstring with the actual receiver name.
  • Log message updated to include corpus_id in the rollback error — good improvement.

Summary

This is a well-motivated optimization with clear query reduction (O(N) → O(1) bulk queries). The main concerns to address before merge:

  1. Document the TOCTOU trade-off more prominently in the public method docstrings, not just implementation comments — callers need to know the retry semantics changed.
  2. Verify or remove extra_occupied — it appears to be dead code now.
  3. Check the DocumentPath type annotation in _dispatch_document_path_created_signals for a potential NameError at class evaluation time.

Everything else is minor. Happy to discuss any of these points further.

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Code Review: PR #1237 — Fix O(N) query regression in bulk folder operations

Overview

This PR replaces per-document save()/create() loops in move_documents_to_folder() and delete_folder() with two-query batch operations: one filter().update(is_current=False) and one bulk_create(). For a 100-document batch, this cuts ~300 DB round-trips down to roughly 4. The implementation is well-motivated and the helper extraction is clean. A few issues warrant attention before merging.


Issues

1. TOCTOU recovery regression (significant behavioral change)

The old _create_successor_path_with_retry() provided per-document savepoint recovery: a TOCTOU race on the partial unique index caused only one document's insertion to retry with a new path — the rest of the batch proceeded. This PR removes that behavior for bulk operations:

"If that happens, the unique_active_path_per_corpus partial unique constraint raises IntegrityError, which rolls back the entire batch."

The deleted test classes TestBulkMoveIntegrityRecovery and TestDeleteFolderIntegrityRecovery documented this now-gone behavior. For low-concurrency deployments this is probably fine, but for systems that run concurrent bulk moves (e.g. parallel Celery tasks moving files to the same folder), a single path race now silently fails 100 documents instead of retrying 1. The changelog entry for issue #1200 (which added retry logic in a prior PR) describes this as fixed; this PR effectively reverts that fix for the batch path.

At minimum this should be called out prominently in the PR description/changelog, and callers (Celery tasks, GraphQL mutations) need to be aware they must now retry the entire batch on IntegrityError.

2. extra_occupied parameter is effectively dead code

_disambiguate_path retains extra_occupied with the docstring note "used during retry loops to avoid within-batch collisions." But _create_successor_path_with_retry (the only caller that passed extra_occupied) is no longer invoked from the batch paths. Does it still exist? If it does, is it still used anywhere? If not, extra_occupied should be removed per CLAUDE.md's "no dead code" rule.

# folder_service.py — _disambiguate_path signature
def _disambiguate_path(
    base_path: str,
    corpus: Corpus,
    exclude_pk: int | None = None,
    occupied_override: set[str] | None = None,
    extra_occupied: set[str] | None = None,  # <- still needed?
) -> str:

3. Redundant nested import in _dispatch_document_path_created_signals

The nested from opencontractserver.documents.models import DocumentPath inside _dispatch_document_path_created_signals is described as avoiding a circular dependency, but folder_service.py already imports DocumentPath at the module level (it uses it throughout the file). This nested import is harmless but misleading — the comment suggests a circular dependency that either doesn't exist or is already handled by the top-level import. If the circular dependency is real, add a brief explanation of the import cycle. If it isn't, remove the nested import.

4. Stale occupancy window for very large batches

The pre-fetched occupied_paths set is valid at the moment of the query but can become stale if another transaction commits new paths before bulk_create. This is the documented TOCTOU window and is acceptable. However, for large batches the planning phase itself (the for current in paths_to_move loop) can take non-trivial time holding the select_for_update row locks. Consider documenting the expected batch-size ceiling (e.g. hundreds, not tens of thousands) to set appropriate caller expectations.

5. of=("self",) on select_for_update — PostgreSQL-only concern documented

The addition of of=("self",) is correct and beneficial (scopes lock to DocumentPath rows). Good change, and it's already PostgreSQL-only per pgvector requirements. No issue, just confirming this is intentional.


Minor / Style Notes

  • Test class naming inconsistency: New test classes introduced in this PR use a mixed naming convention — TestCoverageGap_DeleteFolderIntegrityErrorRollback and TestCoverageGap_TargetDirectoryStringFromPathEdgeCases use an underscore separator, while pre-existing classes like TestCoverageGapBulkMoveIntegrityErrorRollback do not. Pick one style and be consistent.

  • _target_directory_string_from_path raises on "/" input: The ValueError for a slash-only input is correct (None should be used for root), but the error message could be clearer about why "/" is invalid as a folder path string (vs. None for root). A one-liner in the docstring would help.

  • moved_count = len(created_paths) is correct, but worth noting: Django's bulk_create returns the list of created instances on PostgreSQL (which supports RETURNING). This is reliable here, but for portability awareness, the behavior differs on databases without RETURNING support (not relevant for this project, just a note).

  • Changelog / TOCTOU entry: The changelog still references TestMoveDocumentIntegrityRecovery, TestBulkMoveIntegrityRecovery, and TestDeleteFolderIntegrityRecovery as added tests under the DocumentPath: TOCTOU race on path uniqueness needs partial unique index #1200 fix, but those classes are now deleted by this PR. The changelog should be updated to reflect the removal and explain why (the batch approach doesn't support per-document retry).


What's done well

  • The two-query batch pattern (filter().update() + bulk_create()) is the right approach and the in-memory occupancy set correctly handles within-batch filename conflicts.
  • Manual post_save dispatch via _dispatch_document_path_created_signals is a necessary and well-commented workaround for bulk_create's signal bypass.
  • _fetch_occupied_paths_in_directory correctly uses regex anchoring to avoid loading subtrees, and the root-level special case (^/[^/]+$) is handled correctly.
  • select_related("document") addition eliminates N+1 access to current.document inside the build loop — good catch.
  • The regex injection risk in rf"^{re.escape(directory)}[^/]+$" is handled correctly with re.escape.
  • Tests for the new signal dispatch behavior and rollback semantics are well-structured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DocumentPath: O(N) queries in bulk move/delete operations

2 participants