Skip to content

Implement server-side pagination for extract grid embeds (Issue #1204)#1235

Merged
JSv4 merged 25 commits intomainfrom
claude/resolve-issue-1204-aIifN
Apr 15, 2026
Merged

Implement server-side pagination for extract grid embeds (Issue #1204)#1235
JSv4 merged 25 commits intomainfrom
claude/resolve-issue-1204-aIifN

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented Apr 12, 2026

Summary

Resolves Issue #1204 by implementing server-side pagination for the ExtractGridEmbed component. The fullDatacellList GraphQL field now accepts optional limit and offset arguments to bound the payload size, and a new datacellCount field reports the total visible cells so the UI can display "showing N of M" indicators when truncated.

Key Changes

  • GraphQL Schema & Backend

    • Added limit and offset arguments to ExtractType.fullDatacellList with deterministic ordering (document_id, column_id, id) to ensure stable pagination
    • Added datacellCount field to ExtractType to report total visible datacells regardless of pagination bounds
    • Implemented resolve_full_datacell_list() and resolve_datacell_count() resolvers with proper slicing logic
  • Frontend Component

    • Updated GET_EXTRACT_GRID_EMBED query to pass limit: EXTRACT_GRID_EMBED_CELL_LIMIT (500 cells) to bound server payload
    • Replaced the "too many rows" error guard with a graceful partial-data footer banner
    • Added OverflowFooter styled component that displays "showing N of M" messages for both:
      • Server-side truncation (cells fetched < total cells)
      • Client-side display clipping (rows rendered < total rows from fetched cells)
    • Refactored row truncation logic to clip to EXTRACT_GRID_EMBED_MAX_ROWS (200) as a pure display bound while keeping the fetched payload bounded server-side
  • Constants & Configuration

    • Added EXTRACT_GRID_EMBED_CELL_LIMIT = 500 constant for server-side payload cap
    • Updated EXTRACT_GRID_EMBED_MAX_ROWS documentation to clarify it's now a display-only bound
  • Tests & Mocks

    • Added comprehensive backend test (test_full_datacell_list_limit_offset_and_count) covering pagination, offset, and count behavior
    • Updated all GraphQL mocks to use shared MOCK_VARS with explicit limit parameter
    • Added partialExtractMock to test server-side truncation scenario (800 total cells, 4 returned)
    • Updated component tests to verify "showing N of M" banner instead of error message
    • Added new test case for partial payload rendering

Implementation Details

  • The server applies deterministic ordering to ensure pagination stability across requests
  • datacellCount is fetched separately and always reports the full total, enabling accurate "N of M" indicators
  • The component now handles two independent truncation scenarios:
    1. Server-side: When fetchedCellCount < totalCellCount (payload bounded by limit)
    2. Client-side: When rows.length > EXTRACT_GRID_EMBED_MAX_ROWS (display clipping for safety)
  • The footer banner intelligently combines both messages when both truncations apply
  • Backward compatible: omitting limit/offset returns all visible cells as before

https://claude.ai/code/session_018GC4vMGMVWCvLHiJMYVxRf

Resolves #1204 — the GET_EXTRACT_GRID_EMBED query previously fetched the
unbounded full_datacell_list, which for extracts with hundreds of
documents and columns could return tens of thousands of cells.

Backend:
- ExtractType.fullDatacellList now accepts optional limit/offset args.
- Resolver applies a deterministic order_by(document_id, column_id, id)
  before slicing so limit/offset windows are stable across requests.
- New ExtractType.datacellCount field reports total visible cells
  ignoring pagination, so clients can render "showing N of M" indicators.
- Schema.graphql updated to match.

Frontend:
- GET_EXTRACT_GRID_EMBED query passes limit: EXTRACT_GRID_EMBED_CELL_LIMIT
  (500) and fetches datacellCount alongside fullDatacellList.
- ExtractGridEmbed renders a partial-data footer banner ("Showing N of M
  cells" / "Showing X of Y documents") instead of the previous hard
  "exceeds limit" error screen when the payload is bounded.
- New EXTRACT_GRID_EMBED_CELL_LIMIT constant; existing
  EXTRACT_GRID_EMBED_MAX_ROWS repurposed as a pure display bound.

Tests:
- test_extract_queries.test_full_datacell_list_limit_offset_and_count
  verifies limit/offset pagination, datacellCount, disjoint pages, and
  partial final-page behavior.
- ExtractGridEmbedTestWrapper gets a new "partial" state mock; the
  "too-many-rows" state now exercises the partial-data banner instead of
  the removed error screen.
- Two new component tests cover the partial-data rendering paths.
@claude
Copy link
Copy Markdown

claude bot commented Apr 12, 2026

Code Review: Server-side pagination for extract grid embeds (#1204)

Overall this is a well-structured, well-documented change that cleanly resolves #1204. The approach (bounded fullDatacellList + separate datacellCount field + graceful footer banner) is the right design. A few issues worth addressing before merge, ranging from a potential server error to UX inaccuracies.


Bug: Negative offset will crash the Django queryset

config/graphql/extract_types.py, resolve_full_datacell_list

start = offset or 0

In Python, -1 or 0 evaluates to -1 (since -1 is truthy). Django ORM does not support negative indexing on querysets and raises AssertionError: Negative indexing is not supported. A user passing offset: -1 would get a 500 error instead of a validation message.

Replace with:

start = max(0, offset) if offset is not None else 0

Security/DoS: No upper bound on the limit argument

config/graphql/extract_types.py

The resolver guards against limit < 0 but places no ceiling on limit. Any authenticated caller can pass limit: 2147483647 and bypass the intended 500-cell cap entirely. The frontend defaults to 500, but the GraphQL API is directly callable.

Suggest adding a server-enforced maximum. Something like:

MAX_FULL_DATACELL_LIST_LIMIT = 500  # or a constant from opencontractserver/constants/

if limit is not None:
    limit = max(0, min(limit, MAX_FULL_DATACELL_LIST_LIMIT))

UX Inaccuracy: Banner document count is misleading when both truncations apply

frontend/src/components/extracts/ExtractGridEmbed.tsx, banner logic

When rowsTruncated && cellsTruncated are both true, the message reads:

Showing 200 of 201 documents (4 of 800 cells fetched). View the full extract in the Extracts panel.

The 201 here is rows.length — the number of documents in the fetched slice (bounded by EXTRACT_GRID_EMBED_CELL_LIMIT), not the true total. If there are 800 total cells across 400 total documents, the message implies the full extract has only 201 documents, which is incorrect.

A more accurate message would note that the 201 document count is from the fetched slice only, e.g.:

Showing 200 of 201 fetched documents (4 of 800 total cells). View the full extract in the Extracts panel.

Or simplify to just cite cells when server truncation applies:

Showing 200 documents from a bounded payload of 4 of 800 total cells. View the full extract in the Extracts panel.


Minor: Two DB queries per request for fullDatacellList + datacellCount

config/graphql/extract_types.py

resolve_full_datacell_list and resolve_datacell_count each invoke ExtractQueryOptimizer.get_extract_datacells() independently, so every request that fetches both fields hits the DB twice for the queryset setup. This is low risk at current scale and is somewhat inherent to separate GraphQL resolvers, but it's worth noting for future optimization if this becomes a hot path.


Nit: EXTRACT_GRID_EMBED_CELL_LIMIT is the right constant to use as the server max

If you do add a server-side max cap (point 2 above), consider whether to define it in opencontractserver/constants/ and reference it in the frontend constant file comments, or keep them separate. The constant names suggest they're aligned (EXTRACT_GRID_EMBED_CELL_LIMIT = 500 frontend, and any server-side cap should match), so documenting that relationship would prevent them from drifting apart.


What's done well

  • MOCK_VARS shared constant is a clean DRY improvement for test mocks — prevents silent variable-mismatch failures.
  • Deterministic order_by("document_id", "column_id", "id") ensures stable pagination — this is easy to forget and good to see explicitly documented.
  • The partialExtractMock covers the critical new scenario (fetched < total).
  • Backward-compatible: omitting limit/offset returns all cells as before.
  • Backend test covers unbounded, page 1, page 2, and boundary (partial last page) — solid coverage.
  • Schema documentation on the new fields is clear and precise.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 12, 2026

Codecov Report

❌ Patch coverage is 52.94118% with 32 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ntend/src/components/extracts/ExtractGridEmbed.tsx 31.91% 30 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

… improve banner UX

- Clamp negative offset to 0 in resolve_full_datacell_list to prevent
  Django AssertionError on negative queryset indexing
- Add MAX_FULL_DATACELL_LIST_LIMIT (500) server-side cap to prevent
  DoS via arbitrarily large limit values in GraphQL API
- Fix misleading banner message when both row and cell truncation apply:
  now distinguishes fetched document count from total
- Add backend tests for negative offset and limit cap scenarios
- Add ComponentEmbedErrorFallback component test with docScreenshot
- Cross-reference frontend/backend constants for maintainability
from opencontractserver.constants.corpus_actions import * # noqa: F401, F403
from opencontractserver.constants.discovery import * # noqa: F401, F403
from opencontractserver.constants.document_processing import * # noqa: F401, F403
from opencontractserver.constants.extracts import * # noqa: F401, F403
@claude
Copy link
Copy Markdown

claude bot commented Apr 12, 2026

Code Review

Good work overall — the core approach is solid. The resolver logic is correct, the UI degrades gracefully instead of hard-erroring, and the test coverage is thoughtful. A few items worth addressing before merge:


Medium: Two separate DB queries per extract when both fields are fetched

resolve_full_datacell_list and resolve_datacell_count each call ExtractQueryOptimizer.get_extract_datacells(...) independently. When the frontend fetches both in the same GraphQL request (which it always does), Django evaluates two separate querysets. This is fine today since GET_EXTRACT_GRID_EMBED resolves a single extract, but if datacellCount ever appears on a list query it will N+1. Consider making a note (or a # TODO) that the count could be folded into the pagination queryset via len(qs) after slicing, or cached on the ExtractType instance.


Medium: Combined truncation message is untested and grammatically awkward

config/graphql/extract_types.py correctly surfaces two independent truncation paths — server-side cell limit and client-side row clip. But the branch where both fire simultaneously (cellsTruncated && rowsTruncated) produces:

"Showing X documents from a bounded payload of Y of Z total cells."

That phrasing is hard to parse (mixes "documents" and "cells" in one sentence). More importantly, this branch has no test. The existing mocks only cover the two single-truncation cases. Suggest either adding a test fixture that triggers both conditions, or simplifying to two separate footer lines.


Minor: GetExtractGridEmbedOutput.datacellCount typed as required number vs. nullable schema field

schema.graphql declares datacellCount: Int (nullable). The TypeScript interface in queries.ts types it as datacellCount: number (non-optional), while graphql-api.ts correctly has it as datacellCount?: number. The component handles null via ?? extract.fullDatacellList.length, so there's no runtime issue, but the inconsistency between the two type files is a latent confusion. The queries.ts interface should match: datacellCount?: number | null.


Minor: test_full_datacell_list_limit_capped_at_server_max doesn't actually verify capping

The test fixture has 1 cell. Passing limit = MAX_FULL_DATACELL_LIST_LIMIT + 9999 and getting 1 result back doesn't confirm the cap was applied — you'd get 1 result with any limit ≥ 1. The test as written only verifies "no 500 error", which is useful, but the comment promises more. Consider either: (a) acknowledging the limitation in the docstring, or (b) creating MAX_FULL_DATACELL_LIST_LIMIT + 1 cells and asserting len(result) == MAX_FULL_DATACELL_LIST_LIMIT.


Observation: Manual sync required between backend and frontend constants

opencontractserver/constants/extracts.py:MAX_FULL_DATACELL_LIST_LIMIT and frontend/src/assets/configurations/constants.ts:EXTRACT_GRID_EMBED_CELL_LIMIT must stay equal (500 each). Both files carry a comment to that effect, which is good. However, there's no programmatic enforcement. A future PR could add a trivial backend test:

def test_backend_cell_limit_matches_frontend_constant(self):
    # Guards against accidental drift between the two mirrored constants.
    from opencontractserver.constants.extracts import MAX_FULL_DATACELL_LIST_LIMIT
    self.assertEqual(MAX_FULL_DATACELL_LIST_LIMIT, 500)  # Must match frontend EXTRACT_GRID_EMBED_CELL_LIMIT

…or document the sync requirement in a single canonical place (e.g. the extracts.py module docstring already does a decent job).


Nit: offset in schema with no frontend usage

The offset arg is wired through the schema and resolver but the frontend query only passes limit. That's fine — future pagination support is a reasonable thing to pre-wire. Worth adding a short comment in GET_EXTRACT_GRID_EMBED noting offset isn't used yet:

# offset not yet passed — reserved for client-driven pagination
fullDatacellList(limit: $limit)

Nit: Backend test query construction uses string interpolation

All three new backend tests embed the extract_id and limit values via %s / .format() string formatting rather than query variables. For test-only code with programmatically-generated IDs this is safe, but it diverges from how the production frontend sends variables. Using variables={"extractId": extract_id, "limit": 2} in self.client.execute(query, variables=...) would be more idiomatic and slightly closer to real usage.


Overall verdict: The feature is correct and well-structured. The medium items above are the ones worth fixing before merge; the rest can be addressed as follow-ups if preferred.

JSv4 added 5 commits April 12, 2026 20:05
The fullDatacellList and datacellCount tests were failing because the
test setUp never associated the document with the extract's documents
M2M or granted the test user read permissions on the document. The
permission-aware datacell resolver (ExtractQueryOptimizer) returned
empty results for non-superuser users without document permissions.
- Fix datacellCount type in GetExtractGridEmbedOutput to match nullable
  GraphQL schema (number -> number | null | undefined)
- Clarify test_full_datacell_list_limit_capped_at_server_max docstring
  to accurately describe what it tests vs. what it cannot verify
- Add TODO noting the duplicate DB query when both fullDatacellList and
  datacellCount are fetched in the same GraphQL request
Adds a "both-truncated" test state and corresponding Playwright test
to cover the code path where both server-side cell truncation and
client-side row clipping fire simultaneously. The previous test suite
only covered each truncation independently.
New docScreenshot for the both-truncated extract grid embed state.
Updated loading screenshot from local test run.
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Code Review: Server-side Pagination for Extract Grid Embeds (#1204)

This is a well-scoped fix with good test coverage. The core approach is sound. A few items worth addressing before merge, roughly ordered by severity.


Issues

1. Double DB query on every request (performance)

The TODO in resolve_datacell_count is the most important issue to resolve — not just leave as a TODO. Since the frontend always fetches both fullDatacellList and datacellCount in the same query, every single embed load hits the database twice for the same permission-filtered queryset. For extracts with large corpora, this matters.

A straightforward fix is to cache the base queryset on the ExtractType instance during resolution:

# config/graphql/extract_types.py

def _get_datacell_qs(self, user):
    """Returns the permission-filtered, deterministically ordered queryset.
    Cached on the instance so both resolvers share a single queryset build."""
    if not hasattr(self, "_datacell_qs"):
        from opencontractserver.annotations.query_optimizer import ExtractQueryOptimizer
        self._datacell_qs = (
            ExtractQueryOptimizer.get_extract_datacells(self, user, document_id=None)
            .order_by("document_id", "column_id", "id")
        )
    return self._datacell_qs

def resolve_full_datacell_list(self, info, limit=None, offset=None):
    from opencontractserver.constants.extracts import MAX_FULL_DATACELL_LIST_LIMIT
    qs = self._get_datacell_qs(info.context.user)
    start = max(0, offset) if offset is not None else 0
    if limit is not None:
        limit = max(0, min(limit, MAX_FULL_DATACELL_LIST_LIMIT))
        return qs[start : start + limit]
    return qs[start:] if start else qs

def resolve_datacell_count(self, info) -> int:
    return self._get_datacell_qs(info.context.user).count()

This makes one queryset build per ExtractType instance within a request, while still issuing two SQL statements (SELECT ... LIMIT N and SELECT COUNT(*)). If you want a single SQL round-trip you could use qs.count() plus slice from a Window annotation, but the instance-level cache is the minimal fix and removes the comment.

2. Confusing "both truncated" banner copy (UX)

The combined message mixes units mid-sentence in a way that's hard to parse:

"Showing 200 documents from a bounded payload of 500 of 1000 total cells. View the full extract in the Extracts panel."

"500 of 1000 total cells" buried in the middle of a sentence about documents is jarring. Consider separating the two facts:

`Showing ${visibleRows.length} of ${rows.length} documents (payload bounded to ${fetchedCellCount} of ${totalCellCount} total cells). View the full extract in the Extracts panel.`

Or split into two lines inside the footer so each truncation is clearly labeled. The single-truncation messages are fine as-is.

3. fullDatacellList accessed without null guard in component

In ExtractGridEmbed.tsx:

const fetchedCellCount = extract.fullDatacellList.length;

fullDatacellList is typed as DatacellType[] | undefined in GetExtractGridEmbedOutput.extract. If the field is ever absent (e.g., a backend error returns a partial response), this throws at runtime. Guard it:

const fetchedCellCount = extract.fullDatacellList?.length ?? 0;

Minor / Nit

4. Missing test for offset-only (no limit)

The resolver has a specific code path for if start: return qs[start:] — when offset is provided but limit is not. There's no test that exercises this path. test_full_datacell_list_limit_offset_and_count only tests offset with limit. Adding one case (e.g., offset: 3, no limit, expect all 5 − 3 = 2 cells) would give that branch coverage.

5. EXTRACT_GRID_EMBED_CELL_LIMIT and MAX_FULL_DATACELL_LIST_LIMIT are separate constants that must stay in sync

The comment in constants.ts says "Must match the backend constant." This is a valid concern: they're both 500 today but can drift. Since the backend clamps the limit anyway the behavior stays correct (frontend gets fewer cells than it requested), but the N of M indicator would silently undercount. Worth adding an assertion or at least a test that the backend rejects a limit higher than MAX_FULL_DATACELL_LIST_LIMIT (which test_full_datacell_list_limit_capped_at_server_max already does — so you're covered. Just keep both constants in sync when either changes).

6. Inline imports in resolvers

resolve_full_datacell_list and resolve_datacell_count import ExtractQueryOptimizer and MAX_FULL_DATACELL_LIST_LIMIT inside the function body. Per project convention, imports belong at module level. The original code had inline imports too, so this is pre-existing style, but the new code doubles them. If these are inline to avoid circular imports, a comment saying so would help.


What's good

  • The deterministic order_by("document_id", "column_id", "id") is correct and well-commented.
  • Negative-offset and limit-capping guards are handled defensively with tests for each.
  • MOCK_VARS shared constant in the test wrapper is exactly the DRY approach this codebase calls for.
  • resolve_full_datacell_list returning a queryset slice (not a list) means Django can still apply LIMIT/OFFSET at the SQL level rather than fetching all rows into Python.
  • The backward-compatibility guarantee (omitting limit/offset = full list) is preserved.
  • Changelog entry is detailed and accurately references changed files.

Summary: Items 1 (double DB query) and 2 (banner copy) are worth fixing before merge. Items 3–6 are lower priority but 3 is a potential runtime crash in degenerate cases and is a one-liner fix.

Add backend test for offset-only (no limit) code path in
resolve_full_datacell_list, covering the `if start: return qs[start:]`
branch. Export formatCellValue helper and add Vitest unit tests for all
value formatting branches (null, boolean, number, string, object
truncation).
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

PR Review: Server-side pagination for extract grid embeds

Overall this is a well-structured implementation. The architecture is sound, tests are solid, and it cleanly resolves the unbounded-payload problem. A few things worth addressing before merge.

@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Issues and Suggestions for PR 1235:

  1. Double DB query on every embed render (performance): resolve_full_datacell_list and resolve_datacell_count each independently build the same permission-filtered queryset, meaning every GET_EXTRACT_GRID_EMBED call incurs two separate DB round-trips. The quickest mitigation is caching the base queryset on the ExtractType instance during the request cycle so both resolvers share one DB trip.

  2. Both-truncated banner message is hard to parse: The combined message 'Showing N documents from a bounded payload of X of Y total cells' is awkward. Clearer: 'Showing N of M documents (payload bounded to X of Y total cells). View the full extract in the Extracts panel.'

  3. formatCellValue belongs in a utility file: Per CLAUDE.md, shared helpers go in frontend/src/utils/ (e.g. formatters.ts). Moving it to the utils directory makes it reusable and aligns with the project convention.

  4. Frontend never uses offset -- asymmetric API surface: GetExtractGridEmbedInput has limit but no offset. A brief comment in queries.ts explaining the intentional omission would prevent future confusion.

  5. Constants-sync comment overstates the risk: datacellCount is computed independently of limit, so diverging constants leave the 'showing N of M' math accurate. The only real risk is a misleading UX. The comment could be softened.

  6. Minor: no test for limit=0. Tests cover negative-offset clamping and over-large-limit capping, but limit=0 is unexercised.

  7. Minor: GQL test queries use string formatting -- a variables dict is cleaner and avoids potential escaping issues.

What is done well:

  • Deterministic ordering (document_id, column_id, id) before slicing ensures stable pagination.
  • Negative-offset guard (max(0, offset)) prevents AssertionError from Django queryset layer.
  • MAX_FULL_DATACELL_LIST_LIMIT cap prevents callers from bypassing the payload bound via the API.
  • MOCK_VARS shared constant in the test wrapper eliminates the class of Apollo variable mismatch bugs.
  • Backward compatible -- omitting limit/offset returns all cells as before.
  • Test coverage is comprehensive: negative offset, over-large limit, offset-only, pagination windows, disjoint-page assertion, and both truncation paths.
  • Unit tests for formatCellValue are a welcome addition.
  • CHANGELOG entry is clear and detailed.

Items 1-4 are worth addressing before merge; 5-7 are optional improvements.

…s, improve banner copy

- Cache permission-filtered datacell queryset on ExtractType instance so
  resolve_full_datacell_list and resolve_datacell_count share one queryset
  build per request instead of duplicating the DB round-trip.
- Move formatCellValue from inline definition in ExtractGridEmbed.tsx to
  frontend/src/utils/formatters.ts per project convention for shared helpers.
- Clarify the combined truncation banner message to read "Showing N of M
  documents (payload bounded to X of Y total cells)" instead of the harder
  to parse previous phrasing.
- Add null guard on fullDatacellList access for defensive runtime safety.
- Add comment in GraphQL query noting offset is accepted but not yet used.
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

PR Review: Server-side pagination for extract grid embeds (#1204)

Overall this is a well-structured, clean implementation. The change correctly resolves the unbounded-payload issue, maintains backward compatibility, and is accompanied by solid test coverage. A few things worth addressing before merge.


Issues

1. Instance-level queryset cache ignores the user argument (config/graphql/extract_types.py)

def _get_datacell_qs(self, user):
    if not hasattr(self, "_datacell_qs"):
        ...
        self._datacell_qs = ExtractQueryOptimizer.get_extract_datacells(
            self, user, document_id=None
        ).order_by(...)
    return self._datacell_qs

The cache key is just the instance; the user argument is not part of it. In current Graphene, a fresh ExtractType instance is created per request, so datacellCount and fullDatacellList in the same query always share the same user — safe today.

However, if this pattern is ever copied by someone who does apply object-level instance caching (e.g. via DataLoader), the first user's queryset will silently be returned for a different user, leaking data. A minimal safeguard would be to add a comment explicitly noting the assumption, or to cache with the user as part of the key:

cache_key = f"_datacell_qs_{user.pk}"
if not hasattr(self, cache_key):
    ...
    setattr(self, cache_key, qs)
return getattr(self, cache_key)

This is low-risk today but worth hardening given the security-sensitive context.


2. "Showing N of M documents" M-value is misleading when both bounds fire (ExtractGridEmbed.tsx)

In the rowsTruncated && cellsTruncated branch:

`Showing ${visibleRows.length} of ${rows.length} documents (payload bounded to ${fetchedCellCount} of ${totalCellCount} total cells). View the full extract...`

rows.length here is derived from the already server-truncated fullDatacellList. If the extract has 1 000 documents but the server only returned 500 cells' worth (~250 rows at 2 columns each), the banner reads "Showing 200 of 250 documents" — making the extract appear far smaller than it is.

The test mock confirms this: bothTruncatedExtractMock has datacellCount: 1000 but fullDatacellList only contains 201 items, so rows.length is 201, not the true document count.

Since there is no documentCount field on the server today, an exact fix would require a new field. A minimal improvement for this PR would be to rephrase the combined-truncation message so it doesn't claim to know the total document count when cells were also truncated:

Showing 200 of 201 fetched documents (full extract has more — payload was bounded to 201 of 1000 total cells). View the full extract...

3. test_full_datacell_list_limit_capped_at_server_max doesn't verify capping (test_extract_queries.py)

The test acknowledges this in its docstring, but the consequence is that the most important behaviour — that limit: 100000 actually returns ≤ 500 rows — is never exercised. Consider creating MAX_FULL_DATACELL_LIST_LIMIT + 1 cells in the test and asserting that the returned count equals MAX_FULL_DATACELL_LIST_LIMIT. Given the test already loops to create cells in nearby tests, this should be straightforward.


Minor nits

GetExtractGridEmbedOutput type inconsistency (frontend/src/graphql/queries.ts vs frontend/src/types/graphql-api.ts):
GetExtractGridEmbedOutput types datacellCount as number | null while ExtractType types it as number. Both are fine individually but slightly inconsistent. number | null is the correct shape for a nullable GraphQL Int.

Coupled constants with no automated enforcement:
MAX_FULL_DATACELL_LIST_LIMIT = 500 (Python) and EXTRACT_GRID_EMBED_CELL_LIMIT = 500 (TypeScript) must stay in sync. Comments in both files note this, which is the right low-overhead approach. Worth also adding a # sync-required or similar label in both so a grep can surface all such pairs.


What's good

  • DRY extraction of formatCellValue into formatters.ts with unit tests — exactly the right call, and the extraction path follows the CLAUDE.md utility-file guidance.
  • Backward compatibility preserved — omitting limit/offset still returns all visible cells.
  • Deterministic ordering (document_id, column_id, id) ensures stable pagination windows; this is easy to get wrong and was done correctly.
  • Negative-offset and over-limit clamping with test coverage for both edge cases.
  • MOCK_VARS shared constant in the test wrapper correctly resolves the strict-variable-matching pitfall called out in CLAUDE.md.
  • Pagination test covers disjoint pages — the isdisjoint assertion is exactly the right thing to verify.
  • OverflowFooter gracefully degrades rather than blocking access to whatever data is available — better UX than the previous hard error screen.

JSv4 added 2 commits April 12, 2026 23:40
- Key queryset cache by user.pk to prevent cross-user data leaks if the
  pattern is ever reused with shared instances (e.g. DataLoader).
- Clarify combined-truncation banner: "fetched documents" and "total cells
  loaded" avoids implying rows.length is the true document count.
- Convert all pagination test queries to use GraphQL variables dict instead
  of string interpolation, matching production usage patterns.
- Improve cap test: create MAX_FULL_DATACELL_LIST_LIMIT+1 cells and verify
  the returned count is exactly the cap, not just "no 500 error".
- Add test for limit=0 edge case (returns empty list, count still accurate).
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

PR Review: Server-side pagination for extract grid embeds (#1235)

Overall this is a solid, well-scoped implementation. The backend resolver design is clean, the test coverage is excellent, and the UX improvement (graceful banner vs. hard error) is the right call. A few things worth addressing before merge:


Potential Issues

1. UX: combined-truncation banner is hard to parse

In ExtractGridEmbed.tsx (around the OverflowFooter section), the message when both paths fire is:

Showing ${visibleRows.length} of ${rows.length} fetched documents
(${fetchedCellCount} of ${totalCellCount} total cells loaded).

The phrase "fetched documents" is confusing — rows.length is the number of rows from the already-bounded server payload, not a document count that means anything to users. More importantly, mixing document-row counts with cell counts in a single sentence makes users do mental math to understand their actual exposure. Consider two separate sentences or a simpler fallback:

Showing ${visibleRows.length} of ${totalRowCount} documents.
${fetchedCellCount} of ${totalCellCount} total cells loaded.

where totalRowCount is derived from totalCellCount / columns.length (or just leave it as the existing individual messages). The current wording in tooManyRows-only and cellsTruncated-only branches is clear; only the combined branch is awkward.

2. No enforcement of constant parity

MAX_FULL_DATACELL_LIST_LIMIT = 500 (backend opencontractserver/constants/extracts.py) and EXTRACT_GRID_EMBED_CELL_LIMIT = 500 (frontend constants.ts) are documented to match, but there's no test that enforces this. If one is bumped without the other, the server silently returns fewer cells than the client expects, and the "showing N of M" indicator will report N = M even when it isn't. A backend integration test that queries fullDatacellList(limit: FRONTEND_CONSTANT_VALUE) and asserts the cap is unnecessary complexity, but at minimum the comment in each file should reference the exact file path of the counterpart (the backend comment references the frontend path, but the frontend comment only says "Must match the backend constant" without the full path — add it so a grep lands in one hop).

3. Instance-level queryset cache (_get_datacell_qs) — docs vs. reality

The cache comment says it protects against a DataLoader scenario where one ExtractType instance serves multiple users. That's accurate forward-thinking, but it's worth noting the current Graphene implementation creates a fresh instance per resolved object, so the cache is only beneficial for the single-request two-field case (count + list). The comment is fine as-is, but if someone reading it wonders "why bother caching on self?", the in-request benefit (avoiding a duplicate permission-filtered queryset build when both fullDatacellList and datacellCount are queried) should be made explicit.

4. offset-only path needs a docstring

if start:
    return qs[start:]

This branch (offset without limit) is tested by test_full_datacell_list_offset_only_without_limit, which is great, but the resolver itself has no comment explaining why the three-way branch exists (limit+offset, offset-only, unbounded). A one-liner above each branch would help future readers avoid accidentally collapsing them.


Observations / Minor Nits

  • formatCellValue extraction is the right call — moving it to formatters.ts and adding unit tests is a textbook DRY cleanup. The new formatCellValue.test.ts covers all branches cleanly.

  • MOCK_VARS shared across all test mocks — good pattern; previously each mock hardcoded variables independently. This eliminates the class of test failure where the query variables change but one mock is missed.

  • both-truncated mock correctness — the mock has datacellCount: 1000 and 201 cells in fullDatacellList, but the test asserts 201 of 1000 total cells loaded. That's correct because fetchedCellCount = fullDatacellList.length = 201 and totalCellCount = datacellCount = 1000. Just worth confirming the mock intent is documented (it is, in the JSDoc above the mock — good).

  • test_full_datacell_list_limit_capped_at_server_max bulk_create — creating 500 records is heavy for a unit test. Consider reducing to MAX_FULL_DATACELL_LIST_LIMIT // 10 (50 records) and adjusting the cap to a test-local value via mock.patch. The invariant being tested is the clamping logic, not the absolute cap number. That said, if the test suite already runs in parallel and the DB is fast, this is acceptable.

  • Apollo cache keyArgs — if GET_EXTRACT_GRID_EMBED is ever called with different limit values from the same component instance (e.g., during a future "load more" feature), Apollo will serve stale cached results because limit isn't in keyArgs. This PR only calls it with a fixed constant, so it's not a bug today, but worth a comment in cache.ts if/when the query is registered there.

  • schema.graphql update included — correct practice; good.


Summary

Area Status
Backend resolver logic ✅ Clean, well-guarded
Permission filtering shared via _get_datacell_qs ✅ Correct
Backward compatibility (optional args)
Server-side cap enforcement ✅ Tested
Frontend graceful degradation ✅ Better than hard error
Combined-truncation banner UX ⚠️ Wording could be clearer
Constant parity enforcement ⚠️ No test/tooling
Test coverage (backend) ✅ Excellent (5 test methods)
Test coverage (frontend) ✅ Good (4 states covered)
DRY / utility extraction formatCellValue moved correctly

Blocking on the combined-banner wording (#1 above) is a judgment call — it works correctly, it just reads oddly. Everything else is advisory. Happy to approve once the authors have had a chance to look at these.

JSv4 added 2 commits April 13, 2026 07:28
Move _get_datacell_qs from ExtractType method to module-level function.
Graphene-Django resolvers receive the Django model instance as self, not
the GraphQL type, so methods defined on ExtractType are not available at
resolve time. This caused all 6 CI test failures.

Also addresses review comments:
- Simplify combined truncation banner to two separate sentences
- Add full path reference (::) for backend constant parity comment
- Add one-liner comments above each resolver branch
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

PR Review: Server-side pagination for extract grid embeds (#1204)

Overall this is a well-structured solution to a real payload-size problem. The three-layered approach (server cap, display clip, and informational banner) is user-friendly and the edge-case tests are thorough. A few issues worth addressing before merge:


Bug: "Both Truncated" test expectations don't match component output

The component (ExtractGridEmbed.tsx:451) renders:

`Showing ${visibleRows.length} of ${totalRowCount} documents. ` +
`${fetchedCellCount} of ${totalCellCount} total cells loaded. ` +
"View the full extract in the Extracts panel."

But ExtractGridEmbed.ct.tsx asserts:

await expect(page.getByText(/Showing 200 of 201 fetched documents/)).toBeVisible();

The word "fetched" doesn't appear in the banner text, so this regex will never match. The test will fail (or is masked by a timeout). The assertion should be:

await expect(page.getByText(/Showing 200 of 201 documents/)).toBeVisible();

Design gap: offset-only calls bypass the server cap

The resolver has three branches:

if limit is not None:
    limit = max(0, min(limit, MAX_FULL_DATACELL_LIST_LIMIT))
    return qs[start : start + limit]
if start:
    return qs[start:]   # ← no cap applied here
return qs               # ← backward-compat unbounded

A caller can send fullDatacellList(offset: 1) without a limit and receive the entire remaining queryset with no cap. The docstring says "Use together with limit" but this isn't enforced. While the backward-compatible unbounded path is intentional, the offset-only branch being silently uncapped is inconsistent. Consider either:

  • Applying the cap when start > 0 (same as the limit branch), or
  • Documenting in the schema that offset without limit is unbounded (to set correct expectations for API consumers)

Manual sync requirement between frontend and backend constants

MAX_FULL_DATACELL_LIST_LIMIT = 500 (Python) and EXTRACT_GRID_EMBED_CELL_LIMIT = 500 (TypeScript) are documented as needing to stay in sync, but there's no automated check. A drift here would cause the frontend to request exactly the server cap, silently triggering truncation every time a caller passes limit = EXTRACT_GRID_EMBED_CELL_LIMIT when the server interprets it as capped. Consider adding a note in the docstring about the risk, or adding a CI check (e.g., a grep-based test that validates both values match).


Minor: _get_datacell_qs cache is lost on order_by in slices

The cached queryset is the deterministically-ordered base (order_by("document_id", "column_id", "id")). Both resolve_full_datacell_list and resolve_datacell_count share it correctly. This is good design. One note: if a future resolver adds a different ordering or filter on top of the cached qs, the cache key (_datacell_qs_{user.pk}) won't reflect that variation. Worth a comment warning against extending the cache for cases with additional filters.


Positives worth calling out

  • Backward compatibility: Omitting limit/offset returns all visible cells as before — no breaking change.
  • Input validation: negative offset clamped to 0, limit capped at MAX_FULL_DATACELL_LIST_LIMIT, limit=0 returns empty without crashing. All three are tested.
  • DRY improvement: formatCellValue correctly extracted to frontend/src/utils/formatters.ts with its own unit tests.
  • UX improvement: replacing the hard error screen with an informational banner is much more user-friendly.
  • Test coverage: the five new backend test methods cover the main pagination branches comprehensively.
  • Constants placement: both the backend and frontend constants land in the project's designated constants files per CLAUDE.md guidance.

Summary

The "both-truncated" test mismatch is a concrete bug that will cause CI failures. The offset-only cap gap is a design inconsistency worth resolving before the pagination surface is widened. Everything else is solid.

JSv4 added 3 commits April 13, 2026 20:49
- Add _get_datacell_qs() with user-keyed instance cache so both
  resolve_full_datacell_list and resolve_datacell_count share a single
  queryset build per request instead of hitting the DB twice
- Improve "both truncated" banner message to clearly separate document
  count from cell count: "Showing N of M fetched documents (X of Y
  total cells loaded)"
- Add null guard on fullDatacellList access (optional chaining)
- Fix datacellCount type in graphql-api.ts to include | null to match
  the nullable GraphQL Int schema
- Add branch comments to resolver for offset-only and unbounded paths
- Update test assertion to match improved banner text
… into claude/resolve-issue-1204-aIifN

# Conflicts:
#	config/graphql/extract_types.py
#	frontend/src/components/extracts/ExtractGridEmbed.tsx
#	frontend/tests/ExtractGridEmbed.ct.tsx
@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

PR Review: Server-side pagination for extract grid embeds (#1204)

Overall this is a well-structured fix. The approach — bounded limit/offset on the resolver, a datacellCount field for N-of-M display, and a graceful footer banner instead of a hard error — is the right design. Good backward compatibility (omitting limit still returns all cells) and solid test coverage across edge cases.

A few items to address before merging:


Bug: "Both Truncated" test assertion will fail

The test in ExtractGridEmbed - Both Truncated asserts:

await expect(
  page.getByText(/Showing 200 of 201 fetched documents/)
).toBeVisible();

But the component (ExtractGridEmbed.tsx) renders:

`Showing ${visibleRows.length} of ${totalRowCount} documents. ...`

The word "fetched" does not appear anywhere in the component's message. The regex /Showing 200 of 201 fetched documents/ will not match "Showing 200 of 201 documents." — this test will fail in CI. The fix is to drop "fetched" from the assertion (or add it to the component message if that phrasing is intentional).


Design gap: offset-only calls bypass the server cap

The resolver has three branches:

if limit is not None:
    limit = max(0, min(limit, MAX_FULL_DATACELL_LIST_LIMIT))
    return qs[start : start + limit]
if start:
    return qs[start:]   # ← no cap applied
return qs               # ← backward-compat unbounded

A caller can send fullDatacellList(offset: 1) with no limit and receive the entire remaining queryset uncapped. The schema docstring says "Use together with limit" but this isn't enforced. The offset-only path should either apply the cap (qs[start : start + MAX_FULL_DATACELL_LIST_LIMIT]) or be explicitly documented in the schema description as unbounded so API consumers aren't surprised.


Minor: queryset instance-cache is fragile

_get_datacell_qs caches the queryset via setattr(extract, f"_datacell_qs_{user.pk}", qs). This works when Graphene passes the same Python model instance to both resolvers in a single query (which it does today), but this is an undocumented Graphene-Django implementation detail. A comment in the function body explaining this assumption — and what breaks if it doesn't hold — would help future maintainers. The user-scoped cache key (user.pk in the attr name) is a good security touch.


Minor: deferred import in a module-level function

ExtractQueryOptimizer is imported inside _get_datacell_qs() rather than at the top of extract_types.py. If there's a circular import concern driving this, a brief comment would clarify. Otherwise, top-level imports are conventional in this codebase and easier for static analysis tools.


Missing test: negative limit values

The resolver correctly clamps negative limits to 0 via max(0, min(limit, MAX_FULL_DATACELL_LIST_LIMIT)), but there's no test for this branch. Given that negative offset already has a dedicated test (test_full_datacell_list_negative_offset_clamped_to_zero), parity for negative limit would be consistent and ensure the clamping is not accidentally removed.


Positives

  • Backward compatible: omitting limit/offset returns all visible cells — no breaking change.
  • Input validation: negative offset clamped, limit capped at MAX_FULL_DATACELL_LIST_LIMIT, limit=0 returns empty without crash. All three are tested.
  • DRY improvement: formatCellValue correctly extracted to frontend/src/utils/formatters.ts with its own unit test file.
  • UX improvement: replacing the hard error wall with an informational footer banner is a meaningfully better user experience.
  • Constants placement: both backend (opencontractserver/constants/extracts.py) and frontend (constants.ts) land in the designated constants files per project conventions.
  • Schema doc: new limit/offset/datacellCount fields are properly documented in the GraphQL schema.

Must fix before merge: the "both-truncated" test assertion ("fetched" mismatch) — this will break CI. The offset-only cap gap is also worth resolving while this code is being touched.

Use the project's coverage fixture (./utils/coverage) instead of importing
directly from @playwright/experimental-ct-react, so Istanbul coverage data
is actually collected. Also fix the "both-truncated" test assertion that
referenced non-existent "fetched" text in the banner regex.
@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

Code Review

This is a clean, well-structured feature. The overall approach is sound — bound the payload at the resolver level, surface a count separately, and replace the hard error with an informative footer banner. The test coverage is thorough. A few things worth discussing before merge:


Medium: Over-engineered instance-level queryset caching

config/graphql/extract_types.py_get_datacell_qs

cache_attr = f"_datacell_qs_{user.pk}"
if not hasattr(extract, cache_attr):
    setattr(extract, cache_attr, ExtractQueryOptimizer.get_extract_datacells(...).order_by(...))
return getattr(extract, cache_attr)

The comment says this saves "a single queryset build when both fields are requested." But get_extract_datacells() just constructs a lazy queryset — no DB hit until evaluation. What it does not save is the two actual DB round-trips (SELECT + COUNT), which still happen regardless of the cache.

The pattern also has a subtle quirk: in the unbounded case, resolve_full_datacell_list returns the exact cached queryset object. Graphene iterates it, populating _result_cache on that object. resolve_datacell_count then calls .count() on the same object — which correctly bypasses _result_cache and hits the DB, so it's not wrong, but it is confusing if you're trying to reason about it later.

Simpler alternative: just call ExtractQueryOptimizer.get_extract_datacells() twice (once per resolver). The queryset construction is a handful of Python dict lookups — cheaper than the cognitive overhead of the caching pattern, and it avoids mutating model instances.


Medium: Inline import could be module-level

config/graphql/extract_types.py, resolve_full_datacell_list:

def resolve_full_datacell_list(self, info, limit=None, offset=None):
    from opencontractserver.constants.extracts import MAX_FULL_DATACELL_LIST_LIMIT

opencontractserver.constants.extracts is a leaf module with no Django model imports, so there's no circular import risk here (unlike ExtractQueryOptimizer). Moving it to the top of the file would be consistent with the project's style and make the cap more visible to future readers.


Low: Missing test for negative limit

opencontractserver/tests/test_extract_queries.py covers:

  • limit=0 → empty list ✓
  • offset=-5 (negative offset clamped) ✓
  • limit > MAX (cap enforced) ✓

But limit=-1 is not tested. The resolver does max(0, min(limit, MAX)), so -1 clamps to 0 and returns an empty list — same behavior as limit=0. Worth a one-liner assertion in test_full_datacell_list_limit_zero_returns_empty, or a separate test, just to lock in the contract.


Low: datacellCount N+1 if field is reused outside the embed

resolve_datacell_count issues a COUNT(*) SQL query per ExtractType instance. For the single-extract embed this is fine. If datacellCount is ever added to a list query (e.g., an extract listing page), it will create N+1 queries. Worth a brief # N+1 if used in list context — add a DataLoader before doing so comment on the resolver to give future readers a heads-up.


Nit: offset-only branch is a redundant code path

if limit is not None:
    ...
    return qs[start : start + limit]
if start:
    return qs[start:]
return qs

The middle branch (offset without limit) could be unified with the last: return qs[start:] when start == 0 is equivalent to return qs. Alternatively, return qs[start:] unconditionally (dropping the if start: guard) would collapse the last two branches and make the logic easier to follow. Minor, but reduces the number of code paths to reason about.


Nit: Dual-constant sync is undocumented at the call site

frontend/src/assets/configurations/constants.ts and opencontractserver/constants/extracts.py both hardcode 500 and the comments say "must match." This is fine for now, but there's no test that catches divergence if one is changed without the other. Something like a backend integration test that asserts the GraphQL schema cap equals MAX_FULL_DATACELL_LIST_LIMIT would be belt-and-suspenders, though I understand if that's overkill for this change.


Looks good

  • Pagination logic (limit/offset/clamping) is correct
  • _get_datacell_qs correctly uses user.pk in the cache key, preventing cross-user queryset leakage
  • datacellCount correctly counts the full unsliced queryset regardless of limit/offset
  • The OverflowFooter handles all three banner states (server-truncated, display-clipped, both) correctly
  • MOCK_VARS shared constant ensures all Apollo mocks agree on the limit variable — good catch, this would silently fail otherwise
  • formatCellValue extraction to formatters.ts + unit tests is a clean DRY improvement
  • Backend tests cover all meaningful edge cases; the test_full_datacell_list_limit_capped_at_server_max bulk-create test may be slow in CI (creates 501 rows) but the boundary case is worth testing explicitly

JSv4 added 2 commits April 14, 2026 00:21
…imports, add negative limit test

- Remove instance-level queryset caching from _get_datacell_qs; the underlying
  get_extract_datacells() returns a lazy queryset with no DB hit, so the
  hasattr/setattr cache pattern added no value.
- Move MAX_FULL_DATACELL_LIST_LIMIT import to module level since
  opencontractserver.constants.extracts is a leaf module with no circular
  dependency risk.
- Add test_full_datacell_list_negative_limit_clamped_to_zero to verify that
  limit=-1 is clamped to 0 and returns an empty list.
@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

Code Review — PR #1235: Server-side pagination for extract grid embeds

Overall this is a solid, well-scoped implementation. The architecture follows project conventions and the test coverage is comprehensive. A few issues need attention before merge.


Bugs / Correctness

1. Unbounded offset-only queries bypass the payload cap

config/graphql/extract_types.pyresolve_full_datacell_list:

# Offset only: skip the first N cells, return the rest.
if start:
    return qs[start:]
# Unbounded: return the full queryset as-is.
return qs

MAX_FULL_DATACELL_LIST_LIMIT is only enforced when limit is not None. Any authenticated caller can issue fullDatacellList(offset: 1) and receive every remaining datacell without a limit. The intent of the constant is bypassed by the offset-only branch. Consider enforcing the cap regardless of whether limit is provided:

effective_limit = min(limit, MAX_FULL_DATACELL_LIST_LIMIT) if limit is not None else MAX_FULL_DATACELL_LIST_LIMIT
return qs[start : start + effective_limit]

If truly unbounded access needs to be supported (e.g., internal bulk export paths), that should be a separate resolver or require superuser/staff privilege.

2. test_full_datacell_list_offset_only_without_limit tests the problematic branch

That test validates that offset-without-limit works, but it inadvertently documents the unbounded behaviour as intentional. If the cap is tightened (see above), the test expectation will need updating.


Performance

3. Double queryset for fullDatacellList + datacellCount in a single request

The frontend GET_EXTRACT_GRID_EMBED query requests both fields simultaneously. Each field calls _get_datacell_qs(self, info.context.user) independently, so a single GraphQL request issues two separate SELECT queries (one with LIMIT/OFFSET, one with COUNT(*)) on the same base queryset. Caching the queryset on the instance avoids the redundant permission join:

def _cached_datacell_qs(self, user):
    if not hasattr(self, "_datacell_qs_cache"):
        self._datacell_qs_cache = _get_datacell_qs(self, user)
    return self._datacell_qs_cache

The COUNT(*) query is already efficient (Django issues it as SELECT COUNT(*) …), so this is a minor optimisation, but worth having given both fields are always co-fetched.


Maintenance Risk

4. Two constants that must stay in sync

The docstring in opencontractserver/constants/extracts.py documents the manual sync requirement between MAX_FULL_DATACELL_LIST_LIMIT (backend) and EXTRACT_GRID_EMBED_CELL_LIMIT (frontend). This is already acknowledged in the comment, which is good. Consider adding an assertion in a backend test to make the contract explicit and fail loudly if someone changes one without the other — or at minimum add a CI lint rule. Not blocking, but worth tracking.


Rule Violation

5. PR description contains a Claude Code session URL

The PR body ends with https://claude.ai/code/session_018GC4vMGMVWCvLHiJMYVxRf. Per CLAUDE.md:

Never credit Claude or Claude Code in commit messages, PR messages, comments, or any other artifacts.

Please edit the PR description to remove that URL before merge.


What's Done Well

  • _get_datacell_qs() helper is a clean DRY pattern — both resolvers share the same permission-filtered, deterministically ordered queryset.
  • formatCellValue() extraction to formatters.ts is proper SRP compliance per the project conventions.
  • Constants placement is correct: opencontractserver/constants/extracts.py (backend) and frontend/src/assets/configurations/constants.ts (frontend).
  • setUp fix (adding document to extract + granting READ permission) is an important correctness fix — prior tests silently relied on superuser privilege and didn't exercise the permission-filtered code path.
  • MOCK_VARS constant in ExtractGridEmbedTestWrapper.tsx is the right pattern to prevent Apollo variable-matching bugs (null vs undefined matters per CLAUDE.md).
  • Graceful degradation: replacing the hard error screen with an overflow footer banner is the right UX direction.
  • Test coverage is thorough: 6 backend cases covering negative offset clamping, limit capping, zero limit, offset-only, and full pagination; 2 new frontend component test suites.
  • Negative offset guard (max(0, offset)) is a good defensive choice — Django querysets do not support negative indexing.

Summary: Items 1 and 5 should be resolved before merge. Item 3 is a nice-to-have optimisation. Item 2 is a test update that follows from item 1.

JSv4 added 2 commits April 14, 2026 06:08
…warning

- Cap offset-only queryset to MAX_FULL_DATACELL_LIST_LIMIT to prevent
  unbounded results when offset is provided without limit
- Clarify combined-truncation banner wording to avoid mixing document/cell
  units
- Add N+1 warning comment on resolve_datacell_count
- Add clarifying comments on _get_datacell_qs lifecycle and inline import
@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

PR Review: Server-side pagination for extract grid embeds (#1204)

Overall this is a well-structured implementation. The security guards, test coverage, and DRY refactoring are all solid. A few issues worth addressing before merge.


Bugs / Correctness

1. Misleading behavior in test_full_datacell_list_offset_only_without_limit

The test docstring says it verifies the resolver "skips the first N cells and returns the remainder" (unlimited). In practice, the resolver applies MAX_FULL_DATACELL_LIST_LIMIT as an implicit cap when offset > 0 and no limit is given:

if start:
    return qs[start : start + MAX_FULL_DATACELL_LIST_LIMIT]

The test only has 5 cells, so it never hits the 500-cell cap, meaning the implicit cap is never exercised by this test. The docstring should reflect the actual behavior (capped at server max), and a test case with total_cells > MAX_FULL_DATACELL_LIST_LIMIT should be added to verify the cap fires. As written, the cap is untested on this code path.

2. GraphQL field description doesn't document the implicit cap on offset-only calls

The offset arg description says "Use together with limit for client-driven pagination." But providing offset without limit silently applies MAX_FULL_DATACELL_LIST_LIMIT. A caller following the description in good faith could get fewer results than expected. Either document this behavior in the field description, or simplify to always require limit when offset is used (return an error or apply the cap explicitly).


Performance / Architecture

3. _get_datacell_qs called twice per request (N+1 COUNT(*))

The code itself has the right warning comment:

# N+1 warning: issues a COUNT(*) per ExtractType instance. Safe for
# the single-extract embed query; add a DataLoader before exposing
# this field on list queries.

This is fine for the current embed use case, but datacellCount is now part of the public ExtractType schema. There's nothing stopping a list query from requesting it across hundreds of extracts. Consider adding an @deprecated hint or schema directive to gate this field, or at minimum add a note in schema.graphql recommending against fetching it on list queries.


Minor Issues

4. Unbounded backward-compatible path is still unlimited

When neither limit nor offset is provided, the resolver returns the full queryset:

# Unbounded (no limit, no offset): backward-compatible full list.
return qs

This means issue #1204 is only resolved for clients that explicitly pass limit. The endpoint remains unbounded for any caller that omits the argument. This is a reasonable trade-off for backward compatibility, but the PR description says "Resolves #1204" — worth clarifying in the description that unbounded fetches are still possible and this is intentional.

5. Frontend/backend constant sync is manual

opencontractserver/constants/extracts.py has a comment noting MAX_FULL_DATACELL_LIST_LIMIT must match EXTRACT_GRID_EMBED_CELL_LIMIT in the frontend. Since these live in two separate files with no automated check, this will silently drift. A // NOTE: keep in sync with... comment is already there, but consider a backend test asserting the two values match (e.g. reading the constants.ts value via a fixture or comparing against a hardcoded expected value):

def test_backend_and_frontend_cell_limit_match(self):
    # If this fails, update frontend/src/assets/configurations/constants.ts
    from opencontractserver.constants.extracts import MAX_FULL_DATACELL_LIST_LIMIT
    self.assertEqual(MAX_FULL_DATACELL_LIST_LIMIT, 500)

That's still manual, but at least it breaks loudly when changed.


Positive Highlights

  • Security: max(0, min(limit, MAX_FULL_DATACELL_LIST_LIMIT)) correctly prevents limit bypass — good.
  • Negative offset/limit clamping with explicit tests for each edge case — thorough.
  • formatCellValue extracted to formatters.ts: correct application of the DRY rule and utility-file convention from CLAUDE.md. The new unit test file for it is a nice addition.
  • MOCK_VARS refactor in test wrapper: using a shared constant ensures Apollo's strict variable-matching doesn't create confusing test failures.
  • Deterministic ordering (document_id, column_id, id) makes pagination stable across requests.
  • OverflowFooter banner is a much better UX than the previous hard error screen; the combined message for both truncation paths covers the edge case cleanly.
  • ComponentEmbedErrorFallback test is a welcome addition.

JSv4 added 2 commits April 14, 2026 19:36
…ocstring

- Rewrite the both-truncated OverflowFooter to separate document count
  and cell count into distinct sentences for clarity
- Update test assertion to match the revised banner wording
- Fix misleading docstring on offset-only test to document the server
  cap behavior instead of implying unbounded results
- Improve inline comments on resolver branches for readability
@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Code Review: PR #1235 — Server-side pagination for extract grid embeds

Overview

This PR addresses issue #1204 by bounding the fullDatacellList GraphQL payload with optional limit/offset arguments and a companion datacellCount field. The frontend embed now fetches at most 500 cells and renders a "Showing N of M" footer instead of the previous hard error screen. Overall this is a well-structured, well-tested improvement to a real scalability problem.


Strengths

  • Test coverage is excellent. Five distinct backend test cases cover the happy path, pagination, negative inputs, zero limit, offset-only, and server-cap enforcement. The new frontend test states (partial, both-truncated) close the gap on the two new banner branches.
  • Deterministic ordering (order_by("document_id", "column_id", "id")) ensures stable pagination windows across requests — a common oversight in ad-hoc pagination implementations.
  • Input sanitization is thorough: negative offset clamped to 0, negative limit clamped to 0, oversized limit capped at MAX_FULL_DATACELL_LIST_LIMIT.
  • formatCellValue correctly moved from an inline component helper to utils/formatters.ts, matching project conventions for utility placement.
  • MOCK_VARS constant in the test wrapper is a nice DRY fix — GraphQL mock variable matching is a common source of silent test failures and centralising it prevents future drift.

Issues and Suggestions

1. Unbounded backward-compat path is still vulnerable (config/graphql/extract_types.py)

# No limit, no offset: backward-compatible full list for callers
# that need all visible cells (e.g., the Extracts panel).
return qs

When limit=None and offset=None (or offset=0), the resolver returns a fully unbounded queryset. The PR description documents this as intentional, but it means any caller that omits limit — including a future query or a direct API consumer — still gets an unbounded payload. The original problem (#1204) is solved for the embed, but the API-level risk persists.

Suggestion: Consider emitting a logging.warning() on the unbounded path (guarded by a feature flag or Django debug mode) so operators can detect callers that haven't migrated. Alternatively, document in the GraphQL description that the no-args path will eventually be deprecated.

2. N+1 query concern for datacellCount (config/graphql/extract_types.py:116)

The inline comment already calls this out:

# N+1 warning: issues a COUNT(*) per ExtractType instance. Safe for
# the single-extract embed query; add a DataLoader before exposing
# this field on list queries.

The warning is correct, but ExtractType is already reachable through the extracts list query in GraphQL. If datacellCount is ever included in a list query (e.g., for a dashboard showing many extracts), each row triggers a COUNT. No action needed now, but consider adding a @skip_in_list guard or schema-level documentation to prevent accidental misuse.

3. Inline import to avoid circular dependency is a code smell (config/graphql/extract_types.py:43-49)

def _get_datacell_qs(extract, user):
    # Inline import to avoid circular dependency with the annotations module.
    from opencontractserver.annotations.query_optimizer import (
        ExtractQueryOptimizer,
    )

Inline imports inside functions are a known signal that the module graph has a structural problem. It works, but future maintainers may be puzzled by it.

Suggestion: If ExtractQueryOptimizer is used only for its get_extract_datacells method, consider moving the helper function to a dedicated utility module (e.g., opencontractserver/utils/extracts.py) that can import from both the annotations optimizer and the constants without creating a cycle.

4. Frontend/backend constant synchronization is manual (opencontractserver/constants/extracts.py, frontend/src/assets/configurations/constants.ts)

Both files document "must match" in comments:

# Must match the frontend constant ``EXTRACT_GRID_EMBED_CELL_LIMIT``
// Must match the backend constant
// ``opencontractserver/constants/extracts.py::MAX_FULL_DATACELL_LIST_LIMIT``.

Comments are easy to miss during updates. A simple safeguard would be a backend test that reads the frontend constant file and asserts the values are equal, or a grep-based pre-commit check. Not a blocker, but worth noting.

5. Minor: offset=0 explicit vs. offset=None behaviour difference (config/graphql/extract_types.py:97)

start = max(0, offset) if offset is not None else 0

An explicit offset: 0 in the GraphQL call produces start=0, which then falls through to the unbounded path (same as no offset at all). This is correct, but it means fullDatacellList(offset: 0) is unbounded while fullDatacellList(offset: 1) is capped at MAX_FULL_DATACELL_LIST_LIMIT. This asymmetry is subtle and worth a comment near the if start: check to avoid future confusion.

6. Banner UX: "Showing N of M cells" vs. "Showing N of M documents" (ExtractGridEmbed.tsx:331-340)

The combined banner text is:

Showing 200 of 201 documents. 201 of 1000 total cells loaded. View the full extract in the Extracts panel.

When both truncations apply, the "201 of 1000 total cells loaded" refers to the fetched-vs-total cell count, while "200 of 201 documents" refers to rendered-vs-available documents. These are different denominators, which could be confusing to users.

Suggestion: Consider clarifying the message:

"Showing 200 of 201 documents (201 of 1000 cells fetched). View the full extract in the Extracts panel."


Summary

Category Assessment
Correctness ✅ Pagination logic is correct and well-guarded
Test coverage ✅ Both backend and frontend cases covered
Security ✅ Server-side cap prevents bypass via API
Performance ⚠️ N+1 on datacellCount in list queries (documented, low risk now)
Code quality ⚠️ Inline import for circular-dependency workaround; manual constant sync
UX Minor banner copy clarity issue when both truncations apply

The implementation is solid and the issue (#1204) is substantively resolved. The items above are mostly minor or forward-looking — none are blockers for merge. The most important one to address before merge is the datacellCount N+1 exposure if there's any chance it will appear in a list query in the short term.

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Code Review: Server-side pagination for extract grid embeds (#1204)

Overview

This PR resolves a real scalability problem: fullDatacellList was unbounded, and large extracts could return thousands of cells in a single response. The solution is clean — add limit/offset to the GraphQL field, add a datacellCount field for "N of M" display, and replace the hard error guard with a graceful footer banner. Overall this is a solid, well-structured change. The test coverage is thorough and the DRY improvements (extracting formatCellValue, centralising mock variables) are good housekeeping.


Issues

1. Asymmetric behavior in the "offset only, no limit" path (medium severity)

config/graphql/extract_types.py, resolve_full_datacell_list

if start:
    return qs[start : start + MAX_FULL_DATACELL_LIST_LIMIT]
# No limit, no offset: backward-compatible full list
return qs

The no-limit + no-offset path returns all visible cells (backward-compatible). But the no-limit + has-offset path silently caps at MAX_FULL_DATACELL_LIST_LIMIT. A caller who passes fullDatacellList(offset: 0) gets capped at 500, while fullDatacellList() returns everything. This asymmetry is surprising and is not surfaced by datacellCount.

The simplest fix is to cap the unbounded no-args path as well, or to document in the schema description that the no-args path is uncapped. Since the PR description says "backward compatible", keeping the current behavior is acceptable, but a schema description clarification would prevent confusion:

# Schema description update
"Returns all visible datacells if omitted (uncapped). Callers that "
"need a bounded payload should always supply `limit`."

2. The unbounded fullDatacellList() (no args) is still reachable by direct API callers (low-medium severity)

The embed now always passes limit: 500. However, authenticated users making direct GraphQL requests can still call fullDatacellList without a limit and receive the full unbounded payload. This was the original behavior, but now that there's an explicit server cap for the bounded path, it may be worth logging a note or opening a follow-up issue to decide whether the no-limit path should also be capped.

3. Dual-constant fragility — frontend and backend values must stay in sync manually (low severity)

EXTRACT_GRID_EMBED_CELL_LIMIT = 500 (frontend) and MAX_FULL_DATACELL_LIST_LIMIT = 500 (backend) are documented in comments but not automatically verified. If they drift apart, the frontend requests 600 cells but the server caps at 500 (or vice versa), producing silently incorrect "N of M" math.

Suggestion: Add an assertion to the backend test suite or a CI check that reads the frontend constant from constants.ts and compares it, or at minimum note this in CHANGELOG.md with a clear warning. A simpler near-term approach: add a comment in both files pointing to an issue number so any future change is tracked.

4. resolve_datacell_count issues two queries per embed request, and the comment is slightly misleading (low severity)

config/graphql/extract_types.py line 135-139

def resolve_datacell_count(self, info) -> int:
    # N+1 warning: issues a COUNT(*) per ExtractType instance.
    return _get_datacell_qs(self, info.context.user).count()

The comment correctly flags the N+1 risk for list queries. However, for the single-extract embed query it's actually N+2 queries (one SELECT ... LIMIT N for fullDatacellList + one SELECT COUNT(*) for datacellCount). This is acceptable for the current use case — just noting so the comment is accurate. Consider updating to: "Issues a COUNT(*) in addition to the main list query per ExtractType instance."

5. _get_datacell_qs circular-import workaround is a smell worth tracking (nit)

# Inline import to avoid circular dependency with the annotations module.
from opencontractserver.annotations.query_optimizer import ExtractQueryOptimizer

This works fine, but it fires on every resolver call. It'd be worth opening a follow-up issue to break the circular dependency properly (e.g., move ExtractQueryOptimizer to a dedicated module). Not a blocker.


What's Good

  • Constants centralisation: opencontractserver/constants/extracts.py + frontend/src/assets/configurations/constants.ts follows the project's no-magic-numbers rule well.
  • Test coverage is excellent: Negative offset, negative limit, limit=0, limit > server max, offset-only, and both-truncated paths are all covered. This is above-average thoroughness.
  • formatCellValue moved to formatters.ts: Correct DRY refactor; the companion unit test in formatCellValue.test.ts is a nice addition.
  • MOCK_VARS shared object: Eliminates the Apollo strict-matching footgun across all mock definitions — good catch.
  • OverflowFooter design: Replacing the hard error screen with a graceful banner improves UX significantly.
  • Deterministic ordering: order_by("document_id", "column_id", "id") ensures stable pagination windows across requests.
  • Codec change (file:files: in backend.yml): Legitimate fix, not scope creep.

Summary

Severity Count
Medium 2
Low 3
Nit 1

None of the issues are blockers. The asymmetric no-args behavior is the most likely to cause a future bug. Everything else is documentation or follow-up work. The core implementation is correct and the test suite gives good confidence in the resolver logic.

@JSv4 JSv4 mentioned this pull request Apr 15, 2026
@JSv4 JSv4 merged commit 3121cc5 into main Apr 15, 2026
16 checks passed
@JSv4 JSv4 deleted the claude/resolve-issue-1204-aIifN branch April 15, 2026 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add pagination/limits to ExtractGridEmbed fullDatacellList query

2 participants