Implement server-side pagination for extract grid embeds (Issue #1204)#1235
Implement server-side pagination for extract grid embeds (Issue #1204)#1235
Conversation
Resolves #1204 — the GET_EXTRACT_GRID_EMBED query previously fetched the unbounded full_datacell_list, which for extracts with hundreds of documents and columns could return tens of thousands of cells. Backend: - ExtractType.fullDatacellList now accepts optional limit/offset args. - Resolver applies a deterministic order_by(document_id, column_id, id) before slicing so limit/offset windows are stable across requests. - New ExtractType.datacellCount field reports total visible cells ignoring pagination, so clients can render "showing N of M" indicators. - Schema.graphql updated to match. Frontend: - GET_EXTRACT_GRID_EMBED query passes limit: EXTRACT_GRID_EMBED_CELL_LIMIT (500) and fetches datacellCount alongside fullDatacellList. - ExtractGridEmbed renders a partial-data footer banner ("Showing N of M cells" / "Showing X of Y documents") instead of the previous hard "exceeds limit" error screen when the payload is bounded. - New EXTRACT_GRID_EMBED_CELL_LIMIT constant; existing EXTRACT_GRID_EMBED_MAX_ROWS repurposed as a pure display bound. Tests: - test_extract_queries.test_full_datacell_list_limit_offset_and_count verifies limit/offset pagination, datacellCount, disjoint pages, and partial final-page behavior. - ExtractGridEmbedTestWrapper gets a new "partial" state mock; the "too-many-rows" state now exercises the partial-data banner instead of the removed error screen. - Two new component tests cover the partial-data rendering paths.
Code Review: Server-side pagination for extract grid embeds (#1204)Overall this is a well-structured, well-documented change that cleanly resolves #1204. The approach (bounded Bug: Negative
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
… improve banner UX - Clamp negative offset to 0 in resolve_full_datacell_list to prevent Django AssertionError on negative queryset indexing - Add MAX_FULL_DATACELL_LIST_LIMIT (500) server-side cap to prevent DoS via arbitrarily large limit values in GraphQL API - Fix misleading banner message when both row and cell truncation apply: now distinguishes fetched document count from total - Add backend tests for negative offset and limit cap scenarios - Add ComponentEmbedErrorFallback component test with docScreenshot - Cross-reference frontend/backend constants for maintainability
| from opencontractserver.constants.corpus_actions import * # noqa: F401, F403 | ||
| from opencontractserver.constants.discovery import * # noqa: F401, F403 | ||
| from opencontractserver.constants.document_processing import * # noqa: F401, F403 | ||
| from opencontractserver.constants.extracts import * # noqa: F401, F403 |
Code ReviewGood work overall — the core approach is solid. The resolver logic is correct, the UI degrades gracefully instead of hard-erroring, and the test coverage is thoughtful. A few items worth addressing before merge: Medium: Two separate DB queries per extract when both fields are fetched
Medium: Combined truncation message is untested and grammatically awkward
That phrasing is hard to parse (mixes "documents" and "cells" in one sentence). More importantly, this branch has no test. The existing mocks only cover the two single-truncation cases. Suggest either adding a test fixture that triggers both conditions, or simplifying to two separate footer lines. Minor:
|
The fullDatacellList and datacellCount tests were failing because the test setUp never associated the document with the extract's documents M2M or granted the test user read permissions on the document. The permission-aware datacell resolver (ExtractQueryOptimizer) returned empty results for non-superuser users without document permissions.
- Fix datacellCount type in GetExtractGridEmbedOutput to match nullable GraphQL schema (number -> number | null | undefined) - Clarify test_full_datacell_list_limit_capped_at_server_max docstring to accurately describe what it tests vs. what it cannot verify - Add TODO noting the duplicate DB query when both fullDatacellList and datacellCount are fetched in the same GraphQL request
Adds a "both-truncated" test state and corresponding Playwright test to cover the code path where both server-side cell truncation and client-side row clipping fire simultaneously. The previous test suite only covered each truncation independently.
New docScreenshot for the both-truncated extract grid embed state. Updated loading screenshot from local test run.
Code Review: Server-side Pagination for Extract Grid Embeds (#1204)This is a well-scoped fix with good test coverage. The core approach is sound. A few items worth addressing before merge, roughly ordered by severity. Issues1. Double DB query on every request (performance)The A straightforward fix is to cache the base queryset on the # config/graphql/extract_types.py
def _get_datacell_qs(self, user):
"""Returns the permission-filtered, deterministically ordered queryset.
Cached on the instance so both resolvers share a single queryset build."""
if not hasattr(self, "_datacell_qs"):
from opencontractserver.annotations.query_optimizer import ExtractQueryOptimizer
self._datacell_qs = (
ExtractQueryOptimizer.get_extract_datacells(self, user, document_id=None)
.order_by("document_id", "column_id", "id")
)
return self._datacell_qs
def resolve_full_datacell_list(self, info, limit=None, offset=None):
from opencontractserver.constants.extracts import MAX_FULL_DATACELL_LIST_LIMIT
qs = self._get_datacell_qs(info.context.user)
start = max(0, offset) if offset is not None else 0
if limit is not None:
limit = max(0, min(limit, MAX_FULL_DATACELL_LIST_LIMIT))
return qs[start : start + limit]
return qs[start:] if start else qs
def resolve_datacell_count(self, info) -> int:
return self._get_datacell_qs(info.context.user).count()This makes one queryset build per 2. Confusing "both truncated" banner copy (UX)The combined message mixes units mid-sentence in a way that's hard to parse:
"500 of 1000 total cells" buried in the middle of a sentence about documents is jarring. Consider separating the two facts: `Showing ${visibleRows.length} of ${rows.length} documents (payload bounded to ${fetchedCellCount} of ${totalCellCount} total cells). View the full extract in the Extracts panel.`Or split into two lines inside the footer so each truncation is clearly labeled. The single-truncation messages are fine as-is. 3.
|
Add backend test for offset-only (no limit) code path in resolve_full_datacell_list, covering the `if start: return qs[start:]` branch. Export formatCellValue helper and add Vitest unit tests for all value formatting branches (null, boolean, number, string, object truncation).
PR Review: Server-side pagination for extract grid embedsOverall this is a well-structured implementation. The architecture is sound, tests are solid, and it cleanly resolves the unbounded-payload problem. A few things worth addressing before merge. |
|
Issues and Suggestions for PR 1235:
What is done well:
Items 1-4 are worth addressing before merge; 5-7 are optional improvements. |
…s, improve banner copy - Cache permission-filtered datacell queryset on ExtractType instance so resolve_full_datacell_list and resolve_datacell_count share one queryset build per request instead of duplicating the DB round-trip. - Move formatCellValue from inline definition in ExtractGridEmbed.tsx to frontend/src/utils/formatters.ts per project convention for shared helpers. - Clarify the combined truncation banner message to read "Showing N of M documents (payload bounded to X of Y total cells)" instead of the harder to parse previous phrasing. - Add null guard on fullDatacellList access for defensive runtime safety. - Add comment in GraphQL query noting offset is accepted but not yet used.
PR Review: Server-side pagination for extract grid embeds (#1204)Overall this is a well-structured, clean implementation. The change correctly resolves the unbounded-payload issue, maintains backward compatibility, and is accompanied by solid test coverage. A few things worth addressing before merge. Issues1. Instance-level queryset cache ignores the
|
- Key queryset cache by user.pk to prevent cross-user data leaks if the pattern is ever reused with shared instances (e.g. DataLoader). - Clarify combined-truncation banner: "fetched documents" and "total cells loaded" avoids implying rows.length is the true document count. - Convert all pagination test queries to use GraphQL variables dict instead of string interpolation, matching production usage patterns. - Improve cap test: create MAX_FULL_DATACELL_LIST_LIMIT+1 cells and verify the returned count is exactly the cap, not just "no 500 error". - Add test for limit=0 edge case (returns empty list, count still accurate).
PR Review: Server-side pagination for extract grid embeds (#1235)Overall this is a solid, well-scoped implementation. The backend resolver design is clean, the test coverage is excellent, and the UX improvement (graceful banner vs. hard error) is the right call. A few things worth addressing before merge: Potential Issues1. UX: combined-truncation banner is hard to parse In The phrase "fetched documents" is confusing — where 2. No enforcement of constant parity
3. Instance-level queryset cache ( The cache comment says it protects against a DataLoader scenario where one 4. if start:
return qs[start:]This branch (offset without limit) is tested by Observations / Minor Nits
Summary
Blocking on the combined-banner wording (#1 above) is a judgment call — it works correctly, it just reads oddly. Everything else is advisory. Happy to approve once the authors have had a chance to look at these. |
Move _get_datacell_qs from ExtractType method to module-level function. Graphene-Django resolvers receive the Django model instance as self, not the GraphQL type, so methods defined on ExtractType are not available at resolve time. This caused all 6 CI test failures. Also addresses review comments: - Simplify combined truncation banner to two separate sentences - Add full path reference (::) for backend constant parity comment - Add one-liner comments above each resolver branch
PR Review: Server-side pagination for extract grid embeds (#1204)Overall this is a well-structured solution to a real payload-size problem. The three-layered approach (server cap, display clip, and informational banner) is user-friendly and the edge-case tests are thorough. A few issues worth addressing before merge: Bug: "Both Truncated" test expectations don't match component outputThe component ( But await expect(page.getByText(/Showing 200 of 201 fetched documents/)).toBeVisible();The word "fetched" doesn't appear in the banner text, so this regex will never match. The test will fail (or is masked by a timeout). The assertion should be: await expect(page.getByText(/Showing 200 of 201 documents/)).toBeVisible();Design gap:
|
- Add _get_datacell_qs() with user-keyed instance cache so both resolve_full_datacell_list and resolve_datacell_count share a single queryset build per request instead of hitting the DB twice - Improve "both truncated" banner message to clearly separate document count from cell count: "Showing N of M fetched documents (X of Y total cells loaded)" - Add null guard on fullDatacellList access (optional chaining) - Fix datacellCount type in graphql-api.ts to include | null to match the nullable GraphQL Int schema - Add branch comments to resolver for offset-only and unbounded paths - Update test assertion to match improved banner text
… into claude/resolve-issue-1204-aIifN # Conflicts: # config/graphql/extract_types.py # frontend/src/components/extracts/ExtractGridEmbed.tsx # frontend/tests/ExtractGridEmbed.ct.tsx
PR Review: Server-side pagination for extract grid embeds (#1204)Overall this is a well-structured fix. The approach — bounded A few items to address before merging: Bug: "Both Truncated" test assertion will failThe test in await expect(
page.getByText(/Showing 200 of 201 fetched documents/)
).toBeVisible();But the component ( The word "fetched" does not appear anywhere in the component's message. The regex Design gap:
|
Use the project's coverage fixture (./utils/coverage) instead of importing directly from @playwright/experimental-ct-react, so Istanbul coverage data is actually collected. Also fix the "both-truncated" test assertion that referenced non-existent "fetched" text in the banner regex.
Code ReviewThis is a clean, well-structured feature. The overall approach is sound — bound the payload at the resolver level, surface a count separately, and replace the hard error with an informative footer banner. The test coverage is thorough. A few things worth discussing before merge: Medium: Over-engineered instance-level queryset caching
cache_attr = f"_datacell_qs_{user.pk}"
if not hasattr(extract, cache_attr):
setattr(extract, cache_attr, ExtractQueryOptimizer.get_extract_datacells(...).order_by(...))
return getattr(extract, cache_attr)The comment says this saves "a single queryset build when both fields are requested." But The pattern also has a subtle quirk: in the unbounded case, Simpler alternative: just call Medium: Inline import could be module-level
def resolve_full_datacell_list(self, info, limit=None, offset=None):
from opencontractserver.constants.extracts import MAX_FULL_DATACELL_LIST_LIMIT
Low: Missing test for negative
|
…imports, add negative limit test - Remove instance-level queryset caching from _get_datacell_qs; the underlying get_extract_datacells() returns a lazy queryset with no DB hit, so the hasattr/setattr cache pattern added no value. - Move MAX_FULL_DATACELL_LIST_LIMIT import to module level since opencontractserver.constants.extracts is a leaf module with no circular dependency risk. - Add test_full_datacell_list_negative_limit_clamped_to_zero to verify that limit=-1 is clamped to 0 and returns an empty list.
Code Review — PR #1235: Server-side pagination for extract grid embedsOverall this is a solid, well-scoped implementation. The architecture follows project conventions and the test coverage is comprehensive. A few issues need attention before merge. Bugs / Correctness1. Unbounded offset-only queries bypass the payload cap
# Offset only: skip the first N cells, return the rest.
if start:
return qs[start:]
# Unbounded: return the full queryset as-is.
return qs
effective_limit = min(limit, MAX_FULL_DATACELL_LIST_LIMIT) if limit is not None else MAX_FULL_DATACELL_LIST_LIMIT
return qs[start : start + effective_limit]If truly unbounded access needs to be supported (e.g., internal bulk export paths), that should be a separate resolver or require superuser/staff privilege. 2. That test validates that offset-without-limit works, but it inadvertently documents the unbounded behaviour as intentional. If the cap is tightened (see above), the test expectation will need updating. Performance3. Double queryset for The frontend def _cached_datacell_qs(self, user):
if not hasattr(self, "_datacell_qs_cache"):
self._datacell_qs_cache = _get_datacell_qs(self, user)
return self._datacell_qs_cacheThe Maintenance Risk4. Two constants that must stay in sync The docstring in Rule Violation5. PR description contains a Claude Code session URL The PR body ends with
Please edit the PR description to remove that URL before merge. What's Done Well
Summary: Items 1 and 5 should be resolved before merge. Item 3 is a nice-to-have optimisation. Item 2 is a test update that follows from item 1. |
…warning - Cap offset-only queryset to MAX_FULL_DATACELL_LIST_LIMIT to prevent unbounded results when offset is provided without limit - Clarify combined-truncation banner wording to avoid mixing document/cell units - Add N+1 warning comment on resolve_datacell_count - Add clarifying comments on _get_datacell_qs lifecycle and inline import
PR Review: Server-side pagination for extract grid embeds (#1204)Overall this is a well-structured implementation. The security guards, test coverage, and DRY refactoring are all solid. A few issues worth addressing before merge. Bugs / Correctness1. Misleading behavior in The test docstring says it verifies the resolver "skips the first N cells and returns the remainder" (unlimited). In practice, the resolver applies if start:
return qs[start : start + MAX_FULL_DATACELL_LIST_LIMIT]The test only has 5 cells, so it never hits the 500-cell cap, meaning the implicit cap is never exercised by this test. The docstring should reflect the actual behavior (capped at server max), and a test case with 2. GraphQL field description doesn't document the implicit cap on The Performance / Architecture3. The code itself has the right warning comment: # N+1 warning: issues a COUNT(*) per ExtractType instance. Safe for
# the single-extract embed query; add a DataLoader before exposing
# this field on list queries.This is fine for the current embed use case, but Minor Issues4. Unbounded backward-compatible path is still unlimited When neither # Unbounded (no limit, no offset): backward-compatible full list.
return qsThis means issue #1204 is only resolved for clients that explicitly pass 5. Frontend/backend constant sync is manual
def test_backend_and_frontend_cell_limit_match(self):
# If this fails, update frontend/src/assets/configurations/constants.ts
from opencontractserver.constants.extracts import MAX_FULL_DATACELL_LIST_LIMIT
self.assertEqual(MAX_FULL_DATACELL_LIST_LIMIT, 500)That's still manual, but at least it breaks loudly when changed. Positive Highlights
|
…ocstring - Rewrite the both-truncated OverflowFooter to separate document count and cell count into distinct sentences for clarity - Update test assertion to match the revised banner wording - Fix misleading docstring on offset-only test to document the server cap behavior instead of implying unbounded results - Improve inline comments on resolver branches for readability
Code Review: PR #1235 — Server-side pagination for extract grid embedsOverviewThis PR addresses issue #1204 by bounding the Strengths
Issues and Suggestions1. Unbounded backward-compat path is still vulnerable (
|
| Category | Assessment |
|---|---|
| Correctness | ✅ Pagination logic is correct and well-guarded |
| Test coverage | ✅ Both backend and frontend cases covered |
| Security | ✅ Server-side cap prevents bypass via API |
| Performance | datacellCount in list queries (documented, low risk now) |
| Code quality | |
| UX | Minor banner copy clarity issue when both truncations apply |
The implementation is solid and the issue (#1204) is substantively resolved. The items above are mostly minor or forward-looking — none are blockers for merge. The most important one to address before merge is the datacellCount N+1 exposure if there's any chance it will appear in a list query in the short term.
Code Review: Server-side pagination for extract grid embeds (#1204)OverviewThis PR resolves a real scalability problem: Issues1. Asymmetric behavior in the "offset only, no limit" path (medium severity)
if start:
return qs[start : start + MAX_FULL_DATACELL_LIST_LIMIT]
# No limit, no offset: backward-compatible full list
return qsThe no-limit + no-offset path returns all visible cells (backward-compatible). But the no-limit + has-offset path silently caps at The simplest fix is to cap the unbounded no-args path as well, or to document in the schema description that the no-args path is uncapped. Since the PR description says "backward compatible", keeping the current behavior is acceptable, but a schema description clarification would prevent confusion: # Schema description update
"Returns all visible datacells if omitted (uncapped). Callers that "
"need a bounded payload should always supply `limit`."2. The unbounded
|
| Severity | Count |
|---|---|
| Medium | 2 |
| Low | 3 |
| Nit | 1 |
None of the issues are blockers. The asymmetric no-args behavior is the most likely to cause a future bug. Everything else is documentation or follow-up work. The core implementation is correct and the test suite gives good confidence in the resolver logic.
Summary
Resolves Issue #1204 by implementing server-side pagination for the
ExtractGridEmbedcomponent. ThefullDatacellListGraphQL field now accepts optionallimitandoffsetarguments to bound the payload size, and a newdatacellCountfield reports the total visible cells so the UI can display "showing N of M" indicators when truncated.Key Changes
GraphQL Schema & Backend
limitandoffsetarguments toExtractType.fullDatacellListwith deterministic ordering (document_id,column_id,id) to ensure stable paginationdatacellCountfield toExtractTypeto report total visible datacells regardless of pagination boundsresolve_full_datacell_list()andresolve_datacell_count()resolvers with proper slicing logicFrontend Component
GET_EXTRACT_GRID_EMBEDquery to passlimit: EXTRACT_GRID_EMBED_CELL_LIMIT(500 cells) to bound server payloadOverflowFooterstyled component that displays "showing N of M" messages for both:EXTRACT_GRID_EMBED_MAX_ROWS(200) as a pure display bound while keeping the fetched payload bounded server-sideConstants & Configuration
EXTRACT_GRID_EMBED_CELL_LIMIT = 500constant for server-side payload capEXTRACT_GRID_EMBED_MAX_ROWSdocumentation to clarify it's now a display-only boundTests & Mocks
test_full_datacell_list_limit_offset_and_count) covering pagination, offset, and count behaviorMOCK_VARSwith explicitlimitparameterpartialExtractMockto test server-side truncation scenario (800 total cells, 4 returned)Implementation Details
datacellCountis fetched separately and always reports the full total, enabling accurate "N of M" indicatorsfetchedCellCount < totalCellCount(payload bounded bylimit)rows.length > EXTRACT_GRID_EMBED_MAX_ROWS(display clipping for safety)limit/offsetreturns all visible cells as beforehttps://claude.ai/code/session_018GC4vMGMVWCvLHiJMYVxRf