Skip to content

fix(api): add missing error handling in Notion table rows extractor#34553

Open
lawrence3699 wants to merge 3 commits intolanggenius:mainfrom
lawrence3699:fix/notion-table-rows-error-handling
Open

fix(api): add missing error handling in Notion table rows extractor#34553
lawrence3699 wants to merge 3 commits intolanggenius:mainfrom
lawrence3699:fix/notion-table-rows-error-handling

Conversation

@lawrence3699
Copy link
Copy Markdown

I noticed that _read_table_rows in notion_extractor.py is the only method in the Notion extractor that doesn't validate the HTTP response before processing. Its sibling methods _get_notion_block_data (lines 171-188) and _read_block (line 245) both check the status code and validate the response structure, but _read_table_rows goes straight to data["results"][0] without any checks.

This crashes in two situations:

  1. Notion API returns an error (401 expired token, 429 rate limit, 500 server error): The response body has no "results" key, so data["results"] raises KeyError.

  2. Empty table results: If the results array is empty, data["results"][0] raises IndexError.

Both are easy to reproduce:

# Error response (expired token):
data = {"object": "error", "status": 401, "code": "unauthorized"}
data["results"][0]  # → KeyError: 'results'

# Empty results:
data = {"results": [], "next_cursor": None}
data["results"][0]  # → IndexError: list index out of range

The fix adds the same validation pattern already used by _get_notion_block_data:

  • HTTP status code check with ValueError on non-200
  • httpx.HTTPError catch for network failures
  • "results" key and empty-list check before indexing

Tests added: Four test cases covering HTTP error, network error, empty results, and error-in-200-body scenarios.

_read_table_rows did not validate the HTTP response status or check
for the presence of "results" in the response body, unlike its sibling
methods _get_notion_block_data and _read_block which both do.

When the Notion API returns an error (401/403/429/500) or an empty
result set, the method crashed with KeyError or IndexError instead of
raising a clear ValueError.

Added status code check, httpx.HTTPError handling, and results
validation matching the pattern in _get_notion_block_data (lines
171-188). Added four regression tests.
Copilot AI review requested due to automatic review settings April 5, 2026 00:06
@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the Notion extractor’s robustness by adding HTTP/network error handling and response-shape validation to _read_table_rows, preventing crashes on non-200 responses and empty result sets.

Changes:

  • Add httpx.HTTPError handling and non-200 status checks in NotionExtractor._read_table_rows.
  • Guard against missing/invalid/empty "results" before indexing table header rows.
  • Add unit tests covering HTTP error, network error, empty results, and “error object in 200 body” scenarios.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
api/core/rag/extractor/notion_extractor.py Adds status-code/network handling and "results" validation to _read_table_rows.
api/tests/unit_tests/core/rag/extractor/test_notion_extractor.py Adds regression tests for the new _read_table_rows error handling paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

data = res.json()
except httpx.HTTPError as e:
raise ValueError("Error fetching Notion table rows") from e
if "results" not in data or not isinstance(data["results"], list) or len(data["results"]) == 0:
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response-shape validation currently breaks the pagination loop when results is missing / not a list. That can silently return partial table output (or an empty string) on malformed 200 responses (e.g., {object: "error", ...}), which differs from _get_notion_block_data (it raises on missing/non-list results). Consider raising a ValueError for unexpected response shapes and only returning "" for the legitimate empty-table case (results == []).

Suggested change
if "results" not in data or not isinstance(data["results"], list) or len(data["results"]) == 0:
if "results" not in data or not isinstance(data["results"], list):
raise ValueError("Unexpected Notion table rows response shape")
if len(data["results"]) == 0:
done = True

Copilot uses AI. Check for mistakes.
Comment on lines +295 to +307
res = httpx.request(
"GET",
block_url,
headers={
"Authorization": "Bearer " + self._notion_access_token,
"Content-Type": "application/json",
"Notion-Version": "2022-06-28",
},
params=query_dict,
)
if res.status_code != 200:
raise ValueError(f"Error fetching Notion table rows: {res.text}")
data = res.json()
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions _read_block also checks HTTP status codes, but in the current implementation _read_block does not validate res.status_code or catch httpx.HTTPError (it immediately calls res.json()). Either update the PR description to reflect the current behavior, or consider aligning _read_block with the new error-handling pattern for consistency (could be a follow-up if out of scope here).

Copilot uses AI. Check for mistakes.
Comment on lines +410 to +420
def test_read_table_rows_returns_empty_on_missing_results(self, mocker: MockerFixture):
extractor = notion_extractor.NotionExtractor(
notion_workspace_id="ws",
notion_obj_id="obj",
notion_page_type="page",
tenant_id="tenant",
notion_access_token="token",
)

mocker.patch("httpx.request", return_value=_mock_response({"results": [], "next_cursor": None}))
assert extractor._read_table_rows("tbl-1") == ""
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test name says "missing_results" but the mocked response uses {"results": [], ...} (empty results rather than a missing results key). Renaming this test (or adjusting the mocked payload) would make the intent clearer and avoid conflating two distinct scenarios.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 5, 2026

Pyrefly Diff

base → PR
--- /tmp/pyrefly_base.txt	2026-04-05 10:53:25.021505173 +0000
+++ /tmp/pyrefly_pr.txt	2026-04-05 10:53:15.149317494 +0000
@@ -274,9 +274,9 @@
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
    --> core/rag/extractor/notion_extractor.py:237:25
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
-   --> core/rag/extractor/notion_extractor.py:297:25
+   --> core/rag/extractor/notion_extractor.py:298:29
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
-   --> core/rag/extractor/notion_extractor.py:371:21
+   --> core/rag/extractor/notion_extractor.py:378:21
 ERROR Argument `Unknown | None` is not assignable to parameter `result_object` with type `dict[str, Any]` in function `WaterCrawlProvider._structure_data` [bad-argument-type]
    --> core/rag/extractor/watercrawl/provider.py:108:37
 ERROR Object of class `BaseOxmlElement` has no attribute `body` [missing-attribute]
@@ -4897,7 +4897,7 @@
 ERROR Object of class `NoneType` has no attribute `strip` [missing-attribute]
   --> tests/unit_tests/core/rag/extractor/test_markdown_extractor.py:30:16
 ERROR Argument `SimpleNamespace` is not assignable to parameter `document_model` with type `Document | None` in function `core.rag.extractor.notion_extractor.NotionExtractor.update_last_edited_time` [bad-argument-type]
-   --> tests/unit_tests/core/rag/extractor/test_notion_extractor.py:435:43
+   --> tests/unit_tests/core/rag/extractor/test_notion_extractor.py:487:43
 ERROR Type `object` is not iterable [not-iterable]
   --> tests/unit_tests/core/rag/extractor/test_word_extractor.py:90:9
 ERROR `Any | None` is not assignable to attribute `FILES_URL` with type `str` [bad-assignment]
@@ -6519,25 +6519,25 @@
 ERROR Argument `SimpleNamespace` is not assignable to parameter `dataset` with type `Dataset` in function `services.dataset_service.SegmentService.delete_child_chunk` [bad-argument-type]
    --> tests/unit_tests/services/test_dataset_service_segment.py:206:64
 ERROR Argument `list[Unknown] | None` is not assignable to parameter `obj` with type `Sized` in function `len` [bad-argument-type]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:466:20
+   --> tests/unit_tests/services/test_dataset_service_segment.py:469:20
 ERROR Type `None` is not iterable [not-iterable]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:467:49
+   --> tests/unit_tests/services/test_dataset_service_segment.py:470:49
 ERROR Type `None` is not iterable [not-iterable]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:468:47
+   --> tests/unit_tests/services/test_dataset_service_segment.py:471:47
 ERROR Type `None` is not iterable [not-iterable]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:469:61
+   --> tests/unit_tests/services/test_dataset_service_segment.py:472:61
 ERROR Type `None` is not iterable [not-iterable]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:470:60
+   --> tests/unit_tests/services/test_dataset_service_segment.py:473:60
 ERROR Type `None` is not iterable [not-iterable]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:471:68
+   --> tests/unit_tests/services/test_dataset_service_segment.py:474:68
 ERROR Argument `SimpleNamespace` is not assignable to parameter `segment` with type `DocumentSegment` in function `services.dataset_service.SegmentService.update_child_chunk` [bad-argument-type]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:813:49
+   --> tests/unit_tests/services/test_dataset_service_segment.py:826:49
 ERROR Argument `SimpleNamespace` is not assignable to parameter `document` with type `Document` in function `services.dataset_service.SegmentService.update_child_chunk` [bad-argument-type]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:813:68
+   --> tests/unit_tests/services/test_dataset_service_segment.py:826:68
 ERROR Argument `SimpleNamespace` is not assignable to parameter `dataset` with type `Dataset` in function `services.dataset_service.SegmentService.update_child_chunk` [bad-argument-type]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:813:87
+   --> tests/unit_tests/services/test_dataset_service_segment.py:826:87
 ERROR Argument `SimpleNamespace` is not assignable to parameter `dataset` with type `Dataset` in function `services.dataset_service.SegmentService.delete_child_chunk` [bad-argument-type]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:827:60
+   --> tests/unit_tests/services/test_dataset_service_segment.py:840:60
 ERROR Argument `Literal['']` is not assignable to parameter `plan` with type `CloudPlan` in function `DocumentIndexingTaskProxyTestDataFactory.create_mock_features` [bad-argument-type]
    --> tests/unit_tests/services/test_document_indexing_task_proxy.py:260:114
 ERROR Argument `None` is not assignable to parameter `plan` with type `CloudPlan` in function `DocumentIndexingTaskProxyTestDataFactory.create_mock_features` [bad-argument-type]

…ion_provider

The existing mocks used Mock(json=lambda: data) without setting status_code,
which caused the new HTTP status check to trigger on Mock objects.
@lawrence3699
Copy link
Copy Markdown
Author

Pushed a fix for the CI failures in test_notion_provider.py. The three existing _read_table_rows tests used Mock(json=lambda: data) without setting status_code, so the new HTTP status check was triggering on the mock objects. Updated the mocks to include status_code = 200.

The other failing tests in the CI run (test_pause_resume, test_knowledge_retrieval_node, etc.) are unrelated to this PR — they appear to be pre-existing failures on the base branch.

@lawrence3699
Copy link
Copy Markdown
Author

Updated the test fixes — found one more mock in TestNotionExtractorTableAdvanced::test_read_table_rows_with_many_columns that was also missing status_code = 200. All 4 failing tests should now be resolved. The new CI runs are waiting for maintainer authorization to execute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants