fix(api): add missing error handling in Notion table rows extractor#34553
fix(api): add missing error handling in Notion table rows extractor#34553lawrence3699 wants to merge 3 commits intolanggenius:mainfrom
Conversation
_read_table_rows did not validate the HTTP response status or check for the presence of "results" in the response body, unlike its sibling methods _get_notion_block_data and _read_block which both do. When the Notion API returns an error (401/403/429/500) or an empty result set, the method crashed with KeyError or IndexError instead of raising a clear ValueError. Added status code check, httpx.HTTPError handling, and results validation matching the pattern in _get_notion_block_data (lines 171-188). Added four regression tests.
There was a problem hiding this comment.
Pull request overview
This PR improves the Notion extractor’s robustness by adding HTTP/network error handling and response-shape validation to _read_table_rows, preventing crashes on non-200 responses and empty result sets.
Changes:
- Add
httpx.HTTPErrorhandling and non-200 status checks inNotionExtractor._read_table_rows. - Guard against missing/invalid/empty
"results"before indexing table header rows. - Add unit tests covering HTTP error, network error, empty results, and “error object in 200 body” scenarios.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
api/core/rag/extractor/notion_extractor.py |
Adds status-code/network handling and "results" validation to _read_table_rows. |
api/tests/unit_tests/core/rag/extractor/test_notion_extractor.py |
Adds regression tests for the new _read_table_rows error handling paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| data = res.json() | ||
| except httpx.HTTPError as e: | ||
| raise ValueError("Error fetching Notion table rows") from e | ||
| if "results" not in data or not isinstance(data["results"], list) or len(data["results"]) == 0: |
There was a problem hiding this comment.
The response-shape validation currently breaks the pagination loop when results is missing / not a list. That can silently return partial table output (or an empty string) on malformed 200 responses (e.g., {object: "error", ...}), which differs from _get_notion_block_data (it raises on missing/non-list results). Consider raising a ValueError for unexpected response shapes and only returning "" for the legitimate empty-table case (results == []).
| if "results" not in data or not isinstance(data["results"], list) or len(data["results"]) == 0: | |
| if "results" not in data or not isinstance(data["results"], list): | |
| raise ValueError("Unexpected Notion table rows response shape") | |
| if len(data["results"]) == 0: | |
| done = True |
| res = httpx.request( | ||
| "GET", | ||
| block_url, | ||
| headers={ | ||
| "Authorization": "Bearer " + self._notion_access_token, | ||
| "Content-Type": "application/json", | ||
| "Notion-Version": "2022-06-28", | ||
| }, | ||
| params=query_dict, | ||
| ) | ||
| if res.status_code != 200: | ||
| raise ValueError(f"Error fetching Notion table rows: {res.text}") | ||
| data = res.json() |
There was a problem hiding this comment.
PR description mentions _read_block also checks HTTP status codes, but in the current implementation _read_block does not validate res.status_code or catch httpx.HTTPError (it immediately calls res.json()). Either update the PR description to reflect the current behavior, or consider aligning _read_block with the new error-handling pattern for consistency (could be a follow-up if out of scope here).
| def test_read_table_rows_returns_empty_on_missing_results(self, mocker: MockerFixture): | ||
| extractor = notion_extractor.NotionExtractor( | ||
| notion_workspace_id="ws", | ||
| notion_obj_id="obj", | ||
| notion_page_type="page", | ||
| tenant_id="tenant", | ||
| notion_access_token="token", | ||
| ) | ||
|
|
||
| mocker.patch("httpx.request", return_value=_mock_response({"results": [], "next_cursor": None})) | ||
| assert extractor._read_table_rows("tbl-1") == "" |
There was a problem hiding this comment.
Test name says "missing_results" but the mocked response uses {"results": [], ...} (empty results rather than a missing results key). Renaming this test (or adjusting the mocked payload) would make the intent clearer and avoid conflating two distinct scenarios.
Pyrefly Diffbase → PR--- /tmp/pyrefly_base.txt 2026-04-05 10:53:25.021505173 +0000
+++ /tmp/pyrefly_pr.txt 2026-04-05 10:53:15.149317494 +0000
@@ -274,9 +274,9 @@
ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
--> core/rag/extractor/notion_extractor.py:237:25
ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
- --> core/rag/extractor/notion_extractor.py:297:25
+ --> core/rag/extractor/notion_extractor.py:298:29
ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
- --> core/rag/extractor/notion_extractor.py:371:21
+ --> core/rag/extractor/notion_extractor.py:378:21
ERROR Argument `Unknown | None` is not assignable to parameter `result_object` with type `dict[str, Any]` in function `WaterCrawlProvider._structure_data` [bad-argument-type]
--> core/rag/extractor/watercrawl/provider.py:108:37
ERROR Object of class `BaseOxmlElement` has no attribute `body` [missing-attribute]
@@ -4897,7 +4897,7 @@
ERROR Object of class `NoneType` has no attribute `strip` [missing-attribute]
--> tests/unit_tests/core/rag/extractor/test_markdown_extractor.py:30:16
ERROR Argument `SimpleNamespace` is not assignable to parameter `document_model` with type `Document | None` in function `core.rag.extractor.notion_extractor.NotionExtractor.update_last_edited_time` [bad-argument-type]
- --> tests/unit_tests/core/rag/extractor/test_notion_extractor.py:435:43
+ --> tests/unit_tests/core/rag/extractor/test_notion_extractor.py:487:43
ERROR Type `object` is not iterable [not-iterable]
--> tests/unit_tests/core/rag/extractor/test_word_extractor.py:90:9
ERROR `Any | None` is not assignable to attribute `FILES_URL` with type `str` [bad-assignment]
@@ -6519,25 +6519,25 @@
ERROR Argument `SimpleNamespace` is not assignable to parameter `dataset` with type `Dataset` in function `services.dataset_service.SegmentService.delete_child_chunk` [bad-argument-type]
--> tests/unit_tests/services/test_dataset_service_segment.py:206:64
ERROR Argument `list[Unknown] | None` is not assignable to parameter `obj` with type `Sized` in function `len` [bad-argument-type]
- --> tests/unit_tests/services/test_dataset_service_segment.py:466:20
+ --> tests/unit_tests/services/test_dataset_service_segment.py:469:20
ERROR Type `None` is not iterable [not-iterable]
- --> tests/unit_tests/services/test_dataset_service_segment.py:467:49
+ --> tests/unit_tests/services/test_dataset_service_segment.py:470:49
ERROR Type `None` is not iterable [not-iterable]
- --> tests/unit_tests/services/test_dataset_service_segment.py:468:47
+ --> tests/unit_tests/services/test_dataset_service_segment.py:471:47
ERROR Type `None` is not iterable [not-iterable]
- --> tests/unit_tests/services/test_dataset_service_segment.py:469:61
+ --> tests/unit_tests/services/test_dataset_service_segment.py:472:61
ERROR Type `None` is not iterable [not-iterable]
- --> tests/unit_tests/services/test_dataset_service_segment.py:470:60
+ --> tests/unit_tests/services/test_dataset_service_segment.py:473:60
ERROR Type `None` is not iterable [not-iterable]
- --> tests/unit_tests/services/test_dataset_service_segment.py:471:68
+ --> tests/unit_tests/services/test_dataset_service_segment.py:474:68
ERROR Argument `SimpleNamespace` is not assignable to parameter `segment` with type `DocumentSegment` in function `services.dataset_service.SegmentService.update_child_chunk` [bad-argument-type]
- --> tests/unit_tests/services/test_dataset_service_segment.py:813:49
+ --> tests/unit_tests/services/test_dataset_service_segment.py:826:49
ERROR Argument `SimpleNamespace` is not assignable to parameter `document` with type `Document` in function `services.dataset_service.SegmentService.update_child_chunk` [bad-argument-type]
- --> tests/unit_tests/services/test_dataset_service_segment.py:813:68
+ --> tests/unit_tests/services/test_dataset_service_segment.py:826:68
ERROR Argument `SimpleNamespace` is not assignable to parameter `dataset` with type `Dataset` in function `services.dataset_service.SegmentService.update_child_chunk` [bad-argument-type]
- --> tests/unit_tests/services/test_dataset_service_segment.py:813:87
+ --> tests/unit_tests/services/test_dataset_service_segment.py:826:87
ERROR Argument `SimpleNamespace` is not assignable to parameter `dataset` with type `Dataset` in function `services.dataset_service.SegmentService.delete_child_chunk` [bad-argument-type]
- --> tests/unit_tests/services/test_dataset_service_segment.py:827:60
+ --> tests/unit_tests/services/test_dataset_service_segment.py:840:60
ERROR Argument `Literal['']` is not assignable to parameter `plan` with type `CloudPlan` in function `DocumentIndexingTaskProxyTestDataFactory.create_mock_features` [bad-argument-type]
--> tests/unit_tests/services/test_document_indexing_task_proxy.py:260:114
ERROR Argument `None` is not assignable to parameter `plan` with type `CloudPlan` in function `DocumentIndexingTaskProxyTestDataFactory.create_mock_features` [bad-argument-type]
|
…ion_provider The existing mocks used Mock(json=lambda: data) without setting status_code, which caused the new HTTP status check to trigger on Mock objects.
|
Pushed a fix for the CI failures in The other failing tests in the CI run ( |
|
Updated the test fixes — found one more mock in |
I noticed that
_read_table_rowsinnotion_extractor.pyis the only method in the Notion extractor that doesn't validate the HTTP response before processing. Its sibling methods_get_notion_block_data(lines 171-188) and_read_block(line 245) both check the status code and validate the response structure, but_read_table_rowsgoes straight todata["results"][0]without any checks.This crashes in two situations:
Notion API returns an error (401 expired token, 429 rate limit, 500 server error): The response body has no
"results"key, sodata["results"]raisesKeyError.Empty table results: If the results array is empty,
data["results"][0]raisesIndexError.Both are easy to reproduce:
The fix adds the same validation pattern already used by
_get_notion_block_data:ValueErroron non-200httpx.HTTPErrorcatch for network failures"results"key and empty-list check before indexingTests added: Four test cases covering HTTP error, network error, empty results, and error-in-200-body scenarios.