fix(api): add missing error handling in Notion table rows extractor by lawrence3699 · Pull Request #34553 · langgenius/dify

lawrence3699 · 2026-04-05T00:06:52Z

I noticed that _read_table_rows in notion_extractor.py is the only method in the Notion extractor that doesn't validate the HTTP response before processing. Its sibling methods _get_notion_block_data (lines 171-188) and _read_block (line 245) both check the status code and validate the response structure, but _read_table_rows goes straight to data["results"][0] without any checks.

This crashes in two situations:

Notion API returns an error (401 expired token, 429 rate limit, 500 server error): The response body has no "results" key, so data["results"] raises KeyError.
Empty table results: If the results array is empty, data["results"][0] raises IndexError.

Both are easy to reproduce:

# Error response (expired token):
data = {"object": "error", "status": 401, "code": "unauthorized"}
data["results"][0]  # → KeyError: 'results'

# Empty results:
data = {"results": [], "next_cursor": None}
data["results"][0]  # → IndexError: list index out of range

The fix adds the same validation pattern already used by _get_notion_block_data:

HTTP status code check with ValueError on non-200
httpx.HTTPError catch for network failures
"results" key and empty-list check before indexing

Tests added: Four test cases covering HTTP error, network error, empty results, and error-in-200-body scenarios.

_read_table_rows did not validate the HTTP response status or check for the presence of "results" in the response body, unlike its sibling methods _get_notion_block_data and _read_block which both do. When the Notion API returns an error (401/403/429/500) or an empty result set, the method crashed with KeyError or IndexError instead of raising a clear ValueError. Added status code check, httpx.HTTPError handling, and results validation matching the pattern in _get_notion_block_data (lines 171-188). Added four regression tests.

Copilot

Pull request overview

This PR improves the Notion extractor’s robustness by adding HTTP/network error handling and response-shape validation to _read_table_rows, preventing crashes on non-200 responses and empty result sets.

Changes:

Add httpx.HTTPError handling and non-200 status checks in NotionExtractor._read_table_rows.
Guard against missing/invalid/empty "results" before indexing table header rows.
Add unit tests covering HTTP error, network error, empty results, and “error object in 200 body” scenarios.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`api/core/rag/extractor/notion_extractor.py`	Adds status-code/network handling and `"results"` validation to `_read_table_rows`.
`api/tests/unit_tests/core/rag/extractor/test_notion_extractor.py`	Adds regression tests for the new `_read_table_rows` error handling paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-05T00:10:04Z

+                data = res.json()
+            except httpx.HTTPError as e:
+                raise ValueError("Error fetching Notion table rows") from e
+            if "results" not in data or not isinstance(data["results"], list) or len(data["results"]) == 0:


The response-shape validation currently breaks the pagination loop when results is missing / not a list. That can silently return partial table output (or an empty string) on malformed 200 responses (e.g., {object: "error", ...}), which differs from _get_notion_block_data (it raises on missing/non-list results). Consider raising a ValueError for unexpected response shapes and only returning "" for the legitimate empty-table case (results == []).

Suggested change

if "results" not in data or not isinstance(data["results"], list) or len(data["results"]) == 0:

if "results" not in data or not isinstance(data["results"], list):

raise ValueError("Unexpected Notion table rows response shape")

if len(data["results"]) == 0:

done = True

Copilot · 2026-04-05T00:10:05Z

+                res = httpx.request(
+                    "GET",
+                    block_url,
+                    headers={
+                        "Authorization": "Bearer " + self._notion_access_token,
+                        "Content-Type": "application/json",
+                        "Notion-Version": "2022-06-28",
+                    },
+                    params=query_dict,
+                )
+                if res.status_code != 200:
+                    raise ValueError(f"Error fetching Notion table rows: {res.text}")
+                data = res.json()


PR description mentions _read_block also checks HTTP status codes, but in the current implementation _read_block does not validate res.status_code or catch httpx.HTTPError (it immediately calls res.json()). Either update the PR description to reflect the current behavior, or consider aligning _read_block with the new error-handling pattern for consistency (could be a follow-up if out of scope here).

Copilot · 2026-04-05T00:10:05Z

+    def test_read_table_rows_returns_empty_on_missing_results(self, mocker: MockerFixture):
+        extractor = notion_extractor.NotionExtractor(
+            notion_workspace_id="ws",
+            notion_obj_id="obj",
+            notion_page_type="page",
+            tenant_id="tenant",
+            notion_access_token="token",
+        )
+
+        mocker.patch("httpx.request", return_value=_mock_response({"results": [], "next_cursor": None}))
+        assert extractor._read_table_rows("tbl-1") == ""


Test name says "missing_results" but the mocked response uses {"results": [], ...} (empty results rather than a missing results key). Renaming this test (or adjusting the mocked payload) would make the intent clearer and avoid conflating two distinct scenarios.

github-actions · 2026-04-05T10:53:40Z

Pyrefly Diff

base → PR

--- /tmp/pyrefly_base.txt	2026-04-05 10:53:25.021505173 +0000
+++ /tmp/pyrefly_pr.txt	2026-04-05 10:53:15.149317494 +0000
@@ -274,9 +274,9 @@
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
    --> core/rag/extractor/notion_extractor.py:237:25
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
-   --> core/rag/extractor/notion_extractor.py:297:25
+   --> core/rag/extractor/notion_extractor.py:298:29
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
-   --> core/rag/extractor/notion_extractor.py:371:21
+   --> core/rag/extractor/notion_extractor.py:378:21
 ERROR Argument `Unknown | None` is not assignable to parameter `result_object` with type `dict[str, Any]` in function `WaterCrawlProvider._structure_data` [bad-argument-type]
    --> core/rag/extractor/watercrawl/provider.py:108:37
 ERROR Object of class `BaseOxmlElement` has no attribute `body` [missing-attribute]
@@ -4897,7 +4897,7 @@
 ERROR Object of class `NoneType` has no attribute `strip` [missing-attribute]
   --> tests/unit_tests/core/rag/extractor/test_markdown_extractor.py:30:16
 ERROR Argument `SimpleNamespace` is not assignable to parameter `document_model` with type `Document | None` in function `core.rag.extractor.notion_extractor.NotionExtractor.update_last_edited_time` [bad-argument-type]
-   --> tests/unit_tests/core/rag/extractor/test_notion_extractor.py:435:43
+   --> tests/unit_tests/core/rag/extractor/test_notion_extractor.py:487:43
 ERROR Type `object` is not iterable [not-iterable]
   --> tests/unit_tests/core/rag/extractor/test_word_extractor.py:90:9
 ERROR `Any | None` is not assignable to attribute `FILES_URL` with type `str` [bad-assignment]
@@ -6519,25 +6519,25 @@
 ERROR Argument `SimpleNamespace` is not assignable to parameter `dataset` with type `Dataset` in function `services.dataset_service.SegmentService.delete_child_chunk` [bad-argument-type]
    --> tests/unit_tests/services/test_dataset_service_segment.py:206:64
 ERROR Argument `list[Unknown] | None` is not assignable to parameter `obj` with type `Sized` in function `len` [bad-argument-type]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:466:20
+   --> tests/unit_tests/services/test_dataset_service_segment.py:469:20
 ERROR Type `None` is not iterable [not-iterable]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:467:49
+   --> tests/unit_tests/services/test_dataset_service_segment.py:470:49
 ERROR Type `None` is not iterable [not-iterable]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:468:47
+   --> tests/unit_tests/services/test_dataset_service_segment.py:471:47
 ERROR Type `None` is not iterable [not-iterable]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:469:61
+   --> tests/unit_tests/services/test_dataset_service_segment.py:472:61
 ERROR Type `None` is not iterable [not-iterable]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:470:60
+   --> tests/unit_tests/services/test_dataset_service_segment.py:473:60
 ERROR Type `None` is not iterable [not-iterable]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:471:68
+   --> tests/unit_tests/services/test_dataset_service_segment.py:474:68
 ERROR Argument `SimpleNamespace` is not assignable to parameter `segment` with type `DocumentSegment` in function `services.dataset_service.SegmentService.update_child_chunk` [bad-argument-type]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:813:49
+   --> tests/unit_tests/services/test_dataset_service_segment.py:826:49
 ERROR Argument `SimpleNamespace` is not assignable to parameter `document` with type `Document` in function `services.dataset_service.SegmentService.update_child_chunk` [bad-argument-type]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:813:68
+   --> tests/unit_tests/services/test_dataset_service_segment.py:826:68
 ERROR Argument `SimpleNamespace` is not assignable to parameter `dataset` with type `Dataset` in function `services.dataset_service.SegmentService.update_child_chunk` [bad-argument-type]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:813:87
+   --> tests/unit_tests/services/test_dataset_service_segment.py:826:87
 ERROR Argument `SimpleNamespace` is not assignable to parameter `dataset` with type `Dataset` in function `services.dataset_service.SegmentService.delete_child_chunk` [bad-argument-type]
-   --> tests/unit_tests/services/test_dataset_service_segment.py:827:60
+   --> tests/unit_tests/services/test_dataset_service_segment.py:840:60
 ERROR Argument `Literal['']` is not assignable to parameter `plan` with type `CloudPlan` in function `DocumentIndexingTaskProxyTestDataFactory.create_mock_features` [bad-argument-type]
    --> tests/unit_tests/services/test_document_indexing_task_proxy.py:260:114
 ERROR Argument `None` is not assignable to parameter `plan` with type `CloudPlan` in function `DocumentIndexingTaskProxyTestDataFactory.create_mock_features` [bad-argument-type]

…ion_provider The existing mocks used Mock(json=lambda: data) without setting status_code, which caused the new HTTP status check to trigger on Mock objects.

lawrence3699 · 2026-04-05T11:19:34Z

Pushed a fix for the CI failures in test_notion_provider.py. The three existing _read_table_rows tests used Mock(json=lambda: data) without setting status_code, so the new HTTP status check was triggering on the mock objects. Updated the mocks to include status_code = 200.

The other failing tests in the CI run (test_pause_resume, test_knowledge_retrieval_node, etc.) are unrelated to this PR — they appear to be pre-existing failures on the base branch.

lawrence3699 · 2026-04-06T07:03:35Z

Updated the test fixes — found one more mock in TestNotionExtractorTableAdvanced::test_read_table_rows_with_many_columns that was also missing status_code = 200. All 4 failing tests should now be resolved. The new CI runs are waiting for maintainer authorization to execute.

lawrence3699 requested review from JohnJyong, QuantumGhost and laipz8200 as code owners April 5, 2026 00:06

Copilot AI review requested due to automatic review settings April 5, 2026 00:06

dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 5, 2026

Copilot started reviewing on behalf of lawrence3699 April 5, 2026 00:07 View session

Copilot AI reviewed Apr 5, 2026

View reviewed changes

fix(tests): add status_code=200 to _read_table_rows mocks in test_not…

e9bab17

…ion_provider The existing mocks used Mock(json=lambda: data) without setting status_code, which caused the new HTTP status check to trigger on Mock objects.

fix(tests): add status_code=200 to TestNotionExtractorTableAdvanced mock

afcdab8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): add missing error handling in Notion table rows extractor#34553

fix(api): add missing error handling in Notion table rows extractor#34553
lawrence3699 wants to merge 3 commits intolanggenius:mainfrom
lawrence3699:fix/notion-table-rows-error-handling

lawrence3699 commented Apr 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 5, 2026

Uh oh!

Copilot AI Apr 5, 2026

Uh oh!

Copilot AI Apr 5, 2026

Uh oh!

github-actions Bot commented Apr 5, 2026

Uh oh!

lawrence3699 commented Apr 5, 2026

Uh oh!

lawrence3699 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-            if "results" not in data or not isinstance(data["results"], list) or len(data["results"]) == 0:
+            if "results" not in data or not isinstance(data["results"], list):
+                raise ValueError("Unexpected Notion table rows response shape")
+            if len(data["results"]) == 0:
+                done = True

Conversation

lawrence3699 commented Apr 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 5, 2026

Pyrefly Diff

Uh oh!

lawrence3699 commented Apr 5, 2026

Uh oh!

lawrence3699 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants