fix(ingestion): cap OM and Snowflake socket waits to stop silent ingestion hangs by harshach · Pull Request #28131 · open-metadata/OpenMetadata

harshach · 2026-05-14T23:02:55Z

Describe your changes:

A Snowflake metadata ingestion run was observed silently pausing in a hybrid-runner K8s pod: the pod stayed alive (no OOM, no crash), no further API calls hit the OpenMetadata server, no further log lines appeared. Pod-side logs ended at a successful Source.GET executed HTTP call to the OM server; the next expected Source executed / Sink.PATCH lines never reached the server. Streamable-log shipping continued from its daemon thread for ~13 minutes (draining its buffer) and then went quiet too.

Root cause is structural: HTTP calls from the connector to the OM server, and the SQLAlchemy/Snowflake connection itself, both go out with no read timeout. requests.Session() and the Snowflake driver keep TCP connections in pools; intermediate network devices (K8s NAT, AWS NLB default 350s, Azure LB default 4 min) silently sever idle sockets, and the next call grabs a stale connection and blocks in socket.recv forever (Linux TCP keepalive default is 2 hours).

Three minimal changes:

ingestion/src/metadata/ingestion/ometa/client.py: default ClientConfig.timeout to (10, 300) (10s connect, 5min read) instead of None. The existing if self._timeout guard still honours an explicit None for callers that need it.
ingestion/src/metadata/ingestion/source/database/snowflake/connection.py: setdefault client_session_keep_alive=True and network_timeout=600 on the connection arguments. User-supplied connectionArguments still win.
ingestion/setup.py: add py-spy>=0.3.14 to base_requirements so the next time a worker hangs we can py-spy dump --pid <pid> in place, without first having to install anything.

Type of change:

Bug fix

High-level design:

N/A — small change.

Tests:

Use cases covered

Snowflake ingestion in a K8s pod whose outbound TCP idle timeout (NAT/LB) is shorter than tcp_keepalive_time: previously hung indefinitely on next reuse of a half-open pooled socket, now surfaces as a requests.exceptions.ReadTimeout (caught by existing _one_request except Exception, logged, workflow continues) within 5 minutes.
Existing callers passing ClientConfig(timeout=None) retain the old "wait forever" behaviour (the if self._timeout guard at client.py:224 is unchanged).
Existing Snowflake users supplying client_session_keep_alive / network_timeout in connectionArguments continue to win via setdefault.

Unit tests

Not added: change is a default-value tweak in existing fields; no logic branches were introduced. Pydantic instantiation of all four shapes (default, int, None, tuple) verified locally.

Backend integration tests

Not applicable (no backend API changes).

Ingestion integration tests

Not applicable: change is a default-arg tweak with no new code paths.

Playwright (UI) tests

Not applicable (no UI changes).

Manual testing performed

AST-parsed both edited files.
Instantiated ClientConfig in a Pydantic stand-in for default (10, 300), int=60, None, and (5, 30) overrides — all accepted.
Confirmed SSE streaming uses its own httpx.Client(timeout=None) in sse_client.py:82 and is unaffected.

UI screen recording / screenshots:

Not applicable.

Checklist:

I have read the CONTRIBUTING document.
I have commented on my code, particularly in hard-to-understand areas.

…ed TCP connection no longer hangs the worker A Snowflake metadata ingestion run was observed to silently pause: the Airflow pod stayed alive (no OOM, no crash), no further API calls hit the OpenMetadata server, no further log lines appeared. The pod-side log ended at `Source.GET executed in 59.057ms` -- a successful HTTP GET to the OM server -- with the next expected `Source executed` / `Sink.PATCH` lines never reaching the server. Streamable-log shipping continued from its daemon thread for ~13 minutes (draining its buffer) and then went quiet too. Root cause is a structural one: HTTP calls from the connector to the OM server, and the SQLAlchemy/Snowflake connection itself, both go out with no read timeout. `requests.Session()` and the Snowflake driver keep their TCP connections alive in pools; intermediate network devices (K8s NAT, ingress NLB, AWS NLB default 350s, Azure LB default 4 min) silently sever idle sockets. The next call grabs a stale connection from the pool and blocks in `socket.recv` waiting for bytes that will never arrive. Linux TCP keepalive defaults to 2 hours. Three minimal changes: - `ingestion/src/metadata/ingestion/ometa/client.py`: default `ClientConfig.timeout` to `(10, 300)` (10s connect, 5min read) instead of `None`. The existing `if self._timeout` guard still honours an explicit `None` override for callers that need it. - `ingestion/src/metadata/ingestion/source/database/snowflake/connection.py`: `setdefault` `client_session_keep_alive=True` and `network_timeout=600` on the connection arguments. User-supplied values in `connectionArguments` still win. - `ingestion/setup.py`: add `py-spy>=0.3.14` to `base_requirements` so the next time a worker hangs we can `py-spy dump --pid <pid>` in place, without first having to install anything in the container. After this, the same kind of dropped socket surfaces as a `requests.exceptions.ReadTimeout` (caught by the existing `except Exception` in `_one_request`, logged, and the workflow continues) instead of wedging the pod indefinitely. SSE streaming uses its own httpx client and is unaffected.

IceS2 · 2026-05-14T23:05:03Z

        if keep_alive := self._get_client_session_keep_alive():
            connection.connectionArguments.root["client_session_keep_alive"] = keep_alive

+        # Safe defaults: prevent indefinite hangs when the Snowflake socket is
+        # silently severed (NAT/LB idle reaping in K8s/hybrid runners). Any
+        # value the user supplied via connectionArguments wins via setdefault.
+        connection.connectionArguments.root.setdefault("client_session_keep_alive", True)


There is a UI toggle for this. It is handled above

Good catch — you are right. The UI toggle handles this and a setdefault(True) would silently override a user who unchecks it. Dropped in 20a72d0. The hang protection now relies only on network_timeout=600, which bounds any silently-severed Snowflake socket to a 10-minute ReadTimeout. Keep-alive (which only protects Snowflake-side session expiry) stays user-controlled via the UI toggle.

Copilot

Pull request overview

This PR addresses a real-world ingestion hang scenario caused by indefinite socket reads on (1) the OpenMetadata REST client and (2) Snowflake driver connections when pooled sockets are silently severed by NAT/LB idle timeouts. It introduces bounded timeouts/defaults to surface failures as timeouts rather than blocking forever, and adds an in-container profiling tool to simplify future debugging.

Changes:

Default OpenMetadata REST client timeout to (connect=10s, read=300s) while preserving the ability to explicitly disable timeouts via timeout=None.
Add Snowflake connection-argument defaults for client_session_keep_alive and network_timeout (with user-provided connectionArguments taking precedence via setdefault).
Add py-spy to ingestion base requirements to enable in-place stack sampling of hung workers.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
ingestion/src/metadata/ingestion/source/database/snowflake/connection.py	Adds safe default Snowflake driver args to reduce the chance of indefinite socket hangs.
ingestion/src/metadata/ingestion/ometa/client.py	Sets a bounded default `requests` timeout to avoid silent hangs on stale pooled HTTP connections.
ingestion/setup.py	Adds `py-spy` to base dependencies to support post-mortem sampling/debugging inside ingestion environments.

+        connection.connectionArguments.root.setdefault("client_session_keep_alive", True)
+        connection.connectionArguments.root.setdefault("network_timeout", 600)


…spy to Dockerfile - snowflake/connection.py: drop the setdefault("client_session_keep_alive", True). The IngestionPipeline UI already exposes a clientSessionKeepAlive toggle (handled at lines 205-206); imposing a default of True here would silently override a user who unchecks it. Hang protection is now covered by network_timeout=600 alone, which bounds any silently-severed Snowflake socket to a 10-minute ReadTimeout instead of hanging the worker forever. Keep-alive only matters for Snowflake-side session expiry, which is a separate concern from the original hang. - setup.py + ingestion/Dockerfile: move py-spy from base_requirements into the ingestion Dockerfile. The original goal — "available in the running container without pip install" — is met without forcing a native Rust binary on dev laptops, CI, and non-container installs that won't have CAP_SYS_PTRACE anyway.

gitar-bot · 2026-05-14T23:09:55Z

Code Review ✅ Approved 2 resolved / 2 findings

Limits socket waits in OpenMetadata and Snowflake connectors to prevent indefinite ingestion hangs caused by stale TCP connections. Includes the addition of py-spy to production dependencies to facilitate future debugging.

✅ 2 resolved

✅ Quality: py-spy is a debug tool added to base production dependencies

📄 ingestion/setup.py:157-159
Adding py-spy>=0.3.14 to base_requirements means every installation of the ingestion framework (including developer laptops, CI, non-container environments) pulls in this native profiling binary. py-spy also requires CAP_SYS_PTRACE or root to function, so it won't even work in most environments without additional setup. It should be in a dedicated optional extras group (e.g., [debug] or [ops]) and pre-installed only in the official container images via a Dockerfile layer, rather than forcing it on all users.

✅ Edge Case: setdefault for client_session_keep_alive is partially redundant

📄 ingestion/src/metadata/ingestion/source/database/snowflake/connection.py:205-212
On line 205-206, if connection.clientSessionKeepAlive is truthy, the key is set via direct assignment (which overwrites any user-supplied value in connectionArguments). On line 211, setdefault only fires when the key is absent. This means:

If the user supplied client_session_keep_alive in connectionArguments but did NOT set the dedicated clientSessionKeepAlive config field, the setdefault correctly preserves their value.

If the user supplied it in connectionArguments AND set the dedicated field, line 206 overwrites their connectionArguments value (pre-existing behavior, not introduced here).

The logic works but the interaction between the explicit assignment (line 206) and setdefault (line 211) is subtle. A brief inline comment noting that line 206 intentionally takes priority over user-supplied connectionArguments would help future readers.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

github-actions · 2026-05-14T23:14:55Z

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

sonarqubecloud · 2026-05-15T00:21:56Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
4 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-05-15T01:30:02Z

🔴 Playwright Results — 1 failure(s), 24 flaky

✅ 4045 passed · ❌ 1 failed · 🟡 24 flaky · ⏭️ 103 skipped

Shard	Passed	Failed	Flaky	Skipped
🔴 Shard 1	298	1	0	4
🟡 Shard 2	743	0	7	25
🟡 Shard 3	779	0	5	7
🟡 Shard 4	788	0	2	18
🟡 Shard 5	706	0	3	41
🟡 Shard 6	731	0	7	8

Genuine Failures (failed on all attempts)

❌ Pages/AuditLogs.spec.ts › should apply both User and EntityType filters simultaneously (shard 1)

�[31mTest timeout of 60000ms exceeded.�[39m

🟡 24 flaky test(s) (passed on retry)

Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 1 retry)
Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 1 retry)
Features/KnowledgeCenterList.spec.ts › Knowledge Center List - Test infinite scroll/pagination (shard 2, 1 retry)
Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Features/Table.spec.ts › Table pagination with sorting should works (shard 3, 1 retry)
Features/Table.spec.ts › Tags term should be consistent for search (shard 3, 1 retry)
Features/Tasks/DomainFiltering.spec.ts › selecting All Domains removes domain filter from feed API call (shard 3, 1 retry)
Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
Pages/DataContractInheritance.spec.ts › Run Validation - Inherited contract validation uses entity-based validation (shard 4, 1 retry)
Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Set (shard 4, 1 retry)
Pages/EntityDataSteward.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
Pages/ExplorePageRightPanel.spec.ts › Should display and verify schema fields for dashboardDataModel (shard 5, 1 retry)
Pages/ExplorePageRightPanel.spec.ts › Should verify deleted glossary term not visible in selection for searchIndex (shard 5, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for apiEndpoint -> dashboardDataModel (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify Impact Analysis service filter selection (shard 6, 1 retry)
Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
Pages/ServiceEntity.spec.ts › Delete Search Index Service (shard 6, 1 retry)
Pages/ServiceEntity.spec.ts › User as Owner Add, Update and Remove (shard 6, 1 retry)
Pages/UserDetails.spec.ts › Create team with domain and verify visibility of inherited domain in user profile after team removal (shard 6, 1 retry)
VersionPages/EntityVersionPages.spec.ts › Store Procedure (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot AI review requested due to automatic review settings May 14, 2026 23:02

harshach requested a review from a team as a code owner May 14, 2026 23:02

github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels May 14, 2026

Copilot started reviewing on behalf of harshach May 14, 2026 23:03 View session

gitar-bot Bot reviewed May 14, 2026

View reviewed changes

Comment thread ingestion/setup.py Outdated

gitar-bot Bot reviewed May 14, 2026

View reviewed changes

Comment thread ingestion/src/metadata/ingestion/source/database/snowflake/connection.py

IceS2 reviewed May 14, 2026

View reviewed changes

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread ingestion/src/metadata/ingestion/source/database/snowflake/connection.py Outdated

Comment on lines +211 to +212

connection.connectionArguments.root.setdefault("client_session_keep_alive", True)

connection.connectionArguments.root.setdefault("network_timeout", 600)

harshach temporarily deployed to test May 14, 2026 23:18 — with GitHub Actions Inactive

harshach had a problem deploying to test May 14, 2026 23:18 — with GitHub Actions Failure

harshach temporarily deployed to test May 14, 2026 23:18 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingestion): cap OM and Snowflake socket waits to stop silent ingestion hangs#28131

fix(ingestion): cap OM and Snowflake socket waits to stop silent ingestion hangs#28131
harshach wants to merge 2 commits into
mainfrom
harshach/snowflake-ingestion-hang

harshach commented May 14, 2026

Uh oh!

Uh oh!

Uh oh!

IceS2 May 14, 2026

Uh oh!

harshach May 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

gitar-bot Bot commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

sonarqubecloud Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		connection.connectionArguments.root.setdefault("client_session_keep_alive", True)
		connection.connectionArguments.root.setdefault("network_timeout", 600)

Conversation

harshach commented May 14, 2026

Describe your changes:

Type of change:

High-level design:

Tests:

Use cases covered

Unit tests

Backend integration tests

Ingestion integration tests

Playwright (UI) tests

Manual testing performed

UI screen recording / screenshots:

Checklist:

Uh oh!

Uh oh!

Uh oh!

IceS2 May 14, 2026

Choose a reason for hiding this comment

Uh oh!

harshach May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

gitar-bot Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

sonarqubecloud Bot commented May 15, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

github-actions Bot commented May 15, 2026

🔴 Playwright Results — 1 failure(s), 24 flaky

Genuine Failures (failed on all attempts)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gitar-bot Bot commented May 14, 2026 •

edited

Loading