Skip to content

fix(ingestion): cap OM and Snowflake socket waits to stop silent ingestion hangs#28131

Open
harshach wants to merge 2 commits into
mainfrom
harshach/snowflake-ingestion-hang
Open

fix(ingestion): cap OM and Snowflake socket waits to stop silent ingestion hangs#28131
harshach wants to merge 2 commits into
mainfrom
harshach/snowflake-ingestion-hang

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

Describe your changes:

A Snowflake metadata ingestion run was observed silently pausing in a hybrid-runner K8s pod: the pod stayed alive (no OOM, no crash), no further API calls hit the OpenMetadata server, no further log lines appeared. Pod-side logs ended at a successful Source.GET executed HTTP call to the OM server; the next expected Source executed / Sink.PATCH lines never reached the server. Streamable-log shipping continued from its daemon thread for ~13 minutes (draining its buffer) and then went quiet too.

Root cause is structural: HTTP calls from the connector to the OM server, and the SQLAlchemy/Snowflake connection itself, both go out with no read timeout. requests.Session() and the Snowflake driver keep TCP connections in pools; intermediate network devices (K8s NAT, AWS NLB default 350s, Azure LB default 4 min) silently sever idle sockets, and the next call grabs a stale connection and blocks in socket.recv forever (Linux TCP keepalive default is 2 hours).

Three minimal changes:

  • ingestion/src/metadata/ingestion/ometa/client.py: default ClientConfig.timeout to (10, 300) (10s connect, 5min read) instead of None. The existing if self._timeout guard still honours an explicit None for callers that need it.
  • ingestion/src/metadata/ingestion/source/database/snowflake/connection.py: setdefault client_session_keep_alive=True and network_timeout=600 on the connection arguments. User-supplied connectionArguments still win.
  • ingestion/setup.py: add py-spy>=0.3.14 to base_requirements so the next time a worker hangs we can py-spy dump --pid <pid> in place, without first having to install anything.

Type of change:

  • Bug fix

High-level design:

N/A — small change.

Tests:

Use cases covered

  • Snowflake ingestion in a K8s pod whose outbound TCP idle timeout (NAT/LB) is shorter than tcp_keepalive_time: previously hung indefinitely on next reuse of a half-open pooled socket, now surfaces as a requests.exceptions.ReadTimeout (caught by existing _one_request except Exception, logged, workflow continues) within 5 minutes.
  • Existing callers passing ClientConfig(timeout=None) retain the old "wait forever" behaviour (the if self._timeout guard at client.py:224 is unchanged).
  • Existing Snowflake users supplying client_session_keep_alive / network_timeout in connectionArguments continue to win via setdefault.

Unit tests

  • Not added: change is a default-value tweak in existing fields; no logic branches were introduced. Pydantic instantiation of all four shapes (default, int, None, tuple) verified locally.

Backend integration tests

  • Not applicable (no backend API changes).

Ingestion integration tests

  • Not applicable: change is a default-arg tweak with no new code paths.

Playwright (UI) tests

  • Not applicable (no UI changes).

Manual testing performed

  • AST-parsed both edited files.
  • Instantiated ClientConfig in a Pydantic stand-in for default (10, 300), int=60, None, and (5, 30) overrides — all accepted.
  • Confirmed SSE streaming uses its own httpx.Client(timeout=None) in sse_client.py:82 and is unaffected.

UI screen recording / screenshots:

Not applicable.

Checklist:

  • I have read the CONTRIBUTING document.
  • I have commented on my code, particularly in hard-to-understand areas.

…ed TCP connection no longer hangs the worker

A Snowflake metadata ingestion run was observed to silently pause: the
Airflow pod stayed alive (no OOM, no crash), no further API calls hit
the OpenMetadata server, no further log lines appeared. The pod-side
log ended at `Source.GET executed in 59.057ms` -- a successful HTTP
GET to the OM server -- with the next expected `Source executed` /
`Sink.PATCH` lines never reaching the server. Streamable-log shipping
continued from its daemon thread for ~13 minutes (draining its buffer)
and then went quiet too.

Root cause is a structural one: HTTP calls from the connector to the
OM server, and the SQLAlchemy/Snowflake connection itself, both go out
with no read timeout. `requests.Session()` and the Snowflake driver
keep their TCP connections alive in pools; intermediate network
devices (K8s NAT, ingress NLB, AWS NLB default 350s, Azure LB default
4 min) silently sever idle sockets. The next call grabs a stale
connection from the pool and blocks in `socket.recv` waiting for bytes
that will never arrive. Linux TCP keepalive defaults to 2 hours.

Three minimal changes:

- `ingestion/src/metadata/ingestion/ometa/client.py`: default
  `ClientConfig.timeout` to `(10, 300)` (10s connect, 5min read)
  instead of `None`. The existing `if self._timeout` guard still
  honours an explicit `None` override for callers that need it.
- `ingestion/src/metadata/ingestion/source/database/snowflake/connection.py`:
  `setdefault` `client_session_keep_alive=True` and
  `network_timeout=600` on the connection arguments. User-supplied
  values in `connectionArguments` still win.
- `ingestion/setup.py`: add `py-spy>=0.3.14` to `base_requirements`
  so the next time a worker hangs we can `py-spy dump --pid <pid>` in
  place, without first having to install anything in the container.

After this, the same kind of dropped socket surfaces as a
`requests.exceptions.ReadTimeout` (caught by the existing
`except Exception` in `_one_request`, logged, and the workflow
continues) instead of wedging the pod indefinitely. SSE streaming
uses its own httpx client and is unaffected.
Copilot AI review requested due to automatic review settings May 14, 2026 23:02
@harshach harshach requested a review from a team as a code owner May 14, 2026 23:02
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels May 14, 2026
Comment thread ingestion/setup.py Outdated
Comment on lines +205 to +211
if keep_alive := self._get_client_session_keep_alive():
connection.connectionArguments.root["client_session_keep_alive"] = keep_alive

# Safe defaults: prevent indefinite hangs when the Snowflake socket is
# silently severed (NAT/LB idle reaping in K8s/hybrid runners). Any
# value the user supplied via connectionArguments wins via setdefault.
connection.connectionArguments.root.setdefault("client_session_keep_alive", True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a UI toggle for this. It is handled above

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you are right. The UI toggle handles this and a setdefault(True) would silently override a user who unchecks it. Dropped in 20a72d0. The hang protection now relies only on network_timeout=600, which bounds any silently-severed Snowflake socket to a 10-minute ReadTimeout. Keep-alive (which only protects Snowflake-side session expiry) stays user-controlled via the UI toggle.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a real-world ingestion hang scenario caused by indefinite socket reads on (1) the OpenMetadata REST client and (2) Snowflake driver connections when pooled sockets are silently severed by NAT/LB idle timeouts. It introduces bounded timeouts/defaults to surface failures as timeouts rather than blocking forever, and adds an in-container profiling tool to simplify future debugging.

Changes:

  • Default OpenMetadata REST client timeout to (connect=10s, read=300s) while preserving the ability to explicitly disable timeouts via timeout=None.
  • Add Snowflake connection-argument defaults for client_session_keep_alive and network_timeout (with user-provided connectionArguments taking precedence via setdefault).
  • Add py-spy to ingestion base requirements to enable in-place stack sampling of hung workers.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
ingestion/src/metadata/ingestion/source/database/snowflake/connection.py Adds safe default Snowflake driver args to reduce the chance of indefinite socket hangs.
ingestion/src/metadata/ingestion/ometa/client.py Sets a bounded default requests timeout to avoid silent hangs on stale pooled HTTP connections.
ingestion/setup.py Adds py-spy to base dependencies to support post-mortem sampling/debugging inside ingestion environments.

Comment on lines +211 to +212
connection.connectionArguments.root.setdefault("client_session_keep_alive", True)
connection.connectionArguments.root.setdefault("network_timeout", 600)
…spy to Dockerfile

- snowflake/connection.py: drop the setdefault("client_session_keep_alive",
  True). The IngestionPipeline UI already exposes a clientSessionKeepAlive
  toggle (handled at lines 205-206); imposing a default of True here would
  silently override a user who unchecks it. Hang protection is now covered
  by network_timeout=600 alone, which bounds any silently-severed Snowflake
  socket to a 10-minute ReadTimeout instead of hanging the worker forever.
  Keep-alive only matters for Snowflake-side session expiry, which is a
  separate concern from the original hang.

- setup.py + ingestion/Dockerfile: move py-spy from base_requirements into
  the ingestion Dockerfile. The original goal — "available in the running
  container without pip install" — is met without forcing a native Rust
  binary on dev laptops, CI, and non-container installs that won't have
  CAP_SYS_PTRACE anyway.
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 14, 2026

Code Review ✅ Approved 2 resolved / 2 findings

Limits socket waits in OpenMetadata and Snowflake connectors to prevent indefinite ingestion hangs caused by stale TCP connections. Includes the addition of py-spy to production dependencies to facilitate future debugging.

✅ 2 resolved
Quality: py-spy is a debug tool added to base production dependencies

📄 ingestion/setup.py:157-159
Adding py-spy>=0.3.14 to base_requirements means every installation of the ingestion framework (including developer laptops, CI, non-container environments) pulls in this native profiling binary. py-spy also requires CAP_SYS_PTRACE or root to function, so it won't even work in most environments without additional setup. It should be in a dedicated optional extras group (e.g., [debug] or [ops]) and pre-installed only in the official container images via a Dockerfile layer, rather than forcing it on all users.

Edge Case: setdefault for client_session_keep_alive is partially redundant

📄 ingestion/src/metadata/ingestion/source/database/snowflake/connection.py:205-212
On line 205-206, if connection.clientSessionKeepAlive is truthy, the key is set via direct assignment (which overwrites any user-supplied value in connectionArguments). On line 211, setdefault only fires when the key is absent. This means:

  1. If the user supplied client_session_keep_alive in connectionArguments but did NOT set the dedicated clientSessionKeepAlive config field, the setdefault correctly preserves their value.
  2. If the user supplied it in connectionArguments AND set the dedicated field, line 206 overwrites their connectionArguments value (pre-existing behavior, not introduced here).

The logic works but the interaction between the explicit assignment (line 206) and setdefault (line 211) is subtle. A brief inline comment noting that line 206 intentionally takes priority over user-supplied connectionArguments would help future readers.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

🔴 Playwright Results — 1 failure(s), 24 flaky

✅ 4045 passed · ❌ 1 failed · 🟡 24 flaky · ⏭️ 103 skipped

Shard Passed Failed Flaky Skipped
🔴 Shard 1 298 1 0 4
🟡 Shard 2 743 0 7 25
🟡 Shard 3 779 0 5 7
🟡 Shard 4 788 0 2 18
🟡 Shard 5 706 0 3 41
🟡 Shard 6 731 0 7 8

Genuine Failures (failed on all attempts)

Pages/AuditLogs.spec.ts › should apply both User and EntityType filters simultaneously (shard 1)
�[31mTest timeout of 60000ms exceeded.�[39m
🟡 24 flaky test(s) (passed on retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 1 retry)
  • Features/KnowledgeCenterList.spec.ts › Knowledge Center List - Test infinite scroll/pagination (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/Table.spec.ts › Table pagination with sorting should works (shard 3, 1 retry)
  • Features/Table.spec.ts › Tags term should be consistent for search (shard 3, 1 retry)
  • Features/Tasks/DomainFiltering.spec.ts › selecting All Domains removes domain filter from feed API call (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/DataContractInheritance.spec.ts › Run Validation - Inherited contract validation uses entity-based validation (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Set (shard 4, 1 retry)
  • Pages/EntityDataSteward.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should display and verify schema fields for dashboardDataModel (shard 5, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should verify deleted glossary term not visible in selection for searchIndex (shard 5, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for apiEndpoint -> dashboardDataModel (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify Impact Analysis service filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/ServiceEntity.spec.ts › Delete Search Index Service (shard 6, 1 retry)
  • Pages/ServiceEntity.spec.ts › User as Owner Add, Update and Remove (shard 6, 1 retry)
  • Pages/UserDetails.spec.ts › Create team with domain and verify visibility of inherited domain in user profile after team removal (shard 6, 1 retry)
  • VersionPages/EntityVersionPages.spec.ts › Store Procedure (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants