fix(ingestion): cap OM and Snowflake socket waits to stop silent ingestion hangs#28131
fix(ingestion): cap OM and Snowflake socket waits to stop silent ingestion hangs#28131harshach wants to merge 2 commits into
Conversation
…ed TCP connection no longer hangs the worker A Snowflake metadata ingestion run was observed to silently pause: the Airflow pod stayed alive (no OOM, no crash), no further API calls hit the OpenMetadata server, no further log lines appeared. The pod-side log ended at `Source.GET executed in 59.057ms` -- a successful HTTP GET to the OM server -- with the next expected `Source executed` / `Sink.PATCH` lines never reaching the server. Streamable-log shipping continued from its daemon thread for ~13 minutes (draining its buffer) and then went quiet too. Root cause is a structural one: HTTP calls from the connector to the OM server, and the SQLAlchemy/Snowflake connection itself, both go out with no read timeout. `requests.Session()` and the Snowflake driver keep their TCP connections alive in pools; intermediate network devices (K8s NAT, ingress NLB, AWS NLB default 350s, Azure LB default 4 min) silently sever idle sockets. The next call grabs a stale connection from the pool and blocks in `socket.recv` waiting for bytes that will never arrive. Linux TCP keepalive defaults to 2 hours. Three minimal changes: - `ingestion/src/metadata/ingestion/ometa/client.py`: default `ClientConfig.timeout` to `(10, 300)` (10s connect, 5min read) instead of `None`. The existing `if self._timeout` guard still honours an explicit `None` override for callers that need it. - `ingestion/src/metadata/ingestion/source/database/snowflake/connection.py`: `setdefault` `client_session_keep_alive=True` and `network_timeout=600` on the connection arguments. User-supplied values in `connectionArguments` still win. - `ingestion/setup.py`: add `py-spy>=0.3.14` to `base_requirements` so the next time a worker hangs we can `py-spy dump --pid <pid>` in place, without first having to install anything in the container. After this, the same kind of dropped socket surfaces as a `requests.exceptions.ReadTimeout` (caught by the existing `except Exception` in `_one_request`, logged, and the workflow continues) instead of wedging the pod indefinitely. SSE streaming uses its own httpx client and is unaffected.
| if keep_alive := self._get_client_session_keep_alive(): | ||
| connection.connectionArguments.root["client_session_keep_alive"] = keep_alive | ||
|
|
||
| # Safe defaults: prevent indefinite hangs when the Snowflake socket is | ||
| # silently severed (NAT/LB idle reaping in K8s/hybrid runners). Any | ||
| # value the user supplied via connectionArguments wins via setdefault. | ||
| connection.connectionArguments.root.setdefault("client_session_keep_alive", True) |
There was a problem hiding this comment.
There is a UI toggle for this. It is handled above
There was a problem hiding this comment.
Good catch — you are right. The UI toggle handles this and a setdefault(True) would silently override a user who unchecks it. Dropped in 20a72d0. The hang protection now relies only on network_timeout=600, which bounds any silently-severed Snowflake socket to a 10-minute ReadTimeout. Keep-alive (which only protects Snowflake-side session expiry) stays user-controlled via the UI toggle.
There was a problem hiding this comment.
Pull request overview
This PR addresses a real-world ingestion hang scenario caused by indefinite socket reads on (1) the OpenMetadata REST client and (2) Snowflake driver connections when pooled sockets are silently severed by NAT/LB idle timeouts. It introduces bounded timeouts/defaults to surface failures as timeouts rather than blocking forever, and adds an in-container profiling tool to simplify future debugging.
Changes:
- Default OpenMetadata REST client timeout to
(connect=10s, read=300s)while preserving the ability to explicitly disable timeouts viatimeout=None. - Add Snowflake connection-argument defaults for
client_session_keep_aliveandnetwork_timeout(with user-providedconnectionArgumentstaking precedence viasetdefault). - Add
py-spyto ingestion base requirements to enable in-place stack sampling of hung workers.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| ingestion/src/metadata/ingestion/source/database/snowflake/connection.py | Adds safe default Snowflake driver args to reduce the chance of indefinite socket hangs. |
| ingestion/src/metadata/ingestion/ometa/client.py | Sets a bounded default requests timeout to avoid silent hangs on stale pooled HTTP connections. |
| ingestion/setup.py | Adds py-spy to base dependencies to support post-mortem sampling/debugging inside ingestion environments. |
| connection.connectionArguments.root.setdefault("client_session_keep_alive", True) | ||
| connection.connectionArguments.root.setdefault("network_timeout", 600) |
…spy to Dockerfile
- snowflake/connection.py: drop the setdefault("client_session_keep_alive",
True). The IngestionPipeline UI already exposes a clientSessionKeepAlive
toggle (handled at lines 205-206); imposing a default of True here would
silently override a user who unchecks it. Hang protection is now covered
by network_timeout=600 alone, which bounds any silently-severed Snowflake
socket to a 10-minute ReadTimeout instead of hanging the worker forever.
Keep-alive only matters for Snowflake-side session expiry, which is a
separate concern from the original hang.
- setup.py + ingestion/Dockerfile: move py-spy from base_requirements into
the ingestion Dockerfile. The original goal — "available in the running
container without pip install" — is met without forcing a native Rust
binary on dev laptops, CI, and non-container installs that won't have
CAP_SYS_PTRACE anyway.
Code Review ✅ Approved 2 resolved / 2 findingsLimits socket waits in OpenMetadata and Snowflake connectors to prevent indefinite ingestion hangs caused by stale TCP connections. Includes the addition of py-spy to production dependencies to facilitate future debugging. ✅ 2 resolved✅ Quality: py-spy is a debug tool added to base production dependencies
✅ Edge Case: setdefault for client_session_keep_alive is partially redundant
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
|
🔴 Playwright Results — 1 failure(s), 24 flaky✅ 4045 passed · ❌ 1 failed · 🟡 24 flaky · ⏭️ 103 skipped
Genuine Failures (failed on all attempts)❌
|



Describe your changes:
A Snowflake metadata ingestion run was observed silently pausing in a hybrid-runner K8s pod: the pod stayed alive (no OOM, no crash), no further API calls hit the OpenMetadata server, no further log lines appeared. Pod-side logs ended at a successful
Source.GET executedHTTP call to the OM server; the next expectedSource executed/Sink.PATCHlines never reached the server. Streamable-log shipping continued from its daemon thread for ~13 minutes (draining its buffer) and then went quiet too.Root cause is structural: HTTP calls from the connector to the OM server, and the SQLAlchemy/Snowflake connection itself, both go out with no read timeout.
requests.Session()and the Snowflake driver keep TCP connections in pools; intermediate network devices (K8s NAT, AWS NLB default 350s, Azure LB default 4 min) silently sever idle sockets, and the next call grabs a stale connection and blocks insocket.recvforever (Linux TCP keepalive default is 2 hours).Three minimal changes:
ingestion/src/metadata/ingestion/ometa/client.py: defaultClientConfig.timeoutto(10, 300)(10s connect, 5min read) instead ofNone. The existingif self._timeoutguard still honours an explicitNonefor callers that need it.ingestion/src/metadata/ingestion/source/database/snowflake/connection.py:setdefaultclient_session_keep_alive=Trueandnetwork_timeout=600on the connection arguments. User-suppliedconnectionArgumentsstill win.ingestion/setup.py: addpy-spy>=0.3.14tobase_requirementsso the next time a worker hangs we canpy-spy dump --pid <pid>in place, without first having to install anything.Type of change:
High-level design:
N/A — small change.
Tests:
Use cases covered
tcp_keepalive_time: previously hung indefinitely on next reuse of a half-open pooled socket, now surfaces as arequests.exceptions.ReadTimeout(caught by existing_one_requestexcept Exception, logged, workflow continues) within 5 minutes.ClientConfig(timeout=None)retain the old "wait forever" behaviour (theif self._timeoutguard atclient.py:224is unchanged).client_session_keep_alive/network_timeoutinconnectionArgumentscontinue to win viasetdefault.Unit tests
Backend integration tests
Ingestion integration tests
Playwright (UI) tests
Manual testing performed
ClientConfigin a Pydantic stand-in for default(10, 300),int=60,None, and(5, 30)overrides — all accepted.httpx.Client(timeout=None)insse_client.py:82and is unaffected.UI screen recording / screenshots:
Not applicable.
Checklist: