Skip to content

Conversation

@cawthorne
Copy link
Contributor

@cawthorne cawthorne commented Jan 20, 2026

Summary

Adds observability for WebSocket failover mechanism to help diagnose connection issues.

Problem

During a Tiingo incident (2026-01-13 03:19-03:32 UTC), we could not determine if failover triggered:

  • streamHandlerInvocationsWithNoConnection counter not exposed as metric
  • Counter increments logged at TRACE level (not visible with LOG_LEVEL=info)
  • URL changes logged at DEBUG level and suppressed with CENSOR_SENSITIVE_LOGS=true

This made it impossible to answer:

  • Did failover trigger during the incident?
  • What was the counter value at any given time?
  • When did URL switches occur?

Changes

1. New Prometheus Metric

  • Added ws_connection_failover_count gauge metric
  • Exposes streamHandlerInvocationsWithNoConnection value in real-time
  • Labeled by transport_name for per-transport tracking
  • Updated when unresponsive connections are detected

@github-actions
Copy link
Contributor

NPM Publishing labels 🏷️

🛑 This PR needs labels to indicate how to increase the current package version in the automated workflows. Please add one of the following labels: none, patch, minor, or major.

@cawthorne cawthorne changed the title Add WebSocket failover counter metric, abnormal closure tracking, and URL change logging Add WebSocket failover counter metric and URL change logging Jan 20, 2026
}),
wsConnectionFailoverCount: new client.Gauge({
name: 'ws_connection_failover_count',
help: 'The number of consecutive connection issues (unresponsive/no data, abnormal closures), used to trigger URL failover. Resets to 0 when data flows successfully.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this is reset to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. The underlying variable it is meant to expose streamHandlerInvocationsWithNoConnection also never resets. It just increments forever, and Tiingo uses modulo arithmetic on it, it is used in this PR:
https://github.com/smartcontractkit/external-adapters-js/pull/4543/files (even before my changes).

Open to resetting it if there is a good reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it should reset, but the description says "Resets to 0 when data flows successfully."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatting with @cawthorne in person, we'll remove that part of the description as it does not reset to 0. Good call

help: 'The number of addresses in PoR request input parameters',
labelNames: ['feed_id'] as const,
}),
wsConnectionFailoverCount: new client.Gauge({
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing I was toying with doing was making this counter increment when a ws connection closes with a code != 1000 (non healthy close).

Currently we only increment on an initial ws connection failing to connect + an open connection being unresponsive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants