Skip to content

reverse_tunnel: add lifecycle access logging#44298

Open
roll-no-21 wants to merge 13 commits intoenvoyproxy:mainfrom
roll-no-21:reverse-tunnel-lifecycle-logging
Open

reverse_tunnel: add lifecycle access logging#44298
roll-no-21 wants to merge 13 commits intoenvoyproxy:mainfrom
roll-no-21:reverse-tunnel-lifecycle-logging

Conversation

@roll-no-21
Copy link
Copy Markdown
Contributor

Emit structured lifecycle records for tunnel setup, handoff, close, and post-handoff HTTP/2 keepalive timeouts so reverse-tunnel behavior can be debugged from access logs without relying on codec changes.

Commit Message: reverse_tunnel: add lifecycle access logging
Additional Description:

Today, debugging reverse tunnel connection issues (silent drops, keepalive
timeouts, unexpected closes) requires correlating codec-level debug logs
across multiple threads and connections. This PR adds first-class access log
support to the reverse tunnel upstream socket interface so that operators can
observe the full tunnel lifecycle from structured, machine-parseable access
logs.

Proto change (UpstreamReverseConnectionSocketInterface)

A new repeated config.accesslog.v3.AccessLog access_log = 6 field is added
to the bootstrap config. When configured, lifecycle events are emitted through
the standard Envoy access log framework. No logs are emitted when the field is
absent — fully backward compatible.

Lifecycle info struct (ReverseTunnelLifecycleInfo)

A new reverse_tunnel_lifecycle_info.h header defines:

  • ReverseTunnelLifecycleInfo — per-socket metadata carried through the tunnel
    lifetime: node_id, cluster_id, tenant_id, local/remote addresses,
    worker name, file descriptor, and state flags (handed_off_to_upstream,
    upstream_lifecycle_filter_attached, socket_dead_notified,
    close_log_emitted, close_reason).
  • Well-known string constants for all lifecycle events, socket states, handoff
    kinds, close reasons, filter-state keys, and the dynamic metadata namespace.

Lifecycle events emitted

Each access log record is written into the envoy.reverse_tunnel.lifecycle
dynamic metadata namespace with these fields:

Field Description
event One of the event names below.
node_id Originating node identifier.
cluster_id Originating cluster identifier.
tenant_id Tenant identifier (empty when tenant isolation is off).
worker Envoy worker thread that owns the socket.
fd File descriptor number.
socket_state idle, handed_off, or in_use.
handoff_kind Present only on socket_handoff: pool_to_upstream.
close_reason Present only on tunnel_closed (see close reasons below).

The same node/cluster/tenant/worker/fd values are also written into connection
filter state under envoy.reverse_tunnel.* keys so they can be referenced in
%FILTER_STATE(...)% format strings.

Core lifecycle events:

  • tunnel_setup — reverse tunnel accepted and cached.
  • socket_handoff — idle socket handed off to the upstream connection pool.
  • tunnel_closed — socket removed from tracking (final event for every tunnel).
  • http2_keepalive_timeout — upstream closed locally due to HTTP/2 PING timeout
    after handoff.

Idle-phase ping events (while socket is cached):

  • idle_ping_sent / idle_ping_ack / idle_ping_miss / idle_ping_timeout

Post-handoff HTTP/2 keepalive events:

  • http2_keepalive_ping_sent / http2_keepalive_ping_ack

Close reason values on tunnel_closed:

  • idle_peer_close — peer closed while idle.
  • idle_read_error — read error on the idle socket.
  • idle_ping_write_failure — RPING write failed.
  • idle_ping_timeout — RPINGs went unanswered.
  • remote_close — peer closed after handoff.
  • local_close — Envoy closed after handoff.
  • explicit_close — explicit close (e.g. shutdown).

Upstream network filter (envoy.filters.upstream_network.reverse_tunnel_lifecycle)

A new ReverseTunnelUpstreamLifecycleFilter (upstream read filter) is
introduced to observe post-handoff connection events. On onNewConnection() it
copies the tunnel's ReverseTunnelLifecycleInfo from the socket manager into
the upstream connection's filter state. On LocalClose with
http2_ping_timeout, it emits the http2_keepalive_timeout event through the
extension's access loggers. It also provides the final close reason to the
socket manager so the deferred tunnel_closed record carries the real reason
rather than a generic explicit_close.

UpstreamSocketManager changes

  • Tracks fd_to_lifecycle_info_ map alongside existing fd maps.
  • Emits tunnel_setup on addConnectionSocket().
  • Emits socket_handoff on getConnectionSocket().
  • Emits tunnel_closed from markSocketDead(), with deferred close log
    support: when the upstream lifecycle filter is attached and no close reason
    is known yet, the close log is deferred until the filter reports the reason
    via maybeEmitDeferredCloseLog().
  • Sets close reason on idle-phase failures: idle_peer_close,
    idle_read_error, idle_ping_write_failure, idle_ping_timeout.
  • Passes tenant_id through handoffSocketToWorker() and
    addConnectionSocket() so lifecycle info retains the original tenant.

ReverseTunnelAcceptorExtension changes

  • Constructor moved from header to .cc to instantiate access loggers from
    config via AccessLog::AccessLogFactory::fromProto().
  • Two new emit methods:
    • emitSyntheticLifecycleLog() — builds a temporary StreamInfoImpl from
      lifecycle metadata (used for idle-phase events and close events where no
      real connection StreamInfo exists).
    • emitConnectionLifecycleLog() — populates an existing connection's
      StreamInfo with lifecycle metadata (used for post-handoff events).
  • Both write dynamic metadata and filter state, then call all configured
    access loggers.

ReverseTunnelFilter change

  • processAcceptedConnection() now passes the original (unscoped) node_id,
    cluster_id, and tenant_id to addConnectionSocket(). The socket manager
    derives tenant-scoped internal keys itself, keeping lifecycle metadata
    clean.

Docs

  • New "Lifecycle access logs" section in reverse_tunnel.rst documenting config,
    events, metadata fields, and the upstream network filter.

Risk Level: Low — additive only; no logs emitted unless access_log is configured.
Testing:

  • reverse_tunnel_upstream_lifecycle_test.cc — unit tests for the upstream
    lifecycle filter: filter state propagation, keepalive timeout event emission,
    remote/local close reason forwarding, deferred close log behavior.
  • upstream_socket_manager_test.cc — lifecycle log tests: tunnel_setup on add,
    socket_handoff on get, tunnel_closed on markSocketDead with correct close
    reasons for each idle-phase failure mode.
  • config_validation_test.cc — validates the new access_log proto field parses.
    Docs Changes: Added lifecycle access logs section and upstream network filter
    usage to docs/root/configuration/other_features/reverse_tunnel.rst.
    Release Notes: N/A (reverse tunnel is not a public-facing feature)
    Platform Specific Features: N/A

@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #44298 was opened by roll-no-21.

see: more, trace.

@roll-no-21
Copy link
Copy Markdown
Contributor Author

cc: @ivpr

@roll-no-21 roll-no-21 force-pushed the reverse-tunnel-lifecycle-logging branch 2 times, most recently from 4ee916f to 280ebee Compare April 7, 2026 07:50
@agrawroh
Copy link
Copy Markdown
Member

agrawroh commented Apr 7, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces lifecycle access logging for reverse tunnels. It adds an access_log configuration field to the upstream reverse connection socket interface and implements a new upstream network filter, envoy.filters.upstream_network.reverse_tunnel_lifecycle, to capture post-handoff events such as HTTP/2 keepalive timeouts. The implementation includes a new ReverseTunnelLifecycleInfo structure for metadata tracking and updates to the UpstreamSocketManager to support deferred logging. Feedback suggests addressing potential risks associated with raw file descriptor tracking and ensuring that the deferred logging logic has a fallback mechanism to prevent missing log entries if the filter fails to report a close reason.

return Network::FilterStatus::Continue;
}

fd_ = socket->ioHandle().fdDoNotUse();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using fdDoNotUse() is generally discouraged in Envoy as it bypasses the IoHandle abstraction. While it is used here to index into the UpstreamSocketManager maps, ensure that the file descriptor is guaranteed to be valid and unique for the duration of the lifecycle tracking. If the IoHandle is closed and the FD is recycled before the manager is notified, it could lead to incorrect metadata association.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage is safe here for four reasons: (1) the lifecycle filter is a connection callback, so the socket is guaranteed alive during onNewConnection() and
onEvent(); (2) UpstreamReverseConnectionIOHandle::close() calls markSocketDead(fd_) before releasing the socket, preventing FD recycling before cleanup; (3)
each worker thread has its own UpstreamSocketManager, eliminating cross-thread races; (4) all reverse tunnel FD-indexed maps follow this same ownership pattern.
Added a comment documenting these invariants.

Comment on lines +441 to +443
const bool defer_close_log = lifecycle.handed_off_to_upstream &&
lifecycle.upstream_lifecycle_filter_attached &&
!lifecycle.close_log_emitted && lifecycle.close_reason.empty();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for defer_close_log assumes that if a lifecycle filter is attached, it will eventually provide a close reason. If the filter fails to report a reason (e.g., due to an unexpected connection state or filter chain termination), the tunnel_closed log might be permanently deferred, leading to missing access log entries. Consider adding a fallback mechanism or a timeout to ensure the log is eventually emitted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern. Added a fallback in ~UpstreamSocketManager() that emits any remaining un-emitted tunnel_closed logs with explicit_close reason before the
extension is nulled out. This covers the case where a lifecycle filter never fires (e.g., during shutdown or unexpected connection teardown). Added a test for
this.

@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/coverage-shephards: FYI only for changes made to (test/coverage.yaml).
envoyproxy/coverage-shephards assignee is @RyanTheOptimist

🐱

Caused by: #44298 was synchronize by roll-no-21.

see: more, trace.

inline constexpr absl::string_view kFilterStateFd = "envoy.reverse_tunnel.fd";

inline constexpr absl::string_view kLifecycleEventTunnelSetup = "tunnel_setup";
inline constexpr absl::string_view kLifecycleEventIdlePingSent = "idle_ping_sent";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem like these are implemented?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the four idle ping events — idle_ping_sent in sendPingForConnection(), idle_ping_ack in onPingResponse(), idle_ping_miss and idle_ping_timeout in
onPingTimeout() (with miss_count metadata). Removed http2_keepalive_ping_sent and http2_keepalive_ping_ack — those require HTTP/2 codec-layer hooks that are
outside this extension's scope. Updated docs accordingly.

StreamInfo::StreamInfoImpl stream_info(time_source, connection_info_provider,
StreamInfo::FilterState::LifeSpan::Connection);
if (lifecycle.fd >= 0) {
connection_info_provider->setConnectionID(lifecycle.fd);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? The connection ID is supposed to be a monotonically increasing identifier for the connection, but FDs can be reused.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — removed it. FDs are recycled by the OS so they're not valid connection IDs. The FD is already available in dynamic metadata (fd field) and filter
state (envoy.reverse_tunnel.fd) for log correlation, so %CONNECTION_ID% isn't needed on these synthetic entries.

@markdroth
Copy link
Copy Markdown
Contributor

/lgtm api

@repokitteh-read-only repokitteh-read-only Bot removed the api label Apr 8, 2026
Emit structured lifecycle records for tunnel setup, handoff, close, and
post-handoff HTTP/2 keepalive timeouts so reverse-tunnel behavior can be
debugged from access logs without relying on codec changes.

Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
…r metadata

The BUILD file referenced a non-existent reverse_tunnel_reporting_service
package, and the new upstream network filter was missing its entry in
extensions_metadata.yaml.

Co-authored-by: Isaac
Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
reverse_tunnel_acceptor_extension.cc includes source/server/generic_factory_context.h
but the BUILD file was missing the corresponding dependency.

Co-authored-by: Isaac
Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
- Add envoy.filters.upstream_network to extension categories schema
- Add type_urls (google.protobuf.Empty) to lifecycle filter metadata
- Fix clang-format violations in lifecycle and socket manager files
- Add 'handoff' and 'observations' to spelling dictionary

Co-authored-by: Isaac
Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
…s(1)

- Replace direct google/protobuf/empty.pb.h include with
  source/common/protobuf/protobuf.h per Envoy convention
- Fix clang-format in test files
- Remove redundant .Times(1) from EXPECT_CALL statements

Co-authored-by: Isaac
Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
Move protobuf.h into the source/ include block (alphabetical order).

Co-authored-by: Isaac
Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
Null out extension_ before calling markSocketDead during destruction.
During TLS teardown the ReverseTunnelAcceptorExtension may already be
destroyed, causing a segfault when markSocketDead tries to emit
lifecycle logs through the dangling pointer.

Co-authored-by: Isaac
Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
The lifecycle logging code has conditional paths (destructor cleanup,
empty access_logs guard) that aren't fully exercised in unit tests.

Co-authored-by: Isaac
Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
Co-authored-by: Isaac
Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
- Implement idle ping lifecycle events (idle_ping_sent, idle_ping_ack,
  idle_ping_miss, idle_ping_timeout) in UpstreamSocketManager at the
  appropriate points in sendPingForConnection, onPingResponse, and
  onPingTimeout. Remove unimplementable HTTP/2 keepalive ping constants
  (http2_keepalive_ping_sent, http2_keepalive_ping_ack) that require
  codec-layer hooks outside this extension's scope.

- Remove setConnectionID(lifecycle.fd) from emitSyntheticLifecycleLog.
  FDs are recycled by the OS and are not valid monotonically-increasing
  connection IDs. The FD is still available via dynamic metadata.

- Add deferred close log fallback in ~UpstreamSocketManager to emit
  any remaining un-emitted tunnel_closed logs before the extension is
  torn down, preventing permanently lost log entries.

- Add safety comment documenting why fdDoNotUse() is safe in the
  upstream lifecycle filter (ownership guarantee, connection callback
  lifetime, thread-local isolation).

- Add tests for all new idle ping events, destructor fallback, and
  synthetic connection ID behavior.

Co-authored-by: Isaac
Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
Co-authored-by: Isaac
Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
@roll-no-21 roll-no-21 force-pushed the reverse-tunnel-lifecycle-logging branch from dd4f684 to 2fd012b Compare April 9, 2026 13:54
The implicit member destruction order destroys access_logs_ before
tls_slot_. When tls_slot_ is then destroyed, the UpstreamSocketManager
destructor calls emitSyntheticLifecycleLog() which accesses the
already-destroyed access_logs_ vector, causing a segfault under GCC.

Add an explicit destructor that resets tls_slot_ first while all other
members (including access_logs_) are still alive.

Signed-off-by: Krishna Sharma <krishna.sharma@nutanix.com>

Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
The CopiesLifecycleIntoFilterStateAndLogsKeepaliveTimeoutOnce test
expects exactly one access log call. After the destructor fix, the
extension destructor properly resets the TLS slot which triggers a
deferred tunnel_closed close log. Clear access logs after assertions
to avoid this expected teardown behavior triggering mock over-saturation.

Signed-off-by: Krishna Sharma <krishna.sharma@nutanix.com>

Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
Copy link
Copy Markdown
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @roll-no-21 - some docs review

* ``tunnel_setup`` – emitted when a new reverse tunnel connection is accepted and cached.
* ``socket_handoff`` – emitted when a cached idle socket is handed off to an upstream connection pool.
* ``tunnel_closed`` – emitted when the reverse tunnel connection is closed.
* ``http2_keepalive_timeout`` – emitted when the upstream connection closes locally due to an HTTP/2 keepalive (PING) timeout after handoff.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* ``http2_keepalive_timeout`` – emitted when the upstream connection closes locally due to an HTTP/2 keepalive (PING) timeout after handoff.
* ``http2_keepalive_timeout`` – emitted when the upstream connection closes locally due to an HTTP/2 keepalive (``PING``) timeout after handoff.


**Idle-phase ping events** (emitted while the socket is idle in the cache):

* ``idle_ping_sent`` – an RPING probe was sent to verify the idle connection is alive.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* ``idle_ping_sent`` – an RPING probe was sent to verify the idle connection is alive.
* ``idle_ping_sent`` – an ``RPING`` probe was sent to verify the idle connection is alive.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same below


.. code-block:: yaml

bootstrap_extensions:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in a literalinclude

i was hesitant to say this as im not clear whether the current validation works with bootstrap_extensions (altho have a feeling that it may have just been fixed)

if the validator does not support it - we can exclude it from validation - either way lets put this to a full bootstrap file

.. code-block:: yaml

clusters:
- name: reverse_connection_cluster
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar here - this should be a literalinclude

@phlax
Copy link
Copy Markdown
Member

phlax commented Apr 14, 2026

@RyanTheOptimist could you take a look please

Comment thread test/coverage.yaml
source/extensions/api_listeners: 55.0 # Many IS_ENVOY_BUG are not covered.
source/extensions/api_listeners/default_api_listener: 67.8 # Many IS_ENVOY_BUG are not covered.
source/extensions/bootstrap/reverse_tunnel/downstream_socket_interface: 96.3
source/extensions/bootstrap/reverse_tunnel/upstream_socket_interface: 95.4 # Lifecycle logging conditional paths
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add unit tests which cover these codepaths?

@mathetake
Copy link
Copy Markdown
Member

@roll-no-21 looks like there are a few review comments to be addressed

/wait

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants