reverse_tunnel: add lifecycle access logging#44298
reverse_tunnel: add lifecycle access logging#44298roll-no-21 wants to merge 13 commits intoenvoyproxy:mainfrom
Conversation
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
|
cc: @ivpr |
4ee916f to
280ebee
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces lifecycle access logging for reverse tunnels. It adds an access_log configuration field to the upstream reverse connection socket interface and implements a new upstream network filter, envoy.filters.upstream_network.reverse_tunnel_lifecycle, to capture post-handoff events such as HTTP/2 keepalive timeouts. The implementation includes a new ReverseTunnelLifecycleInfo structure for metadata tracking and updates to the UpstreamSocketManager to support deferred logging. Feedback suggests addressing potential risks associated with raw file descriptor tracking and ensuring that the deferred logging logic has a fallback mechanism to prevent missing log entries if the filter fails to report a close reason.
| return Network::FilterStatus::Continue; | ||
| } | ||
|
|
||
| fd_ = socket->ioHandle().fdDoNotUse(); |
There was a problem hiding this comment.
Using fdDoNotUse() is generally discouraged in Envoy as it bypasses the IoHandle abstraction. While it is used here to index into the UpstreamSocketManager maps, ensure that the file descriptor is guaranteed to be valid and unique for the duration of the lifecycle tracking. If the IoHandle is closed and the FD is recycled before the manager is notified, it could lead to incorrect metadata association.
There was a problem hiding this comment.
The usage is safe here for four reasons: (1) the lifecycle filter is a connection callback, so the socket is guaranteed alive during onNewConnection() and
onEvent(); (2) UpstreamReverseConnectionIOHandle::close() calls markSocketDead(fd_) before releasing the socket, preventing FD recycling before cleanup; (3)
each worker thread has its own UpstreamSocketManager, eliminating cross-thread races; (4) all reverse tunnel FD-indexed maps follow this same ownership pattern.
Added a comment documenting these invariants.
| const bool defer_close_log = lifecycle.handed_off_to_upstream && | ||
| lifecycle.upstream_lifecycle_filter_attached && | ||
| !lifecycle.close_log_emitted && lifecycle.close_reason.empty(); |
There was a problem hiding this comment.
The logic for defer_close_log assumes that if a lifecycle filter is attached, it will eventually provide a close reason. If the filter fails to report a reason (e.g., due to an unexpected connection state or filter chain termination), the tunnel_closed log might be permanently deferred, leading to missing access log entries. Consider adding a fallback mechanism or a timeout to ensure the log is eventually emitted.
There was a problem hiding this comment.
Valid concern. Added a fallback in ~UpstreamSocketManager() that emits any remaining un-emitted tunnel_closed logs with explicit_close reason before the
extension is nulled out. This covers the case where a lifecycle filter never fires (e.g., during shutdown or unexpected connection teardown). Added a test for
this.
|
CC @envoyproxy/coverage-shephards: FYI only for changes made to |
| inline constexpr absl::string_view kFilterStateFd = "envoy.reverse_tunnel.fd"; | ||
|
|
||
| inline constexpr absl::string_view kLifecycleEventTunnelSetup = "tunnel_setup"; | ||
| inline constexpr absl::string_view kLifecycleEventIdlePingSent = "idle_ping_sent"; |
There was a problem hiding this comment.
Doesn't seem like these are implemented?
There was a problem hiding this comment.
Implemented the four idle ping events — idle_ping_sent in sendPingForConnection(), idle_ping_ack in onPingResponse(), idle_ping_miss and idle_ping_timeout in
onPingTimeout() (with miss_count metadata). Removed http2_keepalive_ping_sent and http2_keepalive_ping_ack — those require HTTP/2 codec-layer hooks that are
outside this extension's scope. Updated docs accordingly.
| StreamInfo::StreamInfoImpl stream_info(time_source, connection_info_provider, | ||
| StreamInfo::FilterState::LifeSpan::Connection); | ||
| if (lifecycle.fd >= 0) { | ||
| connection_info_provider->setConnectionID(lifecycle.fd); |
There was a problem hiding this comment.
Is this needed? The connection ID is supposed to be a monotonically increasing identifier for the connection, but FDs can be reused.
There was a problem hiding this comment.
Good catch — removed it. FDs are recycled by the OS so they're not valid connection IDs. The FD is already available in dynamic metadata (fd field) and filter
state (envoy.reverse_tunnel.fd) for log correlation, so %CONNECTION_ID% isn't needed on these synthetic entries.
|
/lgtm api |
Emit structured lifecycle records for tunnel setup, handoff, close, and post-handoff HTTP/2 keepalive timeouts so reverse-tunnel behavior can be debugged from access logs without relying on codec changes. Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
…r metadata The BUILD file referenced a non-existent reverse_tunnel_reporting_service package, and the new upstream network filter was missing its entry in extensions_metadata.yaml. Co-authored-by: Isaac Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
reverse_tunnel_acceptor_extension.cc includes source/server/generic_factory_context.h but the BUILD file was missing the corresponding dependency. Co-authored-by: Isaac Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
- Add envoy.filters.upstream_network to extension categories schema - Add type_urls (google.protobuf.Empty) to lifecycle filter metadata - Fix clang-format violations in lifecycle and socket manager files - Add 'handoff' and 'observations' to spelling dictionary Co-authored-by: Isaac Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
…s(1) - Replace direct google/protobuf/empty.pb.h include with source/common/protobuf/protobuf.h per Envoy convention - Fix clang-format in test files - Remove redundant .Times(1) from EXPECT_CALL statements Co-authored-by: Isaac Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
Move protobuf.h into the source/ include block (alphabetical order). Co-authored-by: Isaac Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
Null out extension_ before calling markSocketDead during destruction. During TLS teardown the ReverseTunnelAcceptorExtension may already be destroyed, causing a segfault when markSocketDead tries to emit lifecycle logs through the dangling pointer. Co-authored-by: Isaac Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
The lifecycle logging code has conditional paths (destructor cleanup, empty access_logs guard) that aren't fully exercised in unit tests. Co-authored-by: Isaac Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
Co-authored-by: Isaac Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
- Implement idle ping lifecycle events (idle_ping_sent, idle_ping_ack, idle_ping_miss, idle_ping_timeout) in UpstreamSocketManager at the appropriate points in sendPingForConnection, onPingResponse, and onPingTimeout. Remove unimplementable HTTP/2 keepalive ping constants (http2_keepalive_ping_sent, http2_keepalive_ping_ack) that require codec-layer hooks outside this extension's scope. - Remove setConnectionID(lifecycle.fd) from emitSyntheticLifecycleLog. FDs are recycled by the OS and are not valid monotonically-increasing connection IDs. The FD is still available via dynamic metadata. - Add deferred close log fallback in ~UpstreamSocketManager to emit any remaining un-emitted tunnel_closed logs before the extension is torn down, preventing permanently lost log entries. - Add safety comment documenting why fdDoNotUse() is safe in the upstream lifecycle filter (ownership guarantee, connection callback lifetime, thread-local isolation). - Add tests for all new idle ping events, destructor fallback, and synthetic connection ID behavior. Co-authored-by: Isaac Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
Co-authored-by: Isaac Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
dd4f684 to
2fd012b
Compare
The implicit member destruction order destroys access_logs_ before tls_slot_. When tls_slot_ is then destroyed, the UpstreamSocketManager destructor calls emitSyntheticLifecycleLog() which accesses the already-destroyed access_logs_ vector, causing a segfault under GCC. Add an explicit destructor that resets tls_slot_ first while all other members (including access_logs_) are still alive. Signed-off-by: Krishna Sharma <krishna.sharma@nutanix.com> Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
The CopiesLifecycleIntoFilterStateAndLogsKeepaliveTimeoutOnce test expects exactly one access log call. After the destructor fix, the extension destructor properly resets the TLS slot which triggers a deferred tunnel_closed close log. Clear access logs after assertions to avoid this expected teardown behavior triggering mock over-saturation. Signed-off-by: Krishna Sharma <krishna.sharma@nutanix.com> Signed-off-by: Krishna Sharma <krishnagpl2001@gmail.com>
phlax
left a comment
There was a problem hiding this comment.
thanks @roll-no-21 - some docs review
| * ``tunnel_setup`` – emitted when a new reverse tunnel connection is accepted and cached. | ||
| * ``socket_handoff`` – emitted when a cached idle socket is handed off to an upstream connection pool. | ||
| * ``tunnel_closed`` – emitted when the reverse tunnel connection is closed. | ||
| * ``http2_keepalive_timeout`` – emitted when the upstream connection closes locally due to an HTTP/2 keepalive (PING) timeout after handoff. |
There was a problem hiding this comment.
| * ``http2_keepalive_timeout`` – emitted when the upstream connection closes locally due to an HTTP/2 keepalive (PING) timeout after handoff. | |
| * ``http2_keepalive_timeout`` – emitted when the upstream connection closes locally due to an HTTP/2 keepalive (``PING``) timeout after handoff. |
|
|
||
| **Idle-phase ping events** (emitted while the socket is idle in the cache): | ||
|
|
||
| * ``idle_ping_sent`` – an RPING probe was sent to verify the idle connection is alive. |
There was a problem hiding this comment.
| * ``idle_ping_sent`` – an RPING probe was sent to verify the idle connection is alive. | |
| * ``idle_ping_sent`` – an ``RPING`` probe was sent to verify the idle connection is alive. |
|
|
||
| .. code-block:: yaml | ||
|
|
||
| bootstrap_extensions: |
There was a problem hiding this comment.
this should be in a literalinclude
i was hesitant to say this as im not clear whether the current validation works with bootstrap_extensions (altho have a feeling that it may have just been fixed)
if the validator does not support it - we can exclude it from validation - either way lets put this to a full bootstrap file
| .. code-block:: yaml | ||
|
|
||
| clusters: | ||
| - name: reverse_connection_cluster |
There was a problem hiding this comment.
similar here - this should be a literalinclude
|
@RyanTheOptimist could you take a look please |
| source/extensions/api_listeners: 55.0 # Many IS_ENVOY_BUG are not covered. | ||
| source/extensions/api_listeners/default_api_listener: 67.8 # Many IS_ENVOY_BUG are not covered. | ||
| source/extensions/bootstrap/reverse_tunnel/downstream_socket_interface: 96.3 | ||
| source/extensions/bootstrap/reverse_tunnel/upstream_socket_interface: 95.4 # Lifecycle logging conditional paths |
There was a problem hiding this comment.
Is it possible to add unit tests which cover these codepaths?
|
@roll-no-21 looks like there are a few review comments to be addressed /wait |
Emit structured lifecycle records for tunnel setup, handoff, close, and post-handoff HTTP/2 keepalive timeouts so reverse-tunnel behavior can be debugged from access logs without relying on codec changes.
Commit Message: reverse_tunnel: add lifecycle access logging
Additional Description:
Today, debugging reverse tunnel connection issues (silent drops, keepalive
timeouts, unexpected closes) requires correlating codec-level debug logs
across multiple threads and connections. This PR adds first-class access log
support to the reverse tunnel upstream socket interface so that operators can
observe the full tunnel lifecycle from structured, machine-parseable access
logs.
Proto change (
UpstreamReverseConnectionSocketInterface)A new
repeated config.accesslog.v3.AccessLog access_log = 6field is addedto the bootstrap config. When configured, lifecycle events are emitted through
the standard Envoy access log framework. No logs are emitted when the field is
absent — fully backward compatible.
Lifecycle info struct (
ReverseTunnelLifecycleInfo)A new
reverse_tunnel_lifecycle_info.hheader defines:ReverseTunnelLifecycleInfo— per-socket metadata carried through the tunnellifetime:
node_id,cluster_id,tenant_id, local/remote addresses,worker name, file descriptor, and state flags (
handed_off_to_upstream,upstream_lifecycle_filter_attached,socket_dead_notified,close_log_emitted,close_reason).kinds, close reasons, filter-state keys, and the dynamic metadata namespace.
Lifecycle events emitted
Each access log record is written into the
envoy.reverse_tunnel.lifecycledynamic metadata namespace with these fields:
eventnode_idcluster_idtenant_idworkerfdsocket_stateidle,handed_off, orin_use.handoff_kindsocket_handoff:pool_to_upstream.close_reasontunnel_closed(see close reasons below).The same node/cluster/tenant/worker/fd values are also written into connection
filter state under
envoy.reverse_tunnel.*keys so they can be referenced in%FILTER_STATE(...)%format strings.Core lifecycle events:
tunnel_setup— reverse tunnel accepted and cached.socket_handoff— idle socket handed off to the upstream connection pool.tunnel_closed— socket removed from tracking (final event for every tunnel).http2_keepalive_timeout— upstream closed locally due to HTTP/2 PING timeoutafter handoff.
Idle-phase ping events (while socket is cached):
idle_ping_sent/idle_ping_ack/idle_ping_miss/idle_ping_timeoutPost-handoff HTTP/2 keepalive events:
http2_keepalive_ping_sent/http2_keepalive_ping_ackClose reason values on
tunnel_closed:idle_peer_close— peer closed while idle.idle_read_error— read error on the idle socket.idle_ping_write_failure— RPING write failed.idle_ping_timeout— RPINGs went unanswered.remote_close— peer closed after handoff.local_close— Envoy closed after handoff.explicit_close— explicit close (e.g. shutdown).Upstream network filter (
envoy.filters.upstream_network.reverse_tunnel_lifecycle)A new
ReverseTunnelUpstreamLifecycleFilter(upstream read filter) isintroduced to observe post-handoff connection events. On
onNewConnection()itcopies the tunnel's
ReverseTunnelLifecycleInfofrom the socket manager intothe upstream connection's filter state. On
LocalClosewithhttp2_ping_timeout, it emits thehttp2_keepalive_timeoutevent through theextension's access loggers. It also provides the final close reason to the
socket manager so the deferred
tunnel_closedrecord carries the real reasonrather than a generic
explicit_close.UpstreamSocketManagerchangesfd_to_lifecycle_info_map alongside existing fd maps.tunnel_setuponaddConnectionSocket().socket_handoffongetConnectionSocket().tunnel_closedfrommarkSocketDead(), with deferred close logsupport: when the upstream lifecycle filter is attached and no close reason
is known yet, the close log is deferred until the filter reports the reason
via
maybeEmitDeferredCloseLog().idle_peer_close,idle_read_error,idle_ping_write_failure,idle_ping_timeout.tenant_idthroughhandoffSocketToWorker()andaddConnectionSocket()so lifecycle info retains the original tenant.ReverseTunnelAcceptorExtensionchanges.ccto instantiate access loggers fromconfig via
AccessLog::AccessLogFactory::fromProto().emitSyntheticLifecycleLog()— builds a temporaryStreamInfoImplfromlifecycle metadata (used for idle-phase events and close events where no
real connection StreamInfo exists).
emitConnectionLifecycleLog()— populates an existing connection'sStreamInfo with lifecycle metadata (used for post-handoff events).
access loggers.
ReverseTunnelFilterchangeprocessAcceptedConnection()now passes the original (unscoped) node_id,cluster_id, and tenant_id to
addConnectionSocket(). The socket managerderives tenant-scoped internal keys itself, keeping lifecycle metadata
clean.
Docs
reverse_tunnel.rstdocumenting config,events, metadata fields, and the upstream network filter.
Risk Level: Low — additive only; no logs emitted unless
access_logis configured.Testing:
reverse_tunnel_upstream_lifecycle_test.cc— unit tests for the upstreamlifecycle filter: filter state propagation, keepalive timeout event emission,
remote/local close reason forwarding, deferred close log behavior.
upstream_socket_manager_test.cc— lifecycle log tests: tunnel_setup on add,socket_handoff on get, tunnel_closed on markSocketDead with correct close
reasons for each idle-phase failure mode.
config_validation_test.cc— validates the new access_log proto field parses.Docs Changes: Added lifecycle access logs section and upstream network filter
usage to
docs/root/configuration/other_features/reverse_tunnel.rst.Release Notes: N/A (reverse tunnel is not a public-facing feature)
Platform Specific Features: N/A