Skip to content

Shutdown Websocket Connections Gracefully in the queue-proxy#16362

Merged
knative-prow[bot] merged 10 commits intoknative:mainfrom
dprotaso:websocket-graceful
Feb 10, 2026
Merged

Shutdown Websocket Connections Gracefully in the queue-proxy#16362
knative-prow[bot] merged 10 commits intoknative:mainfrom
dprotaso:websocket-graceful

Conversation

@dprotaso
Copy link
Member

  • have the websocket server test image send a close message when shutting down gracefully
  • include an e2e test that asserts websockets connections are gracefully handled
  • include a special handler to track hijacked connections and create a drain method

@knative-prow knative-prow bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 27, 2026
@dprotaso dprotaso changed the title Shutdown Websocket Connections Gracefully [wip] Shutdown Websocket Connections Gracefully Jan 27, 2026
@knative-prow
Copy link

knative-prow bot commented Jan 27, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jan 27, 2026
@dprotaso dprotaso changed the title [wip] Shutdown Websocket Connections Gracefully [wip] Shutdown Websocket Connections Gracefully in the queue-proxy Jan 27, 2026
@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 41.66667% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.22%. Comparing base (2cd582b) to head (1e11ed1).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
pkg/queue/sharedmain/handlers.go 0.00% 11 Missing ⚠️
pkg/queue/sharedmain/main.go 0.00% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16362      +/-   ##
==========================================
+ Coverage   80.18%   80.22%   +0.03%     
==========================================
  Files         216      217       +1     
  Lines       13440    13507      +67     
==========================================
+ Hits        10777    10836      +59     
- Misses       2298     2306       +8     
  Partials      365      365              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dprotaso
Copy link
Member Author

One observation I have is that the activator will drop hijacked connections similar to the queue proxy.

Hijacking happens implicitly with the httputil.ReverseProxy - https://cs.opensource.google/go/go/+/refs/tags/go1.25.6:src/net/http/httputil/reverseproxy.go;l=504

The solution in the queue-proxy is straight forward because we know K8s will send a TERM signal to the user container - thus we can expect apps to return a close frame when that happens. Those are the changes I made to our websocket server. This isn't necessary but just maybe it's a good practice

With the activator though - there's no way to signal downstream to end the connection. We maybe could send a close frame on the connection but it makes the activator protocol aware and I don't think we can write a message on an upstream connection like this (it could be interleaved with another).

Given this - I don't think the activator issue should prevent this change from merging - but just the solution there isn't straighforward.

@dprotaso
Copy link
Member Author

/retest

hmm

websocket_test.go:162: Successfully wrote: "Hello, websocket"
websocket_test.go:190: error running test failed to read message: read tcp 10.250.69.110:60956->34.138.223.27:80: read: connection reset by peer

@dprotaso dprotaso force-pushed the websocket-graceful branch 2 times, most recently from 9f80c61 to cd8e486 Compare February 5, 2026 02:46
@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2026
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2026
@dprotaso
Copy link
Member Author

dprotaso commented Feb 6, 2026

/retest

@dprotaso dprotaso changed the title [wip] Shutdown Websocket Connections Gracefully in the queue-proxy Shutdown Websocket Connections Gracefully in the queue-proxy Feb 6, 2026
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 6, 2026
gotestsum --format testname -- \
-race -count=1 -parallel=1 -tags=e2e \
-timeout=30m \
-timeout=35m \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the extra test we need to bump this since we run things with -parallel=1 due to github actions default runners being crappy.

Copy link
Member Author

@dprotaso dprotaso Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted the test to not use time.Sleep to speed it up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not that relevant for this PR but timing tests could also be sped up using synctest after requiring go 1.25, see https://go.dev/blog/testing-time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for e2e. It would run faster if we ran tests in parallel - but with kind+github action runners they are so under provisioned that tests flake out etc.

@dprotaso
Copy link
Member Author

dprotaso commented Feb 7, 2026

/assign @linkvt @elijah-rou

@knative-prow
Copy link

knative-prow bot commented Feb 7, 2026

@dprotaso: GitHub didn't allow me to assign the following users: elijah-rou.

Note that only knative members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/assign @linkvt @elijah-rou

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@linkvt linkvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general, took me a while to figure out the concurrency and what is actually tested, I left a few minor comments though.

Comment on lines 67 to 73
<-inHandler
close(drainStarted)
drainResult <- h.Drain(context.Background())
}()

<-drainStarted
close(handlerWait)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this always behaves as expected, couldn't a potential sequence be:

  1. B: inHandler is unblocked
  2. B: close drainStarted
  3. Test: drainStarted is unblocked
  4. Test: close handlerWait
  5. A: handlerWait is unblocked, returns and inflight requests equals 0
  6. B: nothing to Drain

Should this test verify that drain works correctly when a connection is closed after the tracker was asked to Drain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You think I should add a sleep in-between drainStarted being unblocked and closing handlerWait?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was thinking that close(drainStarted) wouldn't yield the go-routine so the drain does start but I don't believe there is a guarantee like that in Go

Copy link
Contributor

@linkvt linkvt Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we can remove drainStarted, move <-inHandler out of the second goroutine between the two goroutines and add a sleep before close(handlerWait).

This way the handler would start, unblock <-inHandler, start the goroutine to drain, sleep in the test func so that Drain will run the loop a few times before close(handlerWait) lets the handler finish.

Edit: We could even add a select to check whether <-drainResult has received a value before this sleep passes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - take a look

Comment on lines 67 to 73
<-inHandler
close(drainStarted)
drainResult <- h.Drain(context.Background())
}()

<-drainStarted
close(handlerWait)
Copy link
Contributor

@linkvt linkvt Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we can remove drainStarted, move <-inHandler out of the second goroutine between the two goroutines and add a sleep before close(handlerWait).

This way the handler would start, unblock <-inHandler, start the goroutine to drain, sleep in the test func so that Drain will run the loop a few times before close(handlerWait) lets the handler finish.

Edit: We could even add a select to check whether <-drainResult has received a value before this sleep passes.

@linkvt
Copy link
Contributor

linkvt commented Feb 10, 2026

/lgtm

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Feb 10, 2026
@knative-prow knative-prow bot merged commit ce8ae6c into knative:main Feb 10, 2026
90 checks passed
@dprotaso dprotaso deleted the websocket-graceful branch February 10, 2026 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants