Add kn.revision.request.queued and kn.revision.request.active metrics#16367
Add kn.revision.request.queued and kn.revision.request.active metrics#16367knative-prow[bot] merged 4 commits intoknative:mainfrom
Conversation
Adds an observable gauge metric that reports the number of in-flight requests per revision in the activator's throttler. Uses a callback-based approach to avoid any overhead on the request hot path - values are only computed when metrics are scraped.
|
Hi @prashanthjos. Thanks for your PR. I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16367 +/- ##
==========================================
+ Coverage 80.17% 80.25% +0.08%
==========================================
Files 216 216
Lines 13440 13483 +43
==========================================
+ Hits 10775 10821 +46
+ Misses 2300 2296 -4
- Partials 365 366 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
linkvt
left a comment
There was a problem hiding this comment.
Hi @prashanthjos , thanks for the PR!
I left a few comments and would also still require a review from a maintainer with more experience.
/ok-to-test
pkg/activator/net/metrics_test.go
Outdated
| } | ||
| } | ||
|
|
||
| func TestThrottlerMetricsShutdownNilObserver(t *testing.T) { |
There was a problem hiding this comment.
Seems like a duplicate of what is tested in TestNewThrottlerMetricsWithNilProvider
pkg/activator/net/metrics_test.go
Outdated
| } | ||
| } | ||
|
|
||
| func TestThrottlerMetricsEmptyQueueDepth(t *testing.T) { |
There was a problem hiding this comment.
We could maybe merge this and the TestNewThrottlerMetrics by using a table driven test? See https://go.dev/wiki/TableDrivenTests#using-a-map-to-store-test-cases
pkg/activator/net/throttler.go
Outdated
| // Initialize metrics with the throttler as the queue depth provider. | ||
| metrics, err := newThrottlerMetrics(mp, t) | ||
| if err != nil { | ||
| logger.Warnw("Failed to initialize throttler metrics", "error", err) |
There was a problem hiding this comment.
I understand that there is currently no error passed to callers of NewThrottler but I don't think that swallowing the error here is the best solution.
pkg/activator/net/metrics.go
Outdated
| } | ||
|
|
||
| // Shutdown unregisters the metrics callback. | ||
| func (m *throttlerMetrics) Shutdown() error { |
There was a problem hiding this comment.
Maybe I'm missing something but it seems this is only called in tests?
pkg/activator/net/metrics.go
Outdated
|
|
||
| queueDepth, err := meter.Int64ObservableGauge( | ||
| "kn.activator.queue.depth", | ||
| metric.WithDescription("Number of requests currently queued in the activator for a revision"), |
There was a problem hiding this comment.
The metric name and description use the word queue but below we are tracking inflight requests which seems a bit different to me? Is this intended?
There was a problem hiding this comment.
The terminology "queue" was used here in the documentation, I am open to change it.
The activator is part of the data-plane. It is responsible to queue incoming requests (if a Knative Service is scaled-to-zero). It communicates with the autoscaler to bring scaled-to-zero Services back up and forward the queued requests. Activator can also act as a request buffer to handle traffic bursts. Additional details can be found here.
|
Thank you @vincent-pli, I will wait for @dprotaso also to review and will address all the comments together. |
|
We already have the metric - Is this not the metric you're looking for based on your description |
|
Thank you Dave, Looking at the code more carefully, I see that ConcurrencyReporter.Handler wraps the activationHandler (which calls throttler.Try), so handleRequestIn/handleRequestOut are called before entering and after exiting the throttler respectively. This means kn.revision.request.concurrency already captures requests that are waiting in the throttler queue. The differences between the two metrics are: We want to capture this raw number instead of having it pre-averaged, so we can apply our own aggregation scenarios (average/sum/count/max/percentiles) based on specific monitoring needs. Pre-averaged data limits flexibility in how we can analyze queue behavior. |
|
I think it would be easier to add this type of instrumentation to where we already produce our spans - then we shouldn't need to mess with our breakers. serving/pkg/activator/handler/handler.go Line 114 in 11899d4 Might be worth refactoring the spans logic as well into a similar struct that encapsulates both similar to https://github.com/knative/serving/blob/main/pkg/activator/handler/metrics.go As for attribute naming we already have the following so we should use those -
We can/probably should include other attributes if they exist. See: As for the metric name Thoughts on |
@dprotaso I am ok with other changes, but from handler.go, we can only count requests entering/exiting throttler.Try(). We can't distinguish "waiting in queue" vs "actively executing" without accessing the breaker internals. |
|
@prashanthjos it should work similar to how we measure spans // increment queue count
if err := a.throttler.Try(tryContext, revID, func(dest string, isClusterIP bool) error {
// decrement queue count
// increment executing count
// defer decrement executing count
...
return nil
}); err != nil {
// we need to distinguish between failure to proxy and failure to queue
// decrement queue count if failure to queue
...
} |
|
@dprotaso i changed the code, please do let me know your thoughts. |
pkg/activator/handler/handler.go
Outdated
| tryContext, trySpan := a.tracer.Start(r.Context(), "throttler_try") | ||
|
|
||
| revID := RevIDFrom(r.Context()) | ||
| rev := RevisionFrom(r.Context()) |
There was a problem hiding this comment.
why do we need rev when it seems like revID has what we need?
pkg/activator/handler/handler.go
Outdated
|
|
||
| if err := a.throttler.Try(tryContext, revID, func(dest string, isClusterIP bool) error { | ||
| // Request got capacity - decrement queued, increment active | ||
| a.metrics.requestQueued.Add(r.Context(), -1, metricOpts) |
There was a problem hiding this comment.
Don't use r.Context() when recording metrics - we shouldn't cancel metric recording when the request is cancelled
|
/lgtm thanks @prashanthjos cc @linkvt for any post merge comments |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dprotaso, prashanthjos The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
This PR moves activator queue metrics from the throttler to the handler layer, adding two new metrics that report instantaneous request counts per revision.
Design
Uses UpDownCounter with callback-based tracking pattern:
context.Background()to prevent metric cancellation when requests are cancelledMetric Details
kn.revision.request.queuedkn.revision.request.activeLabels:
kn.revision.name: Revision namek8s.namespace.name: NamespaceTesting
pkg/activator/handlerandpkg/activator/nettests passrelease-note
Add kn.revision.request.queued and kn.revision.request.active metrics to monitor instantaneous request counts per revision in the activator