Skip to content

Add kn.revision.request.queued and kn.revision.request.active metrics#16367

Merged
knative-prow[bot] merged 4 commits intoknative:mainfrom
prashanthjos:activator_queue_size
Feb 8, 2026
Merged

Add kn.revision.request.queued and kn.revision.request.active metrics#16367
knative-prow[bot] merged 4 commits intoknative:mainfrom
prashanthjos:activator_queue_size

Conversation

@prashanthjos
Copy link
Contributor

@prashanthjos prashanthjos commented Jan 30, 2026

Description

This PR moves activator queue metrics from the throttler to the handler layer, adding two new metrics that report instantaneous request counts per revision.

Design

Uses UpDownCounter with callback-based tracking pattern:

  • Metrics updated at request state transitions (queued → active → complete)
  • Uses context.Background() to prevent metric cancellation when requests are cancelled
  • Handler doesn't interact with instruments directly - uses encapsulated methods

Metric Details

Name Type Description
kn.revision.request.queued Int64UpDownCounter Number of requests currently queued waiting for capacity
kn.revision.request.active Int64UpDownCounter Number of requests currently being proxied

Labels:

  • kn.revision.name: Revision name
  • k8s.namespace.name: Namespace

Testing

  • Updated handler tests for new New() signature
  • All pkg/activator/handler and pkg/activator/net tests pass

release-note
Add kn.revision.request.queued and kn.revision.request.active metrics to monitor instantaneous request counts per revision in the activator

Adds an observable gauge metric that reports the number of in-flight requests per revision in the activator's throttler. Uses a callback-based approach to avoid any overhead on the request hot path - values are only computed when metrics are scraped.
@knative-prow knative-prow bot requested review from dprotaso and skonto January 30, 2026 21:50
@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 30, 2026
@knative-prow
Copy link

knative-prow bot commented Jan 30, 2026

Hi @prashanthjos. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@codecov
Copy link

codecov bot commented Jan 30, 2026

Codecov Report

❌ Patch coverage is 90.69767% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.25%. Comparing base (4004491) to head (bbedb07).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
pkg/activator/handler/metrics.go 89.18% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16367      +/-   ##
==========================================
+ Coverage   80.17%   80.25%   +0.08%     
==========================================
  Files         216      216              
  Lines       13440    13483      +43     
==========================================
+ Hits        10775    10821      +46     
+ Misses       2300     2296       -4     
- Partials      365      366       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@linkvt linkvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @prashanthjos , thanks for the PR!
I left a few comments and would also still require a review from a maintainer with more experience.

/ok-to-test

}
}

func TestThrottlerMetricsShutdownNilObserver(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a duplicate of what is tested in TestNewThrottlerMetricsWithNilProvider

}
}

func TestThrottlerMetricsEmptyQueueDepth(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe merge this and the TestNewThrottlerMetrics by using a table driven test? See https://go.dev/wiki/TableDrivenTests#using-a-map-to-store-test-cases

// Initialize metrics with the throttler as the queue depth provider.
metrics, err := newThrottlerMetrics(mp, t)
if err != nil {
logger.Warnw("Failed to initialize throttler metrics", "error", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that there is currently no error passed to callers of NewThrottler but I don't think that swallowing the error here is the best solution.

}

// Shutdown unregisters the metrics callback.
func (m *throttlerMetrics) Shutdown() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but it seems this is only called in tests?


queueDepth, err := meter.Int64ObservableGauge(
"kn.activator.queue.depth",
metric.WithDescription("Number of requests currently queued in the activator for a revision"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric name and description use the word queue but below we are tracking inflight requests which seems a bit different to me? Is this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terminology "queue" was used here in the documentation, I am open to change it.

The activator is part of the data-plane. It is responsible to queue incoming requests (if a Knative Service is scaled-to-zero). It communicates with the autoscaler to bring scaled-to-zero Services back up and forward the queued requests. Activator can also act as a request buffer to handle traffic bursts. Additional details can be found here.

https://knative.dev/docs/serving/architecture/#diagram

@knative-prow knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 2, 2026
@prashanthjos
Copy link
Contributor Author

prashanthjos commented Feb 2, 2026

Thank you @vincent-pli, I will wait for @dprotaso also to review and will address all the comments together.

@dprotaso
Copy link
Member

dprotaso commented Feb 2, 2026

We already have the metric - kn.revision.request.concurrency. This metric reporter happens before the throttler.

Is this not the metric you're looking for based on your description reports the number of requests currently in-flight

@prashanthjos
Copy link
Contributor Author

prashanthjos commented Feb 2, 2026

Thank you Dave, Looking at the code more carefully, I see that ConcurrencyReporter.Handler wraps the activationHandler (which calls throttler.Try), so handleRequestIn/handleRequestOut are called before entering and after exiting the throttler respectively. This means kn.revision.request.concurrency already captures requests that are waiting in the throttler queue.

The differences between the two metrics are:

We want to capture this raw number instead of having it pre-averaged, so we can apply our own aggregation scenarios (average/sum/count/max/percentiles) based on specific monitoring needs. Pre-averaged data limits flexibility in how we can analyze queue behavior.

@dprotaso
Copy link
Member

dprotaso commented Feb 5, 2026

I think it would be easier to add this type of instrumentation to where we already produce our spans - then we shouldn't need to mess with our breakers.

proxyCtx, proxySpan := a.tracer.Start(r.Context(), "activator_proxy")

Might be worth refactoring the spans logic as well into a similar struct that encapsulates both similar to https://github.com/knative/serving/blob/main/pkg/activator/handler/metrics.go

As for attribute naming we already have the following so we should use those -

k8s.namespace.name
kn.revision.name

We can/probably should include other attributes if they exist. See:

As for the metric name kn.activator.queue.depth maybe we should tweak it to pair with kn.revision.request.concurrency

Thoughts on
kn.revision.request.active
kn.revision.request.queued

@prashanthjos
Copy link
Contributor Author

I think it would be easier to add this type of instrumentation to where we already produce our spans - then we shouldn't need to mess with our breakers.

@dprotaso I am ok with other changes, but from handler.go, we can only count requests entering/exiting throttler.Try(). We can't distinguish "waiting in queue" vs "actively executing" without accessing the breaker internals.

@dprotaso
Copy link
Member

dprotaso commented Feb 5, 2026

@prashanthjos it should work similar to how we measure spans

	// increment queue count
	if err := a.throttler.Try(tryContext, revID, func(dest string, isClusterIP bool) error {
	        // decrement queue count
	        // increment executing count
	        // defer decrement executing count
                  ...
		return nil
	}); err != nil {
	        // we need to distinguish between failure to proxy and failure to queue 
	        // decrement queue count if failure to queue
	        ...
	}

@prashanthjos
Copy link
Contributor Author

@dprotaso i changed the code, please do let me know your thoughts.

tryContext, trySpan := a.tracer.Start(r.Context(), "throttler_try")

revID := RevIDFrom(r.Context())
rev := RevisionFrom(r.Context())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need rev when it seems like revID has what we need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed !


if err := a.throttler.Try(tryContext, revID, func(dest string, isClusterIP bool) error {
// Request got capacity - decrement queued, increment active
a.metrics.requestQueued.Add(r.Context(), -1, metricOpts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use r.Context() when recording metrics - we shouldn't cancel metric recording when the request is cancelled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done !

@prashanthjos prashanthjos changed the title Add activator queue depth metric (kn.activator.queue.depth) Add kn.revision.request.queued and kn.revision.request.active metrics Feb 7, 2026
@dprotaso
Copy link
Member

dprotaso commented Feb 8, 2026

/lgtm
/approve

thanks @prashanthjos

cc @linkvt for any post merge comments

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Feb 8, 2026
@knative-prow
Copy link

knative-prow bot commented Feb 8, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, prashanthjos

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2026
@knative-prow knative-prow bot merged commit 9c43aa8 into knative:main Feb 8, 2026
93 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants