Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 10 additions & 8 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,11 @@ jobs:
with:
persist-credentials: false
- name: Install imaging dependencies
run: >-
sudo apt-get install -y --no-install-recommends
libcairo2-dev libfreetype6-dev libffi-dev
libjpeg-dev libpng-dev libz-dev
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
libcairo2-dev libfreetype6-dev libffi-dev \
libjpeg-dev libpng-dev libz-dev
- uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b # v8.1.0
with:
enable-cache: false
Expand Down Expand Up @@ -70,10 +71,11 @@ jobs:
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- name: Install imaging dependencies
run: >-
sudo apt-get install -y --no-install-recommends
libcairo2-dev libfreetype6-dev libffi-dev
libjpeg-dev libpng-dev libz-dev
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
libcairo2-dev libfreetype6-dev libffi-dev \
libjpeg-dev libpng-dev libz-dev
- uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b # v8.1.0
with:
enable-cache: false
Expand Down
67 changes: 66 additions & 1 deletion docs/guides/administration/scaling.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ For the full list of workers and their defaults, see the
under sustained load), lower the queue's capacity to apply backpressure. Once a queue hits capacity, the
scheduler pauses task creation, propagating throttling back to BOM upload clients.

Change capacity at runtime from the administrator panel under *Workflows Task Queues*, or via the REST
Change capacity at runtime from the administrator panel under *Workflows > Task Queues*, or via the REST
Comment thread
nscuro marked this conversation as resolved.
API. Defaults live in the [task queues reference](../../reference/configuration/dex-engine.md#task-queues).

**Lower-level engine tuning.** When metrics show write-buffer flush latency, run-history cache misses, or
Expand All @@ -83,6 +83,71 @@ and
[`notification.outbox-relay.*`](../../reference/configuration/properties.md#dtnotificationoutbox-relaypoll-interval-ms)
properties.

## Scale workers horizontally

After tuning the vertical knobs, if an activity backlog keeps growing, scale worker instances
horizontally on demand signals from the durable execution engine. Do not scale on CPU or memory
alone: activity workers are I/O-bound (database, registry calls, vulnerability sources) and spend
most of their time waiting, so CPU stays low while tasks queue up.

The engine exposes three Prometheus metrics for this. The management server serves them once you
[turn on Prometheus metrics scraping](configuring-observability.md#enabling-prometheus-metrics-scraping).

| Metric | What it tells you | Use as |
|---|---|---|
| `dt_dex_engine_activity_task_queue_backlog{queueName}` | Approximate count of ready-to-schedule activity tasks per queue, capped at 10000. | Primary scale-up trigger. |
| `dt_dex_engine_activity_task_queue_backlog_age_seconds{queueName}` | Age of the oldest ready-to-schedule activity task per queue. | SLO-aligned secondary trigger. |
| `dt_dex_engine_task_worker_concurrency_utilization{workerType,name}` | Fraction (0–1) of a worker's concurrency slots currently in use. | Scale-down guard. |

Scale up when the backlog exceeds a target per instance, or when the oldest task has waited longer
than the SLO. Scale down only when worker slots stay below a low-use threshold across all
instances.

!!! note "Combine across instances and queues"
Every instance publishes the backlog and age gauges. Most deployments run all activity workers
together, so the right HPA signal is "any queue needs scale-up." Collapse to a single value
with `max(...)` (no `by` clause), for example
`max(dt_dex_engine_activity_task_queue_backlog)`. Add `by (queueName)` only if you split
worker types across separate Deployments and want per-queue scaling.

!!! note "Backlog count is approximate"
The engine caps the count at 10000 per queue to bound query cost. Beyond the cap, the value
saturates at 10000. This is precise enough to drive scaling decisions.

<!-- vale Google.Headings = NO -->
### KEDA example
<!-- vale Google.Headings = YES -->

[KEDA](https://keda.sh) can drive a Deployment from these metrics. The `ScaledObject` below
targets worker nodes (no `web` profile), scaling on the worst-case backlog across all queues,
with the worst-case oldest-task age as a secondary trigger. Each query wraps the metric in
`avg_over_time(...[5m:30s])` so a transient spike (a single large BOM upload) does not trigger
churn.

??? example "`ScaledObject` manifest"
```yaml linenums="1"
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: dependencytrack-worker
spec:
scaleTargetRef:
name: dependencytrack-worker
minReplicaCount: 2
maxReplicaCount: 5
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
query: avg_over_time(max(dt_dex_engine_activity_task_queue_backlog)[5m:30s])
threshold: "1000"
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
query: avg_over_time(max(dt_dex_engine_activity_task_queue_backlog_age_seconds)[5m:30s])
threshold: "300"
```

## Pool connections centrally

- **Up to roughly 5 instances** at the default pool size of 30: the per-instance pool works.
Expand Down
1 change: 1 addition & 0 deletions docs/includes/abbreviations.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,6 @@
*[OSV]: Open Source Vulnerabilities
*[PURL]: Package URL, a standardized format for identifying software packages
*[SBOM]: Software Bill of Materials
*[SLO]: Service Level Objectives
*[VDR]: Vulnerability Disclosure Report
*[VEX]: Vulnerability Exploitability eXchange