Conversation
…, not hardcoded in plaintext
…fana; also added docs file explaining what is what
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughConsolidates auth flows into a new AuthService, adds AccountLockedError and CSRF/security metrics wiring, significantly trims many metrics modules, adds multiple Grafana dashboards and a dashboard-vs-code contract test, and exposes Grafana env/CI variables plus an nginx grafana proxy rewrite. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Route as auth.py (login)
participant Auth as AuthService
participant Lockout as LoginLockoutService
participant Repo as UserRepository
participant Sec as SecurityService
participant Producer as EventProducer
Client->>Route: POST /login (username, password, headers)
Route->>Auth: login(username, password, ip, user_agent)
Auth->>Lockout: check_lockout(username)
alt account locked
Auth->>Auth: record lockout metric
Auth-->>Route: raise AccountLockedError
else not locked
Auth->>Repo: fetch user by username
alt user missing
Auth->>Auth: perform dummy password verify (timing mitigation)
Auth->>Auth: record failed auth metric
Auth-->>Route: raise InvalidCredentialsError
else user found
Auth->>Sec: verify_password(password, stored_hash)
alt invalid password
Auth->>Auth: record failed attempt & maybe lock
Auth-->>Route: raise InvalidCredentialsError
else valid password
Auth->>Lockout: reset_lockout(username)
Auth->>Auth: generate access & csrf tokens
Auth->>Producer: publish UserLoggedInEvent(metadata)
Auth-->>Route: return LoginResult (tokens, csrf, timeout)
Route->>Client: set cookies, 200 OK
end
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
13 issues found across 54 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/app/core/metrics/notifications.py">
<violation number="1" location="backend/app/core/metrics/notifications.py:150">
P2: The new decrement_unread_count blindly subtracts 1 and can drive unread counts negative or drift from actual per-user state. Preserve a cached per-user count and clamp at zero when decrementing so the metric stays accurate.</violation>
</file>
<file name="backend/app/dlq/manager.py">
<violation number="1" location="backend/app/dlq/manager.py:249">
P2: `record_dlq_processing_error` expects an event type label, but this passes `event_id`, creating high-cardinality metrics per message. Use the message’s event type (or a fixed label like "manual_retry") instead of a unique ID.</violation>
</file>
<file name="backend/tests/contract/test_grafana_metrics.py">
<violation number="1" location="backend/tests/contract/test_grafana_metrics.py:59">
P2: This fixture removes OTEL_SDK_DISABLED without restoring the previous value, so later tests will see a different environment. Consider restoring the prior value (or use pytest’s monkeypatch fixture) to avoid cross-test side effects.</violation>
<violation number="2" location="backend/tests/contract/test_grafana_metrics.py:63">
P2: This fixture sets a global meter provider without restoring the previous provider, which can leak into other tests. Capture the current provider and restore it in a teardown to avoid cross-test interference.</violation>
</file>
<file name="backend/grafana/provisioning/dashboards/notifications.json">
<violation number="1" location="backend/grafana/provisioning/dashboards/notifications.json:204">
P2: `histogram_quantile` on classic `_bucket` metrics should aggregate the bucket rates with `sum ... by (le)`; without it, the quantile calculation is incorrect or fragmented across bucket series. Wrap the rate with `sum by (le)` (and include any grouping labels you want to keep).</violation>
<violation number="2" location="backend/grafana/provisioning/dashboards/notifications.json:265">
P2: `histogram_quantile` requires `sum ... by (le)` over `_bucket` rates; otherwise the percentile is computed from unaggregated bucket series and is incorrect. Wrap the bucket rate with `sum by (le)`.</violation>
<violation number="3" location="backend/grafana/provisioning/dashboards/notifications.json:360">
P2: `histogram_quantile` should aggregate `_bucket` rates with `sum ... by (le)` to produce a valid quantile. Without the aggregation, Grafana will compute the percentile from per-bucket series and return misleading values.</violation>
<violation number="4" location="backend/grafana/provisioning/dashboards/notifications.json:397">
P2: `histogram_quantile` on classic histogram buckets needs `sum ... by (le)` around the `rate(...)`; otherwise the percentile is not computed correctly. Aggregate the bucket rates first.</violation>
<violation number="5" location="backend/grafana/provisioning/dashboards/notifications.json:426">
P2: `histogram_quantile` should receive `sum ... by (le)` of the bucket rates; otherwise the quantile is computed on raw bucket series. Aggregate the rate by `le` (and any grouping labels you need).</violation>
</file>
<file name="docker-compose.yaml">
<violation number="1" location="docker-compose.yaml:178">
P1: Hard-coding a default Grafana admin password means the service will start with a known credential whenever GRAFANA_ADMIN_PASSWORD is not set. Require the password to be provided (or use Docker secrets) to avoid insecure defaults.</violation>
</file>
<file name="backend/app/services/auth_service.py">
<violation number="1" location="backend/app/services/auth_service.py:125">
P2: Timing side-channel: username enumeration is possible. When the user doesn't exist, `_fail_login` is called immediately without running `verify_password`. When the user exists but password is wrong, the expensive password hash comparison runs first. An attacker can measure response times to determine whether a username is registered.
Mitigate by running a dummy `verify_password` against a fixed hash when the user is not found, so both paths take comparable time.</violation>
<violation number="2" location="backend/app/services/auth_service.py:226">
P1: Bug: `publish_logout_event` passes the **username** (from `decode_token`) as `user_id` to `UserLoggedOutEvent` and `_build_metadata`. All other events in this service (login, registration, auth-failed) correctly use `str(user.user_id)` (the actual UUID). This means logout events will be uncorrelatable with other user events.
You need to look up the user by username to get the real `user_id`, similar to how `get_current_user` does it.</violation>
</file>
<file name="backend/grafana/provisioning/dashboards/event-stream-monitoring.json">
<violation number="1" location="backend/grafana/provisioning/dashboards/event-stream-monitoring.json:605">
P3: The "Event Buffer Performance" row is now empty (no panels between this row and the next), which leaves a blank section in the dashboard. Either restore the event buffer panels or remove the row to avoid a confusing empty header.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
backend/grafana/provisioning/dashboards/event-stream-monitoring.json
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/app/core/metrics/replay.py (1)
144-150:⚠️ Potential issue | 🟡 Minor
update_replay_queue_sizestate leaks for finished sessions.The per-session attribute
_queue_{session_id}set viasetattris never cleaned up when a session finalizes. Inreplay_service.py_finalize_session(lines 261-274), the session is removed from all internal dicts butupdate_replay_queue_size(session_id, 0)is never called — so the UpDownCounter never decrements to zero for that session, and the cached attribute remains on the metrics object indefinitely.Consider calling
self._metrics.update_replay_queue_size(session.session_id, 0)during finalization, or adding a cleanup method toReplayMetrics.backend/app/core/metrics/security.py (1)
78-89:⚠️ Potential issue | 🟠 Major
user_idandip_addressas metric attributes risk cardinality explosion and PII leakage.Multiple recording methods (
record_authentication_attempt,record_password_reset_request,record_weak_password_attempt,record_brute_force_attempt,record_account_locked) embeduser_idorip_addressdirectly as metric label values. In Prometheus/VictoriaMetrics, each unique label combination creates a separate time series. With many users or IPs, this causes unbounded cardinality growth — degrading query performance and increasing storage costs.Additionally,
user_idandip_addressin metrics labels may constitute PII under GDPR/CCPA, which typically requires justification and retention controls that metrics backends don't natively provide.Consider removing these high-cardinality identifiers from metric attributes and instead logging them via structured logs (where retention and access controls are easier to manage), or using hashed/bucketed values if breakdowns are needed.
Also applies to: 131-157
🤖 Fix all issues with AI agents
In `@backend/app/dlq/manager.py`:
- Line 249: The call to metrics.record_dlq_processing_error currently passes
event_id as the event_type, creating high-cardinality labels; update the call in
manager.py so that metrics.record_dlq_processing_error(original_topic,
event_type, error_type) receives the actual event type extracted from the
message (e.g., use the message/event metadata field that contains the event
type) instead of event_id, preserving event_id only for payload/logging; ensure
you reference the same symbols: metrics.record_dlq_processing_error, event_id,
original_topic and use the message's event_type field (or
message.metadata["event_type"] / equivalent accessor) when invoking the
function.
In `@backend/app/services/auth_service.py`:
- Around line 223-234: publish_logout_event is passing the token subject
(username returned by security_service.decode_token) into
UserLoggedOutEvent.user_id and _build_metadata, causing mismatched IDs vs the
login flow; fix by resolving the username to the canonical user ID before
producing the event: call your user lookup (e.g., fetch by username or use
existing user repository/service used in the login flow to get
str(user.user_id)), then pass that user_id into UserLoggedOutEvent(user_id=...)
and _build_metadata(user_id=...), and use that same user_id as the produce key
when calling self._producer.produce so logout events match login events.
- Around line 108-110: The metric calls currently pass raw usernames into
security_metrics (calls to record_account_locked and
record_weak_password_attempt), which creates high-cardinality/PII labels; update
the calls in AuthService (where locked handling and weak-password handling
occur) to stop passing the plain username — instead pass a stable non-PII
identifier (e.g., numeric user_id if available), or a fixed bucket (e.g.,
"unknown"), or a one-way hash of the username using a secure hash function
(sha256) before calling security_metrics.record_account_locked and
security_metrics.record_weak_password_attempt so the metric labels no longer
contain raw usernames.
In `@backend/app/services/notification_service.py`:
- Around line 548-551: The action string currently set to "enabled" or "updated"
is ambiguous when update_data.enabled is False and should be "disabled" instead;
change the logic around the call to self.metrics.record_subscription_change in
the method containing repository.upsert_subscription so that action = "enabled"
if update_data.enabled is True, action = "disabled" if update_data.enabled is
False, and otherwise "updated" for non-enabled field changes. Also pass the
channel value in the correct parameter name (it represents a
NotificationChannel) when calling self.metrics.record_subscription_change —
ensure you provide it to the parameter that tracks channel/notification channel
rather than notification_type so dashboards receive the proper semantic field.
In `@backend/grafana/provisioning/dashboards/coordinator-execution.json`:
- Around line 38-47: The histogram_quantile calls (e.g.,
histogram_quantile(0.95,
rate(coordinator_scheduling_duration_seconds_bucket[5m]))) are missing an
aggregation and must be changed to aggregate by bucket le before applying
histogram_quantile; update each affected target (panel ids 1,2,5,7,8) to wrap
rate(...) with sum(...) by (le) so the query becomes histogram_quantile(0.95,
sum(rate(coordinator_scheduling_duration_seconds_bucket[5m])) by (le)); apply
the same pattern to any other histogram_quantile uses in this JSON (and
corresponding files noted) to ensure a single aggregated quantile across series.
In `@backend/grafana/provisioning/dashboards/event-stream-monitoring.json`:
- Around line 595-607: Remove the empty dashboard row object with "id": 60 and
"title": "Event Buffer Performance" (gridPos y: 16) from the JSON so it no
longer produces a blank header gap; alternatively, if panels should remain under
that section, move their panel objects to be children below this row or adjust
their gridPos y values, but the simplest fix is to delete the row object with id
60 and its empty "panels": [] entry.
In `@backend/grafana/provisioning/dashboards/notifications.json`:
- Around line 195-211: Panels that plot
histogram_quantile(..._seconds_bucket...) are using the "short" unit so
durations render as raw numbers; update the panels with id 7, 9, 13, and 14
(titles "Channel Delivery Time", "Overall Delivery Time", "Webhook Delivery",
"Slack") to set their unit to "s" (seconds) in the panel JSON (replace the unit
field value currently "short" with "s") so the times display as human-readable
seconds; keep all other panel settings unchanged.
🧹 Nitpick comments (15)
.github/workflows/release-deploy.yml (1)
143-143: Consider parameterizingGRAFANA_ROOT_URLinstead of hardcoding it.The production domain is baked into the deploy script. If it ever changes, this line is easy to miss. A GitHub repository variable (e.g.,
vars.GRAFANA_ROOT_URL) would keep all environment-specific config in one place alongside the other secrets.backend/grafana/provisioning/dashboards/event-replay.json (1)
370-371: Minor inconsistencies with other dashboards in the repo.This dashboard uses
schemaVersion: 33and includes"style": "dark", while the siblingkafka-events-monitoring.jsonusesschemaVersion: 38and omitsstyle. Thestylefield was deprecated in Grafana 9+. Consider aligning the schema version and dropping thestylefield for consistency.backend/grafana/provisioning/dashboards/http-middleware.json (1)
48-71: Duration and size panels use"unit": "short"instead of semantically correct units.Several panels display metrics measured in seconds, bytes, or percent, but all default to
"unit": "short". This means Grafana won't apply unit-aware formatting (e.g., "1.23 s", "4.5 KiB", "85%"), making values harder to interpret at a glance.Affected panels:
- Request Duration (id 2, line 52): queries
http_request_duration_seconds_bucket→ should use"s"- Request/Response Size (id 4, line 99): queries
*_size_bytes_bucket→ should use"bytes"or"decbytes"- MongoDB Query Duration (id 6, line 170): queries
*_duration_seconds_bucket→ should use"s"- Event Store & Idempotency (id 8, line 217): queries
*_duration_seconds_bucket→ should use"s"- CPU (id 11, line 324): queries
system_cpu_percent→ should use"percent"Example fix for Request Duration panel
"fieldConfig": { "defaults": { - "unit": "short" + "unit": "s" } },backend/grafana/provisioning/dashboards/security-auth.json (1)
53-76: Duration panels should use"unit": "s"instead of"short".Same issue as in
http-middleware.json: the Authentication Duration panel (id 2) queriesauthentication_duration_seconds_bucketand Token Expiry Time (id 8, line 220) queriestoken_expiry_time_seconds_bucket— both should use"unit": "s"so Grafana formats values with time-aware suffixes (ms/s/min).Suggested fix
// Panel id 2 (Authentication Duration) and id 8 (Token Expiry Time): "fieldConfig": { "defaults": { - "unit": "short" + "unit": "s" } },backend/grafana/provisioning/dashboards/coordinator-execution.json (1)
197-254: Script Resources panels use"unit": "short"for memory and percent metrics.
- Memory Usage (id 7) queries
script_memory_usage_MiB_bucket→ consider"unit": "MiB"(or"decmbytes")- Memory Utilization (id 8) queries
script_memory_utilization_percent_bucket→ consider"unit": "percent"This is the same pattern flagged in the other new dashboards for duration panels.
backend/app/core/metrics/kubernetes.py (1)
83-89: Thegetattr/setattrdelta-tracking pattern is not thread-safe.
update_active_pod_creations,update_pod_monitor_pods_watched, andupdate_pods_by_phaseall usegetattr/setattrto track the last-known value and compute a delta. If called concurrently from different threads, a race between reading the current value and writing the new value could produce incorrect deltas. This is a pre-existing pattern, but worth noting given that Kubernetes operations are often concurrent.backend/grafana/provisioning/dashboards/kubernetes-pods.json (1)
53-76: Duration panels use"unit": "short"instead of"s"— values will display as raw numbers rather than human-readable durations.The
pod_creation_duration_seconds_buckethistogram quantile returns values in seconds, but Grafana's"short"unit won't format them as time (e.g., "1.2s"). The same applies to the Monitor Processing Duration panel (id 10, line 290) and Pod Lifetime panel (id 6, line 160). Consider using"unit": "s"for these panels so Grafana renders durations properly.backend/app/services/sse/sse_service.py (1)
61-61: Consider usingtime.monotonic()instead ofdatetime.now(timezone.utc)for duration measurement.Wall-clock time can jump due to NTP adjustments or clock skew.
time.monotonic()is the idiomatic choice for measuring elapsed durations. For long-lived SSE connections this is unlikely to be noticeable, but it's a minor correctness improvement.♻️ Proposed fix
+import time ... - start_time = datetime.now(timezone.utc) + start_time = time.monotonic() ... - duration = (datetime.now(timezone.utc) - start_time).total_seconds() + duration = time.monotonic() - start_timeAlso applies to: 125-126
backend/app/core/metrics/notifications.py (1)
163-164: Minor: retry success rate is only recorded whenattempt_number > 1.This means the first attempt's outcome is never recorded in the
retry_success_ratehistogram. If the intent is to track success rate of retries specifically (not first attempts), this is correct. If you want to track overall success rate including first attempts, the condition should beattempt_number >= 1.backend/tests/contract/test_grafana_metrics.py (3)
58-63: Global OTel state mutation without teardown.
os.environ.pop("OTEL_SDK_DISABLED")andotel_api.set_meter_provider(provider)permanently alter process-global state. If any other test in the same process relies on the default no-op provider or onOTEL_SDK_DISABLED, it will break. Consider restoring the env var and shutting down the provider in ayield-based fixture, or running these tests in a dedicated subprocess/marker with--forked.Sketch: yield-based cleanup
`@pytest.fixture`(scope="module") def prometheus_families() -> dict[str, set[str]]: - os.environ.pop("OTEL_SDK_DISABLED", None) + old_val = os.environ.pop("OTEL_SDK_DISABLED", None) reader = PrometheusMetricReader() provider = MeterProvider(metric_readers=[reader]) otel_api.set_meter_provider(provider) ... - return families + yield families + + provider.shutdown() + if old_val is not None: + os.environ["OTEL_SDK_DISABLED"] = old_valNote: OTel's global
set_meter_providercannot be fully reverted, so subprocess isolation (pytest-forked) is the most robust option.
106-113: Metric name extraction heuristic may produce false positives.The
len(t) > 3filter on Line 113 is a rough heuristic. Short but valid metric names (e.g., a hypotheticalcpu) would pass, while PromQL label values leaking through the regex stripping could be misidentified as metrics. This is acceptable for now since Prometheus naming conventions typically use underscores, but worth documenting the limitation as a code comment.
73-89: Use publicMetricReader.collect()API instead of accessing private_collectorattribute.The code relies on private OpenTelemetry SDK internals (
provider._meters,meter._instrument_id_instrument,reader._collector) that may change across releases. Replacereader._collector.collect()with the publicreader.collect(timeout_millis=...)method to ensure stability. The_metersand_instrument_id_instrumentaccess should also be reconsidered if possible to avoid direct iteration over internal SDK state.backend/tests/unit/core/test_csrf.py (1)
29-62: Consider extracting a shared fixture forSecurityServiceto reduce repetition.Every test method in this file constructs
SecurityService(test_settings, security_metrics)individually. A class-scopedpytest.fixture(like intest_security.py) would eliminate the boilerplate across ~15 test methods.This is a pre-existing pattern, so fine to defer.
backend/app/services/auth_service.py (2)
80-81: Prefer structured keyword arguments over f-strings in structlog calls.Using
f"Admin access denied for user: {user.username} (role: {user.role})"embeds variables into the message string, which defeats structured logging's purpose (greppable messages, indexed fields). Same pattern on lines 92–93.Suggested fix
- self.logger.warning(f"Admin access denied for user: {user.username} (role: {user.role})") + self.logger.warning("Admin access denied", username=user.username, role=user.role)- self.logger.warning( - f"Login failed - {reason}", + self.logger.warning( + "Login failed", + reason=reason,
33-52: Constructor has 8 dependencies — consider whether this signals the service is doing too much.
AuthServicenow handles login, registration, logout, current-user resolution, and admin authorization checks. The admin authorization check (get_admin) feels out of place in an auth service that's primarily about the login/register lifecycle. That said, this consolidation is the explicit goal of the PR, so flagging for awareness only.
backend/grafana/provisioning/dashboards/event-stream-monitoring.json
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
backend/grafana/provisioning/dashboards/event-stream-monitoring.json (1)
1094-1098:⚠️ Potential issue | 🟡 Minor"SSE Shutdown Monitoring" section has only a right-half panel — left side is empty.
The "SSE Connections Being Drained" panel (id 73) is positioned at
x: 12, w: 12, leaving the left half (x: 0–11) blank. A panel (likely id 72) appears to have been removed. Either widen this panel to full width or add a complementary shutdown metric (e.g., shutdown duration, drain timeout events).🔧 Quick fix – make the panel full-width
"gridPos": { "h": 8, - "w": 12, - "x": 12, + "w": 24, + "x": 0, "y": 52 },🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and decide whether a code change is actually needed. If the finding is not valid or no change is required, do not modify code for that item and briefly explain why it was skipped. In `@backend/grafana/provisioning/dashboards/event-stream-monitoring.json` around lines 1094 - 1098, The "SSE Connections Being Drained" panel (id 73) is only occupying the right half (gridPos x:12, w:12); update its gridPos to full width by setting x:0 and w:24 in the panel definition for id 73 (panel title "SSE Connections Being Drained") so it fills the left half gap, and verify it doesn't overlap other panels (e.g., any panel id 72) after the change.backend/app/core/metrics/security.py (2)
137-147:⚠️ Potential issue | 🟠 MajorHigh-cardinality PII labels:
ip_addressandtarget_userin brute-force metrics.
ip_addressandtarget_userare unbounded label values that will cause cardinality explosion. Brute-force detection by definition involves many distinct IPs. Record these in logs/alerts, not as metric dimensions.Proposed fix
def record_brute_force_attempt( - self, ip_address: str, target_user: str | None = None, action_taken: str = "logged" + self, action_taken: str = "logged" ) -> None: self.brute_force_attempts.add( 1, attributes={ - "ip_address": ip_address, - "target_user": target_user or "multiple", "action_taken": action_taken, }, )🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and decide whether a code change is actually needed. If the finding is not valid or no change is required, do not modify code for that item and briefly explain why it was skipped. In `@backend/app/core/metrics/security.py` around lines 137 - 147, The record_brute_force_attempt method is adding high‑cardinality PII (ip_address and target_user) as metric attributes on self.brute_force_attempts; remove those unbounded labels and only record bounded dimensions (e.g., keep action_taken and optionally a coarse target category like target_user_present: "yes"/"no" or target_type: "multiple"/"single") when calling self.brute_force_attempts.add, and move the detailed ip_address and target_user values to logs/alerts (e.g., logger.warn or alerting pipeline) so you still capture them for investigation without exploding metric cardinality.
78-86:⚠️ Potential issue | 🟠 MajorHigh-cardinality PII labels:
user_idin authentication metrics.
user_idis added as a metric attribute on every authentication attempt and failure (lines 82, 86). Each unique user creates a new time series, degrading Prometheus/VictoriaMetrics performance. It also exposes PII in monitoring systems.Consider removing
user_idfrom metric attributes entirely (it belongs in structured logs, not in metric labels), or bucketing it if cardinality is bounded.🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and decide whether a code change is actually needed. If the finding is not valid or no change is required, do not modify code for that item and briefly explain why it was skipped. In `@backend/app/core/metrics/security.py` around lines 78 - 86, The record_authentication_attempt function is adding user_id as a metric label on authentication_attempts and authentication_failures which creates high-cardinality/PII in metrics; remove user_id from the attributes passed to authentication_attempts.add and authentication_failures.add (keep only method and success where needed) and instead record user_id in structured logs or trace spans if required, or implement a deterministic low-cardinality bucketing strategy (e.g., hashed bucket) before adding as a metric label if you must keep a coarse identifier; target the record_authentication_attempt function and the authentication_attempts/authentication_failures metric add calls to apply the change.
🤖 Fix all issues with AI agents
Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.
In `@backend/app/core/metrics/dlq.py`:
- Around line 54-59: The helpers increment_dlq_queue_size and
decrement_dlq_queue_size modify the UpDownCounter (dlq_queue_size) but never
update the cached _dlq_sizes, causing stale deltas in update_dlq_queue_size; fix
by having those helpers compute the new size from _dlq_sizes (e.g. new_size =
_dlq_sizes.get(original_topic, 0) +/- 1), call dlq_queue_size.add(±1,
attributes={"original_topic": original_topic}) and then set
_dlq_sizes[original_topic] = new_size (alternatively, have the helpers call
update_dlq_queue_size with the computed new_size so all updates funnel through
update_dlq_queue_size).
In `@backend/app/core/metrics/security.py`:
- Around line 131-132: The password_reset_requests metric is recording
PII/high-cardinality via the user_id attribute in record_password_reset_request;
remove user_id from the attributes passed to password_reset_requests.add and
only include the non-PII dimension (method). Update the
record_password_reset_request method so password_reset_requests.add(…) is called
with attributes={"method": method} (do not log or hash user_id here) to avoid
exposing user identifiers.
- Around line 137-147: The record_brute_force_attempt method is adding
high‑cardinality PII (ip_address and target_user) as metric attributes on
self.brute_force_attempts; remove those unbounded labels and only record bounded
dimensions (e.g., keep action_taken and optionally a coarse target category like
target_user_present: "yes"/"no" or target_type: "multiple"/"single") when
calling self.brute_force_attempts.add, and move the detailed ip_address and
target_user values to logs/alerts (e.g., logger.warn or alerting pipeline) so
you still capture them for investigation without exploding metric cardinality.
- Around line 78-86: The record_authentication_attempt function is adding
user_id as a metric label on authentication_attempts and authentication_failures
which creates high-cardinality/PII in metrics; remove user_id from the
attributes passed to authentication_attempts.add and authentication_failures.add
(keep only method and success where needed) and instead record user_id in
structured logs or trace spans if required, or implement a deterministic
low-cardinality bucketing strategy (e.g., hashed bucket) before adding as a
metric label if you must keep a coarse identifier; target the
record_authentication_attempt function and the
authentication_attempts/authentication_failures metric add calls to apply the
change.
In `@backend/app/services/auth_service.py`:
- Around line 120-175: The login flow never emits SecurityMetrics, so update the
login method to call SecurityMetrics.record_authentication_attempt (or via
self.security_metrics) on both success and failure paths: measure start time at
the top of login, then on the user-not-found branch and invalid-password branch
invoke record_authentication_attempt with failure=True (and include reason like
"user_not_found" or "invalid_password", username, ip_address, user_agent,
duration) before calling or inside _fail_login; on the successful path (before
return) call record_authentication_attempt with failure=False, the username,
ip_address, user_agent, and duration; alternatively refactor _fail_login to
accept and emit the metric so all failure paths centralize metric recording.
Ensure you reference SecurityMetrics.record_authentication_attempt and the login
and _fail_login symbols when making the changes.
- Around line 177-229: The UserRegisteredEvent emitted from the register method
currently includes created_user.email which may accidentally expose PII; review
the register function and UserRegisteredEvent usage and either remove the email
field from the event payload or replace it with a hashed/consented identifier
(e.g., omit email when constructing UserRegisteredEvent or supply an email_hash
or consent flag) so downstream consumers do not receive raw email addresses;
update the producer call site where UserRegisteredEvent is constructed and any
consumers that expect the email field to handle its absence or the new hashed
value.
In `@backend/grafana/provisioning/dashboards/coordinator-execution.json`:
- Around line 26-29: Several panels set fieldConfig.defaults.unit to "short" but
their metrics have units; update the fieldConfig.defaults.unit for the affected
panels (panel id 1, 2, 5 --> use "s"; panel id 7 --> use "decmbytes" or add an
explicit axis label; panel id 8 --> use "percent") and also adjust the same
setting for the other listed ranges (panels around lines referenced: 50-53,
133-136, 198-202, 228-231) so each panel's "unit" matches its metric suffix
(e.g., replace "short" with "s", "decmbytes", or "percent" in the JSON objects
that contain "fieldConfig": {"defaults": {"unit": ...}} for the corresponding
panels).
- Line 25: The dashboard hardcodes "datasource": "Victoria Metrics" in multiple
panels (see occurrences at the panel datasource properties) which breaks
portability; add a dashboard-level templating variable named DS of type
"datasource" (query "prometheus" and default current value "Victoria Metrics")
and replace each panel's hardcoded datasource value with the template reference
"${DS}" so panels use the provisioned datasource; update the templating block
under the top-level "templating" -> "list" and change every panel datasource
field (e.g., in the panel objects that currently have "datasource": "Victoria
Metrics") to "datasource": "${DS}" to make the dashboard portable.
In `@backend/grafana/provisioning/dashboards/event-stream-monitoring.json`:
- Line 287: Panel 54 ("Event Publishing Rate") uses a 1-minute rate window in
the expression sum(rate(events_published_total[1m])); update this to use a
5-minute window (sum(rate(events_published_total[5m]))) to match the other
panels (e.g., panel 66 "Event Publishing Rate by Type") and avoid noisy or
missing data when scrape intervals are ≥1m; ensure any other rate() calls in the
dashboard remain consistent with the chosen window.
- Around line 104-108: The "Event Stream Overview" stat row has unused grid
columns because panels were removed and remaining panels (e.g., panel id
52/54/56 or the panels with gridPos entries shown) are left with x/w settings
that only cover 14 of 24 units; update the panels in the dashboard JSON to
eliminate gaps by either increasing the "w" of the existing stat/timeseries
panels and adjusting their "x" values to pack across x:0–23, or add new stat
panels (suggest metrics like total_messages_sent or event_store_failure_rate) to
occupy the missing grid columns; locate the affected panels by their panel ids
(the ones skipping 53 and 55 referenced in the comment) and modify their
"gridPos" objects so the row fills the full 24-unit width consistently (apply
the same fix to the other occurrence noted around lines 261–266).
- Around line 1094-1098: The "SSE Connections Being Drained" panel (id 73) is
only occupying the right half (gridPos x:12, w:12); update its gridPos to full
width by setting x:0 and w:24 in the panel definition for id 73 (panel title
"SSE Connections Being Drained") so it fills the left half gap, and verify it
doesn't overlap other panels (e.g., any panel id 72) after the change.
In `@backend/grafana/provisioning/dashboards/http-middleware.json`:
- Around line 62-70: The duration panels are using "unit": "short" instead of
the proper units; update each affected panel object (e.g., the panel with
"title": "Request Duration" and the other panels at IDs 2, 4, 6, and 8
referenced in the review) to set "unit": "s" for metrics derived from
*_duration_seconds_bucket and "unit": "bytes" for metrics from
*_size_bytes_bucket (request/response size panels). Locate the JSON panel
objects by their "id" and "title" fields in the dashboard file and replace the
unit value accordingly so Grafana will auto-format seconds and bytes.
- Around line 86-93: The dashboard query is using the wrong metric name; update
the Grafana dashboard JSON so the target's "expr" value is
"http_requests_active" instead of "http_requests_active_requests". Locate the
dashboard object that has "title": "Active Requests" and the "targets" array
(the target with "refId": "A") and replace the expr string accordingly to match
the metric registered as "http_requests_active".
- Around line 357-364: The dashboard panel targets a nonexistent, non-conformant
metric "process_metrics_mixed" (panel title "Process Metrics"); update the panel
in http-middleware.json by replacing the expression in the targets.expr field
with a valid Prometheus metric (for example use process_resident_memory_bytes,
process_open_fds, or another single-quantity metric used elsewhere in the
codebase) or remove the panel entirely if that data source is not available;
ensure the expr uses a real metric name consistent with Prometheus naming
conventions and adjust the panel title/legend if you pick a specific metric.
- Line 25: Multiple panels currently hardcode "datasource": "Victoria Metrics",
which breaks dashboards across environments; add a dashboard-level template
variable (e.g., name it DS_VICTORIA_METRICS) in the dashboard's
"templating.list" and replace each panel's "datasource": "Victoria Metrics" with
"datasource": "${DS_VICTORIA_METRICS}" (update occurrences around the existing
"datasource" keys such as the ones shown) so the datasource is resolved
per-environment instead of hardcoded.
In `@backend/grafana/provisioning/dashboards/security-auth.json`:
- Around line 53-76: Panel units are incorrect: change the fieldConfig default
unit from "short" to the seconds unit "s" for the duration panels. Update the
JSON for the panel with id 2 (title "Authentication Duration") and the panel
with id 8 (title "Token Expiry Time") by replacing "unit": "short" with "unit":
"s" under each panel's fieldConfig.defaults so Grafana renders values as
human-friendly seconds.
🧹 Nitpick comments (5)
🤖 Fix all nitpicks with AI agents
Before applying any fix, first verify the finding against the current code and decide whether a code change is actually needed. If the finding is not valid or no change is required, do not modify code for that item and briefly explain why it was skipped. In `@backend/app/services/auth_service.py`: - Around line 177-229: The UserRegisteredEvent emitted from the register method currently includes created_user.email which may accidentally expose PII; review the register function and UserRegisteredEvent usage and either remove the email field from the event payload or replace it with a hashed/consented identifier (e.g., omit email when constructing UserRegisteredEvent or supply an email_hash or consent flag) so downstream consumers do not receive raw email addresses; update the producer call site where UserRegisteredEvent is constructed and any consumers that expect the email field to handle its absence or the new hashed value. In `@backend/grafana/provisioning/dashboards/coordinator-execution.json`: - Line 25: The dashboard hardcodes "datasource": "Victoria Metrics" in multiple panels (see occurrences at the panel datasource properties) which breaks portability; add a dashboard-level templating variable named DS of type "datasource" (query "prometheus" and default current value "Victoria Metrics") and replace each panel's hardcoded datasource value with the template reference "${DS}" so panels use the provisioned datasource; update the templating block under the top-level "templating" -> "list" and change every panel datasource field (e.g., in the panel objects that currently have "datasource": "Victoria Metrics") to "datasource": "${DS}" to make the dashboard portable. In `@backend/grafana/provisioning/dashboards/event-stream-monitoring.json`: - Line 287: Panel 54 ("Event Publishing Rate") uses a 1-minute rate window in the expression sum(rate(events_published_total[1m])); update this to use a 5-minute window (sum(rate(events_published_total[5m]))) to match the other panels (e.g., panel 66 "Event Publishing Rate by Type") and avoid noisy or missing data when scrape intervals are ≥1m; ensure any other rate() calls in the dashboard remain consistent with the chosen window. - Around line 104-108: The "Event Stream Overview" stat row has unused grid columns because panels were removed and remaining panels (e.g., panel id 52/54/56 or the panels with gridPos entries shown) are left with x/w settings that only cover 14 of 24 units; update the panels in the dashboard JSON to eliminate gaps by either increasing the "w" of the existing stat/timeseries panels and adjusting their "x" values to pack across x:0–23, or add new stat panels (suggest metrics like total_messages_sent or event_store_failure_rate) to occupy the missing grid columns; locate the affected panels by their panel ids (the ones skipping 53 and 55 referenced in the comment) and modify their "gridPos" objects so the row fills the full 24-unit width consistently (apply the same fix to the other occurrence noted around lines 261–266). In `@backend/grafana/provisioning/dashboards/http-middleware.json`: - Line 25: Multiple panels currently hardcode "datasource": "Victoria Metrics", which breaks dashboards across environments; add a dashboard-level template variable (e.g., name it DS_VICTORIA_METRICS) in the dashboard's "templating.list" and replace each panel's "datasource": "Victoria Metrics" with "datasource": "${DS_VICTORIA_METRICS}" (update occurrences around the existing "datasource" keys such as the ones shown) so the datasource is resolved per-environment instead of hardcoded.backend/grafana/provisioning/dashboards/coordinator-execution.json (1)
25-25: Hardcoded datasource name reduces portability.Every panel pins
"datasource": "Victoria Metrics"by name. If the provisioned datasource is renamed or the dashboard is imported into an environment with a different name, all panels will break. Consider using a dashboard-level templating variable:"templating": { "list": [ { "name": "DS", "type": "datasource", "query": "prometheus", "current": { "text": "Victoria Metrics", "value": "Victoria Metrics" } } ] }Then reference
"datasource": "${DS}"in each panel. This is a common Grafana best practice for provisioned dashboards.Also applies to: 49-49, 73-73, 96-96, 132-132, 156-156, 198-198, 227-227
🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and decide whether a code change is actually needed. If the finding is not valid or no change is required, do not modify code for that item and briefly explain why it was skipped. In `@backend/grafana/provisioning/dashboards/coordinator-execution.json` at line 25, The dashboard hardcodes "datasource": "Victoria Metrics" in multiple panels (see occurrences at the panel datasource properties) which breaks portability; add a dashboard-level templating variable named DS of type "datasource" (query "prometheus" and default current value "Victoria Metrics") and replace each panel's hardcoded datasource value with the template reference "${DS}" so panels use the provisioned datasource; update the templating block under the top-level "templating" -> "list" and change every panel datasource field (e.g., in the panel objects that currently have "datasource": "Victoria Metrics") to "datasource": "${DS}" to make the dashboard portable.backend/grafana/provisioning/dashboards/http-middleware.json (1)
25-25: Consider using a datasource variable instead of hardcoding"Victoria Metrics".Every panel hardcodes
"datasource": "Victoria Metrics". If the datasource name differs across environments (dev, staging, prod), all panels break. The standard Grafana practice is to define a dashboard-level template variable (e.g.,${DS_VICTORIA_METRICS}) and reference it in each panel. This is optional but improves portability.Also applies to: 49-49, 138-138
🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and decide whether a code change is actually needed. If the finding is not valid or no change is required, do not modify code for that item and briefly explain why it was skipped. In `@backend/grafana/provisioning/dashboards/http-middleware.json` at line 25, Multiple panels currently hardcode "datasource": "Victoria Metrics", which breaks dashboards across environments; add a dashboard-level template variable (e.g., name it DS_VICTORIA_METRICS) in the dashboard's "templating.list" and replace each panel's "datasource": "Victoria Metrics" with "datasource": "${DS_VICTORIA_METRICS}" (update occurrences around the existing "datasource" keys such as the ones shown) so the datasource is resolved per-environment instead of hardcoded.backend/grafana/provisioning/dashboards/event-stream-monitoring.json (2)
287-287: Inconsistent rate window:[1m]here vs[5m]everywhere else.Panel 54 ("Event Publishing Rate") uses
rate(events_published_total[1m])while the corresponding breakdown panel 66 ("Event Publishing Rate by Type") and all otherrate()calls in this dashboard use[5m]. The[1m]window will produce noisier data and may show no data if the scrape interval is ≥ 1 minute. Consider aligning to[5m]for consistency, or at least ensure the scrape interval supports a 1-minute rate window.🔧 Suggested fix
- "expr": "sum(rate(events_published_total[1m]))", + "expr": "sum(rate(events_published_total[5m]))",🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and decide whether a code change is actually needed. If the finding is not valid or no change is required, do not modify code for that item and briefly explain why it was skipped. In `@backend/grafana/provisioning/dashboards/event-stream-monitoring.json` at line 287, Panel 54 ("Event Publishing Rate") uses a 1-minute rate window in the expression sum(rate(events_published_total[1m])); update this to use a 5-minute window (sum(rate(events_published_total[5m]))) to match the other panels (e.g., panel 66 "Event Publishing Rate by Type") and avoid noisy or missing data when scrape intervals are ≥1m; ensure any other rate() calls in the dashboard remain consistent with the chosen window.
104-108: Gaps in the "Event Stream Overview" stat row.The three stat/timeseries panels in the overview row occupy only 14 of 24 grid units (
x: 0 w: 4,x: 4 w: 4,x: 12 w: 6), leaving a 4-unit gap atx: 8–11and a 6-unit gap atx: 18–23. This suggests panels were removed (the id sequence skips 53 and 55) without redistributing the remaining panels. Consider widening the existing panels or adding new stat panels (e.g., total messages sent, event store failure rate) to fill the row.Also applies to: 261-266
🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and decide whether a code change is actually needed. If the finding is not valid or no change is required, do not modify code for that item and briefly explain why it was skipped. In `@backend/grafana/provisioning/dashboards/event-stream-monitoring.json` around lines 104 - 108, The "Event Stream Overview" stat row has unused grid columns because panels were removed and remaining panels (e.g., panel id 52/54/56 or the panels with gridPos entries shown) are left with x/w settings that only cover 14 of 24 units; update the panels in the dashboard JSON to eliminate gaps by either increasing the "w" of the existing stat/timeseries panels and adjusting their "x" values to pack across x:0–23, or add new stat panels (suggest metrics like total_messages_sent or event_store_failure_rate) to occupy the missing grid columns; locate the affected panels by their panel ids (the ones skipping 53 and 55 referenced in the comment) and modify their "gridPos" objects so the row fills the full 24-unit width consistently (apply the same fix to the other occurrence noted around lines 261–266).backend/app/services/auth_service.py (1)
177-229: Registration emitsLine 223 includes
email=created_user.emailinUserRegisteredEvent. While domain events are different from metrics (PII in events is often acceptable), ensure downstream consumers handle email appropriately per GDPR/privacy requirements (e.g., not logging it, encrypting at rest in event stores).🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and decide whether a code change is actually needed. If the finding is not valid or no change is required, do not modify code for that item and briefly explain why it was skipped. In `@backend/app/services/auth_service.py` around lines 177 - 229, The UserRegisteredEvent emitted from the register method currently includes created_user.email which may accidentally expose PII; review the register function and UserRegisteredEvent usage and either remove the email field from the event payload or replace it with a hashed/consented identifier (e.g., omit email when constructing UserRegisteredEvent or supply an email_hash or consent flag) so downstream consumers do not receive raw email addresses; update the producer call site where UserRegisteredEvent is constructed and any consumers that expect the email field to handle its absence or the new hashed value.
There was a problem hiding this comment.
1 issue found across 10 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/grafana/provisioning/dashboards/coordinator-execution.json">
<violation number="1" location="backend/grafana/provisioning/dashboards/coordinator-execution.json:201">
P2: `script_memory_usage_MiB_bucket` is measured in MiB, but the panel now uses the decimal megabytes unit (`decmbytes`), which will display 1 MiB as 1.048 MB. Use the mebibytes unit (`mbytes`) or convert the metric to MB so the unit matches the data.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
backend/grafana/provisioning/dashboards/coordinator-execution.json
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/core/metrics/security.py (1)
145-155:⚠️ Potential issue | 🟠 Major
ip_addressandtarget_useras metric attributes create high-cardinality series and may leak PII.
ip_addressis unbounded andtarget_usermay contain usernames. Both become Prometheus labels, causing cardinality explosion and potential PII exposure in monitoring systems — the same class of issue previously flagged foruser_idon other methods in this file.Consider replacing
ip_addresswith a coarser bucket (e.g., subnet prefix or just "external"/"internal") and droppingtarget_useror replacing it with a boolean like"targeted": str(target_user is not None).Proposed fix
def record_brute_force_attempt( - self, ip_address: str, target_user: str | None = None, action_taken: str = "logged" + self, action_taken: str = "logged", targeted: bool = False ) -> None: self.brute_force_attempts.add( 1, attributes={ - "ip_address": ip_address, - "target_user": target_user or "multiple", "action_taken": action_taken, + "targeted": str(targeted), }, )🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and decide whether a code change is actually needed. If the finding is not valid or no change is required, do not modify code for that item and briefly explain why it was skipped. In `@backend/app/core/metrics/security.py` around lines 145 - 155, The record_brute_force_attempt metric currently uses high-cardinality PII labels (ip_address and target_user); modify it to avoid leaking identifiers by replacing ip_address with a coarse bucket (e.g., subnet prefix or a simple "external"/"internal" label computed from ip_address) and remove target_user or replace it with a low-cardinality flag such as "targeted": str(target_user is not None); update the call to self.brute_force_attempts.add in record_brute_force_attempt to send only the coarse ip bucket (or external/internal) and the "targeted" boolean plus action_taken, ensuring no raw IPs or usernames are emitted as metric attributes.
🤖 Fix all issues with AI agents
Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.
In `@backend/app/core/metrics/security.py`:
- Around line 145-155: The record_brute_force_attempt metric currently uses
high-cardinality PII labels (ip_address and target_user); modify it to avoid
leaking identifiers by replacing ip_address with a coarse bucket (e.g., subnet
prefix or a simple "external"/"internal" label computed from ip_address) and
remove target_user or replace it with a low-cardinality flag such as "targeted":
str(target_user is not None); update the call to self.brute_force_attempts.add
in record_brute_force_attempt to send only the coarse ip bucket (or
external/internal) and the "targeted" boolean plus action_taken, ensuring no raw
IPs or usernames are emitted as metric attributes.
In `@backend/app/services/auth_service.py`:
- Around line 34-39: The hardcoded _dummy_hash uses cost factor 12 which can
differ from Settings.BCRYPT_ROUNDS and break the timing mitigation; update the
initialization so the dummy hash is generated at startup using the active cost
factor (Settings.BCRYPT_ROUNDS) instead of a static string, e.g., replace the
static _dummy_hash with logic in the AuthService (or module init) that calls
bcrypt.hash with a dummy password and the configured rounds, and ensure
verify_password continues to use this generated _dummy_hash when a username is
not found so timing matches real user hashes.
- Around line 127-128: The AccountLockedError is raised before the metrics
context manager used by track_authentication runs, so lockout rejections are not
recorded in authentication_attempts_total/authentication_failures_total; to
include them, move the call to self._lockout.check_locked(username) (and the
subsequent raise AccountLockedError) inside the track_authentication context
manager used by the authenticate flow (i.e., within the block that
track_authentication wraps) so the same metrics are incremented for lockout
rejections, or leave it as-is if you intentionally want to exclude lockouts from
metrics — update the placement in authenticate/track_authentication accordingly.
- Around line 120-176: The current with
self.security_metrics.track_authentication("login") wraps the entire login flow
so infrastructure errors (user_repo.get_user,
_runtime_settings.get_effective_settings) are recorded as authentication
failures; narrow the scope by removing the outer context and instead open the
context only around the credential-verification block where passwords are
checked and failures are recorded (i.e., around security_service.verify_password
calls and the calls to _fail_login / _lockout.clear_attempts) inside the login
method so DB or settings errors are not counted as auth failures.
In `@backend/grafana/provisioning/dashboards/http-middleware.json`:
- Around line 320-342: The CPU stat panel (id 11, title "CPU") is using the
wrong unit; update the fieldConfig.defaults.unit for the panel that targets the
metric "system_cpu_percent" from "short" to "percent" so Grafana renders a %
suffix for 0–100 values (if your metric were 0–1 use "percentunit" instead);
locate the panel by its id/title or the target expr "system_cpu_percent" and
change the unit string accordingly.
In `@backend/grafana/provisioning/dashboards/security-auth.json`:
- Line 25: The dashboard currently hardcodes the datasource value "Victoria
Metrics" in each panel's "datasource" field, which breaks portability; update
every panel in security-auth.json that sets "datasource": "Victoria Metrics" to
instead reference a template variable (e.g. "${DS_VICTORIAMETRICS}") or use the
UID form (an object with "uid": "<datasource-uid>") so the panels bind to the
provisioned datasource regardless of display name; search for every "datasource"
key in this file and replace the hardcoded string with the chosen variable/UID
form and ensure the dashboard JSON includes a matching dashboard variable
(DS_VICTORIAMETRICS) if you use the template approach.
🧹 Nitpick comments (3)
🤖 Fix all nitpicks with AI agents
Before applying any fix, first verify the finding against the current code and decide whether a code change is actually needed. If the finding is not valid or no change is required, do not modify code for that item and briefly explain why it was skipped. In `@backend/app/services/auth_service.py`: - Around line 127-128: The AccountLockedError is raised before the metrics context manager used by track_authentication runs, so lockout rejections are not recorded in authentication_attempts_total/authentication_failures_total; to include them, move the call to self._lockout.check_locked(username) (and the subsequent raise AccountLockedError) inside the track_authentication context manager used by the authenticate flow (i.e., within the block that track_authentication wraps) so the same metrics are incremented for lockout rejections, or leave it as-is if you intentionally want to exclude lockouts from metrics — update the placement in authenticate/track_authentication accordingly. - Around line 120-176: The current with self.security_metrics.track_authentication("login") wraps the entire login flow so infrastructure errors (user_repo.get_user, _runtime_settings.get_effective_settings) are recorded as authentication failures; narrow the scope by removing the outer context and instead open the context only around the credential-verification block where passwords are checked and failures are recorded (i.e., around security_service.verify_password calls and the calls to _fail_login / _lockout.clear_attempts) inside the login method so DB or settings errors are not counted as auth failures. In `@backend/grafana/provisioning/dashboards/security-auth.json`: - Line 25: The dashboard currently hardcodes the datasource value "Victoria Metrics" in each panel's "datasource" field, which breaks portability; update every panel in security-auth.json that sets "datasource": "Victoria Metrics" to instead reference a template variable (e.g. "${DS_VICTORIAMETRICS}") or use the UID form (an object with "uid": "<datasource-uid>") so the panels bind to the provisioned datasource regardless of display name; search for every "datasource" key in this file and replace the hardcoded string with the chosen variable/UID form and ensure the dashboard JSON includes a matching dashboard variable (DS_VICTORIAMETRICS) if you use the template approach.backend/app/services/auth_service.py (2)
127-128: Locked-account check beforetrack_authentication—AccountLockedErrorbypasses auth metrics.This means locked-account rejections won't appear in the
authentication_attempts_totalorauthentication_failures_totalmetrics. If you want visibility into lockout rates via the same dashboard, the lockout check should be inside the context manager. If the intent is to only count "real" authentication attempts, the current placement is fine — just calling it out as a design choice.🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and decide whether a code change is actually needed. If the finding is not valid or no change is required, do not modify code for that item and briefly explain why it was skipped. In `@backend/app/services/auth_service.py` around lines 127 - 128, The AccountLockedError is raised before the metrics context manager used by track_authentication runs, so lockout rejections are not recorded in authentication_attempts_total/authentication_failures_total; to include them, move the call to self._lockout.check_locked(username) (and the subsequent raise AccountLockedError) inside the track_authentication context manager used by the authenticate flow (i.e., within the block that track_authentication wraps) so the same metrics are incremented for lockout rejections, or leave it as-is if you intentionally want to exclude lockouts from metrics — update the placement in authenticate/track_authentication accordingly.
120-176:track_authenticationnow wraps the login flow — past review concern resolved.The context manager in the
finallyblock records both success and failure outcomes with duration. One subtlety: infrastructure errors (e.g., DB failures fromget_useron line 131 orget_effective_settingson line 142) will be counted as authentication failures in the histogram. If this is undesirable, consider narrowing the context manager to only the credential-verification portion (lines 131–138).🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and decide whether a code change is actually needed. If the finding is not valid or no change is required, do not modify code for that item and briefly explain why it was skipped. In `@backend/app/services/auth_service.py` around lines 120 - 176, The current with self.security_metrics.track_authentication("login") wraps the entire login flow so infrastructure errors (user_repo.get_user, _runtime_settings.get_effective_settings) are recorded as authentication failures; narrow the scope by removing the outer context and instead open the context only around the credential-verification block where passwords are checked and failures are recorded (i.e., around security_service.verify_password calls and the calls to _fail_login / _lockout.clear_attempts) inside the login method so DB or settings errors are not counted as auth failures.backend/grafana/provisioning/dashboards/security-auth.json (1)
25-25: Hardcoded datasource name"Victoria Metrics"across all panels.If the provisioned datasource is renamed or the deployment uses a different name, every panel silently fails. Consider using a template variable (e.g.,
"datasource": "${DS_VICTORIAMETRICS}") or at minimum the"uid"form instead of a display-name string, so the dashboard is portable across environments.Also applies to: 54-54, 78-78, 101-101, 124-124, 147-147, 183-183, 217-217, 241-241, 283-283, 312-312
🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and decide whether a code change is actually needed. If the finding is not valid or no change is required, do not modify code for that item and briefly explain why it was skipped. In `@backend/grafana/provisioning/dashboards/security-auth.json` at line 25, The dashboard currently hardcodes the datasource value "Victoria Metrics" in each panel's "datasource" field, which breaks portability; update every panel in security-auth.json that sets "datasource": "Victoria Metrics" to instead reference a template variable (e.g. "${DS_VICTORIAMETRICS}") or use the UID form (an object with "uid": "<datasource-uid>") so the panels bind to the provisioned datasource regardless of display name; search for every "datasource" key in this file and replace the hardcoded string with the chosen variable/UID form and ensure the dashboard JSON includes a matching dashboard variable (DS_VICTORIAMETRICS) if you use the template approach.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
Verify each finding against the current code and only fix it if needed.
In `@backend/grafana/provisioning/dashboards/http-middleware.json`:
- Around line 213-241: The "Event Store & Idempotency" panel mixes two different
units under a single field default ("unit": "short"): the "Store Ops" series
(rate of event_store_operations_total, ops/s) and the "Idempotency p95" series
(histogram_quantile over idempotency_processing_duration_seconds_bucket,
seconds). Fix by either splitting into two panels (one for Store Ops, one for
Idempotency p95) or add field overrides on the panel to set the unit for the
"Idempotency p95" series to "s" and the "Store Ops" series to an appropriate ops
unit; locate the panel by title "Event Store & Idempotency" and targets with
legendFormat "Store Ops" and "Idempotency p95" to implement the change.
- Around line 24-47: The Request Rate panel currently sets the axis unit to
"short"; change the fieldConfig.defaults.unit for the panel with title "Request
Rate" (id 1) to "reqps" so the timeseries using expr
"rate(http_requests_total[5m])" displays requests/second units; update the
"unit" value in that panel's fieldConfig.defaults object accordingly.
🧹 Nitpick comments (1)
🤖 Fix all nitpicks with AI agents
Verify each finding against the current code and only fix it if needed. In `@backend/grafana/provisioning/dashboards/http-middleware.json`: - Around line 24-47: The Request Rate panel currently sets the axis unit to "short"; change the fieldConfig.defaults.unit for the panel with title "Request Rate" (id 1) to "reqps" so the timeseries using expr "rate(http_requests_total[5m])" displays requests/second units; update the "unit" value in that panel's fieldConfig.defaults object accordingly.backend/grafana/provisioning/dashboards/http-middleware.json (1)
24-47: Request Rate panel could use"unit": "reqps"instead of"short".Grafana's
reqps(requests/second) unit would add proper axis labeling for therate(http_requests_total[5m])expression. This is a minor readability improvement.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/grafana/provisioning/dashboards/http-middleware.json` around lines 24 - 47, The Request Rate panel currently sets the axis unit to "short"; change the fieldConfig.defaults.unit for the panel with title "Request Rate" (id 1) to "reqps" so the timeseries using expr "rate(http_requests_total[5m])" displays requests/second units; update the "unit" value in that panel's fieldConfig.defaults object accordingly.
|
There was a problem hiding this comment.
🤖 Fix all issues with AI agents
Verify each finding against the current code and only fix it if needed.
In `@backend/grafana/provisioning/dashboards/http-middleware.json`:
- Around line 1-405: Panel "Request Rate" (id 1) is using the generic unit
"short" — change the fieldConfig.defaults.unit for the timeseries panel with id
1 (title "Request Rate") from "short" to a more appropriate Grafana unit such as
"reqps" or "ops" so the rate(http_requests_total[5m]) series displays with a
requests-per-second suffix.
🧹 Nitpick comments (1)
🤖 Fix all nitpicks with AI agents
Verify each finding against the current code and only fix it if needed. In `@backend/grafana/provisioning/dashboards/http-middleware.json`: - Around line 1-405: Panel "Request Rate" (id 1) is using the generic unit "short" — change the fieldConfig.defaults.unit for the timeseries panel with id 1 (title "Request Rate") from "short" to a more appropriate Grafana unit such as "reqps" or "ops" so the rate(http_requests_total[5m]) series displays with a requests-per-second suffix.backend/grafana/provisioning/dashboards/http-middleware.json (1)
1-405: Request Rate panel (id 1) uses"unit": "short"— consider"reqps"or"ops".The
rate(http_requests_total[5m])expression yields requests-per-second. Grafana's"reqps"or"ops"unit would add a meaningful suffix. This is minor and cosmetic.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/grafana/provisioning/dashboards/http-middleware.json` around lines 1 - 405, Panel "Request Rate" (id 1) is using the generic unit "short" — change the fieldConfig.defaults.unit for the timeseries panel with id 1 (title "Request Rate") from "short" to a more appropriate Grafana unit such as "reqps" or "ops" so the rate(http_requests_total[5m]) series displays with a requests-per-second suffix.



Summary by cubic
Aligns backend metrics with Grafana via contract tests and focused dashboards, and moves Grafana config to env-driven secrets with a clean /grafana proxy. Simplifies and instruments auth (lockout → 423), improves CSRF tracking, and tightens observability with fewer, clearer metrics.
New Features
Migration
Written for commit 24c40ca. Summary will update on new commits.
Summary by CodeRabbit
New Features
Improvements
Documentation