Fix/keys by HardMax71 · Pull Request #200 · HardMax71/Integr8sCode

HardMax71 · 2026-02-16T19:09:51Z

Summary by cubic

Aligns backend metrics with Grafana via contract tests and focused dashboards, and moves Grafana config to env-driven secrets with a clean /grafana proxy. Simplifies and instruments auth (lockout → 423), improves CSRF tracking, and tightens observability with fewer, clearer metrics.

New Features
- Contract tests validate Grafana dashboards against OTel/Prometheus metric names; added docs.
- New Grafana dashboards: Security & Auth, Notifications, Kubernetes Pods, Coordinator & Execution, Event Replay, HTTP & Middleware; simplified Kafka/Event Stream views; fixed HTTP ops plot in the HTTP & Middleware dashboard.
- Authentication: routes delegate to AuthService; login lockout mapped to 423; emits auth events; CSRF validation reasons recorded via SecurityService; corrected "Invalid credentials" message.
- Observability: SSE connection duration, replay session creation/speed, notification throttle/subscription change, DLQ processing errors (incl. batch retry), active pod creations; removed low-value metrics (including HealthMetrics and unused counters).
Migration
- Set GRAFANA_ADMIN_USER, GRAFANA_ADMIN_PASSWORD, and GRAFANA_ROOT_URL in CI/runtime; release workflow and docker-compose read them; grafana.ini no longer stores these; frontend nginx rewrites /grafana to strip the prefix.
- Clients should handle HTTP 423 on login when accounts are locked.
- Several metrics were removed/renamed; rely on updated dashboards and contract tests.

^{Written for commit 24c40ca. Summary will update on new commits.}

Summary by CodeRabbit

New Features
- Account locking (HTTP 423) and several new Grafana dashboards for monitoring (Coordinator & Execution, Event Replay, SSE/Event Stream, HTTP & Middleware, Kubernetes & Pods, Notifications, Security/Auth, DLQ).
Improvements
- Unified authentication flow with improved token/session handling and added security/operational metrics.
- Grafana deployment enhancements: root URL, admin credentials support, and proxy path handling.
Documentation
- Added contract testing docs and automated dashboard-vs-code contract tests.

…, not hardcoded in plaintext

…fana; also added docs file explaining what is what

coderabbitai · 2026-02-16T19:10:17Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Consolidates auth flows into a new AuthService, adds AccountLockedError and CSRF/security metrics wiring, significantly trims many metrics modules, adds multiple Grafana dashboards and a dashboard-vs-code contract test, and exposes Grafana env/CI variables plus an nginx grafana proxy rewrite.

Changes

Cohort / File(s)	Summary
Auth & domain `backend/app/api/routes/auth.py`, `backend/app/services/auth_service.py`, `backend/app/domain/user/user_models.py`, `backend/app/domain/user/__init__.py`	Routes now delegate login/register/logout to a unified `AuthService`; added `LoginResult` dataclass and exported it; `AuthService` gains login/register/publish_logout_event, lockout, metrics, runtime-settings, producer and settings wiring.
Exceptions & handlers `backend/app/domain/exceptions.py`, `backend/app/core/exceptions/handlers.py`	New `AccountLockedError` added to domain exceptions and mapped to HTTP 423 in exception handlers.
Security / CSRF `backend/app/core/security.py`, `backend/app/core/middlewares/csrf.py`	`SecurityService` now accepts `SecurityMetrics` and records CSRF validation failures; CSRF middleware logging switched to structlog and structured args.
Metrics package pruning & changes `backend/app/core/metrics/...` (`__init__.py`, `connections.py`, `dlq.py`, `events.py`, `execution.py`, `health.py`, `kubernetes.py`, `notifications.py`, `replay.py`, `security.py`)	Removed `HealthMetrics` and many instruments/public recorder methods across metrics modules; DLQ now tracks per-topic sizes internally; security metrics surface reworked (new context manager and trimmed APIs); multiple metric methods removed or signatures changed.
Providers & DI wiring `backend/app/core/providers.py`	Removed health metrics provider; propagate `SecurityMetrics` into security/auth/admin providers and update constructors to accept additional metrics/dependencies.
Service integrations `backend/app/services/...` (`admin_user_service.py`, `notification_service.py`, `k8s_worker/worker.py`, `event_replay/replay_service.py`, `sse/sse_service.py`, `dlq/manager.py`)	Wire `SecurityMetrics` into admin service; add/adjust metric calls across notification, replay, SSE, DLQ, and pod-creation flows; some behavior adds metric side-effects (e.g., DLQ error on batch retry failure).
Grafana provisioning & config `backend/grafana/provisioning/dashboards/*.json`, `backend/grafana/grafana.ini`	Many new and updated Grafana dashboards added; grafana.ini removed embedded admin creds and some server settings.
Deployment & proxy `docker-compose.yaml`, `.github/workflows/release-deploy.yml`, `frontend/nginx.conf.template`	Expose `GRAFANA_ROOT_URL`, `GRAFANA_ADMIN_USER`, `GRAFANA_ADMIN_PASSWORD` via compose and CI; `release-deploy.yml` passes new secrets; nginx proxy now rewrites `/grafana/` prefix.
Contract testing & docs `backend/tests/contract/test_grafana_metrics.py`, `docs/testing/contract-testing.md`, `backend/pyproject.toml`, `mkdocs.yml`	New contract test asserts Grafana PromQL metric names match Python OTel metric families; added pytest marker and documentation for contract testing.
Tests updated `backend/tests/unit/...`, `backend/tests/unit/conftest.py`	Unit tests updated to reflect removed metrics/APIs, removed HealthMetrics fixture, adjusted SecurityService/CSRF tests to inject `SecurityMetrics`, and updated multiple metric call sites/signatures.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Route as auth.py (login)
    participant Auth as AuthService
    participant Lockout as LoginLockoutService
    participant Repo as UserRepository
    participant Sec as SecurityService
    participant Producer as EventProducer

    Client->>Route: POST /login (username, password, headers)
    Route->>Auth: login(username, password, ip, user_agent)
    Auth->>Lockout: check_lockout(username)
    alt account locked
        Auth->>Auth: record lockout metric
        Auth-->>Route: raise AccountLockedError
    else not locked
        Auth->>Repo: fetch user by username
        alt user missing
            Auth->>Auth: perform dummy password verify (timing mitigation)
            Auth->>Auth: record failed auth metric
            Auth-->>Route: raise InvalidCredentialsError
        else user found
            Auth->>Sec: verify_password(password, stored_hash)
            alt invalid password
                Auth->>Auth: record failed attempt & maybe lock
                Auth-->>Route: raise InvalidCredentialsError
            else valid password
                Auth->>Lockout: reset_lockout(username)
                Auth->>Auth: generate access & csrf tokens
                Auth->>Producer: publish UserLoggedInEvent(metadata)
                Auth-->>Route: return LoginResult (tokens, csrf, timeout)
                Route->>Client: set cookies, 200 OK
            end
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

fix: mongo docs #168: Overlapping changes to auth/login flow, account lockout, and provider wiring—closely related to AuthService/refactor work.
docs: updated (removed event bus, pdantic -> domain dataclasses, ..) #181: Related changes to exception handler wiring and AccountLockedError mapping in handlers.
fix: event bus #128: Overlapping edits around notification service and metrics integration used by notification and event-publishing paths.

Poem

🐰 I pruned old gauges and hopped through each line,
Tokens now gather in one tidy design,
Dashboards bloom and tests chase every metric name,
Nginx trims the path and Grafana greets the frame,
A rabbit cheers—auth unified and metrics tame.

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Fix/keys' is vague and does not clearly convey the main changes in this substantial PR covering Grafana metrics, dashboards, authentication refactoring, and observability improvements.	Consider a more descriptive title such as 'Align Grafana metrics with contract tests and refactor auth service' or 'Add Grafana dashboards and metric contract validation' to better reflect the primary objectives.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/keys

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cubic-dev-ai

13 issues found across 54 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/app/core/metrics/notifications.py">

<violation number="1" location="backend/app/core/metrics/notifications.py:150">
P2: The new decrement_unread_count blindly subtracts 1 and can drive unread counts negative or drift from actual per-user state. Preserve a cached per-user count and clamp at zero when decrementing so the metric stays accurate.</violation>
</file>

<file name="backend/app/dlq/manager.py">

<violation number="1" location="backend/app/dlq/manager.py:249">
P2: `record_dlq_processing_error` expects an event type label, but this passes `event_id`, creating high-cardinality metrics per message. Use the message’s event type (or a fixed label like "manual_retry") instead of a unique ID.</violation>
</file>

<file name="backend/tests/contract/test_grafana_metrics.py">

<violation number="1" location="backend/tests/contract/test_grafana_metrics.py:59">
P2: This fixture removes OTEL_SDK_DISABLED without restoring the previous value, so later tests will see a different environment. Consider restoring the prior value (or use pytest’s monkeypatch fixture) to avoid cross-test side effects.</violation>

<violation number="2" location="backend/tests/contract/test_grafana_metrics.py:63">
P2: This fixture sets a global meter provider without restoring the previous provider, which can leak into other tests. Capture the current provider and restore it in a teardown to avoid cross-test interference.</violation>
</file>

<file name="backend/grafana/provisioning/dashboards/notifications.json">

<violation number="1" location="backend/grafana/provisioning/dashboards/notifications.json:204">
P2: `histogram_quantile` on classic `_bucket` metrics should aggregate the bucket rates with `sum ... by (le)`; without it, the quantile calculation is incorrect or fragmented across bucket series. Wrap the rate with `sum by (le)` (and include any grouping labels you want to keep).</violation>

<violation number="2" location="backend/grafana/provisioning/dashboards/notifications.json:265">
P2: `histogram_quantile` requires `sum ... by (le)` over `_bucket` rates; otherwise the percentile is computed from unaggregated bucket series and is incorrect. Wrap the bucket rate with `sum by (le)`.</violation>

<violation number="3" location="backend/grafana/provisioning/dashboards/notifications.json:360">
P2: `histogram_quantile` should aggregate `_bucket` rates with `sum ... by (le)` to produce a valid quantile. Without the aggregation, Grafana will compute the percentile from per-bucket series and return misleading values.</violation>

<violation number="4" location="backend/grafana/provisioning/dashboards/notifications.json:397">
P2: `histogram_quantile` on classic histogram buckets needs `sum ... by (le)` around the `rate(...)`; otherwise the percentile is not computed correctly. Aggregate the bucket rates first.</violation>

<violation number="5" location="backend/grafana/provisioning/dashboards/notifications.json:426">
P2: `histogram_quantile` should receive `sum ... by (le)` of the bucket rates; otherwise the quantile is computed on raw bucket series. Aggregate the rate by `le` (and any grouping labels you need).</violation>
</file>

<file name="docker-compose.yaml">

<violation number="1" location="docker-compose.yaml:178">
P1: Hard-coding a default Grafana admin password means the service will start with a known credential whenever GRAFANA_ADMIN_PASSWORD is not set. Require the password to be provided (or use Docker secrets) to avoid insecure defaults.</violation>
</file>

<file name="backend/app/services/auth_service.py">

<violation number="1" location="backend/app/services/auth_service.py:125">
P2: Timing side-channel: username enumeration is possible. When the user doesn't exist, `_fail_login` is called immediately without running `verify_password`. When the user exists but password is wrong, the expensive password hash comparison runs first. An attacker can measure response times to determine whether a username is registered.

Mitigate by running a dummy `verify_password` against a fixed hash when the user is not found, so both paths take comparable time.</violation>

<violation number="2" location="backend/app/services/auth_service.py:226">
P1: Bug: `publish_logout_event` passes the **username** (from `decode_token`) as `user_id` to `UserLoggedOutEvent` and `_build_metadata`. All other events in this service (login, registration, auth-failed) correctly use `str(user.user_id)` (the actual UUID). This means logout events will be uncorrelatable with other user events.

You need to look up the user by username to get the real `user_id`, similar to how `get_current_user` does it.</violation>
</file>

<file name="backend/grafana/provisioning/dashboards/event-stream-monitoring.json">

<violation number="1" location="backend/grafana/provisioning/dashboards/event-stream-monitoring.json:605">
P3: The "Event Buffer Performance" row is now empty (no panels between this row and the next), which leaves a blank section in the dashboard. Either restore the event buffer panels or remove the row to avoid a confusing empty header.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

docker-compose.yaml

backend/app/services/auth_service.py

backend/app/core/metrics/notifications.py

backend/app/dlq/manager.py

backend/tests/contract/test_grafana_metrics.py

backend/grafana/provisioning/dashboards/notifications.json

backend/app/services/auth_service.py

backend/grafana/provisioning/dashboards/event-stream-monitoring.json

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

backend/app/core/metrics/replay.py (1)

144-150: ⚠️ Potential issue | 🟡 Minor

update_replay_queue_size state leaks for finished sessions.

The per-session attribute _queue_{session_id} set via setattr is never cleaned up when a session finalizes. In replay_service.py _finalize_session (lines 261-274), the session is removed from all internal dicts but update_replay_queue_size(session_id, 0) is never called — so the UpDownCounter never decrements to zero for that session, and the cached attribute remains on the metrics object indefinitely.

Consider calling self._metrics.update_replay_queue_size(session.session_id, 0) during finalization, or adding a cleanup method to ReplayMetrics.

backend/app/core/metrics/security.py (1)

78-89: ⚠️ Potential issue | 🟠 Major

user_id and ip_address as metric attributes risk cardinality explosion and PII leakage.

Multiple recording methods (record_authentication_attempt, record_password_reset_request, record_weak_password_attempt, record_brute_force_attempt, record_account_locked) embed user_id or ip_address directly as metric label values. In Prometheus/VictoriaMetrics, each unique label combination creates a separate time series. With many users or IPs, this causes unbounded cardinality growth — degrading query performance and increasing storage costs.

Additionally, user_id and ip_address in metrics labels may constitute PII under GDPR/CCPA, which typically requires justification and retention controls that metrics backends don't natively provide.

Consider removing these high-cardinality identifiers from metric attributes and instead logging them via structured logs (where retention and access controls are easier to manage), or using hashed/bucketed values if breakdowns are needed.

Also applies to: 131-157

🤖 Fix all issues with AI agents

In `@backend/app/dlq/manager.py`:
- Line 249: The call to metrics.record_dlq_processing_error currently passes
event_id as the event_type, creating high-cardinality labels; update the call in
manager.py so that metrics.record_dlq_processing_error(original_topic,
event_type, error_type) receives the actual event type extracted from the
message (e.g., use the message/event metadata field that contains the event
type) instead of event_id, preserving event_id only for payload/logging; ensure
you reference the same symbols: metrics.record_dlq_processing_error, event_id,
original_topic and use the message's event_type field (or
message.metadata["event_type"] / equivalent accessor) when invoking the
function.

In `@backend/app/services/auth_service.py`:
- Around line 223-234: publish_logout_event is passing the token subject
(username returned by security_service.decode_token) into
UserLoggedOutEvent.user_id and _build_metadata, causing mismatched IDs vs the
login flow; fix by resolving the username to the canonical user ID before
producing the event: call your user lookup (e.g., fetch by username or use
existing user repository/service used in the login flow to get
str(user.user_id)), then pass that user_id into UserLoggedOutEvent(user_id=...)
and _build_metadata(user_id=...), and use that same user_id as the produce key
when calling self._producer.produce so logout events match login events.
- Around line 108-110: The metric calls currently pass raw usernames into
security_metrics (calls to record_account_locked and
record_weak_password_attempt), which creates high-cardinality/PII labels; update
the calls in AuthService (where locked handling and weak-password handling
occur) to stop passing the plain username — instead pass a stable non-PII
identifier (e.g., numeric user_id if available), or a fixed bucket (e.g.,
"unknown"), or a one-way hash of the username using a secure hash function
(sha256) before calling security_metrics.record_account_locked and
security_metrics.record_weak_password_attempt so the metric labels no longer
contain raw usernames.

In `@backend/app/services/notification_service.py`:
- Around line 548-551: The action string currently set to "enabled" or "updated"
is ambiguous when update_data.enabled is False and should be "disabled" instead;
change the logic around the call to self.metrics.record_subscription_change in
the method containing repository.upsert_subscription so that action = "enabled"
if update_data.enabled is True, action = "disabled" if update_data.enabled is
False, and otherwise "updated" for non-enabled field changes. Also pass the
channel value in the correct parameter name (it represents a
NotificationChannel) when calling self.metrics.record_subscription_change —
ensure you provide it to the parameter that tracks channel/notification channel
rather than notification_type so dashboards receive the proper semantic field.

In `@backend/grafana/provisioning/dashboards/coordinator-execution.json`:
- Around line 38-47: The histogram_quantile calls (e.g.,
histogram_quantile(0.95,
rate(coordinator_scheduling_duration_seconds_bucket[5m]))) are missing an
aggregation and must be changed to aggregate by bucket le before applying
histogram_quantile; update each affected target (panel ids 1,2,5,7,8) to wrap
rate(...) with sum(...) by (le) so the query becomes histogram_quantile(0.95,
sum(rate(coordinator_scheduling_duration_seconds_bucket[5m])) by (le)); apply
the same pattern to any other histogram_quantile uses in this JSON (and
corresponding files noted) to ensure a single aggregated quantile across series.

In `@backend/grafana/provisioning/dashboards/event-stream-monitoring.json`:
- Around line 595-607: Remove the empty dashboard row object with "id": 60 and
"title": "Event Buffer Performance" (gridPos y: 16) from the JSON so it no
longer produces a blank header gap; alternatively, if panels should remain under
that section, move their panel objects to be children below this row or adjust
their gridPos y values, but the simplest fix is to delete the row object with id
60 and its empty "panels": [] entry.

In `@backend/grafana/provisioning/dashboards/notifications.json`:
- Around line 195-211: Panels that plot
histogram_quantile(..._seconds_bucket...) are using the "short" unit so
durations render as raw numbers; update the panels with id 7, 9, 13, and 14
(titles "Channel Delivery Time", "Overall Delivery Time", "Webhook Delivery",
"Slack") to set their unit to "s" (seconds) in the panel JSON (replace the unit
field value currently "short" with "s") so the times display as human-readable
seconds; keep all other panel settings unchanged.

🧹 Nitpick comments (15)

.github/workflows/release-deploy.yml (1)

143-143: Consider parameterizing GRAFANA_ROOT_URL instead of hardcoding it.

The production domain is baked into the deploy script. If it ever changes, this line is easy to miss. A GitHub repository variable (e.g., vars.GRAFANA_ROOT_URL) would keep all environment-specific config in one place alongside the other secrets.

backend/grafana/provisioning/dashboards/event-replay.json (1)

370-371: Minor inconsistencies with other dashboards in the repo.

This dashboard uses schemaVersion: 33 and includes "style": "dark", while the sibling kafka-events-monitoring.json uses schemaVersion: 38 and omits style. The style field was deprecated in Grafana 9+. Consider aligning the schema version and dropping the style field for consistency.
backend/grafana/provisioning/dashboards/http-middleware.json (1)
48-71: Duration and size panels use "unit": "short" instead of semantically correct units.

Several panels display metrics measured in seconds, bytes, or percent, but all default to "unit": "short". This means Grafana won't apply unit-aware formatting (e.g., "1.23 s", "4.5 KiB", "85%"), making values harder to interpret at a glance.

Affected panels:

Request Duration (id 2, line 52): queries http_request_duration_seconds_bucket → should use "s"

Request/Response Size (id 4, line 99): queries *_size_bytes_bucket → should use "bytes" or "decbytes"

MongoDB Query Duration (id 6, line 170): queries *_duration_seconds_bucket → should use "s"

Event Store & Idempotency (id 8, line 217): queries *_duration_seconds_bucket → should use "s"

CPU (id 11, line 324): queries system_cpu_percent → should use "percent"
Example fix for Request Duration panel
       "fieldConfig": {
         "defaults": {
-          "unit": "short"
+          "unit": "s"
         }
       },
backend/grafana/provisioning/dashboards/security-auth.json (1)
53-76: Duration panels should use "unit": "s" instead of "short".

Same issue as in http-middleware.json: the Authentication Duration panel (id 2) queries authentication_duration_seconds_bucket and Token Expiry Time (id 8, line 220) queries token_expiry_time_seconds_bucket — both should use "unit": "s" so Grafana formats values with time-aware suffixes (ms/s/min).
Suggested fix
 // Panel id 2 (Authentication Duration) and id 8 (Token Expiry Time):
       "fieldConfig": {
         "defaults": {
-          "unit": "short"
+          "unit": "s"
         }
       },
backend/grafana/provisioning/dashboards/coordinator-execution.json (1)

197-254: Script Resources panels use "unit": "short" for memory and percent metrics.

Memory Usage (id 7) queries script_memory_usage_MiB_bucket → consider "unit": "MiB" (or "decmbytes")

Memory Utilization (id 8) queries script_memory_utilization_percent_bucket → consider "unit": "percent"

This is the same pattern flagged in the other new dashboards for duration panels.

backend/app/core/metrics/kubernetes.py (1)

83-89: The getattr/setattr delta-tracking pattern is not thread-safe.

update_active_pod_creations, update_pod_monitor_pods_watched, and update_pods_by_phase all use getattr/setattr to track the last-known value and compute a delta. If called concurrently from different threads, a race between reading the current value and writing the new value could produce incorrect deltas. This is a pre-existing pattern, but worth noting given that Kubernetes operations are often concurrent.

backend/grafana/provisioning/dashboards/kubernetes-pods.json (1)

53-76: Duration panels use "unit": "short" instead of "s" — values will display as raw numbers rather than human-readable durations.

The pod_creation_duration_seconds_bucket histogram quantile returns values in seconds, but Grafana's "short" unit won't format them as time (e.g., "1.2s"). The same applies to the Monitor Processing Duration panel (id 10, line 290) and Pod Lifetime panel (id 6, line 160). Consider using "unit": "s" for these panels so Grafana renders durations properly.
backend/app/services/sse/sse_service.py (1)
61-61: Consider using time.monotonic() instead of datetime.now(timezone.utc) for duration measurement.

Wall-clock time can jump due to NTP adjustments or clock skew. time.monotonic() is the idiomatic choice for measuring elapsed durations. For long-lived SSE connections this is unlikely to be noticeable, but it's a minor correctness improvement.
♻️ Proposed fix
+import time
 ...
-        start_time = datetime.now(timezone.utc)
+        start_time = time.monotonic()
 ...
-            duration = (datetime.now(timezone.utc) - start_time).total_seconds()
+            duration = time.monotonic() - start_time
Also applies to: 125-126
backend/app/core/metrics/notifications.py (1)

163-164: Minor: retry success rate is only recorded when attempt_number > 1.

This means the first attempt's outcome is never recorded in the retry_success_rate histogram. If the intent is to track success rate of retries specifically (not first attempts), this is correct. If you want to track overall success rate including first attempts, the condition should be attempt_number >= 1.
backend/tests/contract/test_grafana_metrics.py (3)
58-63: Global OTel state mutation without teardown.

os.environ.pop("OTEL_SDK_DISABLED") and otel_api.set_meter_provider(provider) permanently alter process-global state. If any other test in the same process relies on the default no-op provider or on OTEL_SDK_DISABLED, it will break. Consider restoring the env var and shutting down the provider in a yield-based fixture, or running these tests in a dedicated subprocess/marker with --forked.
Sketch: yield-based cleanup
 `@pytest.fixture`(scope="module")
 def prometheus_families() -> dict[str, set[str]]:
-    os.environ.pop("OTEL_SDK_DISABLED", None)
+    old_val = os.environ.pop("OTEL_SDK_DISABLED", None)
 
     reader = PrometheusMetricReader()
     provider = MeterProvider(metric_readers=[reader])
     otel_api.set_meter_provider(provider)
 
     ...
 
-    return families
+    yield families
+
+    provider.shutdown()
+    if old_val is not None:
+        os.environ["OTEL_SDK_DISABLED"] = old_val
Note: OTel's global set_meter_provider cannot be fully reverted, so subprocess isolation (pytest-forked) is the most robust option.
106-113: Metric name extraction heuristic may produce false positives.

The len(t) > 3 filter on Line 113 is a rough heuristic. Short but valid metric names (e.g., a hypothetical cpu) would pass, while PromQL label values leaking through the regex stripping could be misidentified as metrics. This is acceptable for now since Prometheus naming conventions typically use underscores, but worth documenting the limitation as a code comment.

73-89: Use public MetricReader.collect() API instead of accessing private _collector attribute.

The code relies on private OpenTelemetry SDK internals (provider._meters, meter._instrument_id_instrument, reader._collector) that may change across releases. Replace reader._collector.collect() with the public reader.collect(timeout_millis=...) method to ensure stability. The _meters and _instrument_id_instrument access should also be reconsidered if possible to avoid direct iteration over internal SDK state.
backend/tests/unit/core/test_csrf.py (1)

29-62: Consider extracting a shared fixture for SecurityService to reduce repetition.

Every test method in this file constructs SecurityService(test_settings, security_metrics) individually. A class-scoped pytest.fixture (like in test_security.py) would eliminate the boilerplate across ~15 test methods.

This is a pre-existing pattern, so fine to defer.
backend/app/services/auth_service.py (2)
80-81: Prefer structured keyword arguments over f-strings in structlog calls.

Using f"Admin access denied for user: {user.username} (role: {user.role})" embeds variables into the message string, which defeats structured logging's purpose (greppable messages, indexed fields). Same pattern on lines 92–93.
Suggested fix
-            self.logger.warning(f"Admin access denied for user: {user.username} (role: {user.role})")
+            self.logger.warning("Admin access denied", username=user.username, role=user.role)
-        self.logger.warning(
-            f"Login failed - {reason}",
+        self.logger.warning(
+            "Login failed",
+            reason=reason,
33-52: Constructor has 8 dependencies — consider whether this signals the service is doing too much.

AuthService now handles login, registration, logout, current-user resolution, and admin authorization checks. The admin authorization check (get_admin) feels out of place in an auth service that's primarily about the login/register lifecycle. That said, this consolidation is the explicit goal of the PR, so flagging for awareness only.

backend/app/dlq/manager.py

backend/app/services/auth_service.py

backend/app/services/notification_service.py

backend/grafana/provisioning/dashboards/coordinator-execution.json

backend/grafana/provisioning/dashboards/event-stream-monitoring.json

backend/grafana/provisioning/dashboards/notifications.json

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

backend/grafana/provisioning/dashboards/event-stream-monitoring.json (1)

1094-1098: ⚠️ Potential issue | 🟡 Minor

"SSE Shutdown Monitoring" section has only a right-half panel — left side is empty.

The "SSE Connections Being Drained" panel (id 73) is positioned at x: 12, w: 12, leaving the left half (x: 0–11) blank. A panel (likely id 72) appears to have been removed. Either widen this panel to full width or add a complementary shutdown metric (e.g., shutdown duration, drain timeout events).
🔧 Quick fix – make the panel full-width
       "gridPos": {
         "h": 8,
-        "w": 12,
-        "x": 12,
+        "w": 24,
+        "x": 0,
         "y": 52
       },
🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.
In `@backend/grafana/provisioning/dashboards/event-stream-monitoring.json` around
lines 1094 - 1098, The "SSE Connections Being Drained" panel (id 73) is only
occupying the right half (gridPos x:12, w:12); update its gridPos to full width
by setting x:0 and w:24 in the panel definition for id 73 (panel title "SSE
Connections Being Drained") so it fills the left half gap, and verify it doesn't
overlap other panels (e.g., any panel id 72) after the change.

backend/app/core/metrics/security.py (2)

137-147: ⚠️ Potential issue | 🟠 Major

High-cardinality PII labels: ip_address and target_user in brute-force metrics.

ip_address and target_user are unbounded label values that will cause cardinality explosion. Brute-force detection by definition involves many distinct IPs. Record these in logs/alerts, not as metric dimensions.

Proposed fix

     def record_brute_force_attempt(
-        self, ip_address: str, target_user: str | None = None, action_taken: str = "logged"
+        self, action_taken: str = "logged"
     ) -> None:
         self.brute_force_attempts.add(
             1,
             attributes={
-                "ip_address": ip_address,
-                "target_user": target_user or "multiple",
                 "action_taken": action_taken,
             },
         )

🤖 Prompt for AI Agents

Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.
In `@backend/app/core/metrics/security.py` around lines 137 - 147, The
record_brute_force_attempt method is adding high‑cardinality PII (ip_address and
target_user) as metric attributes on self.brute_force_attempts; remove those
unbounded labels and only record bounded dimensions (e.g., keep action_taken and
optionally a coarse target category like target_user_present: "yes"/"no" or
target_type: "multiple"/"single") when calling self.brute_force_attempts.add,
and move the detailed ip_address and target_user values to logs/alerts (e.g.,
logger.warn or alerting pipeline) so you still capture them for investigation
without exploding metric cardinality.

78-86: ⚠️ Potential issue | 🟠 Major

High-cardinality PII labels: user_id in authentication metrics.

user_id is added as a metric attribute on every authentication attempt and failure (lines 82, 86). Each unique user creates a new time series, degrading Prometheus/VictoriaMetrics performance. It also exposes PII in monitoring systems.

Consider removing user_id from metric attributes entirely (it belongs in structured logs, not in metric labels), or bucketing it if cardinality is bounded.

🤖 Prompt for AI Agents

Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.
In `@backend/app/core/metrics/security.py` around lines 78 - 86, The
record_authentication_attempt function is adding user_id as a metric label on
authentication_attempts and authentication_failures which creates
high-cardinality/PII in metrics; remove user_id from the attributes passed to
authentication_attempts.add and authentication_failures.add (keep only method
and success where needed) and instead record user_id in structured logs or trace
spans if required, or implement a deterministic low-cardinality bucketing
strategy (e.g., hashed bucket) before adding as a metric label if you must keep
a coarse identifier; target the record_authentication_attempt function and the
authentication_attempts/authentication_failures metric add calls to apply the
change.

🤖 Fix all issues with AI agents

Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.

In `@backend/app/core/metrics/dlq.py`:
- Around line 54-59: The helpers increment_dlq_queue_size and
decrement_dlq_queue_size modify the UpDownCounter (dlq_queue_size) but never
update the cached _dlq_sizes, causing stale deltas in update_dlq_queue_size; fix
by having those helpers compute the new size from _dlq_sizes (e.g. new_size =
_dlq_sizes.get(original_topic, 0) +/- 1), call dlq_queue_size.add(±1,
attributes={"original_topic": original_topic}) and then set
_dlq_sizes[original_topic] = new_size (alternatively, have the helpers call
update_dlq_queue_size with the computed new_size so all updates funnel through
update_dlq_queue_size).

In `@backend/app/core/metrics/security.py`:
- Around line 131-132: The password_reset_requests metric is recording
PII/high-cardinality via the user_id attribute in record_password_reset_request;
remove user_id from the attributes passed to password_reset_requests.add and
only include the non-PII dimension (method). Update the
record_password_reset_request method so password_reset_requests.add(…) is called
with attributes={"method": method} (do not log or hash user_id here) to avoid
exposing user identifiers.
- Around line 137-147: The record_brute_force_attempt method is adding
high‑cardinality PII (ip_address and target_user) as metric attributes on
self.brute_force_attempts; remove those unbounded labels and only record bounded
dimensions (e.g., keep action_taken and optionally a coarse target category like
target_user_present: "yes"/"no" or target_type: "multiple"/"single") when
calling self.brute_force_attempts.add, and move the detailed ip_address and
target_user values to logs/alerts (e.g., logger.warn or alerting pipeline) so
you still capture them for investigation without exploding metric cardinality.
- Around line 78-86: The record_authentication_attempt function is adding
user_id as a metric label on authentication_attempts and authentication_failures
which creates high-cardinality/PII in metrics; remove user_id from the
attributes passed to authentication_attempts.add and authentication_failures.add
(keep only method and success where needed) and instead record user_id in
structured logs or trace spans if required, or implement a deterministic
low-cardinality bucketing strategy (e.g., hashed bucket) before adding as a
metric label if you must keep a coarse identifier; target the
record_authentication_attempt function and the
authentication_attempts/authentication_failures metric add calls to apply the
change.

In `@backend/app/services/auth_service.py`:
- Around line 120-175: The login flow never emits SecurityMetrics, so update the
login method to call SecurityMetrics.record_authentication_attempt (or via
self.security_metrics) on both success and failure paths: measure start time at
the top of login, then on the user-not-found branch and invalid-password branch
invoke record_authentication_attempt with failure=True (and include reason like
"user_not_found" or "invalid_password", username, ip_address, user_agent,
duration) before calling or inside _fail_login; on the successful path (before
return) call record_authentication_attempt with failure=False, the username,
ip_address, user_agent, and duration; alternatively refactor _fail_login to
accept and emit the metric so all failure paths centralize metric recording.
Ensure you reference SecurityMetrics.record_authentication_attempt and the login
and _fail_login symbols when making the changes.
- Around line 177-229: The UserRegisteredEvent emitted from the register method
currently includes created_user.email which may accidentally expose PII; review
the register function and UserRegisteredEvent usage and either remove the email
field from the event payload or replace it with a hashed/consented identifier
(e.g., omit email when constructing UserRegisteredEvent or supply an email_hash
or consent flag) so downstream consumers do not receive raw email addresses;
update the producer call site where UserRegisteredEvent is constructed and any
consumers that expect the email field to handle its absence or the new hashed
value.

In `@backend/grafana/provisioning/dashboards/coordinator-execution.json`:
- Around line 26-29: Several panels set fieldConfig.defaults.unit to "short" but
their metrics have units; update the fieldConfig.defaults.unit for the affected
panels (panel id 1, 2, 5 --> use "s"; panel id 7 --> use "decmbytes" or add an
explicit axis label; panel id 8 --> use "percent") and also adjust the same
setting for the other listed ranges (panels around lines referenced: 50-53,
133-136, 198-202, 228-231) so each panel's "unit" matches its metric suffix
(e.g., replace "short" with "s", "decmbytes", or "percent" in the JSON objects
that contain "fieldConfig": {"defaults": {"unit": ...}} for the corresponding
panels).
- Line 25: The dashboard hardcodes "datasource": "Victoria Metrics" in multiple
panels (see occurrences at the panel datasource properties) which breaks
portability; add a dashboard-level templating variable named DS of type
"datasource" (query "prometheus" and default current value "Victoria Metrics")
and replace each panel's hardcoded datasource value with the template reference
"${DS}" so panels use the provisioned datasource; update the templating block
under the top-level "templating" -> "list" and change every panel datasource
field (e.g., in the panel objects that currently have "datasource": "Victoria
Metrics") to "datasource": "${DS}" to make the dashboard portable.

In `@backend/grafana/provisioning/dashboards/event-stream-monitoring.json`:
- Line 287: Panel 54 ("Event Publishing Rate") uses a 1-minute rate window in
the expression sum(rate(events_published_total[1m])); update this to use a
5-minute window (sum(rate(events_published_total[5m]))) to match the other
panels (e.g., panel 66 "Event Publishing Rate by Type") and avoid noisy or
missing data when scrape intervals are ≥1m; ensure any other rate() calls in the
dashboard remain consistent with the chosen window.
- Around line 104-108: The "Event Stream Overview" stat row has unused grid
columns because panels were removed and remaining panels (e.g., panel id
52/54/56 or the panels with gridPos entries shown) are left with x/w settings
that only cover 14 of 24 units; update the panels in the dashboard JSON to
eliminate gaps by either increasing the "w" of the existing stat/timeseries
panels and adjusting their "x" values to pack across x:0–23, or add new stat
panels (suggest metrics like total_messages_sent or event_store_failure_rate) to
occupy the missing grid columns; locate the affected panels by their panel ids
(the ones skipping 53 and 55 referenced in the comment) and modify their
"gridPos" objects so the row fills the full 24-unit width consistently (apply
the same fix to the other occurrence noted around lines 261–266).
- Around line 1094-1098: The "SSE Connections Being Drained" panel (id 73) is
only occupying the right half (gridPos x:12, w:12); update its gridPos to full
width by setting x:0 and w:24 in the panel definition for id 73 (panel title
"SSE Connections Being Drained") so it fills the left half gap, and verify it
doesn't overlap other panels (e.g., any panel id 72) after the change.

In `@backend/grafana/provisioning/dashboards/http-middleware.json`:
- Around line 62-70: The duration panels are using "unit": "short" instead of
the proper units; update each affected panel object (e.g., the panel with
"title": "Request Duration" and the other panels at IDs 2, 4, 6, and 8
referenced in the review) to set "unit": "s" for metrics derived from
*_duration_seconds_bucket and "unit": "bytes" for metrics from
*_size_bytes_bucket (request/response size panels). Locate the JSON panel
objects by their "id" and "title" fields in the dashboard file and replace the
unit value accordingly so Grafana will auto-format seconds and bytes.
- Around line 86-93: The dashboard query is using the wrong metric name; update
the Grafana dashboard JSON so the target's "expr" value is
"http_requests_active" instead of "http_requests_active_requests". Locate the
dashboard object that has "title": "Active Requests" and the "targets" array
(the target with "refId": "A") and replace the expr string accordingly to match
the metric registered as "http_requests_active".
- Around line 357-364: The dashboard panel targets a nonexistent, non-conformant
metric "process_metrics_mixed" (panel title "Process Metrics"); update the panel
in http-middleware.json by replacing the expression in the targets.expr field
with a valid Prometheus metric (for example use process_resident_memory_bytes,
process_open_fds, or another single-quantity metric used elsewhere in the
codebase) or remove the panel entirely if that data source is not available;
ensure the expr uses a real metric name consistent with Prometheus naming
conventions and adjust the panel title/legend if you pick a specific metric.
- Line 25: Multiple panels currently hardcode "datasource": "Victoria Metrics",
which breaks dashboards across environments; add a dashboard-level template
variable (e.g., name it DS_VICTORIA_METRICS) in the dashboard's
"templating.list" and replace each panel's "datasource": "Victoria Metrics" with
"datasource": "${DS_VICTORIA_METRICS}" (update occurrences around the existing
"datasource" keys such as the ones shown) so the datasource is resolved
per-environment instead of hardcoded.

In `@backend/grafana/provisioning/dashboards/security-auth.json`:
- Around line 53-76: Panel units are incorrect: change the fieldConfig default
unit from "short" to the seconds unit "s" for the duration panels. Update the
JSON for the panel with id 2 (title "Authentication Duration") and the panel
with id 8 (title "Token Expiry Time") by replacing "unit": "short" with "unit":
"s" under each panel's fieldConfig.defaults so Grafana renders values as
human-friendly seconds.

🧹 Nitpick comments (5)

🤖 Fix all nitpicks with AI agents

Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.

In `@backend/app/services/auth_service.py`:
- Around line 177-229: The UserRegisteredEvent emitted from the register method
currently includes created_user.email which may accidentally expose PII; review
the register function and UserRegisteredEvent usage and either remove the email
field from the event payload or replace it with a hashed/consented identifier
(e.g., omit email when constructing UserRegisteredEvent or supply an email_hash
or consent flag) so downstream consumers do not receive raw email addresses;
update the producer call site where UserRegisteredEvent is constructed and any
consumers that expect the email field to handle its absence or the new hashed
value.

In `@backend/grafana/provisioning/dashboards/coordinator-execution.json`:
- Line 25: The dashboard hardcodes "datasource": "Victoria Metrics" in multiple
panels (see occurrences at the panel datasource properties) which breaks
portability; add a dashboard-level templating variable named DS of type
"datasource" (query "prometheus" and default current value "Victoria Metrics")
and replace each panel's hardcoded datasource value with the template reference
"${DS}" so panels use the provisioned datasource; update the templating block
under the top-level "templating" -> "list" and change every panel datasource
field (e.g., in the panel objects that currently have "datasource": "Victoria
Metrics") to "datasource": "${DS}" to make the dashboard portable.

In `@backend/grafana/provisioning/dashboards/event-stream-monitoring.json`:
- Line 287: Panel 54 ("Event Publishing Rate") uses a 1-minute rate window in
the expression sum(rate(events_published_total[1m])); update this to use a
5-minute window (sum(rate(events_published_total[5m]))) to match the other
panels (e.g., panel 66 "Event Publishing Rate by Type") and avoid noisy or
missing data when scrape intervals are ≥1m; ensure any other rate() calls in the
dashboard remain consistent with the chosen window.
- Around line 104-108: The "Event Stream Overview" stat row has unused grid
columns because panels were removed and remaining panels (e.g., panel id
52/54/56 or the panels with gridPos entries shown) are left with x/w settings
that only cover 14 of 24 units; update the panels in the dashboard JSON to
eliminate gaps by either increasing the "w" of the existing stat/timeseries
panels and adjusting their "x" values to pack across x:0–23, or add new stat
panels (suggest metrics like total_messages_sent or event_store_failure_rate) to
occupy the missing grid columns; locate the affected panels by their panel ids
(the ones skipping 53 and 55 referenced in the comment) and modify their
"gridPos" objects so the row fills the full 24-unit width consistently (apply
the same fix to the other occurrence noted around lines 261–266).

In `@backend/grafana/provisioning/dashboards/http-middleware.json`:
- Line 25: Multiple panels currently hardcode "datasource": "Victoria Metrics",
which breaks dashboards across environments; add a dashboard-level template
variable (e.g., name it DS_VICTORIA_METRICS) in the dashboard's
"templating.list" and replace each panel's "datasource": "Victoria Metrics" with
"datasource": "${DS_VICTORIA_METRICS}" (update occurrences around the existing
"datasource" keys such as the ones shown) so the datasource is resolved
per-environment instead of hardcoded.

backend/grafana/provisioning/dashboards/coordinator-execution.json (1)

25-25: Hardcoded datasource name reduces portability.

Every panel pins "datasource": "Victoria Metrics" by name. If the provisioned datasource is renamed or the dashboard is imported into an environment with a different name, all panels will break. Consider using a dashboard-level templating variable:

"templating": {
  "list": [
    {
      "name": "DS",
      "type": "datasource",
      "query": "prometheus",
      "current": { "text": "Victoria Metrics", "value": "Victoria Metrics" }
    }
  ]
}

Then reference "datasource": "${DS}" in each panel. This is a common Grafana best practice for provisioned dashboards.

Also applies to: 49-49, 73-73, 96-96, 132-132, 156-156, 198-198, 227-227

🤖 Prompt for AI Agents

Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.
In `@backend/grafana/provisioning/dashboards/coordinator-execution.json` at line
25, The dashboard hardcodes "datasource": "Victoria Metrics" in multiple panels
(see occurrences at the panel datasource properties) which breaks portability;
add a dashboard-level templating variable named DS of type "datasource" (query
"prometheus" and default current value "Victoria Metrics") and replace each
panel's hardcoded datasource value with the template reference "${DS}" so panels
use the provisioned datasource; update the templating block under the top-level
"templating" -> "list" and change every panel datasource field (e.g., in the
panel objects that currently have "datasource": "Victoria Metrics") to
"datasource": "${DS}" to make the dashboard portable.

backend/grafana/provisioning/dashboards/http-middleware.json (1)

25-25: Consider using a datasource variable instead of hardcoding "Victoria Metrics".

Every panel hardcodes "datasource": "Victoria Metrics". If the datasource name differs across environments (dev, staging, prod), all panels break. The standard Grafana practice is to define a dashboard-level template variable (e.g., ${DS_VICTORIA_METRICS}) and reference it in each panel. This is optional but improves portability.

Also applies to: 49-49, 138-138
🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.
In `@backend/grafana/provisioning/dashboards/http-middleware.json` at line 25,
Multiple panels currently hardcode "datasource": "Victoria Metrics", which
breaks dashboards across environments; add a dashboard-level template variable
(e.g., name it DS_VICTORIA_METRICS) in the dashboard's "templating.list" and
replace each panel's "datasource": "Victoria Metrics" with "datasource":
"${DS_VICTORIA_METRICS}" (update occurrences around the existing "datasource"
keys such as the ones shown) so the datasource is resolved per-environment
instead of hardcoded.

backend/grafana/provisioning/dashboards/event-stream-monitoring.json (2)

287-287: Inconsistent rate window: [1m] here vs [5m] everywhere else.

Panel 54 ("Event Publishing Rate") uses rate(events_published_total[1m]) while the corresponding breakdown panel 66 ("Event Publishing Rate by Type") and all other rate() calls in this dashboard use [5m]. The [1m] window will produce noisier data and may show no data if the scrape interval is ≥ 1 minute. Consider aligning to [5m] for consistency, or at least ensure the scrape interval supports a 1-minute rate window.
🔧 Suggested fix
-          "expr": "sum(rate(events_published_total[1m]))",
+          "expr": "sum(rate(events_published_total[5m]))",
🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.
In `@backend/grafana/provisioning/dashboards/event-stream-monitoring.json` at line
287, Panel 54 ("Event Publishing Rate") uses a 1-minute rate window in the
expression sum(rate(events_published_total[1m])); update this to use a 5-minute
window (sum(rate(events_published_total[5m]))) to match the other panels (e.g.,
panel 66 "Event Publishing Rate by Type") and avoid noisy or missing data when
scrape intervals are ≥1m; ensure any other rate() calls in the dashboard remain
consistent with the chosen window.
104-108: Gaps in the "Event Stream Overview" stat row.

The three stat/timeseries panels in the overview row occupy only 14 of 24 grid units (x: 0 w: 4, x: 4 w: 4, x: 12 w: 6), leaving a 4-unit gap at x: 8–11 and a 6-unit gap at x: 18–23. This suggests panels were removed (the id sequence skips 53 and 55) without redistributing the remaining panels. Consider widening the existing panels or adding new stat panels (e.g., total messages sent, event store failure rate) to fill the row.

Also applies to: 261-266
🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.
In `@backend/grafana/provisioning/dashboards/event-stream-monitoring.json` around
lines 104 - 108, The "Event Stream Overview" stat row has unused grid columns
because panels were removed and remaining panels (e.g., panel id 52/54/56 or the
panels with gridPos entries shown) are left with x/w settings that only cover 14
of 24 units; update the panels in the dashboard JSON to eliminate gaps by either
increasing the "w" of the existing stat/timeseries panels and adjusting their
"x" values to pack across x:0–23, or add new stat panels (suggest metrics like
total_messages_sent or event_store_failure_rate) to occupy the missing grid
columns; locate the affected panels by their panel ids (the ones skipping 53 and
55 referenced in the comment) and modify their "gridPos" objects so the row
fills the full 24-unit width consistently (apply the same fix to the other
occurrence noted around lines 261–266).

backend/app/services/auth_service.py (1)

177-229: Registration emits email in the domain event — verify this is intentional.

Line 223 includes email=created_user.email in UserRegisteredEvent. While domain events are different from metrics (PII in events is often acceptable), ensure downstream consumers handle email appropriately per GDPR/privacy requirements (e.g., not logging it, encrypting at rest in event stores).

🤖 Prompt for AI Agents

Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.
In `@backend/app/services/auth_service.py` around lines 177 - 229, The
UserRegisteredEvent emitted from the register method currently includes
created_user.email which may accidentally expose PII; review the register
function and UserRegisteredEvent usage and either remove the email field from
the event payload or replace it with a hashed/consented identifier (e.g., omit
email when constructing UserRegisteredEvent or supply an email_hash or consent
flag) so downstream consumers do not receive raw email addresses; update the
producer call site where UserRegisteredEvent is constructed and any consumers
that expect the email field to handle its absence or the new hashed value.

backend/app/core/metrics/dlq.py

backend/app/core/metrics/security.py

backend/app/services/auth_service.py

backend/grafana/provisioning/dashboards/coordinator-execution.json

backend/grafana/provisioning/dashboards/http-middleware.json

backend/grafana/provisioning/dashboards/security-auth.json

cubic-dev-ai

1 issue found across 10 files (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/grafana/provisioning/dashboards/coordinator-execution.json">

<violation number="1" location="backend/grafana/provisioning/dashboards/coordinator-execution.json:201">
P2: `script_memory_usage_MiB_bucket` is measured in MiB, but the panel now uses the decimal megabytes unit (`decmbytes`), which will display 1 MiB as 1.048 MB. Use the mebibytes unit (`mbytes`) or convert the metric to MB so the unit matches the data.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

backend/grafana/provisioning/dashboards/coordinator-execution.json

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/core/metrics/security.py (1)

145-155: ⚠️ Potential issue | 🟠 Major

ip_address and target_user as metric attributes create high-cardinality series and may leak PII.

ip_address is unbounded and target_user may contain usernames. Both become Prometheus labels, causing cardinality explosion and potential PII exposure in monitoring systems — the same class of issue previously flagged for user_id on other methods in this file.

Consider replacing ip_address with a coarser bucket (e.g., subnet prefix or just "external"/"internal") and dropping target_user or replacing it with a boolean like "targeted": str(target_user is not None).

Proposed fix

     def record_brute_force_attempt(
-        self, ip_address: str, target_user: str | None = None, action_taken: str = "logged"
+        self, action_taken: str = "logged", targeted: bool = False
     ) -> None:
         self.brute_force_attempts.add(
             1,
             attributes={
-                "ip_address": ip_address,
-                "target_user": target_user or "multiple",
                 "action_taken": action_taken,
+                "targeted": str(targeted),
             },
         )

🤖 Prompt for AI Agents

Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.
In `@backend/app/core/metrics/security.py` around lines 145 - 155, The
record_brute_force_attempt metric currently uses high-cardinality PII labels
(ip_address and target_user); modify it to avoid leaking identifiers by
replacing ip_address with a coarse bucket (e.g., subnet prefix or a simple
"external"/"internal" label computed from ip_address) and remove target_user or
replace it with a low-cardinality flag such as "targeted": str(target_user is
not None); update the call to self.brute_force_attempts.add in
record_brute_force_attempt to send only the coarse ip bucket (or
external/internal) and the "targeted" boolean plus action_taken, ensuring no raw
IPs or usernames are emitted as metric attributes.

🤖 Fix all issues with AI agents

Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.

In `@backend/app/core/metrics/security.py`:
- Around line 145-155: The record_brute_force_attempt metric currently uses
high-cardinality PII labels (ip_address and target_user); modify it to avoid
leaking identifiers by replacing ip_address with a coarse bucket (e.g., subnet
prefix or a simple "external"/"internal" label computed from ip_address) and
remove target_user or replace it with a low-cardinality flag such as "targeted":
str(target_user is not None); update the call to self.brute_force_attempts.add
in record_brute_force_attempt to send only the coarse ip bucket (or
external/internal) and the "targeted" boolean plus action_taken, ensuring no raw
IPs or usernames are emitted as metric attributes.

In `@backend/app/services/auth_service.py`:
- Around line 34-39: The hardcoded _dummy_hash uses cost factor 12 which can
differ from Settings.BCRYPT_ROUNDS and break the timing mitigation; update the
initialization so the dummy hash is generated at startup using the active cost
factor (Settings.BCRYPT_ROUNDS) instead of a static string, e.g., replace the
static _dummy_hash with logic in the AuthService (or module init) that calls
bcrypt.hash with a dummy password and the configured rounds, and ensure
verify_password continues to use this generated _dummy_hash when a username is
not found so timing matches real user hashes.
- Around line 127-128: The AccountLockedError is raised before the metrics
context manager used by track_authentication runs, so lockout rejections are not
recorded in authentication_attempts_total/authentication_failures_total; to
include them, move the call to self._lockout.check_locked(username) (and the
subsequent raise AccountLockedError) inside the track_authentication context
manager used by the authenticate flow (i.e., within the block that
track_authentication wraps) so the same metrics are incremented for lockout
rejections, or leave it as-is if you intentionally want to exclude lockouts from
metrics — update the placement in authenticate/track_authentication accordingly.
- Around line 120-176: The current with
self.security_metrics.track_authentication("login") wraps the entire login flow
so infrastructure errors (user_repo.get_user,
_runtime_settings.get_effective_settings) are recorded as authentication
failures; narrow the scope by removing the outer context and instead open the
context only around the credential-verification block where passwords are
checked and failures are recorded (i.e., around security_service.verify_password
calls and the calls to _fail_login / _lockout.clear_attempts) inside the login
method so DB or settings errors are not counted as auth failures.

In `@backend/grafana/provisioning/dashboards/http-middleware.json`:
- Around line 320-342: The CPU stat panel (id 11, title "CPU") is using the
wrong unit; update the fieldConfig.defaults.unit for the panel that targets the
metric "system_cpu_percent" from "short" to "percent" so Grafana renders a %
suffix for 0–100 values (if your metric were 0–1 use "percentunit" instead);
locate the panel by its id/title or the target expr "system_cpu_percent" and
change the unit string accordingly.

In `@backend/grafana/provisioning/dashboards/security-auth.json`:
- Line 25: The dashboard currently hardcodes the datasource value "Victoria
Metrics" in each panel's "datasource" field, which breaks portability; update
every panel in security-auth.json that sets "datasource": "Victoria Metrics" to
instead reference a template variable (e.g. "${DS_VICTORIAMETRICS}") or use the
UID form (an object with "uid": "<datasource-uid>") so the panels bind to the
provisioned datasource regardless of display name; search for every "datasource"
key in this file and replace the hardcoded string with the chosen variable/UID
form and ensure the dashboard JSON includes a matching dashboard variable
(DS_VICTORIAMETRICS) if you use the template approach.

🧹 Nitpick comments (3)

🤖 Fix all nitpicks with AI agents

Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.

In `@backend/app/services/auth_service.py`:
- Around line 127-128: The AccountLockedError is raised before the metrics
context manager used by track_authentication runs, so lockout rejections are not
recorded in authentication_attempts_total/authentication_failures_total; to
include them, move the call to self._lockout.check_locked(username) (and the
subsequent raise AccountLockedError) inside the track_authentication context
manager used by the authenticate flow (i.e., within the block that
track_authentication wraps) so the same metrics are incremented for lockout
rejections, or leave it as-is if you intentionally want to exclude lockouts from
metrics — update the placement in authenticate/track_authentication accordingly.
- Around line 120-176: The current with
self.security_metrics.track_authentication("login") wraps the entire login flow
so infrastructure errors (user_repo.get_user,
_runtime_settings.get_effective_settings) are recorded as authentication
failures; narrow the scope by removing the outer context and instead open the
context only around the credential-verification block where passwords are
checked and failures are recorded (i.e., around security_service.verify_password
calls and the calls to _fail_login / _lockout.clear_attempts) inside the login
method so DB or settings errors are not counted as auth failures.

In `@backend/grafana/provisioning/dashboards/security-auth.json`:
- Line 25: The dashboard currently hardcodes the datasource value "Victoria
Metrics" in each panel's "datasource" field, which breaks portability; update
every panel in security-auth.json that sets "datasource": "Victoria Metrics" to
instead reference a template variable (e.g. "${DS_VICTORIAMETRICS}") or use the
UID form (an object with "uid": "<datasource-uid>") so the panels bind to the
provisioned datasource regardless of display name; search for every "datasource"
key in this file and replace the hardcoded string with the chosen variable/UID
form and ensure the dashboard JSON includes a matching dashboard variable
(DS_VICTORIAMETRICS) if you use the template approach.

backend/app/services/auth_service.py (2)

127-128: Locked-account check before track_authentication — AccountLockedError bypasses auth metrics.

This means locked-account rejections won't appear in the authentication_attempts_total or authentication_failures_total metrics. If you want visibility into lockout rates via the same dashboard, the lockout check should be inside the context manager. If the intent is to only count "real" authentication attempts, the current placement is fine — just calling it out as a design choice.
🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.
In `@backend/app/services/auth_service.py` around lines 127 - 128, The
AccountLockedError is raised before the metrics context manager used by
track_authentication runs, so lockout rejections are not recorded in
authentication_attempts_total/authentication_failures_total; to include them,
move the call to self._lockout.check_locked(username) (and the subsequent raise
AccountLockedError) inside the track_authentication context manager used by the
authenticate flow (i.e., within the block that track_authentication wraps) so
the same metrics are incremented for lockout rejections, or leave it as-is if
you intentionally want to exclude lockouts from metrics — update the placement
in authenticate/track_authentication accordingly.
120-176: track_authentication now wraps the login flow — past review concern resolved.

The context manager in the finally block records both success and failure outcomes with duration. One subtlety: infrastructure errors (e.g., DB failures from get_user on line 131 or get_effective_settings on line 142) will be counted as authentication failures in the histogram. If this is undesirable, consider narrowing the context manager to only the credential-verification portion (lines 131–138).
🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.
In `@backend/app/services/auth_service.py` around lines 120 - 176, The current
with self.security_metrics.track_authentication("login") wraps the entire login
flow so infrastructure errors (user_repo.get_user,
_runtime_settings.get_effective_settings) are recorded as authentication
failures; narrow the scope by removing the outer context and instead open the
context only around the credential-verification block where passwords are
checked and failures are recorded (i.e., around security_service.verify_password
calls and the calls to _fail_login / _lockout.clear_attempts) inside the login
method so DB or settings errors are not counted as auth failures.

backend/grafana/provisioning/dashboards/security-auth.json (1)

25-25: Hardcoded datasource name "Victoria Metrics" across all panels.

If the provisioned datasource is renamed or the deployment uses a different name, every panel silently fails. Consider using a template variable (e.g., "datasource": "${DS_VICTORIAMETRICS}") or at minimum the "uid" form instead of a display-name string, so the dashboard is portable across environments.

Also applies to: 54-54, 78-78, 101-101, 124-124, 147-147, 183-183, 217-217, 241-241, 283-283, 312-312
🤖 Prompt for AI Agents
Before applying any fix, first verify the finding against the current code and
decide whether a code change is actually needed. If the finding is not valid or
no change is required, do not modify code for that item and briefly explain why
it was skipped.
In `@backend/grafana/provisioning/dashboards/security-auth.json` at line 25, The
dashboard currently hardcodes the datasource value "Victoria Metrics" in each
panel's "datasource" field, which breaks portability; update every panel in
security-auth.json that sets "datasource": "Victoria Metrics" to instead
reference a template variable (e.g. "${DS_VICTORIAMETRICS}") or use the UID form
(an object with "uid": "<datasource-uid>") so the panels bind to the provisioned
datasource regardless of display name; search for every "datasource" key in this
file and replace the hardcoded string with the chosen variable/UID form and
ensure the dashboard JSON includes a matching dashboard variable
(DS_VICTORIAMETRICS) if you use the template approach.

backend/app/services/auth_service.py

backend/grafana/provisioning/dashboards/http-middleware.json

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

Verify each finding against the current code and only fix it if needed.


In `@backend/grafana/provisioning/dashboards/http-middleware.json`:
- Around line 213-241: The "Event Store & Idempotency" panel mixes two different
units under a single field default ("unit": "short"): the "Store Ops" series
(rate of event_store_operations_total, ops/s) and the "Idempotency p95" series
(histogram_quantile over idempotency_processing_duration_seconds_bucket,
seconds). Fix by either splitting into two panels (one for Store Ops, one for
Idempotency p95) or add field overrides on the panel to set the unit for the
"Idempotency p95" series to "s" and the "Store Ops" series to an appropriate ops
unit; locate the panel by title "Event Store & Idempotency" and targets with
legendFormat "Store Ops" and "Idempotency p95" to implement the change.
- Around line 24-47: The Request Rate panel currently sets the axis unit to
"short"; change the fieldConfig.defaults.unit for the panel with title "Request
Rate" (id 1) to "reqps" so the timeseries using expr
"rate(http_requests_total[5m])" displays requests/second units; update the
"unit" value in that panel's fieldConfig.defaults object accordingly.

🧹 Nitpick comments (1)

🤖 Fix all nitpicks with AI agents

Verify each finding against the current code and only fix it if needed.


In `@backend/grafana/provisioning/dashboards/http-middleware.json`:
- Around line 24-47: The Request Rate panel currently sets the axis unit to
"short"; change the fieldConfig.defaults.unit for the panel with title "Request
Rate" (id 1) to "reqps" so the timeseries using expr
"rate(http_requests_total[5m])" displays requests/second units; update the
"unit" value in that panel's fieldConfig.defaults object accordingly.

backend/grafana/provisioning/dashboards/http-middleware.json (1)

24-47: Request Rate panel could use "unit": "reqps" instead of "short".

Grafana's reqps (requests/second) unit would add proper axis labeling for the rate(http_requests_total[5m]) expression. This is a minor readability improvement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/grafana/provisioning/dashboards/http-middleware.json` around lines 24
- 47, The Request Rate panel currently sets the axis unit to "short"; change the
fieldConfig.defaults.unit for the panel with title "Request Rate" (id 1) to
"reqps" so the timeseries using expr "rate(http_requests_total[5m])" displays
requests/second units; update the "unit" value in that panel's
fieldConfig.defaults object accordingly.

backend/grafana/provisioning/dashboards/http-middleware.json

sonarqubecloud · 2026-02-16T22:34:26Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

🤖 Fix all issues with AI agents

Verify each finding against the current code and only fix it if needed.


In `@backend/grafana/provisioning/dashboards/http-middleware.json`:
- Around line 1-405: Panel "Request Rate" (id 1) is using the generic unit
"short" — change the fieldConfig.defaults.unit for the timeseries panel with id
1 (title "Request Rate") from "short" to a more appropriate Grafana unit such as
"reqps" or "ops" so the rate(http_requests_total[5m]) series displays with a
requests-per-second suffix.

🧹 Nitpick comments (1)

🤖 Fix all nitpicks with AI agents

Verify each finding against the current code and only fix it if needed.


In `@backend/grafana/provisioning/dashboards/http-middleware.json`:
- Around line 1-405: Panel "Request Rate" (id 1) is using the generic unit
"short" — change the fieldConfig.defaults.unit for the timeseries panel with id
1 (title "Request Rate") from "short" to a more appropriate Grafana unit such as
"reqps" or "ops" so the rate(http_requests_total[5m]) series displays with a
requests-per-second suffix.

backend/grafana/provisioning/dashboards/http-middleware.json (1)

1-405: Request Rate panel (id 1) uses "unit": "short" — consider "reqps" or "ops".

The rate(http_requests_total[5m]) expression yields requests-per-second. Grafana's "reqps" or "ops" unit would add a meaningful suffix. This is minor and cosmetic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/grafana/provisioning/dashboards/http-middleware.json` around lines 1
- 405, Panel "Request Rate" (id 1) is using the generic unit "short" — change
the fieldConfig.defaults.unit for the timeseries panel with id 1 (title "Request
Rate") from "short" to a more appropriate Grafana unit such as "reqps" or "ops"
so the rate(http_requests_total[5m]) series displays with a requests-per-second
suffix.

HardMax71 added 4 commits February 16, 2026 13:53

fix: grafana url updated, also user/pass is passed through gh secrets…

845d800

…, not hardcoded in plaintext

fix: nginx rewrite for grafana path

b525f70

feat: contract testing for metrics/labels for metrics backend <-> gra…

fecc94e

…fana; also added docs file explaining what is what

feat: updated metrics system

22fe234

cubic-dev-ai bot reviewed Feb 16, 2026

View reviewed changes

coderabbitai bot reviewed Feb 16, 2026

View reviewed changes

HardMax71 added 2 commits February 16, 2026 20:23

fix: wrong error msg

f238b69

fix: detected issues

245e537

coderabbitai bot reviewed Feb 16, 2026

View reviewed changes

fix: detected issues

2fab9b3

cubic-dev-ai bot reviewed Feb 16, 2026

View reviewed changes

backend/grafana/provisioning/dashboards/coordinator-execution.json Outdated Show resolved Hide resolved

coderabbitai bot reviewed Feb 16, 2026

View reviewed changes

backend/app/services/auth_service.py Show resolved Hide resolved

backend/grafana/provisioning/dashboards/http-middleware.json Show resolved Hide resolved

fix: detected issues

3f37c30

coderabbitai bot reviewed Feb 16, 2026

View reviewed changes

backend/grafana/provisioning/dashboards/http-middleware.json Show resolved Hide resolved

fix: plot with http ops

24c40ca

coderabbitai bot reviewed Feb 16, 2026

View reviewed changes

HardMax71 merged commit 731f312 into main Feb 16, 2026
19 checks passed

HardMax71 deleted the fix/keys branch February 16, 2026 22:43

Conversation

HardMax71 commented Feb 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud bot commented Feb 16, 2026

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

HardMax71 commented Feb 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 16, 2026 •

edited

Loading