diff --git a/.claude/skills/local-monitoring/SKILL.md b/.claude/skills/local-monitoring/SKILL.md new file mode 100644 index 00000000..5557f867 --- /dev/null +++ b/.claude/skills/local-monitoring/SKILL.md @@ -0,0 +1,121 @@ +--- +name: local-monitoring +description: Query the local Grafana/Prometheus/Loki stack shipped with this CDVN repo. Use when investigating cluster health, charon/beacon/EL errors, peer connectivity, validator performance, or log patterns against the locally-running monitoring stack (not Obol's hosted Grafana). +user-invokable: true +--- + +# Local Monitoring + +Query the local monitoring stack (Grafana, Prometheus, Loki) that ships with this repo to investigate cluster health and diagnose issues. + +For Obol's hosted Grafana (across all clusters), use the `obol-monitoring` skill instead. This skill is for the local stack only. + +## Prerequisites + +Before running, verify: +1. The monitoring stack is up: `docker compose ps prometheus grafana loki` shows them running +2. Grafana is reachable on the host at `http://localhost:${MONITORING_PORT_GRAFANA:-3000}` (default 3000) +3. The user knows their Grafana admin credentials, or has unauthenticated access enabled (default in this repo's `grafana.ini`) + +If the stack isn't up, point the user to `docker compose up -d prometheus grafana loki` first. + +## Architecture notes + +- **Prometheus** (`:9090`) and **Loki** (`:3100`) are on the docker network only — not exposed to the host by default. Query them through one of: + - **Grafana datasource proxy** (preferred): `http://localhost:3000/api/datasources/proxy/uid//` — uses Grafana's own connection + - **`docker compose exec`** fallback: `docker compose exec prometheus wget -qO- 'http://localhost:9090/api/v1/query?query=...'` +- Datasource UIDs (from `grafana/datasource.yml`): `prometheus`, `loki`, `tempo` +- Charon metrics are labeled with `cluster_name` and `cluster_peer` — get these from `.env` (`CLUSTER_NAME`, `CLUSTER_PEER`) before querying + +## Gather Arguments + +Use AskUserQuestion to clarify what the user wants to investigate. Common shapes: + +1. **What to investigate** — pick one: + - Cluster health snapshot (readyz, peers, active validators) + - Charon error/log search (last N minutes) + - Beacon node performance (latency, sync status) + - Peer connectivity (ping latency, connection types) + - Custom PromQL / LogQL query +2. **Time range** — default last 15m; ask if investigating a specific incident +3. **Cluster scope** — usually their own (`$CLUSTER_NAME` from `.env`); ask only if multiple clusters share this Prometheus + +If the request is already specific (e.g. "show me charon errors from the last hour"), skip AskUserQuestion and proceed. + +## Execution + +### Instant query (Prometheus) + +```bash +GRAFANA_URL="http://localhost:${MONITORING_PORT_GRAFANA:-3000}" +curl -sG "$GRAFANA_URL/api/datasources/proxy/uid/prometheus/api/v1/query" \ + --data-urlencode 'query=' +``` + +### Range query (Prometheus) + +```bash +curl -sG "$GRAFANA_URL/api/datasources/proxy/uid/prometheus/api/v1/query_range" \ + --data-urlencode 'query=' \ + --data-urlencode "start=$(date -u -v-15M +%s)" \ + --data-urlencode "end=$(date -u +%s)" \ + --data-urlencode 'step=30s' +``` + +### Log search (Loki) + +```bash +curl -sG "$GRAFANA_URL/api/datasources/proxy/uid/loki/loki/api/v1/query_range" \ + --data-urlencode 'query={service_name="charon"} |= "error"' \ + --data-urlencode "start=$(date -u -v-15M +%s)000000000" \ + --data-urlencode "end=$(date -u +%s)000000000" \ + --data-urlencode 'limit=200' +``` + +### Fallback via `docker compose exec` + +If the Grafana proxy is unavailable: +```bash +docker compose exec prometheus wget -qO- "http://localhost:9090/api/v1/query?query=" +docker compose exec loki wget -qO- "http://localhost:3100/loki/api/v1/query_range?query=<...>" +``` + +For a query cookbook (cluster health, charon errors, peer ping, BN latency, validator effectiveness), see [queries.md](queries.md). + +## Output handling + +Parse the JSON response and present results clearly: + +- **Prometheus instant query** — show metric labels + value, flag anomalies (zeros where non-zero expected, threshold breaches) +- **Prometheus range query** — summarise min/max/avg over the window; call out spikes +- **Loki logs** — group by `cluster_peer` if present; surface error/warn lines verbatim with timestamps; suppress repetitive noise +- Always print the **exact query that was run** so the user can re-run it in Grafana + +If the response contains `"status":"error"`, surface the `error` and `errorType` fields and stop — do not invent results. + +## Common diagnoses + +When showing results, watch for these patterns and call them out: + +- **`app_monitoring_readyz != 1`** — node is not ready; explain what readyz state means (1=ready, other=various failure modes documented in charon docs) +- **High `p2p_ping_latency_secs` p90** — peer network is slow; check `p2p_peer_connection_types` for relayed vs direct +- **`p2p_ping_success == 0`** for a peer — that operator is unreachable +- **Charon log `error` spikes** — group by `topic` / `component` to identify which subsystem +- **`core_scheduler_validators_active` lower than `cluster_validators`** — some validators not active (not yet activated, or exited) +- **EL/CL container missing from metrics** — check `docker compose ps` and respective container logs + +## Pointers to dashboards + +Direct the user to the pre-provisioned dashboards in `grafana/dashboards/` rather than reinventing them: +- `charon_overview_dashboard.json` — readyz, peers, validator activity (start here) +- `cluster_dashboard.json` — full cluster view across operators +- `node_overview_dashboard.json` — host/EL/CL/VC resource usage +- `logs_dashboard.json` — Loki log explorer with charon filters + +Open in browser: `http://localhost:${MONITORING_PORT_GRAFANA:-3000}/dashboards`. + +## Dependencies + +- `curl`, `jq` (for parsing responses cleanly) +- Running `prometheus`, `grafana`, `loki` containers from this compose stack +- `CLUSTER_NAME` and `CLUSTER_PEER` set in `.env` (used as Prometheus label values) diff --git a/.claude/skills/local-monitoring/queries.md b/.claude/skills/local-monitoring/queries.md new file mode 100644 index 00000000..372ff529 --- /dev/null +++ b/.claude/skills/local-monitoring/queries.md @@ -0,0 +1,121 @@ +# Query Cookbook + +Curated PromQL and LogQL examples for the local monitoring stack. All examples assume `cluster_name="$CLUSTER_NAME"` from `.env`. Substitute the real value before running. + +## PromQL — Cluster health + +```promql +# Is this node ready? 1 = ready, anything else = degraded. +app_monitoring_readyz{cluster_name="$CLUSTER_NAME"} + +# Active validators on the local node +core_scheduler_validators_active{cluster_name="$CLUSTER_NAME"} + +# Total validators in the cluster (from cluster-lock) +cluster_validators{cluster_name="$CLUSTER_NAME"} + +# Operators and consensus threshold +cluster_operators{cluster_name="$CLUSTER_NAME"} +cluster_threshold{cluster_name="$CLUSTER_NAME"} +``` + +## PromQL — Peer connectivity + +```promql +# Per-peer ping success in the last 5m (0 = peer unreachable) +sum by (cluster_peer, peer) (p2p_ping_success{cluster_name="$CLUSTER_NAME"}) + +# p90 ping latency by peer over 5m +histogram_quantile( + 0.90, + sum by (le, peer) ( + rate(p2p_ping_latency_secs_bucket{cluster_name="$CLUSTER_NAME"}[5m]) + ) +) + +# Connection type breakdown (direct vs relayed) +max by (peer, type) (p2p_peer_connection_types{cluster_name="$CLUSTER_NAME"}) +``` + +## PromQL — Duty performance + +```promql +# Successful duties per type, last 5m +sum by (duty) ( + increase(core_tracker_success_duties_total{cluster_name="$CLUSTER_NAME"}[5m]) +) + +# Failed duties per type, last 5m (should be near zero) +sum by (duty, reason) ( + increase(core_tracker_failed_duties_total{cluster_name="$CLUSTER_NAME"}[5m]) +) +``` + +## PromQL — Beacon node health + +```promql +# BN call latency p95 by endpoint +histogram_quantile( + 0.95, + sum by (le, endpoint) ( + rate(app_beacon_node_latency_secs_bucket{cluster_name="$CLUSTER_NAME"}[5m]) + ) +) + +# BN errors per endpoint +sum by (endpoint) ( + rate(app_beacon_node_errors_total{cluster_name="$CLUSTER_NAME"}[5m]) +) + +# Detected BN client per peer (useful for diversity audit) +max by (cluster_peer, beacon_id) ( + app_beacon_node_version{cluster_name="$CLUSTER_NAME"} +) +``` + +## PromQL — Health checks / alerts + +```promql +# Currently failing health checks, with severity and description +max by (name, severity, description) ( + app_health_checks_failed_total{cluster_name="$CLUSTER_NAME"} > 0 +) +``` + +## LogQL — Charon log search + +```logql +# All charon errors in the window +{service_name="charon"} |= "error" + +# Errors excluding noisy known-benign topics +{service_name="charon"} |= "error" != "context canceled" + +# Filter by component (e.g. p2p, core, validatorapi) +{service_name="charon"} | json | component="p2p" | level="error" + +# Logs around a specific slot +{service_name="charon"} |= "slot=12345678" + +# Rate of errors per minute (for grafana stat panel) +sum(rate({service_name="charon"} |= "error" [1m])) +``` + +## LogQL — Beacon / execution / VC logs + +```logql +# Lighthouse beacon node warnings +{container_name=~".*lighthouse.*"} |~ "(?i)warn|error" + +# Lodestar VC missed duties +{container_name=~".*vc-lodestar.*"} |= "missed" + +# Generic: any container, last 100 errors +{container_name=~".+"} |~ "(?i)error|panic" | line_format "{{.container_name}} {{.message}}" +``` + +## Tips + +- All Charon metrics carry `cluster_peer` — add `by (cluster_peer)` to fan out per-operator. +- Loki labels in this stack come from Alloy's promtail-equivalent — common labels are `service_name`, `container_name`, `level`. +- For dashboard queries, prefer copying directly from `grafana/dashboards/*.json` (search for `"expr":` lines) so units and label selectors stay consistent.