Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions .claude/skills/local-monitoring/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
---
name: local-monitoring
description: Query the local Grafana/Prometheus/Loki stack shipped with this CDVN repo. Use when investigating cluster health, charon/beacon/EL errors, peer connectivity, validator performance, or log patterns against the locally-running monitoring stack (not Obol's hosted Grafana).
user-invokable: true
---

# Local Monitoring

Query the local monitoring stack (Grafana, Prometheus, Loki) that ships with this repo to investigate cluster health and diagnose issues.

For Obol's hosted Grafana (across all clusters), use the `obol-monitoring` skill instead. This skill is for the local stack only.

## Prerequisites

Before running, verify:
1. The monitoring stack is up: `docker compose ps prometheus grafana loki` shows them running
2. Grafana is reachable on the host at `http://localhost:${MONITORING_PORT_GRAFANA:-3000}` (default 3000)
3. The user knows their Grafana admin credentials, or has unauthenticated access enabled (default in this repo's `grafana.ini`)

If the stack isn't up, point the user to `docker compose up -d prometheus grafana loki` first.

## Architecture notes

- **Prometheus** (`:9090`) and **Loki** (`:3100`) are on the docker network only — not exposed to the host by default. Query them through one of:
- **Grafana datasource proxy** (preferred): `http://localhost:3000/api/datasources/proxy/uid/<prometheus|loki>/<path>` — uses Grafana's own connection
- **`docker compose exec`** fallback: `docker compose exec prometheus wget -qO- 'http://localhost:9090/api/v1/query?query=...'`
- Datasource UIDs (from `grafana/datasource.yml`): `prometheus`, `loki`, `tempo`
- Charon metrics are labeled with `cluster_name` and `cluster_peer` — get these from `.env` (`CLUSTER_NAME`, `CLUSTER_PEER`) before querying

## Gather Arguments

Use AskUserQuestion to clarify what the user wants to investigate. Common shapes:

1. **What to investigate** — pick one:
- Cluster health snapshot (readyz, peers, active validators)
- Charon error/log search (last N minutes)
- Beacon node performance (latency, sync status)
- Peer connectivity (ping latency, connection types)
- Custom PromQL / LogQL query
2. **Time range** — default last 15m; ask if investigating a specific incident
3. **Cluster scope** — usually their own (`$CLUSTER_NAME` from `.env`); ask only if multiple clusters share this Prometheus

If the request is already specific (e.g. "show me charon errors from the last hour"), skip AskUserQuestion and proceed.

## Execution

### Instant query (Prometheus)

```bash
GRAFANA_URL="http://localhost:${MONITORING_PORT_GRAFANA:-3000}"
curl -sG "$GRAFANA_URL/api/datasources/proxy/uid/prometheus/api/v1/query" \
--data-urlencode 'query=<PROMQL>'
```

### Range query (Prometheus)

```bash
curl -sG "$GRAFANA_URL/api/datasources/proxy/uid/prometheus/api/v1/query_range" \
--data-urlencode 'query=<PROMQL>' \
--data-urlencode "start=$(date -u -v-15M +%s)" \
--data-urlencode "end=$(date -u +%s)" \
--data-urlencode 'step=30s'
```

### Log search (Loki)

```bash
curl -sG "$GRAFANA_URL/api/datasources/proxy/uid/loki/loki/api/v1/query_range" \
--data-urlencode 'query={service_name="charon"} |= "error"' \
--data-urlencode "start=$(date -u -v-15M +%s)000000000" \
--data-urlencode "end=$(date -u +%s)000000000" \
--data-urlencode 'limit=200'
```

### Fallback via `docker compose exec`

If the Grafana proxy is unavailable:
```bash
docker compose exec prometheus wget -qO- "http://localhost:9090/api/v1/query?query=<URL_ENCODED_PROMQL>"
docker compose exec loki wget -qO- "http://localhost:3100/loki/api/v1/query_range?query=<...>"
```

For a query cookbook (cluster health, charon errors, peer ping, BN latency, validator effectiveness), see [queries.md](queries.md).

## Output handling

Parse the JSON response and present results clearly:

- **Prometheus instant query** — show metric labels + value, flag anomalies (zeros where non-zero expected, threshold breaches)
- **Prometheus range query** — summarise min/max/avg over the window; call out spikes
- **Loki logs** — group by `cluster_peer` if present; surface error/warn lines verbatim with timestamps; suppress repetitive noise
- Always print the **exact query that was run** so the user can re-run it in Grafana

If the response contains `"status":"error"`, surface the `error` and `errorType` fields and stop — do not invent results.

## Common diagnoses

When showing results, watch for these patterns and call them out:

- **`app_monitoring_readyz != 1`** — node is not ready; explain what readyz state means (1=ready, other=various failure modes documented in charon docs)
- **High `p2p_ping_latency_secs` p90** — peer network is slow; check `p2p_peer_connection_types` for relayed vs direct
- **`p2p_ping_success == 0`** for a peer — that operator is unreachable
- **Charon log `error` spikes** — group by `topic` / `component` to identify which subsystem
- **`core_scheduler_validators_active` lower than `cluster_validators`** — some validators not active (not yet activated, or exited)
- **EL/CL container missing from metrics** — check `docker compose ps` and respective container logs

## Pointers to dashboards

Direct the user to the pre-provisioned dashboards in `grafana/dashboards/` rather than reinventing them:
- `charon_overview_dashboard.json` — readyz, peers, validator activity (start here)
- `cluster_dashboard.json` — full cluster view across operators
- `node_overview_dashboard.json` — host/EL/CL/VC resource usage
- `logs_dashboard.json` — Loki log explorer with charon filters

Open in browser: `http://localhost:${MONITORING_PORT_GRAFANA:-3000}/dashboards`.

## Dependencies

- `curl`, `jq` (for parsing responses cleanly)
- Running `prometheus`, `grafana`, `loki` containers from this compose stack
- `CLUSTER_NAME` and `CLUSTER_PEER` set in `.env` (used as Prometheus label values)
121 changes: 121 additions & 0 deletions .claude/skills/local-monitoring/queries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Query Cookbook

Curated PromQL and LogQL examples for the local monitoring stack. All examples assume `cluster_name="$CLUSTER_NAME"` from `.env`. Substitute the real value before running.

## PromQL — Cluster health

```promql
# Is this node ready? 1 = ready, anything else = degraded.
app_monitoring_readyz{cluster_name="$CLUSTER_NAME"}

# Active validators on the local node
core_scheduler_validators_active{cluster_name="$CLUSTER_NAME"}

# Total validators in the cluster (from cluster-lock)
cluster_validators{cluster_name="$CLUSTER_NAME"}

# Operators and consensus threshold
cluster_operators{cluster_name="$CLUSTER_NAME"}
cluster_threshold{cluster_name="$CLUSTER_NAME"}
```

## PromQL — Peer connectivity

```promql
# Per-peer ping success in the last 5m (0 = peer unreachable)
sum by (cluster_peer, peer) (p2p_ping_success{cluster_name="$CLUSTER_NAME"})

# p90 ping latency by peer over 5m
histogram_quantile(
0.90,
sum by (le, peer) (
rate(p2p_ping_latency_secs_bucket{cluster_name="$CLUSTER_NAME"}[5m])
)
)

# Connection type breakdown (direct vs relayed)
max by (peer, type) (p2p_peer_connection_types{cluster_name="$CLUSTER_NAME"})
```

## PromQL — Duty performance

```promql
# Successful duties per type, last 5m
sum by (duty) (
increase(core_tracker_success_duties_total{cluster_name="$CLUSTER_NAME"}[5m])
)

# Failed duties per type, last 5m (should be near zero)
sum by (duty, reason) (
increase(core_tracker_failed_duties_total{cluster_name="$CLUSTER_NAME"}[5m])
)
```

## PromQL — Beacon node health

```promql
# BN call latency p95 by endpoint
histogram_quantile(
0.95,
sum by (le, endpoint) (
rate(app_beacon_node_latency_secs_bucket{cluster_name="$CLUSTER_NAME"}[5m])
)
)

# BN errors per endpoint
sum by (endpoint) (
rate(app_beacon_node_errors_total{cluster_name="$CLUSTER_NAME"}[5m])
)

# Detected BN client per peer (useful for diversity audit)
max by (cluster_peer, beacon_id) (
app_beacon_node_version{cluster_name="$CLUSTER_NAME"}
)
```

## PromQL — Health checks / alerts

```promql
# Currently failing health checks, with severity and description
max by (name, severity, description) (
app_health_checks_failed_total{cluster_name="$CLUSTER_NAME"} > 0
)
```

## LogQL — Charon log search

```logql
# All charon errors in the window
{service_name="charon"} |= "error"

# Errors excluding noisy known-benign topics
{service_name="charon"} |= "error" != "context canceled"

# Filter by component (e.g. p2p, core, validatorapi)
{service_name="charon"} | json | component="p2p" | level="error"

# Logs around a specific slot
{service_name="charon"} |= "slot=12345678"

# Rate of errors per minute (for grafana stat panel)
sum(rate({service_name="charon"} |= "error" [1m]))
```

## LogQL — Beacon / execution / VC logs

```logql
# Lighthouse beacon node warnings
{container_name=~".*lighthouse.*"} |~ "(?i)warn|error"

# Lodestar VC missed duties
{container_name=~".*vc-lodestar.*"} |= "missed"

# Generic: any container, last 100 errors
{container_name=~".+"} |~ "(?i)error|panic" | line_format "{{.container_name}} {{.message}}"
```

## Tips

- All Charon metrics carry `cluster_peer` — add `by (cluster_peer)` to fan out per-operator.
- Loki labels in this stack come from Alloy's promtail-equivalent — common labels are `service_name`, `container_name`, `level`.
- For dashboard queries, prefer copying directly from `grafana/dashboards/*.json` (search for `"expr":` lines) so units and label selectors stay consistent.