Skip to content

cardano-tracer | timeseries: Prometheus-aligned HTTP API, node info endpoints, Grafana datasource#6562

Open
Russoul wants to merge 34 commits into
masterfrom
russoul/timeseries-align-http-api-with-prometheus
Open

cardano-tracer | timeseries: Prometheus-aligned HTTP API, node info endpoints, Grafana datasource#6562
Russoul wants to merge 34 commits into
masterfrom
russoul/timeseries-align-http-api-with-prometheus

Conversation

@Russoul
Copy link
Copy Markdown
Contributor

@Russoul Russoul commented May 6, 2026

Summary

Query language improvements

  • epoch parser bug fixedepoch now correctly produces Timestamp 0 (Unix epoch); previously it parsed identically to now, making epoch + Nms evaluate to ~year 2082 instead of the intended absolute timestamp
  • Metric-name awareness in the elaborator — unknown variable names now produce Undefined name: <x> errors at elaboration time instead of confusing downstream type-mismatch errors; St carries availableMetrics :: Set MetricIdentifier populated from the store
  • "Did you mean?" suggestions in Undefined-name errors — the elaborator computes Levenshtein distance (via edit-distance) to known metric names and local bindings, appending the closest match to the error message
  • Local name shadowinglet x = 1 in let x = 2 in x now correctly returns 2; previously the elaborator rejected re-use of any name
  • RangeVector/Scalar arithmetic+, -, *, / between a range vector and a scalar are now supported in both elaboration and interpretation
  • Unit and List typesUnit, Nil, and Cons constructors added to Value and the semantic Expr; metrics expression returns the metric list as a Cons-list of Text values
  • sum_over_time elab bug fixedSurface.SumOverTime was incorrectly elaborated to Semantic.AvgOverTime (copy-paste error)
  • Noncanonical arithmetic rules — four new bidirectional rules let the elaborator infer types in previously-stuck expressions:
    • Duration + Duration : ? → result is Duration (e.g. 1s + 2s now elaborates standalone)
    • ? - Duration : ? → lhs must be Timestamp (e.g. \x -> x - 1s infers x : Timestamp)
    • ? + ? : Duration → both args must be Duration (e.g. \x -> \y -> m[now; now : x + y] infers both lambda args)
    • ? - ? : Timestamp → lhs must be Timestamp, rhs must be Duration (e.g. \x -> \y -> m[x - y; x] infers arg types from the range start position)

Test suite

  • 362 tests covering the full stack: parser, elaborator, and interpreter
  • Parser suite: full coverage of surface expression parsing including operator precedence and associativity
  • Elaboration suite: ~115 cases covering literals, arithmetic (well-typed and ill-typed), comparisons, boolean operators, let/lambda, pairs/projections, to_scalar, abs/round, the metrics builtin, range vectors, instant-vector operations, metric-name resolution, and "Did you mean?" suggestions
  • Interpretation suite: scalar arithmetic, boolean logic, timestamp/duration arithmetic, instant-vector operations, range-vector aggregations (avg_over_time, sum_over_time, rate, increase, quantile_over_time), label filtering, metrics query, shadowing, and error cases

cardano-tracer timeseries HTTP server

  • Query endpoints:
    • POST /timeseries/query — body: application/x-www-form-urlencoded; parameters:
      • query=<expr> (required) — the query expression
      • time=<unix_s> (optional, float) — evaluation instant; defaults to server time (now)
    • Response: {"status":"success","data":<Value>} or {"status":"error","errorType":"...","error":"..."}
    • GET /timeseries/query — same parameters and response, passed as URL query string
  • Node metadata endpoints (new, via TracerEnv):
    • GET /timeseries/nodes — JSON array of connected node ID strings
    • GET /timeseries/node/{id}/infoNodeInfo fields plus server-computed uptimeSeconds
    • GET /timeseries/node/{id}/startupNodeStartupInfo (era, slot length, epoch length, KES period)
    • GET /timeseries/node/{id}/state (RTVIEW){"syncProgress":<pct>} from the NodeAddBlock data point
  • Data point key fix — the /state endpoint uses key "NodeAddBlock" (the namespace cardano-node actually registers the data point under) rather than "NodeState", which was always empty

Grafana datasource plugin (bench/grafana-datasource/)

  • $__from / $__to support — the plugin pre-processes queries before template substitution, rewriting $__from and $__to to epoch + Nms expressions the query language understands; dashboard panels use [$__from; $__to] ranges driven by the Grafana time picker
  • Error handling — server-side errors are surfaced as Grafana DataQueryError with the first line as the panel message and the full elaboration/interpretation trace in the inspector
  • Node info query types — four additional query types expose RTView header data per connected node:
    • nodes — list of node IDs (populates the $node_id template variable via metricFindQuery)
    • node-info — name, protocol, version, commit, start time, uptime in seconds
    • node-startup — era, slot length, epoch length, KES period length
    • node-state — sync progress percentage
  • Provisioned RTView dashboard — panels covering Resources, Blockchain, Leadership, Transactions, and a Node section with Connected Nodes table and per-node Info/Startup/Sync/Uptime panels (all repeating by $node_id via a multi-select query variable)
  • Metric name sanitisation. and - in EKG metric names are replaced with _ so names match the cardano_node_metrics_* convention used in queries

Workbench profiles

  • 6-dense-timeseries-1h — 6-node dense-topology profile with tracer.timeseries = true and tracer.rtview = true; 1-hour duration; replaces the earlier 2-node ci-bench-timeseries profile
  • macOS compatibilityps h --ppid (GNU ps) replaced with pgrep -P in supervisor.sh; the former errors on macOS BSD ps causing noisy output on local runs

Russoul added 24 commits May 15, 2026 15:45
…eseries profile

**cardano-tracer / cardano-timeseries-io**
- Add `Cardano.Timeseries.JSON` with `ToJSON` instances for `Value`, `Instant`,
  `Timeseries` and `SeriesIdentifier` (orphan module, imported for side-effects)
- Switch `POST /timeseries/query` response from plain text to `application/json`

**cardano-profile**
- Add `timeseries :: Bool` field to `Tracer`
- Add `tracerTimeseries` primitive
- Add `ci-bench-timeseries` profile: 2-node local cluster with timeseries endpoint
  enabled, no shutdown condition, generator runs for 100 000 epochs (effectively
  indefinite) — intended for interactive exploration via Grafana
- Regenerate `all-profiles-coay.json` and `wb_profiles.mk`; update all test fixtures

**Nix**
- `cardano-tracer-service-workbench.nix`: add `timeseries` option → `hasTimeseries`
- `cardano-tracer-service.nix`: add `timeseriesEnable/Host/Port` NixOS options
- `tracer.nix`: wire `profile.tracer.timeseries` → `{epHost, epPort = 3400}`
- `supervisor.sh`: replace hanging `netstat -pltn` with `lsof -nP -iTCP:9001
  -sTCP:LISTEN` for macOS compatibility

**bench/grafana-datasource** (new)
- TypeScript Grafana datasource plugin (`iog-cardanotimeseries-datasource`)
- Sends `POST /timeseries/query` via Grafana server-side proxy (avoids CORS)
- Converts `Value` tagged-union JSON to Grafana `DataFrame[]`
- `docker-compose.yaml` for local development; datasource auto-provisioned at
  `http://host.docker.internal:3400`; Colima-compatible (`extra_hosts`)
- Parse/eval errors from the server are now surfaced as DataQueryError:
  the banner shows the first line (summary); the full multi-line message
  (source location + caret) is in data.message for the inspector
- Requests go via Grafana's server-side proxy (instanceSettings.url) to
  avoid CORS and host-resolution issues in the browser
- Add Unit value: renders as an empty frame (no data to display)
- POST /timeseries/query now accepts JSON body {"query": "...", "time": <optional Unix seconds>}
- Success response wrapped in {"status":"success","data":...} envelope
- Error responses use {"status":"error","errorType":"...","error":"..."} format
- errorType mirrors Prometheus: "parse", "bad_data", "execution"
- Plugin updated to send application/json body and unwrap response envelope
- Remove ConfigEditor.tsx (Grafana's built-in URL field suffices)
Introduces Nil and Cons constructors to both the expression language and
the value domain. The primary motivation is to give the built-in Metrics
expression a well-typed return value: it now returns a proper linked list
of Text values (metric names) rather than a right-nested Pair tuple
terminated by Unit, making it possible to assign it the type List Text in
the elaborator.

- Interp: Metrics now builds a Nil/Cons spine instead of a Pair/Unit tuple
- JSON: ToJSON instances for Nil and Cons (tag/head/tail encoding)
- Show: show Nil = "[]", show (Cons h t) = "[h|t]"
- Resolve: structural pass-throughs for Nil and Cons
- Grafana plugin: Nil/Cons variants in the TypeScript Value union;
  Cons case walks the spine into a string table frame, Text items
  render their actual string value
…ew in ci-bench-timeseries

- Add provisioned Grafana dashboard mimicking RTView's four sections:
  Resources, Blockchain, Leadership, Transactions (26 panels, row-based
  collapsible sections, byte units for memory/mempool panels)
- Pin datasource uid in provisioning so dashboard references are stable
- Mount dashboards directory in docker-compose
- Enable RTView (hasRTView, port 3300) in ci-bench-timeseries profile
  alongside the existing timeseries endpoint
… inconsistencies

Implements pointwise arithmetic (add, sub, mul, div) between RangeVector Scalar
and Scalar, following the existing InstantVector/Scalar pattern. This enables
queries like 'metric[now - 1h; now] / 1048576' in Grafana panels.

Fixes ten inconsistencies found in typing.txt: missing List type in grammar,
filter_by_label argument order, A->B type formation rule conclusion, := vs ≔
notation, 'Syntax hugar' typo, map variable name mismatch, RangeVector missing
type parameter in quantile_over_time, four missing instant_vector_scalar relation
rules, mul_instant_vector_scalar missing arguments, and stale metrics return type
(Text -> List Text).

Adds a precondition note to 'interp' that the input expression must be well-typed.
…mar docs

Add a cardano-timeseries-test suite (tasty/tasty-hunit) mirroring the
cardano-recon-framework layout, with three suites:
- Elab.Expr.Parser.Suite: comprehensive parser coverage across all constructs,
  operator precedence checks, and wrong-arity error cases (152 tests total)
- Elab.Suite: well-typed / ill-typed elaboration checks
- Interp.Suite: end-to-end execution via API.execute against an empty store

Fix parser bug: the `let` RHS was parsed as `exprOr` instead of `exprUniverse`,
preventing unparenthesised lambdas and nested lets from appearing as let-RHS.
The `in` keyword acts as a natural terminator so no ambiguity arises.

Update grammar documentation:
- Elab/Expr.hs inline grammar: `let x = t{> universe}` -> `t{≥ universe}`
- docs/elab.txt: fix `<int>min` -> `<int>m`; add missing `round t`,
  `earliest x`, `latest x`
… elab bug

The test suite covers scalar/bool arithmetic and comparisons, duration and
timestamp literals, type conversions (to_scalar, abs, round), pairs, let/lambda,
InstantVector lookup/aggregation/filter/map/label-filter/unless/join,
InstantVector-Scalar arithmetic and relations, RangeVector aggregations
(avg_over_time, sum_over_time, rate, increase, quantile_over_time), and
metrics.

Writing the tests surfaced a bug in Elab.hs where Surface.SumOverTime was
elaborated to Semantic.AvgOverTime instead of Semantic.SumOverTime, causing
sum_over_time queries to silently return the mean.  Fixed.
Drop the checkFresh guard so duplicate names are no longer rejected.
Change variable lookup to search the context right-to-left (Seq.reverse)
so that the innermost binding wins, as scoping requires.
St now carries availableMetrics :: Set MetricIdentifier, populated at
the call site via metrics store (or Set.empty in metric-free contexts
like the elab test suite).

The fallback Variable case — previously an unconditional metric assumption
— now checks membership in availableMetrics and throws "Undefined name: <v>"
when the name appears neither in the local context nor in the store,
replacing the confusing downstream type-mismatch error that occurred before.
…tamp 0

The `epoch` keyword was mapped to `Now` in the parser, making it
identical to `now` (i.e. current server time).  It now produces
`Epoch`, which the interpreter evaluates to `Timestamp 0` (Unix
epoch), so expressions like `epoch + 1778499300385ms` yield the
intended absolute timestamp rather than ~year 2082.
The ci-bench-timeseries profile has ~900M transactions (100 000 epochs ×
~9k txs/epoch) but inherited fundsDefault (10 000 ADA) from base, which
covers only a handful of transactions before the generator exits with
"insufficient funds".

Add fundsTimeseries (25 000 000 ADA = 25 × 10^15 lovelace) to Vocabulary
and a baseTimeseries variant of base that uses it. Switch ciTimeseries02Value
to baseTimeseries so the ci-bench-timeseries profile has enough genesis funds
to sustain a long-running workload.
100 000 epochs needed ~23B ADA in the generator wallet (split grows
exponentially with tx_count via unfoldSplitSequence) but genesis could
only supply 22.5B, causing an immediate "insufficient funds" crash.

12 epochs at 15 tps / timescaleCompressed = ~108 000 transactions = 2 hours.
The required split is now ~2.3M ADA, well within the 22.5B available.
When a variable is not found in either the local context or the metric
store, the elaborator now appends a "Did you mean: ..." hint by ranking
all candidate names (locals + metrics) by Levenshtein distance and
showing the closest ones (up to 5) within threshold max(1, len/3).

Uses the `edit-distance` library (the canonical Haskell implementation,
also used by GHC for its own "did you mean" diagnostics).
Replace hardcoded [now - 1h; now] in all 25 dashboard panels with
[$__from; $__to] so the Grafana time picker controls the window.

Pre-process $__from/$__to in the datasource plugin before getTemplateSrv()
sees them — Grafana treats these as built-in variables and ignores
scopedVars overrides — converting them to epoch + Nms expressions that
the query language interprets as absolute timestamps.
Replaces the 7-test skeleton with a comprehensive suite covering all
elaborator code paths: literals, duration literals, arithmetic
(well-typed and ill-typed), comparisons, boolean operators, let/lambda,
pairs/projections, to_scalar, abs/round, metrics builtin, range
vectors, instant-vector operations, and metric-name resolution (known
metric, undefined name, Levenshtein suggestions).

Notes two elaborator bidirectionality constraints discovered via
test failures:
- `Duration + Duration` requires the result type to be driven by
  a containing expression (tested via `now + (1s + 2s)`).
- `\x -> x + 1` leaves `x`'s type unconstrained; replaced with
  `\x -> now + x` which forces `x : Duration` via the Timestamp+?
  noncanonical rule.
When both operands are known to be Duration but the result type is
still a hole, force the hole to Duration and re-queue as the canonical
Duration + Duration : Duration problem. This allows standalone
expressions like `1s + 2s` to elaborate without needing an outer
context to drive the result type.
All three rules fire only when the known type information uniquely
determines the outcome:

  ? - Duration : ?  ->  Timestamp - Duration : Timestamp
    Only Timestamp - Duration exists with Duration on the Sub rhs.
    Enables: \x -> x - 1s  (infer x : Timestamp)

  ? + ? : Duration  ->  Duration + Duration : Duration
    Only Duration + Duration produces Duration.
    Enables: \x -> \y -> m [now; now : x + y]  (infer both : Duration)

  ? - ? : Timestamp  ->  Timestamp - Duration : Timestamp
    Only Timestamp - Duration produces Timestamp via Sub.
    Enables: \x -> \y -> m [x - y; x]  (infer x : Timestamp, y : Duration)

Ordering: A (? - Duration) before C (? - ? : Timestamp) so the more
specific rhsTy=Duration match takes priority. B (? + ? : Duration)
after the existing Duration + Duration : ? rule so the known-both-sides
case is tried first.
- JSON.hs: rename tag/value to resultType/result; use Prometheus
  resultType names (scalar, vector, matrix); encode timestamps and
  durations as Unix seconds (Double); encode data-point values as
  strings; rename labels->metric and data->values in Instant/Timeseries
- TimeseriesServer: parse query endpoint body as
  application/x-www-form-urlencoded (Prometheus wire format); add GET
  support alongside POST, sharing a single handleQuery helper
- grafana-datasource: update types, toDataFrames, and datasource to
  match new wire format; switch fetch to x-www-form-urlencoded
Remove the 2-node ci-bench-timeseries profile along with its dedicated
baseTimeseries and fundsTimeseries combinators (now dead code). Add
6-dense-timeseries-1h as a copy of 6-dense-1h with tracerTimeseries
enabled: 6 nodes, 1-hour duration, dense topology. The tracerTimeseries
combinator in Primitives.hs is retained. Regenerate all-profiles-coay.json.
@Russoul Russoul force-pushed the russoul/timeseries-align-http-api-with-prometheus branch from 21567bd to d9368e4 Compare May 15, 2026 11:46
Russoul added 5 commits May 15, 2026 16:45
…files-coay.json

all-profiles-coay.json was corrupted by cabal build output leaking into
stdout during a previous regeneration. Regenerated cleanly by separating
the build step from the capture step.
Replace GNU ps long-option syntax with pgrep -P, which is available
on both macOS (BSD) and Linux.
…ies server

New routes on the timeseries server:
  GET /timeseries/nodes                        — list connected node IDs
  GET /timeseries/node/{id}/info               — NodeInfo + uptimeSeconds
  GET /timeseries/node/{id}/startup            — NodeStartupInfo
  GET /timeseries/node/{id}/state  (RTVIEW)    — sync progress %

The server now receives TracerEnv instead of individual fields so it can
access teDPRequestors, teCurrentDPLock, and teConnectedNodes.

The /state endpoint uses data point key "NodeAddBlock" (the namespace
cardano-node actually stores the NodeState data point under) rather than
"NodeState", which was always empty.
Russoul added 4 commits May 15, 2026 17:38
…d dashboard panels

New query types in the datasource plugin:
  nodes       — lists all connected node IDs (used by $node_id variable)
  node-info   — name, protocol, version, commit, start time, uptime
  node-startup — era, slot length, epoch length, KES period
  node-state  — sync progress %

New panels in rtview.json (all repeat by $node_id variable):
  Connected Nodes table, Node Info table, Startup table,
  Sync % stat, Uptime stat

metricFindQuery populates the $node_id query variable automatically
from /timeseries/nodes.

$__from/$__to in timeseries queries are pre-processed to valid
timestamp expressions before template variable expansion.
@Russoul Russoul changed the title bench | timeseries: JSON API, Grafana datasource plugin, ci-bench-timeseries profile cardano-tracer | timeseries: Prometheus-aligned HTTP API, node info endpoints, Grafana datasource May 15, 2026
@Russoul Russoul marked this pull request as ready for review May 15, 2026 15:08
@Russoul Russoul requested review from a team as code owners May 15, 2026 15:08
… code review

- Elab.hs: fix copy-paste error in binary arithmetic op elab (rhs hole was
  unified against lhsTy instead of rhsTy); rename evalBinaryArithmethicOpElabProblem
  to evalBinaryArithmeticOpElabProblem (typo)
- Elab.hs: elaborate `metrics` as List Text (was Text); add Str elab case
- Elab/Typing.hs, Resolve.hs, Unify.hs: add List Ty to support metrics type
- Interp.hs: guard avg/min/max against empty instant vector; fix rate to
  error on single-point timeseries instead of dividing by zero
- Interp/Value.hs: use showFFloat in Show instance for Scalar to avoid
  scientific notation in JSON output
- TimeseriesServer.hs: fix minimumRetentionMillis units (seconds → ms);
  remove unused RecordWildCards pragma; align sleep delay with Monitoring.hs
- Acceptors/Utils.hs: align new imports with surrounding import block
@Russoul Russoul force-pushed the russoul/timeseries-align-http-api-with-prometheus branch from 1297902 to 560cae8 Compare May 15, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant