Skip to content

feat: per-conversation wiki digest#51

Merged
aniongithub merged 8 commits into
mainfrom
feat/digest
May 24, 2026
Merged

feat: per-conversation wiki digest#51
aniongithub merged 8 commits into
mainfrom
feat/digest

Conversation

@aniongithub
Copy link
Copy Markdown
Owner

Implements the Digest Plan — Persistent Per-Conversation Wiki Context end-to-end.

What

A compact, always-current orientation blob (~4 KB / ~1K tokens) that mind-map-aware agents can consume at the start of every conversation. Three signals, deterministic, no LLM in the regeneration loop:

  • Word/phrase cloud — top-K unigrams + bigrams across all page bodies. Frequency-based; built-in EN stopwords + user extras filter the noise. mind-map, page_count, web-ui survive intact thanks to a custom Go tokenizer (FTS5's tokenizer doesn't cleanly expose token sequences in pure-Go SQLite).
  • Active-use recents LRU — a 20-slot ring of paths the user or agent actually touched (Create / Update / Get / Move / GetBacklinks). Distinct from recent_pages (mtime-sorted), which surfaces sync churn rather than intent.
  • Per-area page counts — driven by the indexed pages table, with the area's index page title as a one-line description.

How

Seven commits, one per plan step, plus one for the WebUI:

# Commit What
1 6c54f8d LRU ring + touch hooks
2 f9a7213 Cloud builder (tokenizer + bigrams + top-K)
3 2480dd6 Digest assembly + renderer + version-keyed cache
4 5712fe1 SQLite wiki_state table for recents/cloud persistence
5 8ac8ad0 MCP get_wiki_digest + HTTP GET /api/digest + get_wiki_context backwards-compat
6 389e114 internal/digest.Manager background tickers (cloud 5m, recents 30s)
7 97aa7f9 config.DigestConfig + SKILL.md / README updates
380df0e WebUI Digest settings section with tag-input for stopwords

Surfaces

  • MCP: new get_wiki_digest tool. get_wiki_context keeps its legacy {page_count, recent_pages, top_level_dirs} shape and gains the new digest fields (plan open question Add structured logging with log/slog and panic recovery #4 — keep old clients working).
  • HTTP: GET /api/digest returns the full Digest JSON for WebUI / non-MCP callers.
  • Config: full digest section in config.json (cloud_size, recents_size, cloud_refresh, stopwords_extra, max_render_bytes). Backwards-compatible with pre-digest configs.
  • WebUI: new Settings → Digest section with chip-style tag-input for stopwords.

Lifecycle

A new internal/digest.Manager mirrors internal/sync.Manager exactly: NewManager(*wiki.Wiki, Options) → Start(ctx) / Stop(). Synchronous first cloud build on Start so cold-start digests have an About: line immediately. Wired into both stdio and HTTP modes in cmd/mind-map; defer dm.Stop() is registered before defer w.Close() so the ticker quiesces before the DB closes (no sql: database is closed races).

Persisted state survives restarts: a freshly-restarted server has a useful digest immediately, not after the first 5-minute ticker tick.

Live data

Sample from GET /api/digest against the in-container mind-map docs wiki:

This wiki contains 34 pages across 6 areas. About:
mind-map, architecture, concepts, wiki, page, mcp, pages, agents, see, guides,
design, graph, index, same, service, agent, path, wikilinks, search, engine,
install, backlinks, ui, mb, files, binary, markdown, web, web ui,
architecture web-ui, web-ui, comparisons, api, file, sqlite, source, obsidian,
also, concepts wikilinks, frontmatter, yes, links, notion, server,
agents mcp-tools, go, mcp-tools, table, architecture mcp-server, git

## Areas
- guides (8)
- architecture (6) — architecture/index: "Architecture"
- concepts (6) — concepts/index: "Concepts"
- comparisons (4) — comparisons/index: "Comparisons"
- design (4) — design/index: "Design"
- agents (3) — agents/index: "Agents"

## Recently active
- agents/index

Full skill: SKILL.md. Use `get_wiki_digest` for the live version.

Cloud surfaces real domain terms (mind-map, mcp, wikilinks, web-ui); bigrams come through (web ui, agents mcp-tools, concepts wikilinks); the hyphen-preservation rule pays off. Noise terms (see, also, same, yes) motivated the WebUI stopword tag-input commit.

Tests

Package New tests
internal/wiki ~44 (recents LRU, cloud builder, digest renderer, state persistence)
internal/digest 5 (manager lifecycle + tickers)
internal/config 3 (digest config roundtrip + backwards compat)
internal/mcp 2 (tool listing + get_wiki_digest)
internal/httpapi 1 (GET /api/digest)

go test ./..., go vet ./..., go build ./... and npm run build all clean.

Out of scope (deliberate)

  • mind-map install-context-hooks — installer subcommand to write per-agent rules files. Plan calls this a separate follow-up.
  • LLM-generated narrative summaries.
  • WebUI word-cloud visualization widget. Data is on /api/digest; rendering is a separate UI task.
  • TF-IDF cloud weighting. Frequency is good enough for v1 per the plan's lean; revisit if noisy in practice.

A doubly-linked-list + map ring (O(1) touch/remove/rename) tracking
pages the user or agent actively used — Create, Update, Get, Move
(both ends), Delete, GetBacklinks — rather than what disk mtime says
was last changed. Distinguishes 'intent' from 'sync churn' for the
upcoming digest's recents signal.

Touches fire only on the success path: a failed CreatePage on an
existing page, or a GetPage on a missing path, does not pollute the
ring. MovePage renames in place so a move shows up as one continuous
use rather than dropping the old name and freshly inserting the new.
DeletePage drops the entry — a deleted page in recents would mislead
the agent.

Capacity defaults to 20 (the plan's recents_size default); a later
step will swap this for a config-driven value. Persistence lives in
state.go (next steps); recentsLRU itself is storage-agnostic, exposing
load/snapshot/takeDirty for a ticker to consume.

Step 1 of the digest plan (mind-map/plans/digest).
A deterministic, frequency-based summary of 'what is this wiki about'.
One pass over pages.body produces unigram + bigram counts, filters
through a built-in English stopword list (plus user extras), and
selects top K with alphabetical tie-break for stable output across
rebuilds.

Tokenizer design notes:
- Custom Go pass over pages.body rather than reaching into FTS5's
  C tokenizer. modernc.org/sqlite doesn't cleanly expose token
  sequences to Go, and reusing FTS5 would lose bigram ordering or
  drag in CGO-adjacent complexity. unicode61-equivalent for our
  purposes: lowercase, non-alnum split, with hyphens/underscores
  preserved mid-token so 'mind-map' and 'page_count' survive as one
  token each.
- Wikilink brackets stripped so [[projects/mind-map]] contributes
  its target words to the cloud naturally.
- Code fences and inline code are NOT stripped: identifiers in code
  are real 'about' signal in a technical wiki; dropping them would
  flatten the cloud.
- Bigrams require both endpoints to pass the stopword filter (the
  plan's chosen lean on open question #2): 'the wiki' must not
  appear just because 'the' is high-frequency.
- Single-char and all-digit tokens are dropped as a low-information
  short-circuit before the stopword map lookup.

A single-slot cloudCache exposes Set/Get with defensive copies so
the upcoming 5-minute rebuild ticker (Step 6) can swap clouds
without readers racing on slice aliasing.

Frequency, not TF-IDF, for v1 (plan open question #1 lean). Easy
swap later if the cloud reads noisy in practice.

Step 2 of the digest plan (mind-map/plans/digest).
Digest() returns a structured Digest{PageCount, Cloud, Recents,
Areas, Markdown} — the typed fields drive the WebUI / HTTP JSON, the
markdown is what an LLM consumes in a per-conversation orientation
prompt. Shape matches the example in the plan:

    This wiki contains N pages across M areas. About:
    term1, term2, term3, …

    ## Areas
    - foo (45) — foo/index: "Foo Area"
    - bar (12)
    - …

    ## Recently active
    - path/one
    - …

    Full skill: SKILL.md. Use `get_wiki_digest` for the live version.

Trim discipline when over the soft cap (default 4096 bytes): drop
recents from the tail first, then cloud, never areas. Areas are the
smallest section and the most structurally important — losing them
means losing the map of the wiki. Footer hint is also preserved.

Caching is version-keyed: cloudCache and recentsLRU each expose a
monotonic counter; digestCache stores (cloudVer, recentsSeq,
pageCount) alongside the cached *Digest and rebuilds on any
mismatch. pageCount is part of the key because pure content edits
that don't touch the LRU still change the header sentence. CRUD
operations automatically bust the cache through their existing LRU
touches, so callers don't need to invalidate explicitly.

Area summaries are driven by the indexed `pages` table, not by
filesystem listing — the source of truth for the digest is what's
queryable, not what's on disk. Flat-rooted pages (no slash) are
ignored: a top-level page is not an area.

Also: hook Reindex Phase 4 into recents.remove() so pages that
vanish via raw-filesystem delete + reindex (common after `git pull`
in sync) don't linger in the LRU as 404 candidates. With this hook,
the renderer can trust the LRU as-is — no filter, no purge — and
the LRU stays consistent with `pages` at all times.

Step 3 of the digest plan (mind-map/plans/digest).
A wiki_state table (key/value/updated) stores the LRU snapshot and
the word/phrase cloud in JSON so a freshly-restarted server has a
useful digest immediately, not after the first 5-minute ticker fires.

The rendered digest markdown is NOT persisted — it's sub-ms to
re-assemble from cloud + LRU, and the in-memory digestCache already
handles 'don't re-format on every hit'. Adding a third write path
buys nothing measurable.

Load happens at the tail of Open(), after Reindex. Persisted
recents are filtered against the current `pages` table so paths
that vanished while the server was off (deleted on disk, or
sync-pulled away) don't reappear in the LRU as 404 candidates.
The cloud loads as-is — global frequency counts remain a reasonable
approximation across small content changes, and the next rebuild
ticker (Step 6) will refresh it within minutes.

Save points:
- persistRecents() — called by Close() for a clean shutdown flush
  and (in Step 6) on a 30s dirty-gated ticker.
- persistCloud() — called by Step 6's 5m rebuild ticker. No-ops
  when the cloud has never been populated so we don't clobber a
  previously-good copy with an empty placeholder.

Failure modes are deliberately lenient: a corrupt JSON row, a
missing table, or an unreachable column logs at WARN and falls back
to fresh-wiki state rather than panicking. The digest is an
orientation signal, not a correctness boundary; losing it shouldn't
take down the server.

Also: made Close() idempotent via sync.Once. testWiki's t.Cleanup
plus explicit defer Close in state tests would otherwise run the
persistRecents flush against an already-closed DB.

Step 4 of the digest plan (mind-map/plans/digest).
Three surfaces, one signal:

- MCP get_wiki_digest — new tool. Returns the structured Digest
  (page count, cloud terms, recents LRU, per-area summaries, rendered
  markdown). Tool description nudges agents to call it at the start
  of every conversation.

- MCP get_wiki_context — the legacy {page_count, recent_pages,
  top_level_dirs} shape is preserved verbatim so existing clients
  (opencode, Claude Code in the wild, per plan open question #4)
  keep working. New fields (cloud_terms, recents, areas, markdown)
  are layered on the same response — old clients ignore them; new
  clients get the orientation upgrade without a tool-name change.

- HTTP GET /api/digest — returns the full Digest as JSON. Intended
  for the WebUI (so it can render its own word-cloud or recents
  widgets off the structured fields rather than parsing the markdown)
  and for non-MCP scripts/tests.

Implementation: WikiContext gets new optional fields (omitempty so
the JSON shape is additive). Wiki.Context() delegates to Digest()
to populate them; a digest failure logs at WARN but doesn't fail
the Context call — the legacy fields are still valuable on their
own, and the digest is an enhancement, not a contract.

Step 5 of the digest plan (mind-map/plans/digest).
A new internal/digest.Manager mirrors internal/sync.Manager's shape:
NewManager(*wiki.Wiki, Options) → Start(ctx) / Stop() lifecycle, with
the embedder (cmd/mind-map) supervising. Sync's separation between
storage engine and goroutine-owning supervisor is a good pattern;
the digest follows it so cmd/mind-map sees a uniform 'subsystems are
supervised, not implicit' model.

Two tickers in one goroutine:
  - cloud_refresh (5m default): full cloud rebuild via Wiki.BuildCloud,
    SetCloud, PersistCloud. Synchronous first build on Start() so the
    very first post-open digest read has cloud terms — cold start
    over a 1k-page wiki is < 100ms.
  - recents_refresh (30s default): gated PersistRecents call. Skips
    SQLite writes on idle servers via a non-mutating peekDirty
    probe; only takeDirty (which clears the flag) runs after a
    successful write.

Shutdown contract: Stop() cancels the loop's context, the loop runs
one final detached-context flushRecents so the last ~30s of touches
land on disk, then closes done. Idempotent via sync.Once on both
Start and Stop. The pairing 'defer dm.Stop(); defer w.Close()' in
cmd/mind-map ensures the ticker quiesces before the DB closes
(prevents 'sql: database is closed' races during shutdown).

Exposed helpers on *Wiki:
  - BuildCloud / SetCloud / PersistCloud — public entry points
    for the supervisor; the lowercase internals stay for tests.
  - PersistRecents — clears dirty only on a successful write so
    a failed persist retries on the next tick rather than dropping
    the diff silently.
  - RecentsDirty — read-only peek, used by the manager's gate.

Wiring: both runStdio and runHTTPServer in cmd/mind-map start a
manager after wiki.Open and Stop it before w.Close. The HTTP path
derives the manager's context from stopCh so /api/restart and
ctrl+C take down the tickers cleanly. The service-mode launcher
delegates to runHTTPServer so it picks up the wiring for free.

Step 6 of the digest plan (mind-map/plans/digest).
Adds the digest section to config.json with the five knobs called
out in the plan:

  {
    "digest": {
      "cloud_size":       50,       // top-K terms in cloud
      "recents_size":     20,       // active-use LRU capacity
      "cloud_refresh":    "5m",     // rebuild interval (>=30s)
      "stopwords_extra":  ["TODO"], // appends to built-in EN list
      "max_render_bytes": 4096      // soft cap on rendered markdown
    }
  }

All fields are optional. A legacy config without a digest section
loads cleanly and yields zero-valued fields that consumers
interpret as 'use built-in defaults' — covered by an explicit
backwards-compat test. ParseCloudRefresh floors at 30 seconds: any
faster is wasted CPU for a signal nobody reads that often.

Wiring:

- wiki.Open(dir, opts ...OpenOption) — added variadic options so
  Open(dir) callers (10 in the tree, mostly tests) keep compiling
  unchanged. WithOptions(wiki.Options{...}) sets RecentsSize,
  MaxRenderBytes, and StopwordsExtra in one call. MaxRenderBytes
  semantics: > 0 trims, == 0 uses default, < 0 disables trimming.

- cmd/mind-map: both runStdio and runHTTPServer now load config
  before opening the wiki, pass digest tunables through helpers
  wikiOptionsFromConfig / digestOptionsFromConfig. Stdio mode
  previously bypassed config entirely; now both modes are
  consistent and a single config.json controls both.

- digest.Manager: StopwordsExtra is now forwarded into BuildCloud
  on every tick rebuild, not just the synchronous first build.
  The plumbing existed but was dropped on the floor — fixed.

Docs:

- SKILL.md: rewritten Getting Oriented section to feature
  get_wiki_digest as the canonical 'start of conversation' call,
  with get_wiki_context retained for backwards compatibility.
  Tool list updated.

- README.md: tool count 10 → 11, new get_wiki_digest row, the
  legacy get_wiki_context row mentions it now returns digest
  fields too, and Wiki Features gets a digest bullet.

Step 7 of the digest plan (mind-map/plans/digest). Plan now fully
implemented end-to-end.
Five new controls in the settings panel, between Sync and Index:

  - Extra Stopwords  → tag-input (comma / space / Enter to commit a
                       chip; Backspace on empty input pops the last)
  - Cloud Size       → number input, blank = server default (50)
  - Recents Size     → number input, blank = server default (20)
  - Cloud Refresh    → text input (5m, 10m, etc.), blank = 5m
  - Max Render Bytes → number input, 0 disables trim, blank = 4096

A new TagInput component (webui/src/TagInput.tsx) implements the
chips UX: type → commit on separator → click × or Backspace to
remove. Pasted strings with commas or whitespace fan out into
multiple chips in one shot, so an operator can paste
'TODO, FIXME, see also' and get four tags. Duplicate detection is
case-insensitive but display preserves what the user typed; the
case-folding for matching happens server-side in the cloud
builder.

CSS uses the existing --accent / --border palette so chips themeIn
match the rest of the settings UI in both light and dark mode.

No backend changes: putSettings already unmarshals the full
config.Config (which gained the Digest section in step 7 of the
digest plan), so the new fields round-trip transparently. Changes
take effect on next restart — same contract as Sync.Interval; the
existing 'Settings saved. Restart to apply.' banner already says so.

Closes the loop on the digest plan's stopword tuning observation:
operators can now add domain-specific noise words from the UI
without editing config.json by hand.
@aniongithub aniongithub merged commit 30a82cc into main May 24, 2026
1 check passed
@aniongithub aniongithub deleted the feat/digest branch May 24, 2026 21:37
aniongithub added a commit that referenced this pull request May 25, 2026
Resolves a single conflict in internal/mcp/server.go where main's
digest PR (#51) and this branch both added new MCP tools.

Resolution:

- get_wiki_context: take main's revised description that mentions the
  new digest fields (auto-merged cleanly outside the conflict region).
- get_wiki_digest: keep main's new tool registration AND handler.
- get_page handler: drop the old main-side getPage that takes
  pagePathInput. This branch's slice 3 already replaced it with
  getPageWithFlags (in images.go) which accepts the new
  IncludeImages / IncludeImageMetadata flags via getPageInput.
  Keeping both would mean two handlers for the same tool name.
- Placeholder comment in server.go points readers at images.go for
  the new get_page handler.

Verified: go vet ./... clean, go test ./... passes (8 packages, including
the new internal/digest package from main).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant