Skip to content

fix: bridge startup race, dirty-reward orphan traces, RPC loopback auth#1817

Open
chiefmojo wants to merge 4 commits into
MemTensor:mainfrom
chiefmojo:pr/may27-fixes
Open

fix: bridge startup race, dirty-reward orphan traces, RPC loopback auth#1817
chiefmojo wants to merge 4 commits into
MemTensor:mainfrom
chiefmojo:pr/may27-fixes

Conversation

@chiefmojo
Copy link
Copy Markdown

Summary

Three targeted fixes from companion infrastructure stabilization:

1. Skip orphan trace insert during dirty-reward recovery (fc7252fc)

When dirtyFlag triggers a rescore on episodes with no matching turns, the capture path inserts a trace with r_reward=0 but no valid idParent. This creates orphan rows that inflate the database and break FTS indexing. This fix checks for the orphan condition and skips the trace insert entirely, logging a warning instead.

2. Bridge startup race (ef311ebb)

Hermes gateway restarts can race with the MemOS daemon warm-up (~30s). The default 15s poll deadline and 30s session.open timeout cause cascading failures when the bridge isn't ready. This extends the deadline to 45s, adds exponential backoff health probes, and ups the session timeout to 60s. Double-spawn prevention via PID file singleton guard ensures only one bridge process survives a race.

3. RPC loopback auth (a459a0a5)

The /api/v1/rpc session exemption was overly broad — any caller could bypass auth. This restricts the exemption to loopback callers only (127.0.0.1 / ::1).

Testing

All three fixes have been running on three companion instances (neuromancer, wintermute, case) for 48+ hours with zero bridge failures, no new orphan traces, and clean RPC auth.

chiefmojo and others added 3 commits May 27, 2026 21:38
runReflect's orphan-steps fallback was inserting new trace rows whenever
a recovered episode's snapshot timestamps didn't match existing DB rows.
This caused trace_ids_json to grow on every bridge restart, keeping
reward.traceCount != traceIds.length and looping forever — generating
48k+ empty traces and continuous OpenRouter traffic.

Guard: when meta.recoveryReason === "dirty_reward_rescore", log and skip
the insert instead of writing new rows. Legitimate orphan handling (test
paths, genuinely dropped events) is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
daemon_manager.py: extend ensure_viewer_daemon health-poll deadline
from 15s to 45s to cover cold Node.js starts (tsx compile + SQLite
open + FTS warmup). After the first 15s, back off probes from 0.5s
to 2s to avoid hammering a slow-starting daemon.

__init__.py: increase _open_session timeout in init from 30s to 60s
so session.open doesn't time out before the daemon finishes starting.

Add a 1s double-spawn guard: when probe_viewer_status() returns
"free", wait 1s and re-probe before falling through to
ensure_viewer_daemon(). Eliminates the race where a second gateway
session spawns a second stdio bridge because the daemon from the
first session hasn't bound the port yet.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The RPC endpoint was unconditionally exempt from session-cookie auth
to allow the local Python adapter to connect without a browser session.
On hub-mode deployments (bindHost=0.0.0.0) this left the endpoint
reachable from the network without any auth when password protection
is enabled.

Now the exemption only applies when remoteAddress is 127.0.0.1, ::1,
or ::ffff:127.0.0.1. Standard installs (bindHost=127.0.0.1) are
unaffected; the Python adapter always connects from loopback and
continues to work. Network callers on hub deployments must hold a
valid session cookie.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Memtensor-AI Memtensor-AI added the plugin Plugin/adapter/bridge layer (apps/ directory) label May 28, 2026
Copy link
Copy Markdown
Collaborator

@Memtensor-AI Memtensor-AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me review this PR carefully.

Needs changes (minor) — solid fixes overall, a few issues worth addressing

The three fixes are well-motivated and the 48h soak test is reassuring. A few things caught my eye:

1. __init__.py — Duplicated connection logic begs for extraction

The HTTP bridge connect-and-register pattern is copy-pasted 3 times (initial probe, late probe, reconnect). If the handshake ever changes (e.g., a second handler registration), you'll need to patch all three. Extract a helper like _try_http_bridge(session_id, timeout) -> bool.

2. __init__.pyhttp_bridge may be unbound on exception

http_bridge: MemosHttpClient | None = MemosHttpClient()

If MemosHttpClient() itself raises, http_bridge is never assigned, yet the except block calls http_bridge.close(). The type: ignore comment masks this. Safer:

http_bridge: MemosHttpClient | None = None
http_bridge = MemosHttpClient()

Same pattern applies to http_bridge_late.

3. __init__.pytime.sleep(1.0) in initialize

A hard 1s sleep on every cold start where the viewer port is free adds latency to the common "first launch" path. Consider making this conditional on evidence of a concurrent session (e.g., PID file exists but port not yet open), or at least document why 1s was chosen.

4. daemon_manager.py — Backoff heuristic is inverted

time.sleep(0.5 if (deadline - time.time()) > 30 else 2.0)

With a 45s deadline, deadline - time.time() > 30 is true only during the first ~15s. So you poll fast (0.5s) for the first 15s, then slow (2s) for the remaining 30s. That matches the PR description — just confirming the intent is correct. One nit: if the loop body takes non-trivial time, deadline - time.time() could go negative on the last iteration; not harmful, but a max(0, ...) would be clearer.

5. auth.ts — Loopback check is good but consider X-Forwarded-For

If the server ever sits behind a local reverse proxy (nginx, Caddy), remoteAddress will always be 127.0.0.1 regardless of the true client. This is fine today since the PR description says the binding is 0.0.0.0 directly, but worth a code comment noting the assumption. If a proxy is introduced later, this becomes an auth bypass.

6. capture.ts — Coupling to meta.recoveryReason magic string

if (input.episode.meta?.recoveryReason === "dirty_reward_rescore") {

If the recovery codepath ever changes the string or omits the field, the orphan loop silently resumes. Consider exporting the constant from the recovery module and importing it here, or at minimum adding a shared RECOVERY_REASONS enum/object.

7. capture.ts — Silent skip may hide real orphans during recovery

If a genuine orphan does appear during a dirty-reward rescore (e.g., a race between capture and recovery), it will be silently dropped. The warning log is good, but consider emitting a metric or structured event so monitoring can catch unexpected counts.


What looks good

  • PID-file singleton guard and zombie cleanup are the right call for the double-spawn problem.
  • The loopback auth restriction is a meaningful security improvement with minimal complexity.
  • The dirty-reward orphan fix correctly breaks the infinite recovery loop.
  • Error handling is consistent (suppress on cleanup, warn on fallback).

Ship it after addressing items 2 and 6; the rest are suggestions.

- auth.ts: add reverse-proxy assumption comment to loopback RPC exemption
- capture.ts + memory-core.ts: extract magic string to RECOVERY_REASONS.DIRTY_REWARD_RESCORE in new recovery-constants.ts
- memory-core.test.ts: update assertions to use shared constant
- __init__.py: extract _connect_http_bridge() helper; fix unbound http_bridge on constructor exception; conditional cold-start sleep
- daemon_manager.py: guard negative sleep in backoff loop; add probe_viewer_status() + startup_lock_active() helpers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

plugin Plugin/adapter/bridge layer (apps/ directory)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants