fix: bridge startup race, dirty-reward orphan traces, RPC loopback auth by chiefmojo · Pull Request #1817 · MemTensor/MemOS

chiefmojo · 2026-05-28T04:52:38Z

Summary

Three targeted fixes from companion infrastructure stabilization:

1. Skip orphan trace insert during dirty-reward recovery (`fc7252fc`)

When dirtyFlag triggers a rescore on episodes with no matching turns, the capture path inserts a trace with r_reward=0 but no valid idParent. This creates orphan rows that inflate the database and break FTS indexing. This fix checks for the orphan condition and skips the trace insert entirely, logging a warning instead.

2. Bridge startup race (`ef311ebb`)

Hermes gateway restarts can race with the MemOS daemon warm-up (~30s). The default 15s poll deadline and 30s session.open timeout cause cascading failures when the bridge isn't ready. This extends the deadline to 45s, adds exponential backoff health probes, and ups the session timeout to 60s. Double-spawn prevention via PID file singleton guard ensures only one bridge process survives a race.

3. RPC loopback auth (`a459a0a5`)

The /api/v1/rpc session exemption was overly broad — any caller could bypass auth. This restricts the exemption to loopback callers only (127.0.0.1 / ::1).

Testing

All three fixes have been running on three companion instances (neuromancer, wintermute, case) for 48+ hours with zero bridge failures, no new orphan traces, and clean RPC auth.

runReflect's orphan-steps fallback was inserting new trace rows whenever a recovered episode's snapshot timestamps didn't match existing DB rows. This caused trace_ids_json to grow on every bridge restart, keeping reward.traceCount != traceIds.length and looping forever — generating 48k+ empty traces and continuous OpenRouter traffic. Guard: when meta.recoveryReason === "dirty_reward_rescore", log and skip the insert instead of writing new rows. Legitimate orphan handling (test paths, genuinely dropped events) is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

daemon_manager.py: extend ensure_viewer_daemon health-poll deadline from 15s to 45s to cover cold Node.js starts (tsx compile + SQLite open + FTS warmup). After the first 15s, back off probes from 0.5s to 2s to avoid hammering a slow-starting daemon. __init__.py: increase _open_session timeout in init from 30s to 60s so session.open doesn't time out before the daemon finishes starting. Add a 1s double-spawn guard: when probe_viewer_status() returns "free", wait 1s and re-probe before falling through to ensure_viewer_daemon(). Eliminates the race where a second gateway session spawns a second stdio bridge because the daemon from the first session hasn't bound the port yet. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The RPC endpoint was unconditionally exempt from session-cookie auth to allow the local Python adapter to connect without a browser session. On hub-mode deployments (bindHost=0.0.0.0) this left the endpoint reachable from the network without any auth when password protection is enabled. Now the exemption only applies when remoteAddress is 127.0.0.1, ::1, or ::ffff:127.0.0.1. Standard installs (bindHost=127.0.0.1) are unaffected; the Python adapter always connects from loopback and continues to work. Network callers on hub deployments must hold a valid session cookie. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Memtensor-AI

Let me review this PR carefully.

Needs changes (minor) — solid fixes overall, a few issues worth addressing

The three fixes are well-motivated and the 48h soak test is reassuring. A few things caught my eye:

1. `init.py` — Duplicated connection logic begs for extraction

The HTTP bridge connect-and-register pattern is copy-pasted 3 times (initial probe, late probe, reconnect). If the handshake ever changes (e.g., a second handler registration), you'll need to patch all three. Extract a helper like _try_http_bridge(session_id, timeout) -> bool.

2. `init.py` — `http_bridge` may be unbound on exception

http_bridge: MemosHttpClient | None = MemosHttpClient()

If MemosHttpClient() itself raises, http_bridge is never assigned, yet the except block calls http_bridge.close(). The type: ignore comment masks this. Safer:

http_bridge: MemosHttpClient | None = None
http_bridge = MemosHttpClient()

Same pattern applies to http_bridge_late.

3. `init.py` — `time.sleep(1.0)` in `initialize`

A hard 1s sleep on every cold start where the viewer port is free adds latency to the common "first launch" path. Consider making this conditional on evidence of a concurrent session (e.g., PID file exists but port not yet open), or at least document why 1s was chosen.

4. `daemon_manager.py` — Backoff heuristic is inverted

time.sleep(0.5 if (deadline - time.time()) > 30 else 2.0)

With a 45s deadline, deadline - time.time() > 30 is true only during the first ~15s. So you poll fast (0.5s) for the first 15s, then slow (2s) for the remaining 30s. That matches the PR description — just confirming the intent is correct. One nit: if the loop body takes non-trivial time, deadline - time.time() could go negative on the last iteration; not harmful, but a max(0, ...) would be clearer.

5. `auth.ts` — Loopback check is good but consider `X-Forwarded-For`

If the server ever sits behind a local reverse proxy (nginx, Caddy), remoteAddress will always be 127.0.0.1 regardless of the true client. This is fine today since the PR description says the binding is 0.0.0.0 directly, but worth a code comment noting the assumption. If a proxy is introduced later, this becomes an auth bypass.

6. `capture.ts` — Coupling to `meta.recoveryReason` magic string

if (input.episode.meta?.recoveryReason === "dirty_reward_rescore") {

If the recovery codepath ever changes the string or omits the field, the orphan loop silently resumes. Consider exporting the constant from the recovery module and importing it here, or at minimum adding a shared RECOVERY_REASONS enum/object.

7. `capture.ts` — Silent skip may hide real orphans during recovery

If a genuine orphan does appear during a dirty-reward rescore (e.g., a race between capture and recovery), it will be silently dropped. The warning log is good, but consider emitting a metric or structured event so monitoring can catch unexpected counts.

What looks good

PID-file singleton guard and zombie cleanup are the right call for the double-spawn problem.
The loopback auth restriction is a meaningful security improvement with minimal complexity.
The dirty-reward orphan fix correctly breaks the infinite recovery loop.
Error handling is consistent (suppress on cleanup, warn on fallback).

Ship it after addressing items 2 and 6; the rest are suggestions.

- auth.ts: add reverse-proxy assumption comment to loopback RPC exemption - capture.ts + memory-core.ts: extract magic string to RECOVERY_REASONS.DIRTY_REWARD_RESCORE in new recovery-constants.ts - memory-core.test.ts: update assertions to use shared constant - __init__.py: extract _connect_http_bridge() helper; fix unbound http_bridge on constructor exception; conditional cold-start sleep - daemon_manager.py: guard negative sleep in backoff loop; add probe_viewer_status() + startup_lock_active() helpers

chiefmojo and others added 3 commits May 27, 2026 21:38

Memtensor-AI added the plugin Plugin/adapter/bridge layer (apps/ directory) label May 28, 2026

Memtensor-AI reviewed May 28, 2026

View reviewed changes

chiefmojo mentioned this pull request May 28, 2026

fix(memos): address review feedback on PRs #1813 and #1817 #1830

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bridge startup race, dirty-reward orphan traces, RPC loopback auth#1817

fix: bridge startup race, dirty-reward orphan traces, RPC loopback auth#1817
chiefmojo wants to merge 4 commits into
MemTensor:mainfrom
chiefmojo:pr/may27-fixes

chiefmojo commented May 28, 2026

Uh oh!

Memtensor-AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chiefmojo commented May 28, 2026

Summary

1. Skip orphan trace insert during dirty-reward recovery (fc7252fc)

2. Bridge startup race (ef311ebb)

3. RPC loopback auth (a459a0a5)

Testing

Uh oh!

Memtensor-AI left a comment

Choose a reason for hiding this comment

Needs changes (minor) — solid fixes overall, a few issues worth addressing

1. __init__.py — Duplicated connection logic begs for extraction

2. __init__.py — http_bridge may be unbound on exception

3. __init__.py — time.sleep(1.0) in initialize

4. daemon_manager.py — Backoff heuristic is inverted

5. auth.ts — Loopback check is good but consider X-Forwarded-For

6. capture.ts — Coupling to meta.recoveryReason magic string

7. capture.ts — Silent skip may hide real orphans during recovery

What looks good

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Skip orphan trace insert during dirty-reward recovery (`fc7252fc`)

2. Bridge startup race (`ef311ebb`)

3. RPC loopback auth (`a459a0a5`)

1. `init.py` — Duplicated connection logic begs for extraction

2. `init.py` — `http_bridge` may be unbound on exception

3. `init.py` — `time.sleep(1.0)` in `initialize`

4. `daemon_manager.py` — Backoff heuristic is inverted

5. `auth.ts` — Loopback check is good but consider `X-Forwarded-For`

6. `capture.ts` — Coupling to `meta.recoveryReason` magic string

7. `capture.ts` — Silent skip may hide real orphans during recovery