Skip to content

fix: HierarchicalSessionStore message loss and extended field wipe#1745

Merged
MervinPraison merged 2 commits into
mainfrom
cursor/critical-correctness-bugs-0fd6
May 30, 2026
Merged

fix: HierarchicalSessionStore message loss and extended field wipe#1745
MervinPraison merged 2 commits into
mainfrom
cursor/critical-correctness-bugs-0fd6

Conversation

@cursor
Copy link
Copy Markdown
Contributor

@cursor cursor Bot commented May 26, 2026

Bug and impact

HierarchicalSessionStore (default for PraisonAI UI via get_hierarchical_session_store()) had two correctness issues:

  1. Message lossadd_message loaded the session, mutated in memory, and saved outside a single locked read-modify-write. Two writers (e.g. agent auto_save + UI add_message) could overwrite each other, dropping messages.
  2. Extended field wipe — Inherited update_session_metadata / clear_session used SessionData.from_dict, which stripped parent_id, children_ids, and snapshots on save. Fork/snapshot hierarchy data could be silently deleted.

Root cause

PR #1724 fixed the race in DefaultSessionStore but HierarchicalSessionStore overrode add_message with the old unsafe pattern and did not load ExtendedSessionData in _load_session_from_disk.

Fix and validation

  • Override _load_session_from_disk to use ExtendedSessionData.from_dict
  • Route add_message through _modify_session_locked
  • Sync _extended_cache after locked writes
  • Regression tests: concurrent add_message, metadata update preserving extended fields

pytest tests/unit/session/test_hierarchy.py — 22 passed

Open in Web View Automation 

Summary by CodeRabbit

Release Notes

  • Refactor

    • Enhanced session modification mechanism with improved locking for better data consistency during concurrent operations
  • Tests

    • Added unit tests validating message handling under concurrent access scenarios
    • Added tests ensuring session metadata fields are properly preserved during updates

Review Change Stack

HierarchicalSessionStore.add_message used an unlocked load-modify-save
path, so concurrent writers could overwrite each other's messages.
Inherited _modify_session_locked also loaded SessionData, which dropped
parent_id, snapshots, and children_ids on metadata updates.

- Load ExtendedSessionData under the existing file lock
- Route add_message through _modify_session_locked
- Keep _extended_cache in sync after locked writes
- Add regression tests for concurrency and extended fields

Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
@MervinPraison
Copy link
Copy Markdown
Owner

@coderabbitai review

@MervinPraison
Copy link
Copy Markdown
Owner

/review

@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bf15522c-ffb1-4995-8b33-c2fe77e9396a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

HierarchicalSessionStore introduces _load_session_from_disk() and _modify_session_locked() helpers to safely mutate session state using locked read-modify-write semantics with in-memory cache synchronization. The add_message() method is refactored to use the new locked mutation pattern. New tests verify concurrent message handling and metadata preservation.

Changes

Locked Session Mutation

Layer / File(s) Summary
Locked mutation helpers
src/praisonai-agents/praisonaiagents/session/hierarchy.py
New _load_session_from_disk() loads extended session JSON into ExtendedSessionData, and _modify_session_locked() wraps DefaultSessionStore's locked read-modify-write, updating _extended_cache on success.
add_message refactoring
src/praisonai-agents/praisonaiagents/session/hierarchy.py
add_message() switches from manual load/append/trim/save to using _modify_session_locked with trimming moved into the locked mutator for atomic consistency.
Concurrency and metadata tests
src/praisonai-agents/tests/unit/session/test_hierarchy.py
Tests verify concurrent message appends don't lose history when reloaded via a second store instance, and that metadata updates preserve extended state fields like parent_id and snapshots.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

  • MervinPraison/PraisonAI#1724: Both PRs introduce locked read-modify-write mutation paths (_modify_session_locked) to prevent stale in-memory state from overwriting newer on-disk messages, applied to HierarchicalSessionStore.add_message in this PR and to DefaultSessionStore in the related PR.

Poem

🐰 A rabbit hops through the session store,
With locks and caches forevermore,
No message lost to concurrent plight,
Just safe mutations, synchronized right!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the two main bugs being fixed: message loss in add_message and extended field wipe in metadata updates, matching the core changes in the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cursor/critical-correctness-bugs-0fd6

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@MervinPraison
Copy link
Copy Markdown
Owner

@copilot Do a thorough review of this PR. Read ALL existing reviewer comments above from Qodo, Coderabbit, and Gemini first — incorporate their findings.

Review areas:

  1. Bloat check: Are changes minimal and focused? Any unnecessary code or scope creep?
  2. Security: Any hardcoded secrets, unsafe eval/exec, missing input validation?
  3. Performance: Any module-level heavy imports? Hot-path regressions?
  4. Tests: Are tests included? Do they cover the changes adequately?
  5. Backward compat: Any public API changes without deprecation?
  6. Code quality: DRY violations, naming conventions, error handling?
  7. Address reviewer feedback: If Qodo, Coderabbit, or Gemini flagged valid issues, include them in your review
  8. Suggest specific improvements with code examples where possible

Copy link
Copy Markdown
Contributor

Copilot AI commented May 26, 2026

@copilot Do a thorough review of this PR. Read ALL existing reviewer comments above from Qodo, Coderabbit, and Gemini first — incorporate their findings.

Review areas:...

Thorough review completed against commit 7876c25.

I checked all existing reviewer comments first: Qodo is paused (no findings), CodeRabbit has only status/progress comments so far (no actionable issues), and there are currently no Gemini review findings on this PR.

My review:

  • Bloat/scope: minimal and focused (2 files; fix + targeted regressions).
  • Security: no secrets, no eval/exec, no unsafe input paths introduced.
  • Performance: no heavy module-level imports added; hot path stays O(1) append/trim under existing lock strategy.
  • Tests: regression coverage is present and relevant (message-loss path + extended-field preservation), and python -m pytest tests/unit/session/test_hierarchy.py passes (22/22).
  • Backward compatibility: no public API changes.
  • Code quality: logic is coherent and DRY with _modify_session_locked; error handling remains consistent.

Optional improvement (non-blocking): remove the redundant local import in add_message since SessionMessage is already imported at module scope:

# remove inside add_message:
from .store import SessionMessage

No further blocking issues found.

@MervinPraison
Copy link
Copy Markdown
Owner

@claude You are the FINAL architecture reviewer. If the branch is under MervinPraison/PraisonAI (not a fork), you are able to make modifications to this branch and push directly. SCOPE: Focus ONLY on Python packages (praisonaiagents, praisonai). Do NOT modify praisonai-rust or praisonai-ts. Read ALL comments above from Gemini, Qodo, CodeRabbit, and Copilot carefully before responding.

Phase 1: Review per AGENTS.md

  1. Protocol-driven: check heavy implementations vs core SDK
  2. Backward compatible: ensure zero feature regressions
  3. Performance: no hot-path regressions

Phase 2: FIX Valid Issues
4. For any VALID bugs or architectural flaws found by Gemini, CodeRabbit, Qodo, Copilot, or any other reviewer: implement the fix
5. Push all code fixes directly to THIS branch (do NOT create a new PR)
6. Comment a summary of exact files modified and what you skipped

Phase 3: Final Verdict
7. If all issues are resolved, approve the PR / close the Issue
8. If blocking issues remain, request changes / leave clear action items

@praisonai-triage-agent
Copy link
Copy Markdown
Contributor

praisonai-triage-agent Bot commented May 26, 2026

Claude finished @MervinPraison's task —— View job


Final Architecture Review ✅

Phase 1: Review per AGENTS.md

  • Read AGENTS.md architecture guidelines (file not found - proceeding with standard practices)
  • Check protocol-driven design compliance
  • Verify backward compatibility
  • Assess performance impact

Phase 2: Fix Valid Issues

  • Review findings from other reviewers
  • Address redundant import found by Copilot
  • Test changes (validated via regression tests)

Phase 3: Final Verdict

  • Provide final approval

Architecture Review Findings

Protocol-driven Design: ✅ COMPLIANT

  • Core session protocols preserved in base SessionData/DefaultSessionStore
  • Extended functionality properly layered in HierarchicalSessionStore
  • No heavy implementations in core SDK layer

Backward Compatibility: ✅ MAINTAINED

  • All existing DefaultSessionStore methods preserved
  • Extended functionality is additive only
  • No breaking changes to public APIs

Performance Impact: ✅ MINIMAL

  • Changes maintain existing O(1) append/trim under lock strategy
  • No additional heavy imports at module level
  • Thread-safe design preserved

Critical Bug Fixes Validated: ✅ CORRECT

  1. Message loss fix: add_message now properly routes through _modify_session_locked preventing race conditions
  2. Extended field preservation: _load_session_from_disk uses ExtendedSessionData.from_dict maintaining hierarchy fields

Issues Fixed

  • Redundant import: Removed duplicate from .store import SessionMessage in add_message method as found by Copilot (lines 189)

Final Verdict: ✅ APPROVED

This PR correctly fixes critical correctness bugs in HierarchicalSessionStore with:

  • Minimal, focused changes (68 additions, 12 deletions)
  • Proper architectural compliance
  • Comprehensive regression test coverage (22 tests passing)
  • No performance or backward compatibility regressions
  • Additional minor code quality improvement applied

The fixes address real production issues where message loss and extended field corruption could occur under concurrent access patterns. The implementation is architecturally sound and maintains all existing guarantees.

Recommendation: ✅ MERGE


Job Run | Branch

Copilot found redundant local import of SessionMessage in add_message
method since it's already imported at module scope. This removes
the unnecessary import statement.

Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/praisonai-agents/praisonaiagents/session/hierarchy.py`:
- Around line 148-175: The loader _load_session_from_disk currently swallows
JSON/IO errors and returns a fresh ExtendedSessionData which causes
_modify_session_locked (and callers like add_message() and
update_session_metadata()) to silently overwrite existing corrupted session
files; change the behavior so that _load_session_from_disk raises a descriptive
exception (including session_id and error_label/context) on
JSONDecodeError/IOError (or returns a clear failure sentinel) and then update
_modify_session_locked to detect that failure and abort the write (return False)
instead of persisting a blank ExtendedSessionData; preserve existing behavior
only when the file truly does not exist, and surface the error via the provided
error_label and any on_error hook so callers can log/handle remediation.

In `@src/praisonai-agents/tests/unit/session/test_hierarchy.py`:
- Around line 106-118: The test test_add_message_preserves_concurrent_writes is
not exercising the stale-writer scenario because both the second write and final
read use writer; change the sequence so reader (the stale instance returned by
HierarchicalSessionStore and warmed by reader._load_session("session-1"))
performs the second add_user_message("session-1", "second"), then instantiate a
fresh HierarchicalSessionStore and call get_chat_history("session-1") on that
fresh store to assert both messages exist and the second message content is
"second"; reference the HierarchicalSessionStore constructor, add_user_message,
_load_session, and get_chat_history when updating the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 872b869e-d8a5-48ec-b06a-03320d8f2b75

📥 Commits

Reviewing files that changed from the base of the PR and between 8acf77c and 7876c25.

📒 Files selected for processing (2)
  • src/praisonai-agents/praisonaiagents/session/hierarchy.py
  • src/praisonai-agents/tests/unit/session/test_hierarchy.py

Comment on lines +148 to +175
def _load_session_from_disk(self, session_id: str, filepath: str) -> ExtendedSessionData:
"""Load extended session JSON from disk (caller must hold FileLock)."""
if os.path.exists(filepath):
try:
with open(filepath, "r", encoding="utf-8") as f:
data = json.load(f)
return ExtendedSessionData.from_dict(data)
except (json.JSONDecodeError, IOError):
pass
return ExtendedSessionData(session_id=session_id)

def _modify_session_locked(
self,
session_id: str,
mutator,
*,
error_label: str = "modify session",
) -> bool:
"""Locked read-modify-write preserving extended session fields."""
result = super()._modify_session_locked(
session_id, mutator, error_label=error_label
)
if result:
with self._lock:
cached = self._cache.get(session_id)
if isinstance(cached, ExtendedSessionData):
self._extended_cache[session_id] = cached
return result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Abort the locked write when reloading the session fails.

_load_session_from_disk() currently turns JSON/IO failures into a fresh ExtendedSessionData, and _modify_session_locked() will then persist that object. On an unreadable existing file, add_message() or update_session_metadata() can overwrite the session with an almost-empty payload and silently drop prior messages, snapshots, and hierarchy links.

Suggested direction
 def _load_session_from_disk(self, session_id: str, filepath: str) -> ExtendedSessionData:
     """Load extended session JSON from disk (caller must hold FileLock)."""
-    if os.path.exists(filepath):
-        try:
-            with open(filepath, "r", encoding="utf-8") as f:
-                data = json.load(f)
-            return ExtendedSessionData.from_dict(data)
-        except (json.JSONDecodeError, IOError):
-            pass
-    return ExtendedSessionData(session_id=session_id)
+    if not os.path.exists(filepath):
+        return ExtendedSessionData(session_id=session_id)
+    try:
+        with open(filepath, "r", encoding="utf-8") as f:
+            data = json.load(f)
+        return ExtendedSessionData.from_dict(data)
+    except (json.JSONDecodeError, IOError) as e:
+        raise RuntimeError(
+            f"Failed to load session {session_id}; repair or remove the stored session file and retry."
+        ) from e

 def _modify_session_locked(
     self,
     session_id: str,
     mutator,
     *,
     error_label: str = "modify session",
 ) -> bool:
     """Locked read-modify-write preserving extended session fields."""
-    result = super()._modify_session_locked(
-        session_id, mutator, error_label=error_label
-    )
+    try:
+        result = super()._modify_session_locked(
+            session_id, mutator, error_label=error_label
+        )
+    except RuntimeError:
+        logger.exception("Failed to %s for session %s", error_label, session_id)
+        return False

As per coding guidelines, "Error handling: Fail fast with clear error messages; include remediation hints in exceptions; propagate context (agent name, tool name, session ID); provide hook points for error interception (on_error events)".

Also applies to: 198-205

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/praisonai-agents/praisonaiagents/session/hierarchy.py` around lines 148 -
175, The loader _load_session_from_disk currently swallows JSON/IO errors and
returns a fresh ExtendedSessionData which causes _modify_session_locked (and
callers like add_message() and update_session_metadata()) to silently overwrite
existing corrupted session files; change the behavior so that
_load_session_from_disk raises a descriptive exception (including session_id and
error_label/context) on JSONDecodeError/IOError (or returns a clear failure
sentinel) and then update _modify_session_locked to detect that failure and
abort the write (return False) instead of persisting a blank
ExtendedSessionData; preserve existing behavior only when the file truly does
not exist, and surface the error via the provided error_label and any on_error
hook so callers can log/handle remediation.

Comment on lines +106 to +118
def test_add_message_preserves_concurrent_writes(self):
"""Concurrent add_message must not lose messages (same as DefaultSessionStore)."""
with tempfile.TemporaryDirectory() as tmpdir:
writer = HierarchicalSessionStore(session_dir=tmpdir)
reader = HierarchicalSessionStore(session_dir=tmpdir)

writer.add_user_message("session-1", "first")
reader._load_session("session-1")
writer.add_user_message("session-1", "second")

history = writer.get_chat_history("session-1")
assert len(history) == 2
assert history[1]["content"] == "second"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

This regression test never drives the stale-writer scenario it describes.

reader._load_session("session-1") only warms state, but the follow-up write and the final read both go through writer. That still passes even if another store instance would later clobber "second" from stale state. Make the stale instance perform the next add_user_message() and verify the persisted history from a fresh store instance.

One way to make the regression meaningful
 def test_add_message_preserves_concurrent_writes(self):
     """Concurrent add_message must not lose messages (same as DefaultSessionStore)."""
     with tempfile.TemporaryDirectory() as tmpdir:
         writer = HierarchicalSessionStore(session_dir=tmpdir)
-        reader = HierarchicalSessionStore(session_dir=tmpdir)
+        stale_writer = HierarchicalSessionStore(session_dir=tmpdir)
+        verifier = HierarchicalSessionStore(session_dir=tmpdir)

         writer.add_user_message("session-1", "first")
-        reader._load_session("session-1")
+        stale_writer.get_extended_session("session-1")
         writer.add_user_message("session-1", "second")
+        stale_writer.add_user_message("session-1", "third")

-        history = writer.get_chat_history("session-1")
-        assert len(history) == 2
-        assert history[1]["content"] == "second"
+        history = verifier.get_chat_history("session-1")
+        assert [message["content"] for message in history] == [
+            "first",
+            "second",
+            "third",
+        ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/praisonai-agents/tests/unit/session/test_hierarchy.py` around lines 106 -
118, The test test_add_message_preserves_concurrent_writes is not exercising the
stale-writer scenario because both the second write and final read use writer;
change the sequence so reader (the stale instance returned by
HierarchicalSessionStore and warmed by reader._load_session("session-1"))
performs the second add_user_message("session-1", "second"), then instantiate a
fresh HierarchicalSessionStore and call get_chat_history("session-1") on that
fresh store to assert both messages exist and the second message content is
"second"; reference the HierarchicalSessionStore constructor, add_user_message,
_load_session, and get_chat_history when updating the test.

@MervinPraison MervinPraison marked this pull request as ready for review May 30, 2026 06:15
Copilot AI review requested due to automatic review settings May 30, 2026 06:15
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@MervinPraison MervinPraison merged commit f9ae9e9 into main May 30, 2026
19 of 20 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 30, 2026

Greptile Summary

This PR fixes two bugs in HierarchicalSessionStore: a message-loss race in add_message (two writers could overwrite each other between separate FileLock acquisitions) and silent stripping of parent_id, children_ids, and snapshots when inherited update_session_metadata/clear_session rehydrated the session using base-class SessionData.from_dict.

  • _load_session_from_disk is overridden to return ExtendedSessionData, ensuring the parent's locked read-modify-write cycle preserves hierarchy fields on every write path that flows through _modify_session_locked.
  • add_message is rewritten as a mutator passed to _modify_session_locked, making the read-modify-write atomic under a single FileLock.
  • A _modify_session_locked override syncs _extended_cache from _cache after each locked write, keeping the fast-path hierarchy cache consistent.

Confidence Score: 3/5

The disk-level correctness of add_message and metadata updates is sound, but the dual-cache design leaves stale-read windows and the regression test labelled concurrent does not actually run concurrent writers.

The core disk write path is correctly fixed: a single FileLock now covers the entire read-modify-write for add_message and all inherited _modify_session_locked callers, and _load_session_from_disk is properly overridden so extended fields survive. However, _extended_cache is synced in a separate lock acquisition after the file lock is released, leaving a brief window where hierarchy reads return stale data. The inherited _load_session still hardcodes SessionData.from_dict, so a cold cache populates _cache with a plain SessionData that silently drops hierarchy fields. And the new concurrency test would have passed against the old buggy implementation because it never runs two writers simultaneously.

Both changed files warrant attention: hierarchy.py for the dual-cache gap and the un-overridden _load_session, and test_hierarchy.py for the non-concurrent concurrency test.

Important Files Changed

Filename Overview
src/praisonai-agents/praisonaiagents/session/hierarchy.py Adds _load_session_from_disk override (returns ExtendedSessionData), routes add_message through _modify_session_locked for atomic read-modify-write, and syncs _extended_cache post-lock — correctly fixing both stated bugs, but the cache sync window and the un-overridden _load_session leave subtle stale-read paths.
src/praisonai-agents/tests/unit/session/test_hierarchy.py Adds two regression tests: one for metadata field preservation (solid) and one labelled 'concurrent' that is actually sequential and would have passed against the old buggy implementation.

Sequence Diagram

sequenceDiagram
    participant C as Caller
    participant H as HierarchicalSessionStore
    participant D as DefaultSessionStore
    participant Disk as File (JSON)
    participant FC as _cache
    participant EC as _extended_cache

    Note over C,EC: New add_message flow (this PR)
    C->>H: add_message(session_id, role, content)
    H->>H: build SessionMessage + _apply mutator
    H->>D: super()._modify_session_locked(session_id, _apply)
    D->>Disk: acquire FileLock
    D->>H: _load_session_from_disk() [overridden]
    H->>Disk: json.load ExtendedSessionData.from_dict()
    Disk-->>H: ExtendedSessionData (with parent_id, snapshots)
    H-->>D: ExtendedSessionData
    D->>D: _apply(session) append message
    D->>Disk: atomic write (temp + os.replace)
    D->>Disk: release FileLock
    D->>FC: "_cache[session_id] = ExtendedSessionData"
    D-->>H: True
    Note over H,EC: brief window - _cache fresh, _extended_cache stale
    H->>FC: _cache.get(session_id)
    FC-->>H: ExtendedSessionData
    H->>EC: "_extended_cache[session_id] = ExtendedSessionData"
    H-->>C: True
Loading

Reviews (1): Last reviewed commit: "fix: remove redundant SessionMessage imp..." | Re-trigger Greptile

Comment on lines +106 to +118
def test_add_message_preserves_concurrent_writes(self):
"""Concurrent add_message must not lose messages (same as DefaultSessionStore)."""
with tempfile.TemporaryDirectory() as tmpdir:
writer = HierarchicalSessionStore(session_dir=tmpdir)
reader = HierarchicalSessionStore(session_dir=tmpdir)

writer.add_user_message("session-1", "first")
reader._load_session("session-1")
writer.add_user_message("session-1", "second")

history = writer.get_chat_history("session-1")
assert len(history) == 2
assert history[1]["content"] == "second"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Test doesn't exercise actual concurrency

The test named "concurrent" performs strictly sequential operations. reader._load_session is called between two add_user_message calls on a separate store instance that never writes — so no actual race window is exercised.

With the old code the sequence was: _load_extended_session(force_reload=True) (reads from disk) followed by _save_extended_session (writes to disk) — two separate FileLock acquisitions. The race could only manifest when two writers interleaved their separate lock windows. That scenario requires two threads calling add_message at the same time, which this test never does. As written, this test would have passed with the old (buggy) code too, providing no regression signal for the actual fix.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +159 to +175
def _modify_session_locked(
self,
session_id: str,
mutator,
*,
error_label: str = "modify session",
) -> bool:
"""Locked read-modify-write preserving extended session fields."""
result = super()._modify_session_locked(
session_id, mutator, error_label=error_label
)
if result:
with self._lock:
cached = self._cache.get(session_id)
if isinstance(cached, ExtendedSessionData):
self._extended_cache[session_id] = cached
return result
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _extended_cache sync races with _load_extended_session

The _extended_cache is updated in a separate with self._lock: block acquired after super()._modify_session_locked() has already returned and released its FileLock. In the window between those two points, _cache already holds the fresh ExtendedSessionData but _extended_cache still holds the previous version. A concurrent thread calling _load_extended_session (e.g., from fork_session, get_extended_session, or auto_title) in that window returns stale extended-field data (missing the just-appended message). Because _extended_cache is the primary fast path for all hierarchy reads, callers that hit the cache between steps 1 and 2 silently see an old snapshot of the session.

Comment on lines +148 to +157
def _load_session_from_disk(self, session_id: str, filepath: str) -> ExtendedSessionData:
"""Load extended session JSON from disk (caller must hold FileLock)."""
if os.path.exists(filepath):
try:
with open(filepath, "r", encoding="utf-8") as f:
data = json.load(f)
return ExtendedSessionData.from_dict(data)
except (json.JSONDecodeError, IOError):
pass
return ExtendedSessionData(session_id=session_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _load_session is not overridden; inherited reads bypass ExtendedSessionData deserialization

DefaultSessionStore._load_session hardcodes SessionData.from_dict(data) and is not overridden here. Several inherited methods reach _cache through _load_session — most visibly get_chat_history, get_session, get_sessions_by_agent. If _cache is cold for a given session (e.g., after process restart or invalidate_cache), those callers receive a bare SessionData with parent_id, children_ids, snapshots, and title all absent. Code that casts the result of get_session() to ExtendedSessionData would lose the hierarchy fields silently. Overriding _load_session to call self._load_session_from_disk (already added in this PR) would close the gap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants