fix(session): prevent concurrent commit re-committing old messages#783
fix(session): prevent concurrent commit re-committing old messages#783deepakdevp wants to merge 4 commits intovolcengine:mainfrom
Conversation
commit_async() now acquires an asyncio.Lock during Phase 1 (copy + clear + file write). This prevents concurrent commits from re-committing the same messages. The lock is released before the slow LLM summary and memory extraction, so it doesn't block other operations. The phase order is changed: live messages are cleared BEFORE the archive summary is generated, closing the race window where a second commit could see stale data. If the file-clear fails, messages are rolled back. Fixes volcengine#580. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests verify that: - Two concurrent commit_async() calls on the same session produce exactly one archive (the other returns early) - Messages added while a commit is running are preserved in the session and not lost or re-committed Part of fix for volcengine#580. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! The race condition analysis is spot-on — there is indeed a window during Existing protectionsThe session commit path already has several layers of protection:
What's missing is exactly what you identified: Phase 1 atomicity — the gap between copying messages and clearing them, with a slow LLM summary call in between. Why
|
Replaces the in-process asyncio.Lock with the existing PathLock (distributed filesystem lock via LockContext) for Phase 1 of commit_async(). This ensures commit serialization works across multiple HTTP workers and service instances, not just within a single Python process. Addresses review feedback from qin-ctx on PR volcengine#783.
|
Thanks @qin-ctx for the thorough review and the suggestion! You're absolutely right that I've replaced it with Changes in the latest push:
Please take another look when you get a chance! |
| # Use filesystem-based distributed lock so this works across workers/processes. | ||
| session_path = self._viking_fs._uri_to_path(self._session_uri, ctx=self.ctx) | ||
| async with LockContext(get_lock_manager(), [session_path], lock_mode="point"): | ||
| if not self._messages: |
There was a problem hiding this comment.
Small thing — the old code returned early on empty _messages before acquiring any lock. Now every commit_async() call on an empty session takes a filesystem PathLock just to check the list length and return. Probably negligible in practice, but if something is calling commit_async frequently (e.g. a keep-alive or periodic flush), the lock contention could add up. Worth moving the empty check back above the async with, or is there a reason it needs to be inside the lock now?
There was a problem hiding this comment.
Good catch — applied a double-check locking pattern in 2f85807. A fast pre-check for not self._messages now sits above the async with LockContext, so empty sessions skip the filesystem lock entirely (common case, zero I/O). The authoritative check inside the lock is still there to handle the race where two concurrent callers both pass the pre-check but only the first should archive. Thanks for the suggestion!
Add a double-check locking pattern to commit_async(): a fast pre-check for empty _messages before acquiring the PathLock, with an authoritative check inside the lock to handle concurrent callers. This avoids unnecessary filesystem lock acquisition (and the associated AGFS .path.ovlock round-trip) for the common case where commit_async() is called on a session with no pending messages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
asyncio.LocktoSessionto serialize concurrentcommit_async()callscommit_async()to clear live messages before the slow LLM summary generation, closing the race window where a second commit could see stale dataFixes #580
Root Cause
commit_async()had no synchronization. When called concurrently, both calls would copy the sameself._messages, generate separate archives, and trigger duplicate memory extraction. The race window spanned the entire LLM summary generation (seconds), during which the livemessages.jsonlstill contained the old messages.Changes Made
openviking/session/session.py:self._commit_lock = asyncio.Lock()toSession.__init__async with self._commit_locktests/session/test_session_commit_race.py(new): 2 testsType of Change
Testing
🤖 Generated with Claude Code