Skip to content

fix stream name mismatch inbox => stream#104

Merged
FirwoodLin merged 2 commits into
mainfrom
fix_stream_mismatch
May 27, 2026
Merged

fix stream name mismatch inbox => stream#104
FirwoodLin merged 2 commits into
mainfrom
fix_stream_mismatch

Conversation

@FirwoodLin
Copy link
Copy Markdown
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings May 27, 2026 15:20
@FirwoodLin FirwoodLin temporarily deployed to self-hosted-rdma May 27, 2026 15:20 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces mailbox stream deletion during cleanup in the Redis repository, tracks the session start time for peer agents, and implements a stale message filtering mechanism in the mailbox. However, a critical issue was identified in the stale message filtering logic: comparing absolute wall-clock timestamps across distributed systems introduces vulnerabilities to clock drift and startup race conditions, which can permanently stall connection handshakes.

Comment thread dlslime/dlslime/peer_agent/_mailbox.py Outdated
Comment on lines +235 to +245
def _is_stale_message(self, fields: Dict[str, str]) -> bool:
raw_timestamp = fields.get("timestamp")
if raw_timestamp is None:
return False
if isinstance(raw_timestamp, bytes):
raw_timestamp = raw_timestamp.decode("utf-8", errors="replace")
try:
timestamp = float(raw_timestamp)
except (TypeError, ValueError):
return False
return timestamp < getattr(self._agent, "_session_started_at", 0.0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Critical Bug: Clock Drift and Startup Race Condition in Stale Message Filtering

The introduction of _is_stale_message to filter out messages where timestamp < self._session_started_at introduces two severe issues that can permanently stall the connection handshake:

  1. Startup Rendezvous Violation:
    As documented in the comments for start() (lines 59–63), a peer whose connect_to runs before this agent registers will write qp_ready to our stream before our listener/session starts. In this valid scenario, the message's timestamp will naturally be less than self._session_started_at. By classifying these messages as stale and ignoring them, the agent will fail to establish the connection, stalling the handshake forever.

  2. Distributed Clock Drift:
    The timestamp is generated using the local system clock of the sending peer or the control plane (time.time()), while _session_started_at is generated using the local system clock of the receiving agent. In a distributed environment, even minor clock drift (e.g., if the sender's clock is slightly behind the receiver's clock) will cause newly sent, perfectly valid messages to be incorrectly discarded as stale.

Suggested Remedy:

Instead of using absolute wall-clock timestamps from different machines to detect stale messages, consider using a unique session/run identifier (e.g., a UUID or an incrementing epoch/incarnation number generated during registration) that is passed in the messages. Alternatively, rely on the fact that cleanup_agent already deletes the stream key upon deregistration/restart, or use Redis stream IDs (which are monotonic and generated by the single Redis server clock) to establish a safe boundary.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates mailbox cleanup and stale-message handling for Redis stream-based peer handshakes, addressing stream/inbox naming drift and stale RDMA exchange data.

Changes:

  • Adds timestamp-based stale stream message filtering in StreamMailbox.
  • Records agent session start time for stale-message comparisons.
  • Deletes stream:{agent} during controller-side agent cleanup and expands Python startup exchange-key cleanup.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
dlslime/dlslime/peer_agent/_mailbox.py Adds stale-message detection before handling stream messages.
dlslime/dlslime/peer_agent/_agent.py Tracks session start time and expands stale exchange-key cleanup pattern.
dlslime-ctrl/src/redis_repo.rs Deletes stream mailbox keys alongside legacy inbox keys during cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dlslime/dlslime/peer_agent/_mailbox.py Outdated
Comment on lines +236 to +246
raw_timestamp = fields.get("timestamp")
if raw_timestamp is None:
return False
if isinstance(raw_timestamp, bytes):
raw_timestamp = raw_timestamp.decode("utf-8", errors="replace")
try:
timestamp = float(raw_timestamp)
except (TypeError, ValueError):
return False
return timestamp < getattr(self._agent, "_session_started_at", 0.0)

patterns = [
f"{prefix}exchange:{self.alias}:*",
f"{prefix}exchange:*:{self.alias}",
f"{prefix}exchange:*:{self.alias}:*",
@FirwoodLin FirwoodLin merged commit ed422cd into main May 27, 2026
@FirwoodLin FirwoodLin temporarily deployed to self-hosted-rdma May 27, 2026 15:30 — with GitHub Actions Inactive
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 27, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ root
❌ FirwoodLin


root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants