Skip to content

fix(qss): track endpoint identity in QSSClient connection coalescing#3210

Open
holmesworcester wants to merge 1 commit into
7.1.0from
fix/qss-client-endpoint-coalescing
Open

fix(qss): track endpoint identity in QSSClient connection coalescing#3210
holmesworcester wants to merge 1 commit into
7.1.0from
fix/qss-client-endpoint-coalescing

Conversation

@holmesworcester
Copy link
Copy Markdown
Collaborator

Summary

This fixes a QSS client connection race where one in-flight endpoint connection can incorrectly satisfy a later request for a different endpoint.

Changes:

  • Preserve same-endpoint coalescing: concurrent requests for ws://qss-a still share the same in-flight promise.
  • Track the endpoint associated with the in-flight connect promise.
  • If a new request targets a different endpoint, abort the stale socket connect and start a fresh socket for the requested endpoint.
  • Add QSSClient regression coverage for both same-endpoint coalescing and endpoint replacement.

Code Path

The vulnerable primitive in 7.1.0 is QSSClient.createSocketAndConnect().

In the base branch:

  • packages/backend/src/nest/qss/qss.client.ts:82-89 stores one global _connectPromise.
  • If _connectPromise exists, the method returns it without checking what endpoint it belongs to.
  • packages/backend/src/nest/qss/qss.client.ts:100-129 then creates a socket for the first requested endpoint and waits for that socket to connect.

That means this sequence is possible inside one client process:

  1. A QSS connect starts for endpoint A.
  2. Before A connects, the user abandons that flow or the app starts a replacement flow, such as leave/recreate, aborted join/retry, or a new invite/server endpoint.
  3. The app asks to connect to endpoint B.
  4. The old code returns endpoint A's pending promise to the endpoint B caller.
  5. If A eventually connects, B's caller receives A's socket and the client believes the replacement flow connected even though it is on the stale server.

This does not require multiple active communities or users in one Quiet client. It only requires one client to replace an in-flight QSS connection attempt before the first attempt settles.

Fix

The fix adds endpoint identity to the in-flight connection state:

  • packages/backend/src/nest/qss/qss.client.ts:34-36 now tracks _connectPromiseEndpoint and an abort controller for the in-flight connect.
  • packages/backend/src/nest/qss/qss.client.ts:84-101 still returns the existing promise when the requested endpoint matches.
  • packages/backend/src/nest/qss/qss.client.ts:87-98 aborts the stale in-flight connect when the requested endpoint differs.
  • packages/backend/src/nest/qss/qss.client.ts:104-118 records and clears the promise/endpoint/abort state together.
  • packages/backend/src/nest/qss/qss.client.ts:166-229 makes _waitForConnect() operate on the specific socket it was given and reject promptly when that connect is aborted.

The key invariant is now: coalesce only when the endpoint is the same; replace when the endpoint changes.

Why This Is A Real Bug Class

This is not a "multiple communities active at once" bug. The realistic version is stale state during transition:

  • A user starts a QSS-backed create/join flow.
  • The endpoint connect is still pending.
  • The user leaves, aborts, retries, or recreates with a different invite/server endpoint.
  • The backend starts a new QSS connection request while the old one has not settled.

The previous implementation had no way to distinguish those requests. That is a real correctness issue because QSS endpoint identity is part of the community/join context. Returning a socket for endpoint A to a caller that requested endpoint B can make subsequent auth/sign-in/log-sync traffic go to the wrong QSS server.

Honest likelihood assessment:

  • This is timing-sensitive. If the first connection settles before the replacement flow starts, the bug does not appear.
  • It is most plausible under slow or flaky network, CPU throttling, QSS startup delay, aborted join/create flows, or leave/recreate flows where cleanup and retry overlap.
  • The failure mode is high impact when hit because the client can proceed using a stale QSS connection for the replacement flow.
  • The POC does not require unsupported multiple active communities. It models one active client replacing an abandoned pending endpoint with a new endpoint.

Regression Test

The new test file is packages/backend/src/nest/qss/qss.client.spec.ts.

It covers two cases:

  • packages/backend/src/nest/qss/qss.client.spec.ts:106-128: concurrent requests for the same endpoint still coalesce onto one socket.
  • packages/backend/src/nest/qss/qss.client.spec.ts:130-160: a newer request for ws://new-qss must not be satisfied by an abandoned pending connection to ws://old-qss.

Red on raw 7.1.0 after adding the test only:

FAIL src/nest/qss/qss.client.spec.ts
QSSClient
  ✓ coalesces concurrent connection requests for the same endpoint (4 ms)
  ✕ does not satisfy a newer endpoint request with an abandoned endpoint connection (1 ms)

expect(jest.fn()).toHaveBeenCalledTimes(expected)
Expected number of calls: 2
Received number of calls: 1

Green after this fix:

PASS src/nest/qss/qss.client.spec.ts
QSSClient
  ✓ coalesces concurrent connection requests for the same endpoint (5 ms)
  ✓ does not satisfy a newer endpoint request with an abandoned endpoint connection (2 ms)

Validation

PATH=/home/holmes/.nvm/versions/node/v20.20.1/bin:$PATH npm run test -- --runInBand --verbose --detectOpenHandles --forceExit src/nest/qss/qss.client.spec.ts

Result:

PASS src/nest/qss/qss.client.spec.ts
Tests: 2 passed, 2 total

Note

Replaces holmesworcester/quiet#2, which targeted the workflow-stripped mirror at holmesworcester/quiet:7.1.0. This PR retargets the same patch directly at upstream TryQuiet/quiet:7.1.0.

Copy link
Copy Markdown
Collaborator

@adrastaea adrastaea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't fit into the current release. Adds too much complexity for an unrealistic case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants