Commit 48d83cb
committed
drpcmanager: fix context cancellation error for unary RPCs
Fix a race condition where a unary RPC with a cancelled context could
return io.EOF instead of codes.Canceled. Two changes, mirroring how
gRPC handles this:
1. Early ctx.Err() check in NewClientStream before creating the stream.
2. Deferred stream.CheckCancelError in doInvoke to convert io.EOF to
the cancel error if the stream was cancelled mid-operation.
The problem:
With multiplexing, each stream gets a manageStream goroutine that
watches ctx.Done() and calls SendCancel + Cancel when the context is
cancelled. This races with doInvoke, which writes invoke and message
frames through the same stream. The race has three outcomes depending
on who acquires the stream's write lock first:
1. doInvoke wins the lock, completes all writes, and the RPC succeeds
even though it should have been cancelled.
2. SendCancel wins, sets send=io.EOF before doInvoke runs.
rawWriteLocked sees send.IsSet() and returns io.EOF. Invoke's
ToRPCErr passes io.EOF through unchanged, so the caller gets the
wrong error code.
3. doInvoke finishes writes, then MsgRecv sees the cancellation and
returns codes.Canceled. This is the correct outcome but only
happens by luck of timing.
Why this didn't happen before multiplexing:
The old single-stream manager used a non-blocking SendCancel that
returned (busy=true) when the write lock was held by an in-progress
write. With SoftCancel=false (the default), the fallback path was:
manageStream calls stream.Cancel(ctx.Err()). The stream is not
finished because doInvoke holds the write lock, so the manager calls
m.terminate(), which closes the entire transport. The in-flight
Writer.Write() fails with an IO error, and checkCancelError sees
cancel.IsSet() and returns context.Canceled.
The correct error surfaced, but through connection termination. This was
fine in single-stream mode where one stream is one connection. With
multiplexing, we cannot terminate the entire connection for one stream's
cancellation. The new SendCancel blocks on the write lock to guarantee
the cancel frame is sent, and that introduced this race.
How gRPC handles this (verified against grpc-go source):
gRPC uses two mechanisms. First, newAttemptLocked (stream.go:408) checks
cs.ctx.Err() before creating the transport stream. This catches the
already-cancelled case without allocating resources. Second, for unary
RPCs, csAttempt.sendMsg (stream.go:1092) swallows write errors and
returns nil when !cs.desc.ClientStreams. The real error always surfaces
from RecvMsg, which detects context cancellation via
recvBufferReader.readClient (transport.go:239) and returns
status.Error(codes.Canceled, ...). This means gRPC never returns io.EOF
from a unary RPC because it never short-circuits on a send error.
For streaming RPCs, gRPC returns io.EOF from Send() after cancel (the
stream is done for writing) and codes.Canceled from Recv() (the actual
reason). Our grpccompat tests confirm this by comparing gRPC and DRPC
error results for identical cancel scenarios.
Our fix:
Rather than restructuring doInvoke to swallow send errors like gRPC, we
use the stream's existing CheckCancelError mechanism.
NewClientStream checks ctx.Err() before creating the stream. This
mirrors gRPC's newAttemptLocked check and avoids wasting a stream ID,
spawning a goroutine, and allocating stream resources.
doInvoke defers stream.CheckCancelError on its return value. If any
operation in doInvoke fails because SendCancel won the write lock race
(returning io.EOF via the send signal), CheckCancelError replaces it
with the cancel signal's error (context.Canceled). This is the same
function the stream already uses internally for transport write
failures. CheckCancelError is exported (was checkCancelError) so that
doInvoke in the drpcconn package can call it.
On TOCTOU:
The NewClientStream check is technically TOCTOU: the context could be
cancelled immediately after the check passes. This is acceptable because
Go's context cancellation model is cooperative, not preemptive. The
context package provides Done() "for use in select statements," and
operations check at natural boundaries rather than continuously. The
standard library follows this pattern: http.Client.Do checks between
redirect hops, database/sql checks before query execution, and gRPC
checks in newAttemptLocked before creating the transport stream. If the
context is cancelled mid-operation, manageStream handles cleanup and the
deferred CheckCancelError corrects the error code.1 parent 68e31cb commit 48d83cb
3 files changed
Lines changed: 11 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
89 | 89 | | |
90 | 90 | | |
91 | 91 | | |
| 92 | + | |
92 | 93 | | |
93 | 94 | | |
94 | 95 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
328 | 328 | | |
329 | 329 | | |
330 | 330 | | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
331 | 334 | | |
332 | 335 | | |
333 | 336 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
297 | 297 | | |
298 | 298 | | |
299 | 299 | | |
300 | | - | |
| 300 | + | |
301 | 301 | | |
302 | 302 | | |
303 | | - | |
| 303 | + | |
304 | 304 | | |
305 | 305 | | |
306 | 306 | | |
| |||
401 | 401 | | |
402 | 402 | | |
403 | 403 | | |
404 | | - | |
| 404 | + | |
405 | 405 | | |
406 | 406 | | |
407 | 407 | | |
| |||
496 | 496 | | |
497 | 497 | | |
498 | 498 | | |
499 | | - | |
| 499 | + | |
500 | 500 | | |
501 | 501 | | |
502 | 502 | | |
| |||
519 | 519 | | |
520 | 520 | | |
521 | 521 | | |
522 | | - | |
| 522 | + | |
523 | 523 | | |
524 | 524 | | |
525 | 525 | | |
| |||
540 | 540 | | |
541 | 541 | | |
542 | 542 | | |
543 | | - | |
| 543 | + | |
544 | 544 | | |
545 | 545 | | |
546 | 546 | | |
| |||
563 | 563 | | |
564 | 564 | | |
565 | 565 | | |
566 | | - | |
| 566 | + | |
567 | 567 | | |
568 | 568 | | |
569 | 569 | | |
| |||
0 commit comments