Skip to content

fix: guard handle_read against closed socket race (#614)#902

Draft
fruch wants to merge 1 commit into
scylladb:masterfrom
fruch:fruch/fix-614-handle-read-guard
Draft

fix: guard handle_read against closed socket race (#614)#902
fruch wants to merge 1 commit into
scylladb:masterfrom
fruch:fruch/fix-614-handle-read-guard

Conversation

@fruch

@fruch fruch commented Jun 11, 2026

Copy link
Copy Markdown

Minimal fix for issue #614 — Bad file descriptor

2-line change. Adds if self.is_closed: return at the top of LibevConnection.handle_read().

Root Cause

Thread race between close() and handle_read():

  • close() calls self._socket.close() synchronously (from any thread)
  • Watcher stop is asynchronous (batched in _loop_will_run(), next loop iteration)
  • If the reactor has already dispatched handle_read() for that connection before the watcher is stopped, recv() hits a closed FD → EBADF
Thread A (app):    close() → _socket.close()     [immediate]
Thread B (reactor): handle_read() → recv()       [EBADF - socket already closed]
                    ...
                    _loop_will_run() → watcher.stop()  [too late]

The Fix

 def handle_read(self, watcher, revents, errno=None):
+    if self.is_closed:
+        return
     if revents & libev.EV_ERROR:

Why this is safe

  • is_closed is set under self.lock in close() before _socket.close() is called
  • Reading a boolean attribute is atomic in CPython (GIL guarantees)
  • If is_closed is True, the socket is either already closed or about to be — no useful work can be done

Reproducer

100% reproduction rate: https://github.com/scylladb/scylla-dtest/pull/7079

Note on existing fix branches

  • fix/factory-close-race-614 — fixes factory returning dead connections (different race)
  • fix/asyncio-close-race-614 — fixes AsyncioConnection race (different reactor)

Neither adds the handle_read guard for LibevConnection. This PR is complementary.

How to test with SCT/Hydra

# In SCT requirements, point python-driver to this branch:
pip install git+https://github.com/scylladb/python-driver.git@fruch/fix-614-handle-read-guard

# Or in hydra, override the driver version in the test config

@vponomaryov — could you try building hydra with this branch and check if it fixes the EBADF reports in longevity runs? This is the minimal possible fix (2 lines). If it works, we can merge this independently of the other #614 fix branches.

Closes #614

Add is_closed check at the top of LibevConnection.handle_read() to
prevent recv() on an already-closed socket.

Root cause: close() is called from any thread (main/app) and does
_socket.close() synchronously, but the libev reactor thread may already
be inside handle_read() for that connection (dispatched before
_loop_will_run batches the watcher stop). Without this guard, recv()
on the closed FD raises OSError [Errno 9] Bad file descriptor.

This is the minimal fix for issue scylladb#614. The race is inherent in the
threading model (single reactor thread shared across all sessions,
close callable from any thread). The watcher stop is asynchronous
(batched in _loop_will_run), but socket.close() is synchronous —
this guard bridges the gap.

Reproducer: scylladb/scylla-dtest#7079
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 3bd2400c-2235-4161-98c0-c950a898d977

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vponomaryov

vponomaryov commented Jun 12, 2026

Copy link
Copy Markdown

I ran the CI job here: https://argus.scylladb.com/tests/scylla-cluster-tests/59720349-680a-44ed-bd9f-109ddf28657f/results
But list of nemesis has changed and the picked ones are questionnable for triggering the connections closings.

So, I re-triggered it with different nemesis seed: scylla-staging/valerii/vp-longevity-100gb-4h-oci-test#38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Driver reported "[Errno 9] Bad file descriptor"

2 participants