Skip to content

env/posix: improve io_uring init failure diagnostics in ReadAsync#14736

Open
krhancoc wants to merge 1 commit into
facebook:mainfrom
krhancoc:fix/readasync-enomem-better-error
Open

env/posix: improve io_uring init failure diagnostics in ReadAsync#14736
krhancoc wants to merge 1 commit into
facebook:mainfrom
krhancoc:fix/readasync-enomem-better-error

Conversation

@krhancoc
Copy link
Copy Markdown

@krhancoc krhancoc commented May 12, 2026

Problem

When io_uring_queue_init fails with ENOMEM (memlock limit exhausted — common on systems running many concurrent threads), CreateIOUring returns
nullptr and ReadAsync propagates a generic IOStatus::NotSupported("ReadAsync") with no indication of why the ring creation failed. This makes the failure opaque and hard to diagnose.

Root Cause Chain

 MultiGet(async_io=true)
   → MultiGetFromSSTCoroutine (per SST file)
     → FilePrefetchBuffer::ReadAsync
       → PosixRandomAccessFile::ReadAsync
         → CreateIOUring() → io_uring_queue_init(256, ring, flags)
           ← ENOMEM (memlock exhausted)
         ← returns null
       ← returns IOStatus::NotSupported("ReadAsync")   ← opaque, no errno
     ← per-key error propagated to caller

Fix

  • CreateIOUring now accepts an optional int* err_out parameter. On failure it captures the errno from io_uring_queue_init, logs a diagnostic to stderr
    (with a memlock-specific hint when errno is ENOMEM), and — if the caller passes a pointer — fills it with the raw errno so the caller can respond
    appropriately.
  • PosixRandomAccessFile::ReadAsync now uses err_out to return a typed IOStatus:
  • ENOMEM → IOStatus::IOError(...) with a message explaining the memlock limit
  • ENOSYS / EPERM → IOStatus::NotSupported(...) (kernel doesn't support io_uring)
  • Other errno → IOStatus::IOError(...) with the raw errno string
  • PosixRandomAccessFile::MultiRead is unchanged in behavior — it already falls back to synchronous reads when ring init fails. It calls CreateIOUring()
    with no argument (default nullptr), relying on the internal stderr logging.

When io_uring_queue_init fails, ReadAsync previously returned a single
generic IOStatus::NotSupported("ReadAsync: failed to init io_uring")
regardless of why the ring could not be created.  ENOMEM (memlock quota
exhausted) and ENOSYS (kernel has no io_uring) have very different
remediation paths, yet both produced identical, opaque error messages.

Changes:

1. CreateIOUring (env/io_posix.h)
   - Add optional int* err_out parameter so callers receive the raw
     positive errno from io_uring_queue_init.
   - Log to stderr (was stdout) so the message is captured by log
     pipelines that suppress stdout.
   - Emit a dedicated message for ENOMEM that names RLIMIT_MEMLOCK as
     the likely cause and suggests raising it.

2. ReadAsync (env/io_posix.cc)
   - Capture the errno via the new err_out parameter.
   - Map errno to a precise IOStatus:
       ENOMEM          -> IOError  (resource exhaustion; io_uring IS
                          supported, resources are just depleted)
       ENOSYS / EPERM  -> NotSupported (kernel/permissions barrier)
       other non-zero  -> IOError with the strerror description
       zero (TLS null) -> NotSupported (io_uring disabled at file-open)
   - Using IOError rather than NotSupported for ENOMEM is intentional:
     callers that treat NotSupported as a silent "retry sync" signal
     would otherwise hide the resource-exhaustion problem from operators.

3. MultiRead (env/io_posix.cc)
   - Update CreateIOUring() call to pass &init_err so the improved
     ENOMEM log line fires when MultiRead's ring init fails (no
     behavioral change; MultiRead already falls back to serialized
     reads on init failure).

This is motivated by a flaky CI failure (Not implemented: ReadAsync in
AnnIndexSPANNTest) where ENOMEM was the root cause but the error message
gave no indication of RLIMIT_MEMLOCK being relevant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented May 12, 2026

Hi @krhancoc!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@meta-cla meta-cla Bot added the CLA Signed label May 12, 2026
@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented May 12, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant