Skip to content

fix(logging): prevent log file race condition under parallel attempts#1827

Open
nuthalapativarun wants to merge 2 commits into
NVIDIA:mainfrom
nuthalapativarun:fix/1355-parallel-log-race-condition-clean
Open

fix(logging): prevent log file race condition under parallel attempts#1827
nuthalapativarun wants to merge 2 commits into
NVIDIA:mainfrom
nuthalapativarun:fix/1355-parallel-log-race-condition-clean

Conversation

@nuthalapativarun
Copy link
Copy Markdown

Summary

Fixes #1355.

When multiprocessing.Pool forks worker processes they inherit the parent's open FileHandler file descriptor. Concurrent writes — or reentrant flushes triggered by third-party library destructors (e.g. openai/httpcore closing connections in __del__) — on those shared handles cause a RuntimeError: reentrant call inside <_io.BufferedWriter> in Python 3.13+.

Fix: pass initializer=_worker_logging_init to Pool(...) in garak/probes/base.py. The initializer runs once in each worker process and:

  1. Closes and removes every inherited log handler.
  2. Re-opens a fresh FileHandler via logging.basicConfig, using the GARAK_LOG_FILE env var that garak/__init__.py already sets before any children are spawned.

Each worker then has its own private file descriptor pointing at the same log file, eliminating the shared-handle race.

Files changed

  • garak/probes/base.py — add _worker_logging_init() and pass it as initializer to Pool

Test plan

  • Reproduce with parallel_attempts: 2 and a multi-probe run (e.g. the config from the issue comments) — confirm no RuntimeError appears in the log
  • Existing unit tests pass (pre-existing missing-module failures are unrelated to this change)

When multiprocessing.Pool forks worker processes they inherit the
parent's open FileHandler file descriptors.  Concurrent writes (or
reentrant flushes triggered by third-party library destructors such as
openai/httpcore closing connections in __del__) on those shared handles
cause a RuntimeError in Python 3.13+.

Add a _worker_logging_init pool initializer that closes all inherited
handlers and re-opens a fresh FileHandler in each worker, giving every
worker its own private file descriptor.  The log file path is read from
the GARAK_LOG_FILE env var that garak/__init__.py already sets before
spawning any children.

Fixes NVIDIA#1355

Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
@aishwaryap
Copy link
Copy Markdown
Collaborator

The fix description appears promising but can we obtain:

  1. A reproducible test that demonstrates that this fix resolves the error in question.
  2. If a reproducible test can be found can that be converted into a unit / integration test

Copy link
Copy Markdown
Collaborator

@aishwaryap aishwaryap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See earlier comment

Adds tests/probes/test_probes_base_parallel_logging.py to reproduce and
verify the fix for NVIDIA#1355. Four unit tests cover handler teardown, file
handler creation from GARAK_LOG_FILE, absence of handler when env is
unset, and idempotency. Two integration tests spawn a real Pool to confirm
workers log concurrently without RuntimeError and that each worker opens
its own private file descriptor rather than sharing the parent's inherited
fd.

Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nuthalapativarun
Copy link
Copy Markdown
Author

Thanks for the clear feedback, @aishwaryap!

I've added tests/probes/test_probes_base_parallel_logging.py with six tests (latest commit 727728ba):

Unit tests for _worker_logging_init directly:

  • test_worker_init_clears_inherited_handlers — verifies the initializer removes all inherited handlers (the root cause of the race)
  • test_worker_init_opens_file_handler_when_env_set — verifies a fresh FileHandler is created pointing at GARAK_LOG_FILE
  • test_worker_init_no_file_handler_when_env_unset — verifies no handler is added when the env var is absent
  • test_worker_init_is_idempotent — verifies calling it twice doesn't accumulate duplicate handlers

Integration tests that exercise the actual Pool path:

  • test_parallel_workers_log_without_error — spawns Pool(4) with _worker_logging_init, maps 8 worker tasks that each log 50 lines concurrently; asserts no RuntimeError is raised
  • test_parallel_worker_fds_are_independent — verifies each worker's FileHandler has a different file descriptor than the parent's, confirming the shared-fd race is eliminated

All six pass locally. Let me know if you'd like any adjustments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition occurs when using parallel attempts while reading log file

2 participants