Skip to content

fix(proactor_threads): fix that proactor_threads is not respected in k8s env#7030

Merged
romange merged 1 commit into
dragonflydb:mainfrom
starek4:fix/proactor_threads-not-respected
Apr 8, 2026
Merged

fix(proactor_threads): fix that proactor_threads is not respected in k8s env#7030
romange merged 1 commit into
dragonflydb:mainfrom
starek4:fix/proactor_threads-not-respected

Conversation

@starek4
Copy link
Copy Markdown
Contributor

@starek4 starek4 commented Mar 31, 2026

Fixing issue: #4251

When a pod has CPU resource limits set, UpdateResourceLimitsIfInsideContainer derives a thread count from the cgroup CPU quota and passes it directly to ProactorPool. Because helio's ProactorPool constructor only checks FLAGS_proactor_threads when its pool_size argument is 0, a non-zero cgroup-derived value silently overrides whatever the user set via --proactor_threads (or DFLY_proactor_threads).

This caused pods to start with a thread count dictated by the CPU limit rather than the explicitly configured value, often failing the maxmemory >= threads * 256MB check on startup.

Changes:

  • After cgroup detection, reset max_available_threads to 0 when --proactor_threads is explicitly set (non-zero), so ProactorPool uses the flag value instead
  • Handles both --proactor_threads CLI flag and DFLY_proactor_threads environment variable, since ParseFlagsFromEnv() runs before this check

Behavior:

  • No CPU limits → unchanged (helio auto-detects CPUs)
  • CPU limits set, no --proactor_threads → unchanged (cgroup value used)
  • CPU limits set + --proactor_threads=N → fixed (flag is now respected)

Tests:

  • tests/dragonfly/proactor_threads_test.py — three new integration tests:
    • test_proactor_threads_flag_is_respected → "verifies --proactor_threads=N results in exactly N threads"
    • test_proactor_threads_env_var_is_respected → "verifies DFLY_proactor_threads=N env var (the Kubernetes env: equivalent) is respected the same way"
    • test_proactor_threads_flag_overrides_env_var → "verifies CLI flag takes priority over the env var, consistent with existing WasPresentOnCommandLine behaviour"

@starek4 starek4 force-pushed the fix/proactor_threads-not-respected branch from c58b415 to 94810e6 Compare March 31, 2026 08:46
@starek4
Copy link
Copy Markdown
Contributor Author

starek4 commented Apr 1, 2026

@dranikpg would you be so kind to fill in some reviewers? I am not so familiar with GitHub, and it seems I cannot add any after the PR is created. Thanks!

@starek4 starek4 force-pushed the fix/proactor_threads-not-respected branch 3 times, most recently from 6086c40 to bd85b67 Compare April 1, 2026 11:11
@romange romange self-requested a review April 2, 2026 08:09
@romange
Copy link
Copy Markdown
Collaborator

romange commented Apr 2, 2026

@starek4 thanks for the contribution. I am trying to understand - the tests are not related to the fix, correct?

They are just "fixing" the current behavior with cli args and the env vars?

@romange
Copy link
Copy Markdown
Collaborator

romange commented Apr 2, 2026

Also, can I reproduce the problem locally with docker/podman or I need k8s environment for that?

(Sorry, not a k8s expert).

@starek4
Copy link
Copy Markdown
Contributor Author

starek4 commented Apr 2, 2026

@romange, you can definitely reproduce the issue locally (and also the fix), let me show you how:

  1. Create a cgroup with a CPU limit (2CPUs, 200000us quota / 100000us period)
sudo mkdir -p /sys/fs/cgroup/dragonfly-test
echo "200000 100000" | sudo tee /sys/fs/cgroup/dragonfly-test/cpu.max
  1. Check your machine has more than 2 threads (otherwise the fix is a no-op)
    nproc

  2. Run dragonfly without the flag (should be limited to 2 threads by cgroup)

sudo cgexec -g cpu:dragonfly-test ./build-dbg/dragonfly --alsologtostderr 2>&1 | grep -i "thread\|proactor"
# Before and after fix: thread_count:2
  1. Run dragonfly WITH --proactor_threads=6 (should override the cgroup limit)
sudo cgexec -g cpu:dragonfly-test ./build-dbg/dragonfly --proactor_threads=6 --alsologtostderr
# Before fix: thread_count:2
# After fix: thread_count:6
  1. Test env var path too
sudo cgexec -g cpu:dragonfly-test env DFLY_proactor_threads=5 ./build-dbg/dragonfly --alsologtostderr
# Before fix: thread_count:2
# After fix: thread_count:5
  1. Cleanup
    sudo rmdir /sys/fs/cgroup/dragonfly-test

And about the tests, you are right. It is only about passing the arguments.

@romange
Copy link
Copy Markdown
Collaborator

romange commented Apr 2, 2026

Thanks! Let me few days to review it (currently on Passover vacation). Meanwhile can you pls edit your comment above and add what is the actual behavior before the fix for each of the cases you wrote down?

@starek4
Copy link
Copy Markdown
Contributor Author

starek4 commented Apr 2, 2026

Enjoy your vacation! I have added the before and after fix behaviour.

@romange
Copy link
Copy Markdown
Collaborator

romange commented Apr 2, 2026

The fix is correct and minimal. When FLAGS_proactor_threads > 0, resetting max_available_threads = 0 makes ProactorPool fall back to the flag value.

However, the tests don't validate the actual bug scenario. They run on a normal machine where max_available_threads is already 0 (no cgroup limits), so they exercise the path that was never broken.

Suggested: integration test with real cgroup limits

The existing CI runs pytest inside Docker containers where /sys/fs/cgroup is read-only and Docker isn't available. Add a separate job that runs directly on the runner and uses docker run --cpus=N to verify dragonfly's behavior under real cgroup limits. Maybe something like this:

cgroup-thread-detection:
  runs-on: ubuntu-latest
  needs: [build]
  steps:
    - uses: actions/checkout@v6

    # Reuse the binary built by the build job (requires uploading it as an artifact there)
    - uses: actions/download-artifact@v7
      with:
        name: dragonfly-binary
        path: build

    - name: Test cgroup CPU auto-detection
      run: |
        chmod +x build/dragonfly

        # Scenario 1: --cpus=2, no flag -> should auto-detect 2 threads
        CID=$(docker run --rm -d --cpus=2 --security-opt seccomp=unconfined \
          -v "$PWD/build/dragonfly:/dragonfly" -p 16380:6379 ubuntu:24.04 \
          /dragonfly --port 6379 --maxmemory 1G --dbfilename "" --noversion_check)
        #
        # Wait for PING to return PONG 
        #
        THREADS=$(redis-cli -p 16380 INFO server | grep -oP 'thread_count:\K\d+')
        docker rm -f "$CID"
        echo "Scenario 1: expected 2, got $THREADS"
        [ "$THREADS" -eq 2 ]

        # Scenario 2: --cpus=2 + --proactor_threads=4 -> flag must win
        CID=$(docker run --rm -d --cpus=2 --security-opt seccomp=unconfined \
          -v "$PWD/build/dragonfly:/dragonfly" -p 16381:6379 ubuntu:24.04 \
          /dragonfly --port 6379 --proactor_threads 4 --maxmemory 1G --dbfilename "" --noversion_check)
        #
        # Wait for PING to return PONG
        #
        THREADS=$(redis-cli -p 16381 INFO server | grep -oP 'thread_count:\K\d+')
        docker rm -f "$CID"
        echo "Scenario 2: expected 4, got $THREADS"
        [ "$THREADS" -eq 4 ]

Minor suggestion

Add a LOG(INFO) when the flag overrides the cgroup value -- helps debugging in production:

if (absl::GetFlag(FLAGS_proactor_threads) > 0) {
  LOG(INFO) << "Using proactor_threads=" << absl::GetFlag(FLAGS_proactor_threads)
            << " (overriding cgroup-derived " << max_available_threads << ")";
  max_available_threads = 0;
}

@starek4 starek4 force-pushed the fix/proactor_threads-not-respected branch from c8f0bc5 to 04e2a23 Compare April 4, 2026 11:31
romange
romange previously approved these changes Apr 6, 2026
@romange
Copy link
Copy Markdown
Collaborator

romange commented Apr 7, 2026

@starek4 you need to install redis-tools or valkey-tools to be able to ping the server

@starek4
Copy link
Copy Markdown
Contributor Author

starek4 commented Apr 7, 2026

@romange yep, no worries, will do it when I have time.

@starek4 starek4 force-pushed the fix/proactor_threads-not-respected branch from b232107 to 8d96ea5 Compare April 8, 2026 08:42
@starek4
Copy link
Copy Markdown
Contributor Author

starek4 commented Apr 8, 2026

@romange it should work now, tested on my forked actions

@romange
Copy link
Copy Markdown
Collaborator

romange commented Apr 8, 2026

@starek4 please squash and sign your commits

@starek4 starek4 force-pushed the fix/proactor_threads-not-respected branch 2 times, most recently from 3e2d6c2 to 17d2038 Compare April 8, 2026 18:21
@starek4 starek4 force-pushed the fix/proactor_threads-not-respected branch from 17d2038 to f7c2807 Compare April 8, 2026 18:25
@starek4
Copy link
Copy Markdown
Contributor Author

starek4 commented Apr 8, 2026

@romange done

@romange
Copy link
Copy Markdown
Collaborator

romange commented Apr 8, 2026

thank you!

@romange romange merged commit 567f353 into dragonflydb:main Apr 8, 2026
13 checks passed
@starek4 starek4 deleted the fix/proactor_threads-not-respected branch April 9, 2026 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants