Skip to content

run: cap Redis maxmemory to N% of total_system_memory before init_commands#536

Open
fcostaoliveira wants to merge 1 commit intomasterfrom
feat/auto-maxmemory-cap
Open

run: cap Redis maxmemory to N% of total_system_memory before init_commands#536
fcostaoliveira wants to merge 1 commit intomasterfrom
feat/auto-maxmemory-cap

Conversation

@fcostaoliveira
Copy link
Copy Markdown
Contributor

Summary

Adds an opt-in mechanism to set Redis maxmemory proportionally to the bench-server's total memory before any spec-defined init_commands run, so workloads that previously OOM-killed the OS now hit Redis-level OOM errors instead.

Three sources, in precedence order:

Source Where
--no-maxmemory-cap CLI — force-disable for this run, beats everything below
--maxmemory-pct N CLI — override applied regardless of spec
dbconfig.maxmemory_pct: N spec field — per-spec opt-in

When none is set, nothing is sent. Existing specs and existing CI pipelines (RediSearch, RediSearchEnterprise, anyone else consuming this CLI) see no behaviour change unless they explicitly opt in.

Why

The search-expire-doc-10-milliseconds spec on RediSearch makes the bench-server (m7g.8xlarge / m7i.8xlarge) unreachable via SSH around 140-170s into the test, with paramiko: Error reading SSH protocol banner after Connection closed by server. Signature is consistent with the kernel OOM-killer leaving sshd memory-starved (terraform destroy still works, but no result is captured). Reproduced on:

  • RediSearch/RediSearch 8.2 / 8.4 aarch64
  • redislabsdev/RediSearch v8.2.9 / v8.4.5 aarch64 + v2.8.35 / v2.10.28 x86_64
  • ftsb-10K (read-only, 10 k docs) on the same setups runs cleanly → workload-specific.

Setting maxmemory ahead of time means Redis returns OOM command not allowed when used memory > 'maxmemory' instead of letting the kernel decide. Run completes with a partial-result warning, instance stays reachable, comparison data still flows to RTS.

Cap formula

cap = min(pct% * total_system_memory, total - 4 GiB)   if total >  4 GiB
    = pct% * total_system_memory                       if total <= 4 GiB

The 4 GiB OS-reserve floor only kicks in when the instance has enough memory to support it. Tiny test instances (e.g. CI scratch boxes) keep the simple pct cap.

total_system_memory is read from INFO memory so we don't need an instance-type catalog or /proc/meminfo SSH; it works on every supported Redis version. Failure to read it (older Redis, mocked test env) is logged and skipped — never blocks the run.

If Redis already has a non-zero maxmemory configured, the new cap overwrites it with a WARNING log line. Spec authors who want to set their own absolute byte value in init_commands can do so — the explicit CONFIG SET maxmemory <bytes> they emit later will take precedence over our preface.

What changes

  • redisbench_admin/run/args.py: new --maxmemory-pct N (int, default None) and --no-maxmemory-cap (store_true) on common_run_args.
  • redisbench_admin/run/common.py: new helpers:
    • _get_maxmemory_pct_from_dbconfig — handles both dict and list dbconfig formats
    • _resolve_maxmemory_pct — CLI > spec precedence
    • _compute_maxmemory_cap_bytes — cap arithmetic + floor
    • apply_maxmemory_cap — orchestrator, called as the first thing inside execute_init_commands
    • execute_init_commands and run_redis_pre_steps gain an optional args=None kwarg (default preserves existing behaviour for any caller that doesn't update).
  • redisbench_admin/run_local/local_db.py: passes args through the two run_redis_pre_steps callsites (it already had args from local_db_spin).
  • run_remote/remote_db.py: not modified in this PR. The spec-field path still fires (it doesn't need args); CLI override on the remote path is a small follow-up.

Compatibility

  • RediSearchEnterprise (disk project): zero behaviour change unless their specs add maxmemory_pct or their pipelines pass --maxmemory-pct.
  • Existing RediSearch specs: zero behaviour change.
  • Existing local-bench runs without args wiring: zero behaviour change.

Tests

13 new unit tests in tests/test_common.py, all pass:

tests/test_common.py::test_get_maxmemory_pct_from_dbconfig_dict           PASSED
tests/test_common.py::test_get_maxmemory_pct_from_dbconfig_list           PASSED
tests/test_common.py::test_resolve_maxmemory_pct_precedence               PASSED
tests/test_common.py::test_compute_maxmemory_cap_bytes_floor_kicks_in_on_large_instances    PASSED
tests/test_common.py::test_compute_maxmemory_cap_bytes_floor_caps_high_pct_on_smaller_box   PASSED
tests/test_common.py::test_compute_maxmemory_cap_bytes_no_floor_for_tiny_instances          PASSED
tests/test_common.py::test_apply_maxmemory_cap_no_op_without_opt_in       PASSED
tests/test_common.py::test_apply_maxmemory_cap_no_op_when_cli_disables    PASSED
tests/test_common.py::test_apply_maxmemory_cap_sets_from_spec_field       PASSED
tests/test_common.py::test_apply_maxmemory_cap_floor_wins_on_smaller_box  PASSED
tests/test_common.py::test_apply_maxmemory_cap_skips_on_invalid_pct       PASSED
tests/test_common.py::test_apply_maxmemory_cap_skips_when_info_unavailable PASSED
tests/test_common.py::test_apply_maxmemory_cap_overwrites_existing_with_warn   PASSED

Coverage:

  • precedence resolution
  • both dbconfig formats (dict and list)
  • cap arithmetic at 128 GiB (floor doesn't bite), 8 GiB (floor wins), 2 GiB (no floor)
  • the 4 no-op paths (no opt-in / --no-maxmemory-cap / invalid pct / INFO memory unavailable)
  • the spec-field happy path
  • overwrite-with-warn against an existing maxmemory

3 pre-existing tests still need RTS_PORT for a live Redis (test_execute_init_commands, test_common_exporter_logic, test_dbconfig_keyspacelen_check) — confirmed failing identically on origin/master without that env, so unrelated to this change.

Test plan after merge

  • Bump tests/benchmarks/requirements.txt in RediSearch / RediSearchEnterprise to a release that includes this change.
  • Add maxmemory_pct: 80 to the 4 search-expire-* specs only.
  • Re-trigger the expiration runs across the 6 historical tags on redislabsdev/RediSearch (priv PR #241) and confirm the bench-server stays reachable.
  • Spot-check 2-3 RediSearchEnterprise specs via run-local to confirm no CONFIG SET maxmemory is emitted when the field is absent.

…mands

Adds an opt-in mechanism to set Redis `maxmemory` proportionally to
the bench-server's total memory before any spec-defined init_commands
run, so workloads that previously OOM-killed the OS now hit
Redis-level OOM errors instead.

Default behaviour is unchanged: nothing is sent unless a spec field
or CLI flag asks for it. RediSearchEnterprise (and any other
downstream consumer) keeps its existing semantics.

Sources, in precedence order:
  --no-maxmemory-cap        force-disable for this run
  --maxmemory-pct N         CLI override
  dbconfig.maxmemory_pct N  spec-level opt-in (per-spec)

Cap formula: min(pct% * total_system_memory, total - 4 GiB), where the
4 GiB OS-reserve floor is only applied when total_system_memory is
greater than 4 GiB. On tiny instances the floor is bypassed and only
the pct cap applies. Existing non-zero `maxmemory` values are
overwritten with a WARNING log line.

Implementation:
  - new helpers in redisbench_admin/run/common.py:
      _get_maxmemory_pct_from_dbconfig (handles both dict and list
                                        dbconfig formats)
      _resolve_maxmemory_pct           (precedence resolver)
      _compute_maxmemory_cap_bytes     (cap arithmetic + 4 GiB floor)
      apply_maxmemory_cap              (orchestrator: reads INFO
                                        memory, logs, applies)
  - execute_init_commands and run_redis_pre_steps gain an optional
    `args` kwarg (default None). When set, CLI overrides are honoured;
    when not, only the spec field is consulted, so the spec-driven
    path works regardless of caller plumbing.
  - run_local/local_db.py threads `args` through the two
    run_redis_pre_steps callsites. The remote_db path keeps `args=None`
    for now (spec field still fires); CLI override there is a small
    follow-up.

Resolves the bench-server unresponsiveness observed when the
`search-expire-doc-10-milliseconds` spec is run on m7g.8xlarge /
m7i.8xlarge: instead of the kernel OOM-killer leaving sshd starved
and the box unreachable (terraform destroy still works, but no
result is captured), Redis now refuses writes once the cap is hit
and the bench completes with a partial-result warning.

Tests:
  - 13 new unit tests in tests/test_common.py covering precedence,
    cap arithmetic at 8/128 GiB and tiny instances, no-op paths,
    overwrite-with-warn, and INFO memory failure modes. All pass.
  - 3 pre-existing tests still need RTS_PORT (live Redis) and were
    already failing on origin/master without it; not affected.
help="If set, before running each spec's init_commands send `CONFIG SET maxmemory <pct%% of total_system_memory>` so OS-level OOM is replaced with Redis-level OOM. Floor: leaves >=4 GiB for the OS when total > 4 GiB. Overrides `dbconfig.maxmemory_pct` from the spec. Range: (0, 100].",
)
parser.add_argument(
"--no-maxmemory-cap",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure is good to have negative logic. It is hard to keep it backwards compatible?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if setting None to the maxmemory-pct should be the way to remove it, and it should be the default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants