Skip to content

Add script to benchmark column stats creation#3106

Open
markovskipetar wants to merge 15 commits into
masterfrom
column_stats_improvements
Open

Add script to benchmark column stats creation#3106
markovskipetar wants to merge 15 commits into
masterfrom
column_stats_improvements

Conversation

@markovskipetar
Copy link
Copy Markdown
Collaborator

@markovskipetar markovskipetar commented May 13, 2026

Column Stats Benchmarks

Standalone benchmarks for measuring the performance and memory usage of ArcticDB column stats creation. These are not part of the ASV framework — they are designed for manual profiling and regression checks without ASV overhead.

Scripts

Script Role
run_bench_col_stats.py Orchestrator — runs all scenarios and prints a summary table
bench_col_stats.py Worker — runs a single operation in isolation and prints JSON results

How It Works

run_bench_col_stats.py iterates over a set of (rows, cols) scenarios. For each scenario it:

  1. Writes a float64 dataframe of the given shape to LMDB via a subprocess.
  2. Runs create_column_stats (MINMAX on all columns) 10 times, each in its own subprocess.
  3. Collects elapsed time and peak RSS delta from each run.
  4. Prints a summary table with min, max, and variance for both time and memory.

Each operation runs in a fresh subprocess (bench_col_stats.py) so that:

  • RSS measurements start from a clean baseline, unaffected by previous runs or accumulated Python/ArcticDB state.
  • A crash or OOM kill in one run does not abort the entire benchmark — the orchestrator catches the failure, cleans up, and reports a descriptive error.

Results are returned as JSON over stdout and parsed by the orchestrator.

Usage

# Run all scenarios
python python/benchmarks/non_asv/run_bench_col_stats.py

# Run a single operation manually (useful for profiling one case)
python python/benchmarks/non_asv/bench_col_stats.py --scenario 500x500 --operation write_symbol
python python/benchmarks/non_asv/bench_col_stats.py --scenario 500x500 --operation create_stats

--scenario is ROWSxCOLS. --operation is either write_symbol or create_stats.

Output

        rows    cols      write_time_s    stats_time_min    stats_time_max    stats_time_var    stats_rss_min_mb    stats_rss_max_mb    stats_rss_var_mb
---------------------------------------------------------------------------------------------------------------------------------------------------------
         500     500              0.42              1.23              1.45            0.0031                45.2                52.1                1.23
  • write_time_s — wall-clock time in seconds for the symbol write (one run).
  • stats_time_* — min, max, and variance of elapsed time across 10 stats-creation runs.
  • stats_rss_* — min, max, and variance of peak RSS increase in MB across the same 10 runs. Sampled at 10 ms intervals inside the worker process.

High variance in either column is a signal of instability worth investigating.

Scenarios

Default scenarios in run_bench_col_stats.py:

rows cols
500 500
500 100
600 600
700 700
800 500
900 800
1,000 100

A commented-out block of larger scenarios (5,000×5,000 up to 10,000×10,000) is also present for heavier profiling — uncomment and replace SCENARIOS to use them.

Cleanup

The LMDB store at /tmp/arcticdb_bench_col_stats is cleaned up automatically:

  • After each scenario (symbol deleted so the next write starts fresh).
  • On normal exit and on SIGINT/SIGTERM via atexit and signal handlers.

@github-actions
Copy link
Copy Markdown

Label error. Requires exactly 1 of: patch, minor, major. Found:

parser.add_argument("--simple_tests", action="store_true", help="Run only the first 4 scenarios")
args = parser.parse_args()

scenarios = SCENARIOS[:2] if args.simple_tests else SCENARIOS
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: this slices the first 2 scenarios, but the --simple_tests help text on line 35 advertises "Run only the first 4 scenarios". Either update the help string or change the slice (e.g. SCENARIOS[:4]) so the flag actually does what it claims.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 13, 2026

ArcticDB Code Review Summary

This PR adds a non-ASV benchmark harness for create_column_stats (worker + orchestrator, peak RSS via ru_maxrss in a fresh subprocess per experiment). The latest push adds a KeyboardInterrupt handler to the orchestrator top-level try so Ctrl-C exits cleanly rather than printing "Benchmark aborted unexpectedly". No other code change in this delta. Items below still need attention.

PR Title & Description

  • Title "Column Stats Improvements" misleading — retitled to "Add script to benchmark column stats creation".
  • Description is badly stale. The body still describes the original LMDB flow, the old script names run_bench_col_stats.py / bench_col_stats.py, the old --scenario / --operation flags, the old SCENARIOS table (≤1000×1000), and the old output format. The current code uses three scripts (col_stats_bench_orchestrator.py, col_stats_bench_write_symbol.py, col_stats_bench_create_stats.py), ahl.mongo.NativeMongoose, infinite row/col generators (stepping by 3×), and a different output table. Please refresh the PR body end-to-end before merge.
  • ⚠️ PR has no labels — no-release-notes should be set since this is developer-tooling only.

Correctness

  • run_subprocess re-raises every non-zero subprocess exit as MemoryError (col_stats_bench_orchestrator.py:40). Any failure — a mongo blip, a KeyError from the library lookup, an unhandled exception in the worker, a programming bug — trips run_phase except MemoryError and silently terminates the phase. The prior code distinguished signal-killed (likely OOM) from a regular exit code; that distinction is gone. Inline comment posted in a previous review.
  • ⚠️ row_scenarios() and col_scenarios() are still infinite generators (col_stats_bench_orchestrator.py:79–90). MULTIPLIER = 3 softens the step, but there is still no explicit termination condition; on a large machine the run never ends, and on any machine a transient non-OOM failure (see broad-MemoryError above) aborts the phase before the real memory ceiling is found. Please add an explicit upper bound (rows×cols ceiling or max scenario count).

Code Quality

  • Hardcoded Man-Group-internal dependency. All three scripts from ahl.mongo import NativeMongoose and target pmarkovski.columns_stats on mktdatad. The benchmark cannot run outside Man Group, and concurrent runs by different engineers collide on the same shared library/symbol. Parametrise library / symbol / connection via argv or env vars; at minimum extract LIBRARY_NAME / SYMBOL_NAME constants. Previous inline comment on col_stats_bench_create_stats.py:16 still applies.
  • ⚠️ Partial results are silently dropped. In measure(), result = Result(...) is appended before the stats loop. If any stats subprocess raises MemoryError, result.stats_create_times stays empty and print_results() skips that row — the successful symbol_write_time is never reported.
  • ⚠️ peak_rss_mb returned by col_stats_bench_write_symbol.py is not consumed by the orchestrator (only elapsed_seconds is read). Either drop the measurement or hook it into the output table. As-is it is misleading dead code, especially since ru_maxrss is captured after the np.random.rand(rows, cols) allocation, which for the larger scenarios dominates any RSS reading.
  • ⚠️ Bare except Exception: pass in cleanup() still silently swallows mongo auth / connectivity / permission failures. Previous inline comment still applies.
  • ⚠️ "test_symbol" literal and the NativeMongoose("mktdatad").get_library("pmarkovski.columns_stats", api="v2") lookup are duplicated across all three scripts. Replace with a shared constants module or pass via argv.
  • ⚠️ np.random.rand(...) is still called without a seed in the worker, so successive runs across machines/processes are not directly comparable.
  • ⚠️ lib._nvs still reaches into a private attribute. Acceptable for a one-off bench script, but a one-line comment noting why the public API is not used would help.

Prior review feedback now addressed

  • ✅ Mean / max reporting across measured iterations.
  • WARMUP_RUNS = 2 runs before RUNS = 10 measured iterations.
  • --simple_tests inconsistent with help string — flag removed.
  • col_stats_bench_write_symbol.py post-loop timing only measured the last chunk — fixed by removing the chunked loop entirely.
  • Orchestrator constants WRITE_SYMBOL_SCRIPT / CREATE_STATS_SCRIPT referenced filenames that did not exist — fixed.
  • Trailing-whitespace lint issues — clean.
  • Commented-out alternative SCENARIOS block — removed.
  • col_scenarios() step softened from 10× to 3× (MULTIPLIER = 3) — the immediate "second yield already OOMs" concern from the prior review is mitigated, though the generators remain unbounded (see Correctness above).
  • KeyboardInterrupt is now caught at top level (col_stats_bench_orchestrator.py:131–132) — Ctrl-C now exits cleanly via cleanup() rather than producing a misleading "aborted unexpectedly" message.

Copy link
Copy Markdown
Collaborator

@poodlewars poodlewars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR titles should always be more descriptive than this, eg Add script to benchmark column stats creation rather than Column stats improvements. You should also always include a description.



atexit.register(_cleanup)
signal.signal(signal.SIGINT, lambda *_: exit(130))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't do cleanup like this. This file should end with:

if __name__ == "__main__":
  try:
    run()
  finally:
    cleanup()

or you could put the same idea in your run(). There's no need for exit handlers.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


col_stats = {f"col_{i}": {"MINMAX"} for i in range(cols)}

mem_before = psutil.Process().memory_info().rss / 1e6
Copy link
Copy Markdown
Collaborator

@poodlewars poodlewars May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These memory measurements are wrong, we care about peak RSS for the function calls, not RSS after they're done. This means that each experiment must be run in its own process. We also care about measuring peak RSS for the write call. This means that in each run of the script you should just do a write or create stats for a single example, then you can write a bash script to run your code across the different examples and summarize the results in a readable way.

mem_before = psutil.Process().memory_info().rss / 1e6
t0 = time.time()
nvs.create_column_stats(sym, col_stats)
stats_time = time.time() - t0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless test cases are very slow, it's always best to run timings with a couple of un-measured warmup iterations, then N measured iterations, to get more stable results. Then report the mean, median and max across the measured iterations.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@poodlewars
Copy link
Copy Markdown
Collaborator

Claude is right about the very large examples. These should be created by chunking up the dataframe with a series of append calls rather than ever loading the large dataframe in to memory.

@markovskipetar
Copy link
Copy Markdown
Collaborator Author

markovskipetar commented May 13, 2026

=== scenario 500x500 ===
[write_symbol] 500x500
[create_stats] run 1/10
[create_stats] run 2/10
[create_stats] run 3/10
[create_stats] run 4/10
[create_stats] run 5/10
[create_stats] run 6/10
[create_stats] run 7/10
[create_stats] run 8/10
[create_stats] run 9/10
[create_stats] run 10/10

=== scenario 500x100 ===
[write_symbol] 500x100
[create_stats] run 1/10
[create_stats] run 2/10
[create_stats] run 3/10
[create_stats] run 4/10
[create_stats] run 5/10
[create_stats] run 6/10
[create_stats] run 7/10
[create_stats] run 8/10
[create_stats] run 9/10
[create_stats] run 10/10

=== scenario 600x600 ===
[write_symbol] 600x600
[create_stats] run 1/10
[create_stats] run 2/10
[create_stats] run 3/10
[create_stats] run 4/10
[create_stats] run 5/10
[create_stats] run 6/10
[create_stats] run 7/10
[create_stats] run 8/10
[create_stats] run 9/10
[create_stats] run 10/10

=== scenario 700x700 ===
[write_symbol] 700x700
[create_stats] run 1/10
[create_stats] run 2/10
[create_stats] run 3/10
[create_stats] run 4/10
[create_stats] run 5/10
[create_stats] run 6/10
[create_stats] run 7/10
[create_stats] run 8/10
[create_stats] run 9/10
[create_stats] run 10/10

=== scenario 800x500 ===
[write_symbol] 800x500
[create_stats] run 1/10
[create_stats] run 2/10
[create_stats] run 3/10
[create_stats] run 4/10
[create_stats] run 5/10
[create_stats] run 6/10
[create_stats] run 7/10
[create_stats] run 8/10
[create_stats] run 9/10
[create_stats] run 10/10

=== scenario 900x800 ===
[write_symbol] 900x800
[create_stats] run 1/10
[create_stats] run 2/10
[create_stats] run 3/10
[create_stats] run 4/10
[create_stats] run 5/10
[create_stats] run 6/10
[create_stats] run 7/10
[create_stats] run 8/10
[create_stats] run 9/10
[create_stats] run 10/10

=== scenario 1000x100 ===
[write_symbol] 1000x100
[create_stats] run 1/10
[create_stats] run 2/10
[create_stats] run 3/10
[create_stats] run 4/10
[create_stats] run 5/10
[create_stats] run 6/10
[create_stats] run 7/10
[create_stats] run 8/10
[create_stats] run 9/10
[create_stats] run 10/10

    rows    cols      write_time_s    stats_time_min    stats_time_max    stats_time_var  stats_rss_min_mb  stats_rss_max_mb  stats_rss_var_mb
     500     500              0.06              0.13              0.14            0.0000              26.1              26.2              0.00
     500     100              0.02              0.03              0.04            0.0000              11.9              12.0              0.00
     600     600              0.07              0.16              0.17            0.0000              30.6              30.7              0.00
     700     700              0.08              0.19              0.20            0.0000              36.2              36.3              0.00
     800     500              0.06              0.15              0.15            0.0000              28.7              28.8              0.00
     900     800              0.09              0.23              0.25            0.0000              42.5              42.6              0.00
   1,000     100              0.02              0.04              0.04            0.0000              12.7              12.8              0.00

Comment on lines +32 to +40
# SCENARIOS = [
# (5_000, 5_000),
# (5_000, 10_000),
# (6_000, 6_000),
# (7_000, 7_000),
# (8_000, 5_000),
# (9_000, 8_000),
# (10_000, 10_000),
# ]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't ship a commented-out alternative SCENARIOS block — either expose it as a CLI flag (e.g. --large selecting between two named presets) or delete it. As-is, switching between scenario sets requires editing the source.

Also, the active list maxes out at 1000 rows × 800 cols, which is very small for a create_column_stats performance benchmark — operations may complete in <10 ms and never get sampled by the RSS thread (see comment on bench_col_stats.py).

Comment on lines +64 to +72
for result in results:
elapsed_times = result.stats_elapsed_time
memory_values = result.stats_memory_delta_mb
print(
f"{result.rows:>12,} {result.cols:>6,}"
f" {result.symbol_write_time:>{column_width}.2f}"
f" {min(elapsed_times):>{column_width}.2f} {max(elapsed_times):>{column_width}.2f} {statistics.variance(elapsed_times):>{column_width}.4f}"
f" {min(memory_values):>{column_width}.1f} {max(memory_values):>{column_width}.1f} {statistics.variance(memory_values):>{column_width}.2f}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior review feedback is only partially addressed here:

  1. The reviewer asked for mean, median, and max across measured iterations — this reports min/max/variance. Variance is harder to interpret and sensitive to outliers; please add mean and median (max is fine to keep).
  2. The reviewer also asked for a couple of un-measured warmup iterations before the N measured iterations. All 10 runs in measure() (lines 98-102) are still measured. Discard at least the first 1–2 to stabilise results — first-run latency on LMDB / page-cache effects can dominate.

delta_mb = process.memory_info().rss / 1e6 - rss_baseline_mb
if delta_mb > peak_rss_delta_mb[0]:
peak_rss_delta_mb[0] = delta_mb
time.sleep(0.01)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 ms sampling is too coarse for the currently-active scenarios in run_bench_col_stats.py (≤1000 rows × 800 cols). The sampler loop samples once, then sleeps; if the operation finishes during the first sleep, only the baseline sample is taken and peak_rss_delta_mb returns 0. Either drop the period (e.g. 1 ms) or guard against operations that completed below the sampler resolution (e.g. log a warning when elapsed_seconds < period).

@markovskipetar markovskipetar changed the title Column Stats Improvements Add script to benchmark column stats creation May 13, 2026
LMDB_PATH = "/tmp/arcticdb_bench_col_stats"
SYMBOL_NAME = "test_symbol"

signal.signal(signal.SIGINT, lambda *_: exit(130))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this??

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +11 to +12
RUNS = 10
WRITE_SYMBOL_SCRIPT = Path(__file__).parent / "bench_write_symbol.py"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: these paths don't match the actual filenames. The worker scripts in this PR are col_stats_bench_write_symbol.py and col_stats_bench_create_stats.py, so every subprocess.run call will fail with FileNotFoundError and the orchestrator can't run at all. Please update the constants:

Suggested change
RUNS = 10
WRITE_SYMBOL_SCRIPT = Path(__file__).parent / "bench_write_symbol.py"
WRITE_SYMBOL_SCRIPT = Path(__file__).parent / "col_stats_bench_write_symbol.py"
CREATE_STATS_SCRIPT = Path(__file__).parent / "col_stats_bench_create_stats.py"

Comment on lines +15 to +23
SCENARIOS = [
(10, 10),
(500,500),
(400,400),
(500,500),
(1_000, 1_000),
(700,700),
(900,900),
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(500, 500) is listed twice and the list isn't size-ordered, which makes the output table harder to read. The whitespace style is also mixed ((500,500) vs (1_000, 1_000)). Please dedupe, sort by total cell count, and pick one style — e.g.:

Suggested change
SCENARIOS = [
(10, 10),
(500,500),
(400,400),
(500,500),
(1_000, 1_000),
(700,700),
(900,900),
]
SCENARIOS = [
(10, 10),
(400, 400),
(500, 500),
(700, 700),
(900, 900),
(1_000, 1_000),
]

Co-authored-by: Copilot <copilot@github.com>
columns=column_names,
)

start_time = time.time()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: start_time is now reassigned on every iteration of the loop, so the post-loop elapsed_seconds = time.time() - start_time (line 40) only measures the last chunk's write/append — not the total write time. With the active scenarios (rows ≤ 1000 vs CHUNK_ROWS = 100_000) there is only ever one chunk, so the bug is silent today, but it will silently under-report as soon as a scenario exceeds 100k rows (and that's exactly when accurate write timing matters most).

If the intent of this move was to exclude np.random.rand(...) from the measurement, accumulate per chunk instead:

total_elapsed = 0.0
for chunk_start in range(0, rows, CHUNK_ROWS):
    chunk_row_count = min(CHUNK_ROWS, rows - chunk_start)
    chunk = pd.DataFrame(
        np.random.rand(chunk_row_count, cols).astype(np.float64),
        columns=column_names,
    )
    start_time = time.time()
    if chunk_start == 0:
        lib.write(SYMBOL_NAME, chunk)
    else:
        lib.append(SYMBOL_NAME, chunk)
    total_elapsed += time.time() - start_time

elapsed_seconds = total_elapsed

Also: line 27 has trailing whitespace and lines 24/32 add stray blank lines — these will likely be flagged by make lint.

markovskipetar and others added 5 commits May 14, 2026 15:53
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
def main():
cols = int(sys.argv[1])

lib = NativeMongoose("mktdatad").get_library("pmarkovski.columns_stats", api="v2")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This push replaces self-contained LMDB with ahl.mongo.NativeMongoose and hardcodes a user-specific library namespace pmarkovski.columns_stats. Two concrete problems:

  1. The benchmark is now unrunnable outside Man Group (and the PR description still describes the LMDB flow / old script names / old CLI args — please update).
  2. The library/symbol pair is hardcoded identically in all three scripts. Two engineers running this concurrently would clobber each other's data on the shared mktdatad instance.

Please parametrise connection/library/symbol via argv or env vars (e.g. ARCTICDB_BENCH_LIBRARY, ARCTICDB_BENCH_SYMBOL) and pass them from the orchestrator, or at least define LIBRARY / SYMBOL constants in a shared module rather than triplicating the literals.

lib = NativeMongoose("mktdatad").get_library("pmarkovski.columns_stats", api="v2")
try:
lib.delete("test_symbol")
except Exception:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare except Exception: pass swallows real failures here (mongo auth, connection drops, permission errors, library-not-found). If cleanup silently fails, the next run sees a stale symbol and the measurements are garbage. At minimum log it:

Suggested change
except Exception:
try:
lib.delete("test_symbol")
except Exception as exc:
print(f"cleanup: lib.delete failed: {exc}", file=sys.stderr)

Also note that this is called from if __name__ == "__main__": before any scenario runs — if mongo is unreachable at startup the script will only fail later on the first write/create_stats subprocess, with no obvious clue that the issue is auth/connectivity.

Comment on lines +17 to +23
(10, 10),
(1_000, 1_000),
(100_000, 1_000),
(100_000, 10_000),
(1_000_000, 1_000),
(1_000_000, 5_000),
(10_000_000, 1_000),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This push reinstates the massive-scenario block that the previous review summary tracked as resolved. Be aware of what this implies for the worker:

  • (100_000, 10_000): a single 100K × 10K float64 chunk is ~8 GB per chunk in col_stats_bench_write_symbol.py. That will OOM most dev machines.
  • (10_000_000, 1_000): ~80 GB total writes across 100 chunks of ~800 MB; many hours against mktdatad.
  • (1_000_000, 5_000): ~40 GB total, 4 GB per chunk.

If these are intentional targets against the shared Man Group mongo, fine — but (a) call this out in the PR description so reviewers know the expected wall-clock and storage cost, and (b) consider gating the heavy scenarios behind a --full flag so a casual run doesn't accidentally write tens of GB to mktdatad under pmarkovski.columns_stats.

…row-slicing+append)

Co-authored-by: Copilot <copilot@github.com>
)
return json.loads(completed.stdout)
except subprocess.CalledProcessError as e:
raise MemoryError(f"[{label}] exited with code {e.returncode}") from None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise MemoryError(...) classifies every non-zero subprocess exit as an OOM — a mongo connection blip, a KeyError looking up the library, an unhandled AssertionError, a programmer bug in the worker, all become "MemoryError" here. That trips run_phase's except MemoryError and silently terminates the row/col walk, hiding the real failure.

It also discards the previous distinction between SIGKILL (negative returncode, the actual OOM-killer signature) and a regular exit code (almost always a bug or environment issue).

Either:

  • only re-classify as OOM when e.returncode < 0 (or specifically -signal.SIGKILL), and re-raise other failures as RuntimeError so they propagate up to the user, or
  • inspect the worker's stderr / dmesg for an OOM marker before deciding.

As written, a single mongo hiccup in the warmup of the first scenario will silently end the entire benchmark.

Comment on lines +79 to +90
def row_scenarios():
rows = BASE_ROWS
while True:
yield rows, BASE_COLS
rows *= 10


def col_scenarios():
cols = BASE_COLS * 10
while True:
yield BASE_ROWS, cols
cols *= 10
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both row_scenarios() and col_scenarios() are unbounded generators that escalate by 10× indefinitely. The only stopping condition is the MemoryError raised inside run_subprocess — which, per the comment on line 40, currently fires for any non-zero exit, not just real OOMs.

Consequences:

  • On a sufficiently large machine, the row phase never terminates and col_scenarios() is never reached.
  • On any machine, a transient mongo/network blip terminates the phase before its actual memory ceiling is found.
  • The 10× step is also coarse: col_scenarios() jumps (100_000, 1_270) → (100_000, 12_700) → (100_000, 127_000); the second step is already ~10 GB of float64 in the writer (see col_stats_bench_write_symbol.py — it now materialises the whole frame in one np.random.rand call), so the col walk almost certainly OOMs on the second yield and you get one data point for the whole phase.

Please bound the walk explicitly (e.g. max iterations, or rows * cols cap) and consider a 2× step so the failure point is found with usable resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants