Add script to benchmark column stats creation#3106
Conversation
|
Label error. Requires exactly 1 of: patch, minor, major. Found: |
| parser.add_argument("--simple_tests", action="store_true", help="Run only the first 4 scenarios") | ||
| args = parser.parse_args() | ||
|
|
||
| scenarios = SCENARIOS[:2] if args.simple_tests else SCENARIOS |
There was a problem hiding this comment.
Bug: this slices the first 2 scenarios, but the --simple_tests help text on line 35 advertises "Run only the first 4 scenarios". Either update the help string or change the slice (e.g. SCENARIOS[:4]) so the flag actually does what it claims.
ArcticDB Code Review SummaryThis PR adds a non-ASV benchmark harness for PR Title & Description
Correctness
Code Quality
Prior review feedback now addressed
|
poodlewars
left a comment
There was a problem hiding this comment.
PR titles should always be more descriptive than this, eg Add script to benchmark column stats creation rather than Column stats improvements. You should also always include a description.
|
|
||
|
|
||
| atexit.register(_cleanup) | ||
| signal.signal(signal.SIGINT, lambda *_: exit(130)) |
There was a problem hiding this comment.
You shouldn't do cleanup like this. This file should end with:
if __name__ == "__main__":
try:
run()
finally:
cleanup()
or you could put the same idea in your run(). There's no need for exit handlers.
|
|
||
| col_stats = {f"col_{i}": {"MINMAX"} for i in range(cols)} | ||
|
|
||
| mem_before = psutil.Process().memory_info().rss / 1e6 |
There was a problem hiding this comment.
These memory measurements are wrong, we care about peak RSS for the function calls, not RSS after they're done. This means that each experiment must be run in its own process. We also care about measuring peak RSS for the write call. This means that in each run of the script you should just do a write or create stats for a single example, then you can write a bash script to run your code across the different examples and summarize the results in a readable way.
| mem_before = psutil.Process().memory_info().rss / 1e6 | ||
| t0 = time.time() | ||
| nvs.create_column_stats(sym, col_stats) | ||
| stats_time = time.time() - t0 |
There was a problem hiding this comment.
Unless test cases are very slow, it's always best to run timings with a couple of un-measured warmup iterations, then N measured iterations, to get more stable results. Then report the mean, median and max across the measured iterations.
|
Claude is right about the very large examples. These should be created by chunking up the dataframe with a series of |
Co-authored-by: Copilot <copilot@github.com>
|
=== scenario 500x500 === === scenario 500x100 === === scenario 600x600 === === scenario 700x700 === === scenario 800x500 === === scenario 900x800 === === scenario 1000x100 === |
| # SCENARIOS = [ | ||
| # (5_000, 5_000), | ||
| # (5_000, 10_000), | ||
| # (6_000, 6_000), | ||
| # (7_000, 7_000), | ||
| # (8_000, 5_000), | ||
| # (9_000, 8_000), | ||
| # (10_000, 10_000), | ||
| # ] |
There was a problem hiding this comment.
Don't ship a commented-out alternative SCENARIOS block — either expose it as a CLI flag (e.g. --large selecting between two named presets) or delete it. As-is, switching between scenario sets requires editing the source.
Also, the active list maxes out at 1000 rows × 800 cols, which is very small for a create_column_stats performance benchmark — operations may complete in <10 ms and never get sampled by the RSS thread (see comment on bench_col_stats.py).
| for result in results: | ||
| elapsed_times = result.stats_elapsed_time | ||
| memory_values = result.stats_memory_delta_mb | ||
| print( | ||
| f"{result.rows:>12,} {result.cols:>6,}" | ||
| f" {result.symbol_write_time:>{column_width}.2f}" | ||
| f" {min(elapsed_times):>{column_width}.2f} {max(elapsed_times):>{column_width}.2f} {statistics.variance(elapsed_times):>{column_width}.4f}" | ||
| f" {min(memory_values):>{column_width}.1f} {max(memory_values):>{column_width}.1f} {statistics.variance(memory_values):>{column_width}.2f}" | ||
| ) |
There was a problem hiding this comment.
Prior review feedback is only partially addressed here:
- The reviewer asked for mean, median, and max across measured iterations — this reports min/max/variance. Variance is harder to interpret and sensitive to outliers; please add mean and median (max is fine to keep).
- The reviewer also asked for a couple of un-measured warmup iterations before the N measured iterations. All 10 runs in
measure()(lines 98-102) are still measured. Discard at least the first 1–2 to stabilise results — first-run latency on LMDB / page-cache effects can dominate.
| delta_mb = process.memory_info().rss / 1e6 - rss_baseline_mb | ||
| if delta_mb > peak_rss_delta_mb[0]: | ||
| peak_rss_delta_mb[0] = delta_mb | ||
| time.sleep(0.01) |
There was a problem hiding this comment.
10 ms sampling is too coarse for the currently-active scenarios in run_bench_col_stats.py (≤1000 rows × 800 cols). The sampler loop samples once, then sleeps; if the operation finishes during the first sleep, only the baseline sample is taken and peak_rss_delta_mb returns 0. Either drop the period (e.g. 1 ms) or guard against operations that completed below the sampler resolution (e.g. log a warning when elapsed_seconds < period).
| LMDB_PATH = "/tmp/arcticdb_bench_col_stats" | ||
| SYMBOL_NAME = "test_symbol" | ||
|
|
||
| signal.signal(signal.SIGINT, lambda *_: exit(130)) |
…or, symbol write, col stats create
| RUNS = 10 | ||
| WRITE_SYMBOL_SCRIPT = Path(__file__).parent / "bench_write_symbol.py" |
There was a problem hiding this comment.
Critical: these paths don't match the actual filenames. The worker scripts in this PR are col_stats_bench_write_symbol.py and col_stats_bench_create_stats.py, so every subprocess.run call will fail with FileNotFoundError and the orchestrator can't run at all. Please update the constants:
| RUNS = 10 | |
| WRITE_SYMBOL_SCRIPT = Path(__file__).parent / "bench_write_symbol.py" | |
| WRITE_SYMBOL_SCRIPT = Path(__file__).parent / "col_stats_bench_write_symbol.py" | |
| CREATE_STATS_SCRIPT = Path(__file__).parent / "col_stats_bench_create_stats.py" |
| SCENARIOS = [ | ||
| (10, 10), | ||
| (500,500), | ||
| (400,400), | ||
| (500,500), | ||
| (1_000, 1_000), | ||
| (700,700), | ||
| (900,900), | ||
| ] |
There was a problem hiding this comment.
(500, 500) is listed twice and the list isn't size-ordered, which makes the output table harder to read. The whitespace style is also mixed ((500,500) vs (1_000, 1_000)). Please dedupe, sort by total cell count, and pick one style — e.g.:
| SCENARIOS = [ | |
| (10, 10), | |
| (500,500), | |
| (400,400), | |
| (500,500), | |
| (1_000, 1_000), | |
| (700,700), | |
| (900,900), | |
| ] | |
| SCENARIOS = [ | |
| (10, 10), | |
| (400, 400), | |
| (500, 500), | |
| (700, 700), | |
| (900, 900), | |
| (1_000, 1_000), | |
| ] |
Co-authored-by: Copilot <copilot@github.com>
| columns=column_names, | ||
| ) | ||
|
|
||
| start_time = time.time() |
There was a problem hiding this comment.
Bug: start_time is now reassigned on every iteration of the loop, so the post-loop elapsed_seconds = time.time() - start_time (line 40) only measures the last chunk's write/append — not the total write time. With the active scenarios (rows ≤ 1000 vs CHUNK_ROWS = 100_000) there is only ever one chunk, so the bug is silent today, but it will silently under-report as soon as a scenario exceeds 100k rows (and that's exactly when accurate write timing matters most).
If the intent of this move was to exclude np.random.rand(...) from the measurement, accumulate per chunk instead:
total_elapsed = 0.0
for chunk_start in range(0, rows, CHUNK_ROWS):
chunk_row_count = min(CHUNK_ROWS, rows - chunk_start)
chunk = pd.DataFrame(
np.random.rand(chunk_row_count, cols).astype(np.float64),
columns=column_names,
)
start_time = time.time()
if chunk_start == 0:
lib.write(SYMBOL_NAME, chunk)
else:
lib.append(SYMBOL_NAME, chunk)
total_elapsed += time.time() - start_time
elapsed_seconds = total_elapsedAlso: line 27 has trailing whitespace and lines 24/32 add stray blank lines — these will likely be flagged by make lint.
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
| def main(): | ||
| cols = int(sys.argv[1]) | ||
|
|
||
| lib = NativeMongoose("mktdatad").get_library("pmarkovski.columns_stats", api="v2") |
There was a problem hiding this comment.
This push replaces self-contained LMDB with ahl.mongo.NativeMongoose and hardcodes a user-specific library namespace pmarkovski.columns_stats. Two concrete problems:
- The benchmark is now unrunnable outside Man Group (and the PR description still describes the LMDB flow / old script names / old CLI args — please update).
- The library/symbol pair is hardcoded identically in all three scripts. Two engineers running this concurrently would clobber each other's data on the shared
mktdatadinstance.
Please parametrise connection/library/symbol via argv or env vars (e.g. ARCTICDB_BENCH_LIBRARY, ARCTICDB_BENCH_SYMBOL) and pass them from the orchestrator, or at least define LIBRARY / SYMBOL constants in a shared module rather than triplicating the literals.
| lib = NativeMongoose("mktdatad").get_library("pmarkovski.columns_stats", api="v2") | ||
| try: | ||
| lib.delete("test_symbol") | ||
| except Exception: |
There was a problem hiding this comment.
Bare except Exception: pass swallows real failures here (mongo auth, connection drops, permission errors, library-not-found). If cleanup silently fails, the next run sees a stale symbol and the measurements are garbage. At minimum log it:
| except Exception: | |
| try: | |
| lib.delete("test_symbol") | |
| except Exception as exc: | |
| print(f"cleanup: lib.delete failed: {exc}", file=sys.stderr) |
Also note that this is called from if __name__ == "__main__": before any scenario runs — if mongo is unreachable at startup the script will only fail later on the first write/create_stats subprocess, with no obvious clue that the issue is auth/connectivity.
| (10, 10), | ||
| (1_000, 1_000), | ||
| (100_000, 1_000), | ||
| (100_000, 10_000), | ||
| (1_000_000, 1_000), | ||
| (1_000_000, 5_000), | ||
| (10_000_000, 1_000), |
There was a problem hiding this comment.
This push reinstates the massive-scenario block that the previous review summary tracked as resolved. Be aware of what this implies for the worker:
(100_000, 10_000): a single 100K × 10K float64 chunk is ~8 GB per chunk incol_stats_bench_write_symbol.py. That will OOM most dev machines.(10_000_000, 1_000): ~80 GB total writes across 100 chunks of ~800 MB; many hours againstmktdatad.(1_000_000, 5_000): ~40 GB total, 4 GB per chunk.
If these are intentional targets against the shared Man Group mongo, fine — but (a) call this out in the PR description so reviewers know the expected wall-clock and storage cost, and (b) consider gating the heavy scenarios behind a --full flag so a casual run doesn't accidentally write tens of GB to mktdatad under pmarkovski.columns_stats.
…row-slicing+append) Co-authored-by: Copilot <copilot@github.com>
| ) | ||
| return json.loads(completed.stdout) | ||
| except subprocess.CalledProcessError as e: | ||
| raise MemoryError(f"[{label}] exited with code {e.returncode}") from None |
There was a problem hiding this comment.
raise MemoryError(...) classifies every non-zero subprocess exit as an OOM — a mongo connection blip, a KeyError looking up the library, an unhandled AssertionError, a programmer bug in the worker, all become "MemoryError" here. That trips run_phase's except MemoryError and silently terminates the row/col walk, hiding the real failure.
It also discards the previous distinction between SIGKILL (negative returncode, the actual OOM-killer signature) and a regular exit code (almost always a bug or environment issue).
Either:
- only re-classify as OOM when
e.returncode < 0(or specifically-signal.SIGKILL), and re-raise other failures asRuntimeErrorso they propagate up to the user, or - inspect the worker's stderr /
dmesgfor an OOM marker before deciding.
As written, a single mongo hiccup in the warmup of the first scenario will silently end the entire benchmark.
| def row_scenarios(): | ||
| rows = BASE_ROWS | ||
| while True: | ||
| yield rows, BASE_COLS | ||
| rows *= 10 | ||
|
|
||
|
|
||
| def col_scenarios(): | ||
| cols = BASE_COLS * 10 | ||
| while True: | ||
| yield BASE_ROWS, cols | ||
| cols *= 10 |
There was a problem hiding this comment.
Both row_scenarios() and col_scenarios() are unbounded generators that escalate by 10× indefinitely. The only stopping condition is the MemoryError raised inside run_subprocess — which, per the comment on line 40, currently fires for any non-zero exit, not just real OOMs.
Consequences:
- On a sufficiently large machine, the row phase never terminates and
col_scenarios()is never reached. - On any machine, a transient mongo/network blip terminates the phase before its actual memory ceiling is found.
- The 10× step is also coarse:
col_scenarios()jumps(100_000, 1_270) → (100_000, 12_700) → (100_000, 127_000); the second step is already ~10 GB of float64 in the writer (seecol_stats_bench_write_symbol.py— it now materialises the whole frame in onenp.random.randcall), so the col walk almost certainly OOMs on the second yield and you get one data point for the whole phase.
Please bound the walk explicitly (e.g. max iterations, or rows * cols cap) and consider a 2× step so the failure point is found with usable resolution.
Column Stats Benchmarks
Standalone benchmarks for measuring the performance and memory usage of ArcticDB column stats creation. These are not part of the ASV framework — they are designed for manual profiling and regression checks without ASV overhead.
Scripts
run_bench_col_stats.pybench_col_stats.pyHow It Works
run_bench_col_stats.pyiterates over a set of(rows, cols)scenarios. For each scenario it:float64dataframe of the given shape to LMDB via a subprocess.create_column_stats(MINMAX on all columns) 10 times, each in its own subprocess.Each operation runs in a fresh subprocess (
bench_col_stats.py) so that:Results are returned as JSON over stdout and parsed by the orchestrator.
Usage
--scenarioisROWSxCOLS.--operationis eitherwrite_symbolorcreate_stats.Output
write_time_s— wall-clock time in seconds for the symbol write (one run).stats_time_*— min, max, and variance of elapsed time across 10 stats-creation runs.stats_rss_*— min, max, and variance of peak RSS increase in MB across the same 10 runs. Sampled at 10 ms intervals inside the worker process.High variance in either column is a signal of instability worth investigating.
Scenarios
Default scenarios in
run_bench_col_stats.py:A commented-out block of larger scenarios (
5,000×5,000up to10,000×10,000) is also present for heavier profiling — uncomment and replaceSCENARIOSto use them.Cleanup
The LMDB store at
/tmp/arcticdb_bench_col_statsis cleaned up automatically:SIGINT/SIGTERMviaatexitand signal handlers.