Add script to benchmark column stats creation by markovskipetar · Pull Request #3106 · man-group/ArcticDB

markovskipetar · 2026-05-13T11:27:18Z

Column Stats Benchmarks

Standalone benchmarks for measuring the performance and memory usage of ArcticDB column stats creation. These are not part of the ASV framework — they are designed for manual profiling and regression checks without ASV overhead.

Scripts

Script	Role
`run_bench_col_stats.py`	Orchestrator — runs all scenarios and prints a summary table
`bench_col_stats.py`	Worker — runs a single operation in isolation and prints JSON results

How It Works

run_bench_col_stats.py iterates over a set of (rows, cols) scenarios. For each scenario it:

Writes a float64 dataframe of the given shape to LMDB via a subprocess.
Runs create_column_stats (MINMAX on all columns) 10 times, each in its own subprocess.
Collects elapsed time and peak RSS delta from each run.
Prints a summary table with min, max, and variance for both time and memory.

Each operation runs in a fresh subprocess (bench_col_stats.py) so that:

RSS measurements start from a clean baseline, unaffected by previous runs or accumulated Python/ArcticDB state.
A crash or OOM kill in one run does not abort the entire benchmark — the orchestrator catches the failure, cleans up, and reports a descriptive error.

Results are returned as JSON over stdout and parsed by the orchestrator.

Usage

# Run all scenarios
python python/benchmarks/non_asv/run_bench_col_stats.py

# Run a single operation manually (useful for profiling one case)
python python/benchmarks/non_asv/bench_col_stats.py --scenario 500x500 --operation write_symbol
python python/benchmarks/non_asv/bench_col_stats.py --scenario 500x500 --operation create_stats

--scenario is ROWSxCOLS. --operation is either write_symbol or create_stats.

Output

        rows    cols      write_time_s    stats_time_min    stats_time_max    stats_time_var    stats_rss_min_mb    stats_rss_max_mb    stats_rss_var_mb
---------------------------------------------------------------------------------------------------------------------------------------------------------
         500     500              0.42              1.23              1.45            0.0031                45.2                52.1                1.23

write_time_s — wall-clock time in seconds for the symbol write (one run).
stats_time_* — min, max, and variance of elapsed time across 10 stats-creation runs.
stats_rss_* — min, max, and variance of peak RSS increase in MB across the same 10 runs. Sampled at 10 ms intervals inside the worker process.

High variance in either column is a signal of instability worth investigating.

Scenarios

Default scenarios in run_bench_col_stats.py:

rows	cols
500	500
500	100
600	600
700	700
800	500
900	800
1,000	100

A commented-out block of larger scenarios (5,000×5,000 up to 10,000×10,000) is also present for heavier profiling — uncomment and replace SCENARIOS to use them.

Cleanup

The LMDB store at /tmp/arcticdb_bench_col_stats is cleaned up automatically:

After each scenario (symbol deleted so the next write starts fresh).
On normal exit and on SIGINT/SIGTERM via atexit and signal handlers.

…reation)

github-actions · 2026-05-13T11:27:29Z

Label error. Requires exactly 1 of: patch, minor, major. Found:

claude · 2026-05-13T11:29:17Z

+parser.add_argument("--simple_tests", action="store_true", help="Run only the first 4 scenarios")
+args = parser.parse_args()
+
+scenarios = SCENARIOS[:2] if args.simple_tests else SCENARIOS


Bug: this slices the first 2 scenarios, but the --simple_tests help text on line 35 advertises "Run only the first 4 scenarios". Either update the help string or change the slice (e.g. SCENARIOS[:4]) so the flag actually does what it claims.

claude · 2026-05-13T11:31:25Z

ArcticDB Code Review Summary

This PR adds a non-ASV benchmark harness for create_column_stats (worker + orchestrator, peak RSS via ru_maxrss in a fresh subprocess per experiment). The latest push adds a KeyboardInterrupt handler to the orchestrator top-level try so Ctrl-C exits cleanly rather than printing "Benchmark aborted unexpectedly". No other code change in this delta. Items below still need attention.

PR Title & Description

✅ ~~Title "Column Stats Improvements" misleading~~ — retitled to "Add script to benchmark column stats creation".
❌ Description is badly stale. The body still describes the original LMDB flow, the old script names run_bench_col_stats.py / bench_col_stats.py, the old --scenario / --operation flags, the old SCENARIOS table (≤1000×1000), and the old output format. The current code uses three scripts (col_stats_bench_orchestrator.py, col_stats_bench_write_symbol.py, col_stats_bench_create_stats.py), ahl.mongo.NativeMongoose, infinite row/col generators (stepping by 3×), and a different output table. Please refresh the PR body end-to-end before merge.
⚠️ PR has no labels — no-release-notes should be set since this is developer-tooling only.

Correctness

❌ run_subprocess re-raises every non-zero subprocess exit as MemoryError (col_stats_bench_orchestrator.py:40). Any failure — a mongo blip, a KeyError from the library lookup, an unhandled exception in the worker, a programming bug — trips run_phase except MemoryError and silently terminates the phase. The prior code distinguished signal-killed (likely OOM) from a regular exit code; that distinction is gone. Inline comment posted in a previous review.
⚠️ row_scenarios() and col_scenarios() are still infinite generators (col_stats_bench_orchestrator.py:79–90). MULTIPLIER = 3 softens the step, but there is still no explicit termination condition; on a large machine the run never ends, and on any machine a transient non-OOM failure (see broad-MemoryError above) aborts the phase before the real memory ceiling is found. Please add an explicit upper bound (rows×cols ceiling or max scenario count).

Code Quality

❌ Hardcoded Man-Group-internal dependency. All three scripts from ahl.mongo import NativeMongoose and target pmarkovski.columns_stats on mktdatad. The benchmark cannot run outside Man Group, and concurrent runs by different engineers collide on the same shared library/symbol. Parametrise library / symbol / connection via argv or env vars; at minimum extract LIBRARY_NAME / SYMBOL_NAME constants. Previous inline comment on col_stats_bench_create_stats.py:16 still applies.
⚠️ Partial results are silently dropped. In measure(), result = Result(...) is appended before the stats loop. If any stats subprocess raises MemoryError, result.stats_create_times stays empty and print_results() skips that row — the successful symbol_write_time is never reported.
⚠️ peak_rss_mb returned by col_stats_bench_write_symbol.py is not consumed by the orchestrator (only elapsed_seconds is read). Either drop the measurement or hook it into the output table. As-is it is misleading dead code, especially since ru_maxrss is captured after the np.random.rand(rows, cols) allocation, which for the larger scenarios dominates any RSS reading.
⚠️ Bare except Exception: pass in cleanup() still silently swallows mongo auth / connectivity / permission failures. Previous inline comment still applies.
⚠️ "test_symbol" literal and the NativeMongoose("mktdatad").get_library("pmarkovski.columns_stats", api="v2") lookup are duplicated across all three scripts. Replace with a shared constants module or pass via argv.
⚠️ np.random.rand(...) is still called without a seed in the worker, so successive runs across machines/processes are not directly comparable.
⚠️ lib._nvs still reaches into a private attribute. Acceptable for a one-off bench script, but a one-line comment noting why the public API is not used would help.

Prior review feedback now addressed

✅ Mean / max reporting across measured iterations.
✅ WARMUP_RUNS = 2 runs before RUNS = 10 measured iterations.
✅ ~~--simple_tests inconsistent with help string~~ — flag removed.
✅ ~~col_stats_bench_write_symbol.py post-loop timing only measured the last chunk~~ — fixed by removing the chunked loop entirely.
✅ ~~Orchestrator constants WRITE_SYMBOL_SCRIPT / CREATE_STATS_SCRIPT referenced filenames that did not exist~~ — fixed.
✅ ~~Trailing-whitespace lint issues~~ — clean.
✅ ~~Commented-out alternative SCENARIOS block~~ — removed.
✅ col_scenarios() step softened from 10× to 3× (MULTIPLIER = 3) — the immediate "second yield already OOMs" concern from the prior review is mitigated, though the generators remain unbounded (see Correctness above).
✅ KeyboardInterrupt is now caught at top level (col_stats_bench_orchestrator.py:131–132) — Ctrl-C now exits cleanly via cleanup() rather than producing a misleading "aborted unexpectedly" message.

poodlewars

PR titles should always be more descriptive than this, eg Add script to benchmark column stats creation rather than Column stats improvements. You should also always include a description.

poodlewars · 2026-05-13T11:30:29Z

+
+
+atexit.register(_cleanup)
+signal.signal(signal.SIGINT, lambda *_: exit(130))


You shouldn't do cleanup like this. This file should end with:

if __name__ == "__main__": try: run() finally: cleanup()

or you could put the same idea in your run(). There's no need for exit handlers.

poodlewars · 2026-05-13T11:31:43Z

+
+    col_stats = {f"col_{i}": {"MINMAX"} for i in range(cols)}
+
+    mem_before = psutil.Process().memory_info().rss / 1e6


These memory measurements are wrong, we care about peak RSS for the function calls, not RSS after they're done. This means that each experiment must be run in its own process. We also care about measuring peak RSS for the write call. This means that in each run of the script you should just do a write or create stats for a single example, then you can write a bash script to run your code across the different examples and summarize the results in a readable way.

poodlewars · 2026-05-13T11:33:31Z

+    mem_before = psutil.Process().memory_info().rss / 1e6
+    t0 = time.time()
+    nvs.create_column_stats(sym, col_stats)
+    stats_time = time.time() - t0


Unless test cases are very slow, it's always best to run timings with a couple of un-measured warmup iterations, then N measured iterations, to get more stable results. Then report the mean, median and max across the measured iterations.

poodlewars · 2026-05-13T11:35:37Z

Claude is right about the very large examples. These should be created by chunking up the dataframe with a series of append calls rather than ever loading the large dataframe in to memory.

Co-authored-by: Copilot <copilot@github.com>

markovskipetar · 2026-05-13T15:37:07Z

=== scenario 500x500 ===
[write_symbol] 500x500
[create_stats] run 1/10
[create_stats] run 2/10
[create_stats] run 3/10
[create_stats] run 4/10
[create_stats] run 5/10
[create_stats] run 6/10
[create_stats] run 7/10
[create_stats] run 8/10
[create_stats] run 9/10
[create_stats] run 10/10

=== scenario 500x100 ===
[write_symbol] 500x100
[create_stats] run 1/10
[create_stats] run 2/10
[create_stats] run 3/10
[create_stats] run 4/10
[create_stats] run 5/10
[create_stats] run 6/10
[create_stats] run 7/10
[create_stats] run 8/10
[create_stats] run 9/10
[create_stats] run 10/10

=== scenario 600x600 ===
[write_symbol] 600x600
[create_stats] run 1/10
[create_stats] run 2/10
[create_stats] run 3/10
[create_stats] run 4/10
[create_stats] run 5/10
[create_stats] run 6/10
[create_stats] run 7/10
[create_stats] run 8/10
[create_stats] run 9/10
[create_stats] run 10/10

=== scenario 700x700 ===
[write_symbol] 700x700
[create_stats] run 1/10
[create_stats] run 2/10
[create_stats] run 3/10
[create_stats] run 4/10
[create_stats] run 5/10
[create_stats] run 6/10
[create_stats] run 7/10
[create_stats] run 8/10
[create_stats] run 9/10
[create_stats] run 10/10

=== scenario 800x500 ===
[write_symbol] 800x500
[create_stats] run 1/10
[create_stats] run 2/10
[create_stats] run 3/10
[create_stats] run 4/10
[create_stats] run 5/10
[create_stats] run 6/10
[create_stats] run 7/10
[create_stats] run 8/10
[create_stats] run 9/10
[create_stats] run 10/10

=== scenario 900x800 ===
[write_symbol] 900x800
[create_stats] run 1/10
[create_stats] run 2/10
[create_stats] run 3/10
[create_stats] run 4/10
[create_stats] run 5/10
[create_stats] run 6/10
[create_stats] run 7/10
[create_stats] run 8/10
[create_stats] run 9/10
[create_stats] run 10/10

=== scenario 1000x100 ===
[write_symbol] 1000x100
[create_stats] run 1/10
[create_stats] run 2/10
[create_stats] run 3/10
[create_stats] run 4/10
[create_stats] run 5/10
[create_stats] run 6/10
[create_stats] run 7/10
[create_stats] run 8/10
[create_stats] run 9/10
[create_stats] run 10/10

    rows    cols      write_time_s    stats_time_min    stats_time_max    stats_time_var  stats_rss_min_mb  stats_rss_max_mb  stats_rss_var_mb
     500     500              0.06              0.13              0.14            0.0000              26.1              26.2              0.00
     500     100              0.02              0.03              0.04            0.0000              11.9              12.0              0.00
     600     600              0.07              0.16              0.17            0.0000              30.6              30.7              0.00
     700     700              0.08              0.19              0.20            0.0000              36.2              36.3              0.00
     800     500              0.06              0.15              0.15            0.0000              28.7              28.8              0.00
     900     800              0.09              0.23              0.25            0.0000              42.5              42.6              0.00
   1,000     100              0.02              0.04              0.04            0.0000              12.7              12.8              0.00

claude · 2026-05-13T15:38:39Z

+# SCENARIOS = [
+#     (5_000, 5_000),
+#     (5_000, 10_000),
+#     (6_000, 6_000),
+#     (7_000, 7_000),
+#     (8_000, 5_000),
+#     (9_000, 8_000),
+#     (10_000, 10_000),
+# ]


Don't ship a commented-out alternative SCENARIOS block — either expose it as a CLI flag (e.g. --large selecting between two named presets) or delete it. As-is, switching between scenario sets requires editing the source.

Also, the active list maxes out at 1000 rows × 800 cols, which is very small for a create_column_stats performance benchmark — operations may complete in <10 ms and never get sampled by the RSS thread (see comment on bench_col_stats.py).

claude · 2026-05-13T15:38:43Z

+    for result in results:
+        elapsed_times = result.stats_elapsed_time
+        memory_values = result.stats_memory_delta_mb
+        print(
+            f"{result.rows:>12,}  {result.cols:>6,}"
+            f"  {result.symbol_write_time:>{column_width}.2f}"
+            f"  {min(elapsed_times):>{column_width}.2f}  {max(elapsed_times):>{column_width}.2f}  {statistics.variance(elapsed_times):>{column_width}.4f}"
+            f"  {min(memory_values):>{column_width}.1f}  {max(memory_values):>{column_width}.1f}  {statistics.variance(memory_values):>{column_width}.2f}"
+        )


Prior review feedback is only partially addressed here:

The reviewer asked for mean, median, and max across measured iterations — this reports min/max/variance. Variance is harder to interpret and sensitive to outliers; please add mean and median (max is fine to keep).

The reviewer also asked for a couple of un-measured warmup iterations before the N measured iterations. All 10 runs in measure() (lines 98-102) are still measured. Discard at least the first 1–2 to stabilise results — first-run latency on LMDB / page-cache effects can dominate.

claude · 2026-05-13T15:38:45Z

+            delta_mb = process.memory_info().rss / 1e6 - rss_baseline_mb
+            if delta_mb > peak_rss_delta_mb[0]:
+                peak_rss_delta_mb[0] = delta_mb
+            time.sleep(0.01)


10 ms sampling is too coarse for the currently-active scenarios in run_bench_col_stats.py (≤1000 rows × 800 cols). The sampler loop samples once, then sleeps; if the operation finishes during the first sleep, only the baseline sample is taken and peak_rss_delta_mb returns 0. Either drop the period (e.g. 1 ms) or guard against operations that completed below the sampler resolution (e.g. log a warning when elapsed_seconds < period).

poodlewars · 2026-05-13T15:57:41Z

+LMDB_PATH = "/tmp/arcticdb_bench_col_stats"
+SYMBOL_NAME = "test_symbol"
+
+signal.signal(signal.SIGINT, lambda *_: exit(130))


You don't need this??

…or, symbol write, col stats create

claude · 2026-05-14T12:14:36Z

+RUNS = 10
+WRITE_SYMBOL_SCRIPT = Path(__file__).parent / "bench_write_symbol.py"


Critical: these paths don't match the actual filenames. The worker scripts in this PR are col_stats_bench_write_symbol.py and col_stats_bench_create_stats.py, so every subprocess.run call will fail with FileNotFoundError and the orchestrator can't run at all. Please update the constants:

Suggested change

RUNS = 10

WRITE_SYMBOL_SCRIPT = Path(__file__).parent / "bench_write_symbol.py"

WRITE_SYMBOL_SCRIPT = Path(__file__).parent / "col_stats_bench_write_symbol.py"

CREATE_STATS_SCRIPT = Path(__file__).parent / "col_stats_bench_create_stats.py"

claude · 2026-05-14T12:14:39Z

+SCENARIOS = [
+    (10, 10),
+    (500,500),
+    (400,400),
+    (500,500),
+    (1_000, 1_000),
+    (700,700),
+    (900,900),
+]


(500, 500) is listed twice and the list isn't size-ordered, which makes the output table harder to read. The whitespace style is also mixed ((500,500) vs (1_000, 1_000)). Please dedupe, sort by total cell count, and pick one style — e.g.:

Suggested change

SCENARIOS = [

(10, 10),

(500,500),

(400,400),

(500,500),

(1_000, 1_000),

(700,700),

(900,900),

]

SCENARIOS = [

(10, 10),

(400, 400),

(500, 500),

(700, 700),

(900, 900),

(1_000, 1_000),

]

Co-authored-by: Copilot <copilot@github.com>

claude · 2026-05-14T12:26:07Z

+            columns=column_names,
+        )
+
+        start_time = time.time()


Bug: start_time is now reassigned on every iteration of the loop, so the post-loop elapsed_seconds = time.time() - start_time (line 40) only measures the last chunk's write/append — not the total write time. With the active scenarios (rows ≤ 1000 vs CHUNK_ROWS = 100_000) there is only ever one chunk, so the bug is silent today, but it will silently under-report as soon as a scenario exceeds 100k rows (and that's exactly when accurate write timing matters most).

If the intent of this move was to exclude np.random.rand(...) from the measurement, accumulate per chunk instead:

total_elapsed = 0.0 for chunk_start in range(0, rows, CHUNK_ROWS): chunk_row_count = min(CHUNK_ROWS, rows - chunk_start) chunk = pd.DataFrame( np.random.rand(chunk_row_count, cols).astype(np.float64), columns=column_names, ) start_time = time.time() if chunk_start == 0: lib.write(SYMBOL_NAME, chunk) else: lib.append(SYMBOL_NAME, chunk) total_elapsed += time.time() - start_time elapsed_seconds = total_elapsed

Also: line 27 has trailing whitespace and lines 24/32 add stray blank lines — these will likely be flagged by make lint.

Co-authored-by: Copilot <copilot@github.com>

claude · 2026-05-15T16:23:43Z

+def main():
+    cols = int(sys.argv[1])
+
+    lib = NativeMongoose("mktdatad").get_library("pmarkovski.columns_stats", api="v2")


This push replaces self-contained LMDB with ahl.mongo.NativeMongoose and hardcodes a user-specific library namespace pmarkovski.columns_stats. Two concrete problems:

The benchmark is now unrunnable outside Man Group (and the PR description still describes the LMDB flow / old script names / old CLI args — please update).

The library/symbol pair is hardcoded identically in all three scripts. Two engineers running this concurrently would clobber each other's data on the shared mktdatad instance.

Please parametrise connection/library/symbol via argv or env vars (e.g. ARCTICDB_BENCH_LIBRARY, ARCTICDB_BENCH_SYMBOL) and pass them from the orchestrator, or at least define LIBRARY / SYMBOL constants in a shared module rather than triplicating the literals.

claude · 2026-05-15T16:23:48Z

+    lib = NativeMongoose("mktdatad").get_library("pmarkovski.columns_stats", api="v2")
+    try:
+        lib.delete("test_symbol")
+    except Exception:


Bare except Exception: pass swallows real failures here (mongo auth, connection drops, permission errors, library-not-found). If cleanup silently fails, the next run sees a stale symbol and the measurements are garbage. At minimum log it:

Suggested change

except Exception:

try:

lib.delete("test_symbol")

except Exception as exc:

print(f"cleanup: lib.delete failed: {exc}", file=sys.stderr)

Also note that this is called from if __name__ == "__main__": before any scenario runs — if mongo is unreachable at startup the script will only fail later on the first write/create_stats subprocess, with no obvious clue that the issue is auth/connectivity.

claude · 2026-05-15T16:23:54Z

+    (10, 10),
+    (1_000, 1_000),
+    (100_000, 1_000),
+    (100_000, 10_000),
+    (1_000_000, 1_000),
+    (1_000_000, 5_000),
+    (10_000_000, 1_000),


This push reinstates the massive-scenario block that the previous review summary tracked as resolved. Be aware of what this implies for the worker:

(100_000, 10_000): a single 100K × 10K float64 chunk is ~8 GB per chunk in col_stats_bench_write_symbol.py. That will OOM most dev machines.

(10_000_000, 1_000): ~80 GB total writes across 100 chunks of ~800 MB; many hours against mktdatad.

(1_000_000, 5_000): ~40 GB total, 4 GB per chunk.

If these are intentional targets against the shared Man Group mongo, fine — but (a) call this out in the PR description so reviewers know the expected wall-clock and storage cost, and (b) consider gating the heavy scenarios behind a --full flag so a casual run doesn't accidentally write tens of GB to mktdatad under pmarkovski.columns_stats.

…row-slicing+append) Co-authored-by: Copilot <copilot@github.com>

claude · 2026-05-19T09:26:56Z

+        )
+        return json.loads(completed.stdout)
+    except subprocess.CalledProcessError as e:
+        raise MemoryError(f"[{label}] exited with code {e.returncode}") from None


raise MemoryError(...) classifies every non-zero subprocess exit as an OOM — a mongo connection blip, a KeyError looking up the library, an unhandled AssertionError, a programmer bug in the worker, all become "MemoryError" here. That trips run_phase's except MemoryError and silently terminates the row/col walk, hiding the real failure.

It also discards the previous distinction between SIGKILL (negative returncode, the actual OOM-killer signature) and a regular exit code (almost always a bug or environment issue).

Either:

only re-classify as OOM when e.returncode < 0 (or specifically -signal.SIGKILL), and re-raise other failures as RuntimeError so they propagate up to the user, or

inspect the worker's stderr / dmesg for an OOM marker before deciding.

As written, a single mongo hiccup in the warmup of the first scenario will silently end the entire benchmark.

claude · 2026-05-19T09:27:03Z

+def row_scenarios():
+    rows = BASE_ROWS
+    while True:
+        yield rows, BASE_COLS
+        rows *= 10
+
+
+def col_scenarios():
+    cols = BASE_COLS * 10
+    while True:
+        yield BASE_ROWS, cols
+        cols *= 10


Both row_scenarios() and col_scenarios() are unbounded generators that escalate by 10× indefinitely. The only stopping condition is the MemoryError raised inside run_subprocess — which, per the comment on line 40, currently fires for any non-zero exit, not just real OOMs.

Consequences:

On a sufficiently large machine, the row phase never terminates and col_scenarios() is never reached.

On any machine, a transient mongo/network blip terminates the phase before its actual memory ceiling is found.

The 10× step is also coarse: col_scenarios() jumps (100_000, 1_270) → (100_000, 12_700) → (100_000, 127_000); the second step is already ~10 GB of float64 in the writer (see col_stats_bench_write_symbol.py — it now materialises the whole frame in one np.random.rand call), so the col walk almost certainly OOMs on the second yield and you get one data point for the whole phase.

Please bound the walk explicitly (e.g. max iterations, or rows * cols cap) and consider a 2× step so the failure point is found with usable resolution.

Create simple column stats benchmarks (symbol write vs column stats c…

77c7a25

…reation)

markovskipetar requested review from IvoDD, alexowens90 and poodlewars as code owners May 13, 2026 11:27

claude Bot reviewed May 13, 2026

View reviewed changes

poodlewars reviewed May 13, 2026

View reviewed changes

Isolate each symbol write and stats creation into a single subprocess

adabce9

Co-authored-by: Copilot <copilot@github.com>

claude Bot reviewed May 13, 2026

View reviewed changes

markovskipetar changed the title ~~Column Stats Improvements~~ Add script to benchmark column stats creation May 13, 2026

poodlewars reviewed May 13, 2026

View reviewed changes

markovskipetar added 3 commits May 14, 2026 09:47

add warmup runs

e3a2717

add cleanup functionality

73c969d

rename files, remove unused file, split into three files - orchestrat…

4c95eef

…or, symbol write, col stats create

claude Bot reviewed May 14, 2026

View reviewed changes

fix timing to exclude non-measured operations

a923e70

Co-authored-by: Copilot <copilot@github.com>

claude Bot reviewed May 14, 2026

View reviewed changes

markovskipetar and others added 5 commits May 14, 2026 15:53

fix: measure only symbol writing

92cf7ab

Co-authored-by: Copilot <copilot@github.com>

remove duplicated scenarios

6be8596

Co-authored-by: Copilot <copilot@github.com>

measure rss peak right after creating stats

5336d74

Co-authored-by: Copilot <copilot@github.com>

remove local lmdb and use mongoose + little refactorings

1cb6325

Co-authored-by: Copilot <copilot@github.com>

remove simple dataframes scenarios, used for local tests

ea73298

claude Bot reviewed May 15, 2026

View reviewed changes

Update the scenarios to use arctic's paralellism (removed the custom …

b84a38e

…row-slicing+append) Co-authored-by: Copilot <copilot@github.com>

claude Bot reviewed May 19, 2026

View reviewed changes

markovskipetar added 3 commits May 19, 2026 11:36

add multipler as factor for growing rows and cols

9ec27ff

Merge branch 'master' into column_stats_improvements

1317110

add handler for keyboard interruption

8296670



		atexit.register(_cleanup)
		signal.signal(signal.SIGINT, lambda *_: exit(130))


		col_stats = {f"col_{i}": {"MINMAX"} for i in range(cols)}

		mem_before = psutil.Process().memory_info().rss / 1e6

		RUNS = 10
		WRITE_SYMBOL_SCRIPT = Path(__file__).parent / "bench_write_symbol.py"

Conversation

markovskipetar commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Column Stats Benchmarks

Scripts

How It Works

Usage

Output

Scenarios

Cleanup

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

claude Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ArcticDB Code Review Summary

Uh oh!

poodlewars left a comment

Choose a reason for hiding this comment

Uh oh!

poodlewars May 13, 2026

Choose a reason for hiding this comment

Uh oh!

markovskipetar May 14, 2026

Choose a reason for hiding this comment

Uh oh!

poodlewars May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poodlewars May 13, 2026

Choose a reason for hiding this comment

Uh oh!

markovskipetar May 14, 2026

Choose a reason for hiding this comment

Uh oh!

poodlewars commented May 13, 2026

Uh oh!

markovskipetar commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

poodlewars May 13, 2026

Choose a reason for hiding this comment

Uh oh!

markovskipetar May 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

markovskipetar commented May 13, 2026 •

edited

Loading

claude Bot commented May 13, 2026 •

edited

Loading

poodlewars May 13, 2026 •

edited

Loading

markovskipetar commented May 13, 2026 •

edited

Loading