[OPIK-5659] [BE] fix: prevent concurrent experiment aggregation and reduce ClickHouse query overload by thiagohora · Pull Request #6076 · comet-ml/opik

thiagohora · 2026-04-04T11:30:58Z

Details

Experiment aggregation was generating 2,000+ ClickHouse queries/minute under load. The root cause was a mismatch between the aggregation lock TTL (1 minute) and the actual time required to process large experiments. When the lock expired, multiple nodes reprocessed the same experiment concurrently, each issuing the full set of ClickHouse queries per batch. A subsequent load test against 2 experiments with 1M items each (5M traces, 15M spans, 8M feedback scores) surfaced a client-side INSERT scaling wall that was resolved in this PR.

Root cause — lock TTL vs processing time:

Per-batch cost: ~1.6 s (fetch items) + ~3.3 s (parallel span / trace / feedback / comments / assertions) + ~0.3 s (INSERT) ≈ 5.2 s/batch at batchSize=1,000.
1M-item experiment → 1,000 batches × 5.2 s = ~87 minutes, against a 1 minute lock TTL.
Lock expired → other nodes acquired it and started over → 2,000+ queries/minute.

Original concurrency/contention fixes

Fail-fast on lock contention: switched from executeWithLockCustomExpire (blocks until TTL, enables concurrent processing) to bestEffortLock with lockAcquireWait=500ms — if the lock is held, the message is immediately ack'd without blocking worker threads.
Actual cancellation on timeout: applied .timeout(lockTTL) to the processing Mono. Reactor cancels the upstream R2DBC subscription, stopping in-flight ClickHouse queries when the TTL elapses — instead of leaving them running after the lock silently expires.
Retry logic with max attempts: on timeout the experiment is re-published via the debounce publisher. A per-experiment Redis `RAtomicLong` counter caps retries at `maxLockExpiryRetries=3`; resets on success.

Query-level redundancies removed

`getProjectId` resolved once per experiment: was called inside the per-batch loop — for 1,000 batches that's 1,000 redundant ClickHouse round trips returning the same value. Now resolved once before the `expand()` loop and threaded through.
Batch trace IDs passed directly to parallel queries: each of the 5 parallel queries per batch was re-deriving the batch's trace IDs via a `SELECT DISTINCT trace_id FROM experiment_items FINAL WHERE ... ORDER BY id LIMIT :batchSize` CTE. Those IDs are already in memory after `getExperimentItems`. They are now passed as a UUID array, eliminating 5 `experiment_items FINAL` scans per batch (5,000 per 1M-item experiment at batchSize=1,000).
BatchResult count fixed: ClickHouse R2DBC `getRowsUpdated()` always returns 0 for INSERTs — logging showed `processedCount=0` every batch. Fixed to use `items.size()`.

Client-side INSERT bottleneck eliminated

Bulk-insert `experiment_item_aggregates` via ClickHouse HTTP JSONEachRow: the original template rendered one VALUES tuple per item with distinct named parameters (`:id0, :id1, …, :id9999`). R2DBC's named-parameter resolution scales super-linearly in batch size:

batchSize	Wall/batch	CH active	Java overhead	Items/sec	Items per 60s TTL	Overhead per item
1 000	~500 ms	~100 ms	~400 ms	2 000	90 000	0.4 ms/item
2 000	~2.75 s	~263 ms	~2.5 s	727	44 000	1.25 ms/item
5 000	~14 s	~300 ms	~13.7 s	333	20 000	2.7 ms/item
10 000	~50 s	~500 ms	~49.5 s	200	10 000	4.9 ms/item

Attempting R2DBC's `Statement.add()` batching with a single-row template silently produced zero-row INSERTs (the ClickHouse R2DBC driver does not serialize accumulated bindings into the `FORMAT Values` body).

Replaced the R2DBC INSERT with `com.clickhouse:client-v2`'s HTTP client posting `FORMAT JSONEachRow` directly, with client-request + server-response compression. The rest of the aggregation pipeline keeps using R2DBC. `InsertSettings.serverSetting("date_time_input_format", "best_effort")` is set so ClickHouse accepts ISO 8601 `Instant.toString()` values for DateTime64 columns.

Result at batchSize=10,000:

Metric	Before (R2DBC template)	After (HTTP JSONEachRow)	Speedup
Wall/batch	~50 s	~60–100 ms	~500×
Items/sec per experiment	~200	~166 000	~830×
Items per 60 s TTL attempt	10 000	1 000 000 (full)	100×
Completion for 1M-item experiment	never (hit max retries)	single attempt in ~45–48 s	—

`EXPERIMENT_AGGREGATES_BATCH_SIZE` is exposed as an env var on the backend docker-compose service so operators can tune it per workspace. The application default stays at 1,000 (appropriate for typical workspaces); raising it to ~10,000 is only needed for workspaces with experiments of ≥1M items, where the smaller batch does not fit in the 60s lock TTL.

Config

All new config fields are externalised in `ExperimentDenormalizationConfig` with no hardcoded values. `aggregationLockTime` default remains 1 min — 1M-item experiments complete in ~45 s at batchSize=10,000, leaving comfortable headroom. `maxLockExpiryRetries` defaults to 3.

Change checklist

User facing
Documentation update

Issues

OPIK-5659

AI-WATERMARK

AI-WATERMARK: yes

Tools: Claude Code
Model(s): claude-sonnet-4-6, claude-opus-4-7
Scope: root cause analysis, full implementation, query optimisation, HTTP bulk-insert rewrite, load-test harness and test coverage
Human verification: code review + ExperimentAggregatesIntegrationTest (126 tests) + ExperimentAggregatesSubscriberTest (7 tests) + end-to-end load test

Testing

`mvn test -Dtest="ExperimentAggregatesSubscriberTest"` — 7 tests, all pass (lifecycle, lock contention, timeout/retry, max-retry-stop).
`mvn test -Dtest="ExperimentAggregatesIntegrationTest"` — 126 tests, all pass (includes the HTTP bulk-insert path for item aggregates).
`mvn spotless:check` — no formatting issues.

End-to-end load test against a local dockerised backend (MySQL + ClickHouse + Redis + MinIO + Zookeeper):

Ingested 5,000,000 traces (3 spans each = 15M spans) under a load-test project using `tests_load/tests/test_experiment_aggregation_load.py` with 16 multiprocessing workers (~1,490 traces/s sustained).
Created a dataset with 1,000,000 items using a single `dataset.insert()` call (SDK splits internally with a single `batch_group_id` so the backend keeps a single version).
Created 2 experiments × 1,000,000 linked items (2M `experiment_items` rows in ClickHouse).
Triggered lazy aggregation on both experiments via `GET /v1/private/experiments/{id}` at `EXPERIMENT_AGGREGATES_BATCH_SIZE=10000` and verified:
- Both experiments completed on first attempt: exp₁ in 53.1 s, exp₂ in 49.3 s (running concurrently, sharing backend CPU).
- `experiment_item_aggregates` contains 1,000,000 unique ids per experiment.
- `experiment_aggregates` rows populated for both experiments.
- Typical per-batch insert: `bodyMs=18–40, executeMs=34–61, bodyBytes=7.7–10.4 MB, rowsWritten=10000`.

Scenarios covered:

Lock already held → message ack'd immediately, no processing started.
Processing completes within TTL → retry counter reset.
Processing exceeds TTL → Reactor cancels R2DBC queries, experiment re-triggered via debounce.
Max retries exhausted → counter deleted, no further re-triggers.
Concurrent subscriber instances → only one processes at a time.

Environment: local process mode with Testcontainers for integration tests; local docker-compose stack for the load test.

Documentation

N/A — internal processing component, no user-facing API changes.

…educe ClickHouse query overload Root cause: aggregation lock TTL (1 min) was far shorter than the actual processing time for large experiments (~87 min at batchSize=1000 for 1M items). When the lock expired, multiple nodes reprocessed the same experiment concurrently, each issuing the full set of ClickHouse queries per batch — causing 2000+ queries/minute spikes. Fixes: - Switch from executeWithLockCustomExpire to bestEffortLock with 500ms acquire wait, so a locked experiment is immediately skipped rather than queued concurrently - Apply .timeout(lockTTL) to the processing Mono so Reactor cancels in-flight R2DBC queries when TTL elapses, instead of leaving them running silently after lock expiry - Raise aggregationLockTime default from 1 min to 10 min - Add retry logic: on timeout, re-publish via debounce up to maxLockExpiryRetries=3; reset counter on successful completion - All new config fields (lockAcquireWait, maxLockExpiryRetries, retryCounterTtl) are externalised in ExperimentDenormalizationConfig — no hardcoded values - Resolve getProjectId once per experiment (not per batch) eliminating N-1 redundant ClickHouse round trips across the expand() loop - Pass traceIds extracted from the already-fetched batch directly to all 5 parallel queries, replacing the repeated experiment_items FINAL CTE subquery in each

github-actions · 2026-04-04T11:36:18Z

Backend Tests - Integration Group 15

209 tests ±0 207 ✅ ±0 4m 27s ⏱️ +10s
31 suites ±0 2 💤 ±0
31 files ±0 0 ❌ ±0

Results for commit 54fd3f6. ± Comparison against base commit 6bc1a91.

♻️ This comment has been updated with latest results.

Move resetRetryCounter into the action Mono so it only runs when this node actually acquired the lock and processing completed successfully. Previously it ran after bestEffortLock regardless of whether the lock was acquired, resetting the shared counter on skip (Mono.empty) paths. Also add workspaceId to the doOnError log for consistency with other log statements in the same method.

…ion-query-overload

github-actions · 2026-04-09T14:53:12Z

Backend Tests - Integration Group 14

248 tests ±0 248 ✅ ±0 8m 59s ⏱️ +10s
21 suites ±0 0 💤 ±0
21 files ±0 0 ❌ ±0

Results for commit 416ab1b. ± Comparison against base commit 3b943e8.

♻️ This comment has been updated with latest results.

…ion-query-overload

…cross aggregation queries Removes FINAL from experiments/spans/traces/experiment_items/assertion_results reads and replaces with explicit ORDER BY (<sort_key>) DESC, last_updated_at DESC + LIMIT 1 BY <dedup_key> to avoid the per-query ReplacingSorted merge work. Also fixes two bugs: - ::trace_ids typo (double colon) → :trace_ids - GET_ASSERTIONS_DATA inner subquery was missing the name column and LIMIT 1 BY clause, causing "Unknown expression or function identifier 'name'" at runtime. Aligned with the canonical assertion_results dedup pattern used elsewhere in the file.

ORDER BY ... ASC, last_updated_at ASC before LIMIT 1 BY id kept the oldest row per id, which could cause populateExperimentItemAggregates to process stale snapshots. Flipped to DESC, last_updated_at DESC to match the latest-row dedup pattern used in GET_TRACES_DATA and GET_SPANS_DATA.

…TEMS The prior commit flipped ORDER BY to DESC on all columns so LIMIT 1 BY id would keep the latest row per id, but that broke cursor pagination: with DESC ordering and the filter id > :cursor, each subsequent batch only excludes the smallest id of the previous batch and re-returns the rest, so the iteration only ever processes the top ~batchSize ids of the experiment and never progresses. Keep id ASC for forward cursor progression while using last_updated_at DESC so LIMIT 1 BY id still selects the latest version per id.

…use HTTP JSONEachRow The prior INSERT used a StringTemplate that rendered one VALUES tuple per item with N distinct named parameters (`:id0, :id1, ...`). R2DBC's named-parameter resolution grew super-linearly with batch size, so `EXPERIMENT_AGGREGATES_BATCH_SIZE=10000` made the Java-side render + bind pipeline take ~45s per batch (CH server-side INSERT itself was <200ms). At the 60s lock TTL this left ~0 headroom: every attempt processed one 10k-item batch, then got cancelled. Smaller batch sizes avoided the super-linear cost but capped throughput at ~1.5k items/sec. Replace the R2DBC path for this INSERT with the ClickHouse v2 HTTP client (`com.clickhouse:client-v2`) POSTing JSONEachRow directly, with client-request + server-response compression enabled. The rest of the aggregation pipeline keeps using R2DBC. At batchSize=10000 the per-batch cost drops from ~45s to ~60-100ms (bodyMs ~20-40 + executeMs ~35-60). End-to-end a 1M-item experiment now completes aggregation in ~45s as a single attempt, well within the 60s lock TTL. ExperimentAggregatesIntegrationTest (126 tests) passes. Also wires `EXPERIMENT_AGGREGATES_BATCH_SIZE` through the backend docker-compose service, defaulting to 10000.

…ion-query-overload

CometActions · 2026-04-21T12:24:17Z

✅ Test environment is now available!

To configure additional Environment variables for your environment, run [Deploy Opik AdHoc Environment workflow] (https://github.com/comet-ml/comet-deployment/actions/workflows/deploy_opik_adhoc_env.yaml)

Access Information

URL: https://pr-6076.dev.comet.com
Cluster: comet-ml-development
Namespace: pr-6076
Version: 2.0.8-6076-merge-1919
Application logs: View in Grafana

The deployment has completed successfully and the version has been verified.

github-actions · 2026-04-21T12:27:19Z

Python SDK E2E Tests Results (Python 3.13)

347 tests 345 ✅ 12m 27s ⏱️
1 suites 2 💤
1 files 0 ❌

Results for commit 23e7323.

♻️ This comment has been updated with latest results.

ldaugusto

It's very cool you find out the client-v2 is such more efficient in this situation, if it's something very big it's worthy to have a second pool to manage these operations. Do you have a general guide on which kind of operations would be better to do on each pool? If binded inserts are this better with client-v2, would it be the same for traces/spans inserts?

Also, PR descriptiopn mentions new default for EXPERIMENT_AGGREGATES_BATCH_SIZE is 10000, but I don't see this change

andrescrz

This PR is complex, but generally well done.

I left feedback around the stuff to verify before moving forward.

Good job!

andrescrz · 2026-04-21T13:38:21Z

+                            message.experimentId(), message.workspaceId(), retryCount,
+                            config.getMaxLockExpiryRetries());
+
+                    return publisher.publish(


I understand this re-injects back a message to this class, making it self-retryable. Please make sure there are no race conditions or other circumstances that can lead to a storm of retries here.

You need to ensure there's a short circuit or exit condition to always prevent this from happening.

Good callout — here are the safeguards in place (in retriggerIfBelowMaxRetries / resetRetryCounter):

Atomic retry counter per (workspaceId, experimentId) — uses RedissonClient.getAtomicLong(key) with incrementAndGet(), so increments are race-free across nodes.

Counter increments before publishing — we only re-publish if incrementAndGet() returns <= maxLockExpiryRetries (default 3, configurable via maxLockExpiryRetries). Past that we counter.delete() and return Mono.empty() — the retry chain terminates, no more messages are emitted for that experiment.

TTL on the counter — counter.expire(retryCounterTtl) (default 10m, configurable) is set after each successful re-publish. If the experiment stops producing events, the counter self-cleans and a future legitimate trigger starts fresh.

Counter reset on successful processing — resetRetryCounter deletes the key only on the success path (after populateAggregations completes within the lock TTL). Fixed by a recent commit (c48ece1fbe) so the counter is NOT reset when the lock was never acquired (otherwise the cap could be bypassed).

Only triggered by TimeoutException — i.e. the lock TTL genuinely expired. Other errors fall through to doOnError and do NOT re-publish.

Debounce on the publisher side — publisher.publish(...) dedupes concurrent triggers for the same experiment within a window, so even if multiple retries overlap they collapse to a single message.

Log trail at WARN level on every retry and on cap-reached, so a retry storm would be visible in logs and metrics immediately.

Net effect: at most maxLockExpiryRetries automatic re-triggers per (workspace, experiment), then the chain stops — even under concurrent producers and node restarts.

🤖 Reply posted via /address-github-pr-comments

Commit daa62f7 addressed this comment by adding retriggerIfBelowMaxRetries, which increments a per-experiment atomic counter and stops re-publishing once maxLockExpiryRetries is reached, ensuring the retry loop short-circuits instead of creating a storm of retries.

andrescrz · 2026-04-21T13:47:59Z

+        Map<String, BigDecimal> feedbackScoresMap = Optional.ofNullable(feedback)
+                .map(FeedbackScoreData::feedbackScores).orElse(Map.of());
+
+        var node = JsonUtils.createObjectNode();


Switching clients is a major architectural change. It's fine under the scope of this PR, but we'll need to be careful with the exact configuration of the JSON mapper to be used. I believe JsonUtils should be fine, so we keep things centralised. Let's make sure that no configuration for it breaks the ClickHouse communication.

I guess this is fine for now as tests pass. Likely not action for now, maybe extending the javadocs documentation in this PR, starting by this method.

Fixed in d055ca6 — add javadoc on appendJsonRow documenting the JSONEachRow contract, JsonUtils reuse, and null-coalescing policy.

🤖 Reply posted via /address-github-pr-comments

Commit d055ca6 addressed this comment by adding detailed javadoc to the appendJsonRow helper that explains why we rely on JsonUtils for JSON serialization and how the bulk insert flow keeps ClickHouse configuration centralized; the documentation now documents the reasoning the reviewer asked for.

andrescrz · 2026-04-21T13:59:20Z

+                Optional.ofNullable(assertionsMap.get(item.traceId())).map(AssertionData::assertionsArray)
+                        .orElse(EMPTY_ARRAY_STR));
+
+        out.append(node).append('\n');


Same comment about this. Likely not action for now other than documenting. In the long run we'll need to document these nuances in order to extend the usage for this ClickHouse V2 client.

Fixed in d055ca6 — add javadoc on insertExperimentItems explaining the v2-client + JSONEachRow path is scoped to this batch-insert (used when EXPERIMENT_AGGREGATES_BATCH_SIZE > 1K), the per-request date_time_input_format setting, and NUM_ROWS_WRITTEN return semantics.

🤖 Reply posted via /address-github-pr-comments

Commit d055ca6 addressed this comment by adding extensive JavaDoc around appendJsonRow and insertExperimentItems that explains the JSONEachRow format, shared buffer usage, and ClickHouse V2 client considerations mentioned in the discussion.

andrescrz · 2026-04-21T14:00:08Z

+
+            var settings = new InsertSettings()
+                    .logComment(logComment)
+                    .serverSetting("date_time_input_format", "best_effort");


What about this setting for ClickHouse V2 client, can we configure it globally?

We kept it as a per-request InsertSettings on purpose: in the regular R2DBC insert paths, timestamp strings are normalised via a server-side function (e.g. parseDateTime64BestEffort(...)) inside the SQL itself, so no global setting is required. On the JSONEachRow path there's no function-wrapping opportunity — the column receives the raw string from the JSON payload — so we need date_time_input_format=best_effort on the request so ClickHouse parses Instant.toString() (ISO-8601 with nanos) directly into DateTime64(9, 'UTC').

I'd keep it per-request for now — it's scoped only to this bulk-insert path and leaves the global Client defaults conservative. If we extend the v2 client to more paths in the future and they all want the same behaviour, we can promote it to a global Client.Builder setting (or equivalently add it to custom_http_params in config.yml) as a follow-up.

🤖 Reply posted via /address-github-pr-comments

Commit d055ca6 addressed this comment by documenting that the v2 bulk insert path explicitly sets date_time_input_format=best_effort on each insert request and keeps the global Client.Builder defaults conservative, explaining why we do not lift the setting into a global configuration right now.

thiagohora · 2026-04-21T14:53:04Z

It's very cool you find out the client-v2 is such more efficient in this situation. If it's something very big it's worthy to have a second pool to manage these operations. Do you have a general guide on which kind of operations would be better to do on each pool? If binded inserts are this better with client-v2, would it be the same for traces/spans inserts?

Also, PR description mentions new default for EXPERIMENT_AGGREGATES_BATCH_SIZE is 10000, but I don't see this change.

Hi @ldaugusto

On which operations go where:

R2DBC stays the primary path. From our measurements the per-batch bind cost only becomes a real problem above ~1,000 items per INSERT statement — that's the inflection point where R2DBC's named-parameter resolution starts growing super-linearly with batch size.
Client-v2 HTTP is only used for the single INSERT INTO experiment_item_aggregates path, and only matters when you need batches >1k to keep processing within the 60s lock TTL. The win comes from FORMAT JSONEachRow over HTTP with compression — the driver serializes column-by-column without per-parameter parsing.

When the v2 path actually matters: workspaces with experiments of ≥1M items — at batchSize=1000 a full aggregation does 1,000 batches × 5–6s each and misses the TTL window, so operators need to raise EXPERIMENT_AGGREGATES_BATCH_SIZE (we tested 10,000 as a reasonable ceiling). Everyone else stays on batchSize=1000 where R2DBC is perfectly fine.

Rule of thumb: use R2DBC for everything except a bulk-insert path that has to run with >1k rows per statement (where the fixed cost per statement is dominated by named-parameter bind time rather than server-side work).

Traces/spans inserts: same template pattern, but in practice batched ≤1,000 rows per call via the SDK — comfortably below the inflection point, so no win from switching. If ingest rates ever push those per-statement batch sizes above ~1k, the same v2-HTTP pattern would apply. Worth a follow-up once we agree on a shared helper (e.g., a ClickHouseBulkInserter that hides the v2 client behind a small interface).

On EXPERIMENT_AGGREGATES_BATCH_SIZE=10000: good catch — 10,000 was just what I used for the load test against a 1M-item experiment, not something the PR actually changes as a default. The runtime default stays 1,000; the docker-compose file just exposes the env var so large-workspace operators can tune it without code changes. I'll correct the PR description to reflect that.

🤖 Reply posted via /address-github-pr-comments

@builder

- Log values moved to end of sentence for production greppability (ExperimentAggregatesSubscriber, ExperimentAggregatesService). - @builder on BatchResult; call sites use builder pattern. - Javadoc on appendJsonRow documenting JSONEachRow contract, JsonUtils reuse, shared-StringBuilder rationale, and null-coalescing policy. - Javadoc on insertExperimentItems documenting why the ClickHouse v2 HTTP client + JSONEachRow path is used only for this high-volume batch insert (EXPERIMENT_AGGREGATES_BATCH_SIZE can exceed 1K), why date_time_input_format is scoped per-request, and the NUM_ROWS_WRITTEN return semantics. - DatabaseAnalyticsFactory encapsulates v2 Client construction via buildClient(), parseQueryParameters() splits queryParameters into driver options (R2DBC-specific, not forwarded) and server settings (custom_http_params content → Client.Builder.serverSetting). DatabaseAnalyticsModule.getDatabaseAnalyticsFactory provider removed. - Unit tests (DatabaseAnalyticsFactoryTest) for parseQueryParameters and integration tests (DatabaseAnalyticsFactoryIntegrationTest) verifying custom_http_params entries land in system.settings via a real ClickHouse Testcontainer, that driver options don't leak, and that a JSONEachRow round-trip completes. - config.yml: new env-var-overridable settings for the retry pipeline (lockAcquireWait, maxLockExpiryRetries, retryCounterTtl); updated aggregationLockTime default to 10m to match code.

… and guard it The "finished processing all experiments" log previously fired at INFO on every 5s polling tick, even when no experiments were pending. Now the job tracks the number of processed experiments and only emits the completion log when at least one was actually processed. Value placed at the end of the sentence per the log format convention used across the aggregation pipeline.

- Per-batch log in ExperimentAggregatesService was labelled batchSize but carried result.processedCount(); rename placeholder to processedCount so the wording matches the value. Configured batch size is still logged once at job start. - Denormalization job counter previously incremented in doOnNext before flatMap(processExperiment), so the processedExperiments log counted items that failed and were skipped by onErrorContinue. Move the increment to doOnSuccess on the inner Mono so the count reflects actually processed experiments.

github-actions · 2026-04-21T19:17:59Z

Python SDK E2E Tests Results (Python 3.11)

347 tests 345 ✅ 12m 48s ⏱️
1 suites 2 💤
1 files 0 ❌

Results for commit c04ab13.

♻️ This comment has been updated with latest results.

ldaugusto

Thanks for addressing all the suggestions! It's approved on my side.

andrescrz

LGTM:

Just double check the correctness of the values on the aggregated query.
Double check the OPIK_EXPERIMENT_DENORM_RETRY_COUNTER_TTL default of 2h.
Missing mirroring values in test config YML file.

The rest is minor, feel free to go ahead if all good.

andrescrz · 2026-04-22T09:54:14Z

+                    count() AS n,
+                    avg(value) AS avg_value,
+                    any(value) AS any_value,
+                    any(reason) AS any_reason,
+                    arrayStringConcat(groupArray(if(reason = '', '\\<no reason>', reason)), ', ') AS reason_concat,
+                    arrayStringConcat(groupArray(category_name), ', ') AS category_name_concat,
+                    any(source) AS source_any,
+                    arrayStringConcat(groupArray(created_by), ', ') AS created_by_concat,
+                    arrayStringConcat(groupArray(last_updated_by), ', ') AS last_updated_by_concat,
+                    min(created_at) AS created_at_min,
+                    max(last_updated_at) AS last_updated_at_max,
+                    mapFromArrays(
+                        groupArray(author),
+                        groupArray(tuple(value, reason, category_name, source, last_updated_at))
+                    ) AS value_by_author


Could you double check the correctness of this? Mostly the count without an inner ID (maybe with distinct) and the any functions.

Commit 4eccf7f addressed this comment by changing the aggregate count to use count(DISTINCT author), aligning with the requested clarification about counting without an inner ID.

Fixed in 4eccf7f (count() → count(DISTINCT author)). Tested the full query against a live ClickHouse with three scenarios: 1 author, 2 authors (alice@10:55 + bob@11:00), and 3 authors (alice@11:50, bob@11:55, carol@12:00). Also added a duplicate-insert stress test (second alice row at 10:57 for trace2) to simulate dedup edge cases.

count() vs count(DISTINCT author)

With the current LIMIT 1 BY ..., author: count() == count(DISTINCT author) in every scenario (1/1, 2/2, 3/3). The dedup guarantees one row per author so both forms agree.

Without the LIMIT 1 BY (to simulate a regressed dedup): count() reports 2/1, 4/2, 4/3 — overcounts; count(DISTINCT author) stays correct.

Switched to count(DISTINCT author) — same semantics today, defensive against accidental regression of the dedup clause. Low-risk wording change with no perf impact (the CTE is already 1 row per author).

any(value) / any(reason) / any(source)

Ran the aggregation 5 times sequentially + 3 times with max_threads=8 — every run produced identical output (all 3 any(...) picked the latest author's row: bob for trace2, carol for trace3).

Compared against the raw-query pattern used in ExperimentItemDAO.java:231 (arrayElement(entries, 1).4 AS source) — both produce the same output. In practice the outer aggregation reads the deduped CTE in scan order which is governed by its ORDER BY last_updated_at DESC, so any() and arrayElement(entries, 1) both yield the latest author's values.

any(value) and any(reason) are additionally gated by IF(n=1, any_*, avg/concat) so even if the order ever slipped, a mismatch would only show up for 1-row groups — where any(x) over a single row is trivially deterministic. Same count() + any(value) pattern is used in 14 places across DatasetItemDAO, DatasetItemVersionDAO, ExperimentDAO, KpiCardDAO, OptimizationDAO, ProjectMetricsDAO, SpanDAO, and TraceDAO, so this PR is consistent with the existing codebase convention.

Leaving any(...) as-is in this PR since behaviour matches the raw path end-to-end. Happy to tighten further in a follow-up (e.g., argMax(source, last_updated_at)) if you'd prefer explicit determinism across the aggregation stack.

🤖 Reply posted via /address-github-pr-comments

…ggregation Defensive change: count() was correct given the LIMIT 1 BY ..., author dedup but silently overcounts if the dedup ever regresses. Switching to count(DISTINCT author) keeps the invariant explicit at negligible cost (one row per author in the deduped CTE).

- ExperimentAggregatesSubscriber: extract buildRetryCountKey helper so retriggerIfBelowMaxRetries and resetRetryCounter share a single key generation site. - ExperimentDenormalizationJob: switch processExperiment to return the experiment id so the outer doOnNext actually fires per successful processing (Mono<Void> never triggers doOnNext). Retains the same per-experiment log and counter increment semantic. - config.yml: lower retryCounterTtl default from 2h to 30m to match the worst-case retry cycle (3 attempts at 10m lock TTL) plus a modest idle buffer. - config-test.yml: add missing lockAcquireWait, maxLockExpiryRetries, retryCounterTtl entries to the experimentDenormalization block.

github-actions · 2026-04-22T12:12:09Z

Backend Tests - Integration Group 8

35 files 35 suites 6m 40s ⏱️
466 tests 465 ✅ 1 💤 0 ❌
450 runs 449 ✅ 1 💤 0 ❌

Results for commit c996420.

♻️ This comment has been updated with latest results.

Previously counter.expire(...) ran only after publisher.publish(...) succeeded. If publish failed, the retry counter was incremented without a TTL and could linger, causing subsequent lock-expiry retries to hit maxLockExpiryRetries prematurely. Reordering so expire runs right after the increment guarantees the counter always has a TTL and will age out naturally regardless of whether the publish succeeds.

github-actions Bot assigned thiagohora Apr 4, 2026

github-actions Bot added java Pull requests that update Java code Backend tests Including test files, or tests related like configuration. labels Apr 4, 2026

baz-reviewer Bot reviewed Apr 4, 2026

View reviewed changes

Comment thread ...end/src/main/java/com/comet/opik/api/resources/v1/events/ExperimentAggregatesSubscriber.java

Comment thread ...end/src/main/java/com/comet/opik/api/resources/v1/events/ExperimentAggregatesSubscriber.java

baz-reviewer Bot approved these changes Apr 4, 2026

View reviewed changes

thiagohora added 2 commits April 4, 2026 13:42

test(subscriber): verify retry counter not reset when lock not acquired

2295858

Merge branch 'main' into thiagohora/OPIK-5659-fix-experiment-aggregat…

54fd3f6

…ion-query-overload

thiagohora added 4 commits April 13, 2026 12:46

Merge branch 'main' into thiagohora/OPIK-5659-fix-experiment-aggregat…

5368b36

…ion-query-overload

Merge branch 'main' into thiagohora/OPIK-5659-fix-experiment-aggregat…

04198f0

…ion-query-overload

Merge branch 'main' into thiagohora/OPIK-5659-fix-experiment-aggregat…

ac71c4c

…ion-query-overload

baz-reviewer Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread ...nd/src/main/java/com/comet/opik/domain/experiments/aggregations/ExperimentAggregatesDAO.java

Comment thread ...nd/src/main/java/com/comet/opik/domain/experiments/aggregations/ExperimentAggregatesDAO.java

baz-reviewer Bot approved these changes Apr 20, 2026

View reviewed changes

baz-reviewer Bot approved these changes Apr 21, 2026

View reviewed changes

github-actions Bot added dependencies Pull requests that update a dependency file Infrastructure labels Apr 21, 2026

thiagohora force-pushed the thiagohora/OPIK-5659-fix-experiment-aggregation-query-overload branch from 88a7223 to 67ee4d1 Compare April 21, 2026 11:52

github-actions Bot removed the Infrastructure label Apr 21, 2026

thiagohora marked this pull request as ready for review April 21, 2026 12:06

thiagohora requested a review from a team as a code owner April 21, 2026 12:06

Merge branch 'main' into thiagohora/OPIK-5659-fix-experiment-aggregat…

23e7323

…ion-query-overload

thiagohora added the test-environment Deploy Opik adhoc environment label Apr 21, 2026

ldaugusto reviewed Apr 21, 2026

View reviewed changes

Comment thread apps/opik-backend/src/main/java/com/comet/opik/infrastructure/db/DatabaseAnalyticsModule.java Outdated

Comment thread apps/opik-backend/pom.xml

Comment thread ...nd/src/main/java/com/comet/opik/domain/experiments/aggregations/ExperimentAggregatesDAO.java

andrescrz reviewed Apr 21, 2026

View reviewed changes

thiagohora requested review from andrescrz and ldaugusto April 21, 2026 16:43

baz-reviewer Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread ...rc/main/java/com/comet/opik/domain/experiments/aggregations/ExperimentAggregatesService.java

baz-reviewer Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread ...backend/src/main/java/com/comet/opik/api/resources/v1/jobs/ExperimentDenormalizationJob.java

baz-reviewer Bot approved these changes Apr 21, 2026

View reviewed changes

ldaugusto previously approved these changes Apr 22, 2026

View reviewed changes

andrescrz previously approved these changes Apr 22, 2026

View reviewed changes

thiagohora dismissed stale reviews from andrescrz and ldaugusto via 4eccf7f April 22, 2026 11:56

baz-reviewer Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread ...nd/src/main/java/com/comet/opik/domain/experiments/aggregations/ExperimentAggregatesDAO.java

thiagohora requested review from andrescrz and ldaugusto April 22, 2026 12:03

baz-reviewer Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread ...end/src/main/java/com/comet/opik/api/resources/v1/events/ExperimentAggregatesSubscriber.java Outdated

baz-reviewer Bot approved these changes Apr 22, 2026

View reviewed changes

andrescrz approved these changes Apr 22, 2026

View reviewed changes

thiagohora merged commit f858952 into main Apr 22, 2026
76 checks passed

thiagohora deleted the thiagohora/OPIK-5659-fix-experiment-aggregation-query-overload branch April 22, 2026 13:34

Conversation

thiagohora commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Root cause — lock TTL vs processing time:

Original concurrency/contention fixes

Query-level redundancies removed

Client-side INSERT bottleneck eliminated

Config

Change checklist

Issues

AI-WATERMARK

Testing

Documentation

Uh oh!

github-actions Bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backend Tests - Integration Group 15

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backend Tests - Integration Group 14

Uh oh!

Uh oh!

Uh oh!

CometActions commented Apr 21, 2026

Access Information

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python SDK E2E Tests Results (Python 3.13)

Uh oh!

ldaugusto left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andrescrz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thiagohora commented Apr 21, 2026

thiagohora commented Apr 4, 2026 •

edited

Loading

github-actions Bot commented Apr 4, 2026 •

edited

Loading

github-actions Bot commented Apr 9, 2026 •

edited

Loading

github-actions Bot commented Apr 21, 2026 •

edited

Loading

github-actions Bot commented Apr 21, 2026 •

edited

Loading

github-actions Bot commented Apr 22, 2026 •

edited

Loading