`storage`: remove `compaction_ratio` metric from `probe` by WillemKauf · Pull Request #29994 · redpanda-data/redpanda

WillemKauf · 2026-03-30T16:14:35Z

The cardinality of this metric is much too high, and its usefulness doesnot justify the cost. Aggregating it would also reduce its usefulness.

We have plenty of log lines from which one could infer the "compactibility" of a segment/log/topic.

Backports Required

Release Notes

none

Copilot

Pull request overview

Reduces /metrics cardinality for storage compaction ratio by making the metric eligible for label aggregation across partitions (topic-level rollup) rather than emitting a distinct time series per partition.

Changes:

Removed the non-aggregated compaction_ratio metric registration.
Added a new aggregated metric compaction_ratio_total with updated help text describing topic-level aggregation.

StephanDollberg

Does it make more sense to calculate a ratio more directly at the topic level instead of doing the average of averages approach (weighted by bytes or something)?

WillemKauf · 2026-03-30T21:00:28Z

Does it make more sense to calculate a ratio more directly at the topic level instead of doing the average of averages approach (weighted by bytes or something)?

to be honest this metric is already fairly approximate given its sliding window nature and the fact it has a sample size of 5 segments/operations. i don't believe it is a good indicator of "compactibility" and doing the low effort thing here probably isn't a huge deal.

wdyt @dotnwat ?

vbotbuildovich · 2026-03-30T21:17:43Z

CI test results

test results on build#82498

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/82498#019d4052-e5f7-4866-aa5e-e66447800099	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
WriteCachingFailureInjectionE2ETest	test_crash_all	{"use_transactions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/82498#019d4052-d9f8-49ca-9df6-fef61d0aabd6	FLAKY	18/21	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0825, p0=0.5002, reject_threshold=0.0100. adj_baseline=0.2278, p1=0.1333, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

andrwng · 2026-04-01T19:51:15Z

+          sm::description(
+            "Sum of compaction ratios across segments in this partition. "
+            "When aggregate_metrics is enabled, this is summed across all "
+            "partitions in a topic; divide by the partition count to get "
+            "the average compaction ratio."),


I'm fine with this change, but just looking at how it's used (moving average of ratio computed with every segment rewrite), I'm wondering how useful it is in practice as is today (even without this aggregation).

If we're already going to need callers to do some math with this division, maybe we should instead expose some bytes_rewritten_in and bytes_rewritten_out and aggregate that, so that the collective compaction ratio is always bytes out / bytes in, whether aggregated or not.

I'm wondering how useful it is in practice as is today

I don't think its all that useful. I was leaning towards fully removing it.

If we're already going to need callers to do some math with this division, maybe we should instead expose some bytes_rewritten_in and bytes_rewritten_out and aggregate that, so that the collective compaction ratio is always bytes out / bytes in, whether aggregated or not.

I'd prefer this change as well but it would require some additional state/book-keeping in disk_log_impl, and I wasn't sure that juice was worth the squeeze.

+1 to removing it

additional state/book-keeping

FWIW the compaction_result (where we compute the compaction ratio today) seems to already have size_before and size_after. Presumably if we could just add these to a counter, or am I missing something?

Ah i guess the issue is that we'd kind of want a rolling average of both of the counters in order to make sense of them over longer spans of time, though maybe for monitoring, even if we didn't have the rolling average we could do something like rate(size_before) / rate(size_after)

Currently the compaction_ratio is represented as a moving_average with 5 samples in the disk_log_impl, so the book-keeping isn't just two size_ts or something like that

Right, edited my message with another thought (using rates). If we do that, I don't think a moving average is necessary. Though I'm also not confident enough in my grafana fu to know that what i'm suggesting gets us something useful

I'm leaning towards outright removing the metric and the moving average itself from disk_log_impl now. @dotnwat do you have any objections to this?

I'll defer to your best judgement on if you think it's useful or not. Compaction ratio seems useful on the surface in that it tells us something about the data, but I think we lose that usefulness if we aggregate. So if the choice is between aggregate or drop it, dropping it is probably the right move.

However, if we kind of like knowing that it is there, we could expose it through the admin interface (for later inclusion in debug bundle), or we could log a summary of compation work (including ratio) after a compaction run at debug or trace level?

Although, maybe debug/trace also isn't useful in practice depending on when you'd want to know the compaction ratio 🤷

or we could log a summary of compation work (including ratio) after a compaction run at debug or trace level?

we log some interesting compaction info at INFO level which (per segment) notes the number of batches processed, number of records discarded, etc. It's not a complete summary of the entire compaction run (which would likely be missed for long running compactions if logged at TRACE or DEBUG anyways) but it is pretty good information from which the compaction ratio can be extrapolated.

The cardinality of this metric is much too high, and its usefulness does not justify the cost. Aggregating it would also reduce its usefulness. We have plenty of log lines from which one could infer the "compactibility" of a `segment`/`log`/`topic`.

WillemKauf · 2026-04-03T02:55:43Z

Force push to:

Outright remove the metric from the storage::probe.

WillemKauf requested review from StephanDollberg, Copilot and dotnwat March 30, 2026 16:14

github-actions Bot added the area/redpanda label Mar 30, 2026

Copilot started reviewing on behalf of WillemKauf March 30, 2026 16:15 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

Comment thread src/v/storage/probe.cc Outdated

Comment thread src/v/storage/probe.cc Outdated

Comment thread src/v/storage/probe.cc

WillemKauf force-pushed the compaction_ratio_metric_agg branch from 0d04d47 to cf61deb Compare March 30, 2026 19:42

StephanDollberg previously approved these changes Mar 30, 2026

View reviewed changes

WillemKauf requested a review from andrwng April 1, 2026 18:30

andrwng reviewed Apr 1, 2026

View reviewed changes

WillemKauf dismissed StephanDollberg’s stale review via 88fa106 April 3, 2026 02:54

WillemKauf force-pushed the compaction_ratio_metric_agg branch from cf61deb to 88fa106 Compare April 3, 2026 02:54

WillemKauf changed the title ~~storage: aggregate compaction_ratio metric across partitions~~ storage: remove compaction_ratio metric from probe Apr 3, 2026

WillemKauf requested a review from andrwng April 3, 2026 02:55

andrwng approved these changes Apr 7, 2026

View reviewed changes

WillemKauf merged commit 6af2630 into redpanda-data:dev Apr 7, 2026
21 checks passed

Conversation

WillemKauf commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanDollberg left a comment

Choose a reason for hiding this comment

Uh oh!

WillemKauf commented Mar 30, 2026

Uh oh!

vbotbuildovich commented Mar 30, 2026

CI test results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrwng Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dotnwat Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillemKauf commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

WillemKauf commented Mar 30, 2026 •

edited

Loading

andrwng Apr 1, 2026 •

edited

Loading

dotnwat Apr 2, 2026 •

edited

Loading