[PEP] Add compression ratio calculation and per-column compression stats

### What

Add compression stats and storage breakdown to the existing `GET /tables/{table}/size` and `GET /tables/{table}/metadata` endpoints. No new endpoints.

### Modules Changed

Stats are written once at segment creation into `metadata.properties`, loaded into `ColumnMetadata` at segment load, and aggregated on-demand per API call in the controller — no background jobs, no separate store.

| Module | Change |
|--------|--------|
| `pinot-spi` / `IndexingConfig` | New `compressionStatsEnabled` boolean (default `false`) |
| `pinot-segment-local` writers | `BaseChunkForwardIndexWriter`, `VarByteChunkForwardIndexWriterV4/V5/V6`, `CLPForwardIndexCreatorV2` — track raw byte count during writes |
| `pinot-segment-local` / `BaseSegmentCreator` | Persists uncompressed size + codec to segment `metadata.properties` at creation time |
| `pinot-segment-spi` / `ColumnMetadata` | Two new default methods to read persisted stats at segment load |
| `pinot-common` | New DTOs: `ColumnCompressionStatsInfo`, `CompressionStatsSummary`, `StorageBreakdownInfo`; extended: `SegmentSizeInfo`, `TableMetadataInfo`; new `ControllerGauge` entries |
| `pinot-server` | `/tables/{table}/size` and `/tables/{table}/metadata` — read per-column stats from `ColumnMetadata` and include in response |
| `pinot-controller` | `TableSizeReader` + `ServerSegmentMetadataReader` aggregate per-segment responses on each API call; emit Prometheus gauges |

### New Fields (both endpoints, same structure)

```json
"compressionStats": {
  "rawForwardIndexSizePerReplicaInBytes": 550000000,
  "compressedForwardIndexSizePerReplicaInBytes": 30000000,
  "compressionRatio": 18.3,
  "segmentsWithStats": 312,
  "totalSegments": 801,
  "isPartialCoverage": true
},
"columnCompressionStats": [
  {
    "column": "url",
    "codec": "LZ4",
    "hasDictionary": false,
    "uncompressedSizeInBytes": 120000000,
    "compressedSizeInBytes": 8000000,
    "compressionRatio": 15.0,
    "indexes": ["forward_index"]
  },
  {
    "column": "status_code",
    "hasDictionary": true,
    "uncompressedSizeInBytes": -1,
    "compressedSizeInBytes": 500000,
    "compressionRatio": 0,
    "codec": null
  }
],
"storageBreakdown": {
  "tiers": {
    "hotTier":  { "count": 50,  "sizePerReplicaInBytes": 10000000 },
    "coldTier": { "count": 262, "sizePerReplicaInBytes": 20000000 }
  }
}
```

### Behavior

**Feature flag** (`tableIndexConfig.compressionStatsEnabled`, default `false`):
- `false`: zero overhead — no tracking in writers, nothing written to disk, `compressionStats` and `columnCompressionStats` absent from responses
- `true`: writers track raw byte counts; codec + uncompressed size persisted to segment metadata

**`storageBreakdown`** is always returned regardless of the flag.

**Dictionary columns** appear in `columnCompressionStats` with `hasDictionary=true`, `uncompressedSizeInBytes=-1`, `codec=null`. Forward index size is still reported.

**Partial coverage**: enabling the flag on an existing table only affects new segments. Old segments are excluded from ratio computation (not counted as zero). `isPartialCoverage=true` and `segmentsWithStats < totalSegments` signal this.

**Realtime**: consuming segments excluded — stats appear only after segment commit.

**All ingestion paths covered**: offline batch, realtime, and minion tasks all converge at `SegmentIndexCreationDriverImpl` → `BaseSegmentCreator`.

**Prometheus gauges**: `TABLE_COMPRESSION_RATIO_PERCENT`, `TABLE_RAW_FORWARD_INDEX_SIZE_PER_REPLICA`, `TABLE_COMPRESSED_FORWARD_INDEX_SIZE_PER_REPLICA`, `TABLE_TIERED_STORAGE_SIZE`. Cleared when flag is disabled or table becomes dict-only.

### What's Out of Scope
- **Dictionary-encoded column uncompressed size tracking (follow-up)**: Forward index writers for dict-encoded columns only see dictionary IDs, not raw values — tracking true uncompressed sizes requires instrumenting the stats collection phase before dictionary encoding.
- **UI changes (follow-up)**: Surface compression ratio, per-column stats, and tier breakdown in Pinot Console table detail page — API already returns all required data, purely a rendering change.

### Use Cases

1. **COGS estimation**: Compression ratio and per-column breakdown for informed storage cost projections
2. **Codec optimization**: Identify columns with poor compression ratios and switch codecs (e.g., LZ4 → ZSTANDARD for cold data)
3. **Capacity planning**: Right-size clusters by understanding true storage footprint with local vs tiered breakdown
4. **Schema optimization**: Identify columns that benefit from dictionary encoding vs raw encoding
5. **Index cost analysis**: Per-column index size visibility to evaluate cost-vs-performance trade-offs when adding or removing indexes
6. **Monitoring/alerting**: Alert when compression ratio degrades after schema changes or data pattern shifts

### Related Issues and PRs

- #18092 — Pluggable codec pipeline for RAW forward indexes (more codecs need visibility)
- #18097 — Parameterized forward index compression codecs
- #17826 — Delta/DeltaDelta encoding on raw columns
- #17291 — System tables support (future SQL interface for this data)
- #17169 — QueryLog System Table (precedent for system tables)
- #6804 — Support LZO/LZ4/ZSTD/DEFLATE/GZIP compression codecs for raw index
- #7973 — Chunk compression hardcoded to passthrough for metric columns

### Draft PR
#18185

Module	Change
`pinot-spi` / `IndexingConfig`	New `compressionStatsEnabled` boolean (default `false`)
`pinot-segment-local` writers	`BaseChunkForwardIndexWriter`, `VarByteChunkForwardIndexWriterV4/V5/V6`, `CLPForwardIndexCreatorV2` — track raw byte count during writes
`pinot-segment-local` / `BaseSegmentCreator`	Persists uncompressed size + codec to segment `metadata.properties` at creation time
`pinot-segment-spi` / `ColumnMetadata`	Two new default methods to read persisted stats at segment load
`pinot-common`	New DTOs: `ColumnCompressionStatsInfo`, `CompressionStatsSummary`, `StorageBreakdownInfo`; extended: `SegmentSizeInfo`, `TableMetadataInfo`; new `ControllerGauge` entries
`pinot-server`	`/tables/{table}/size` and `/tables/{table}/metadata` — read per-column stats from `ColumnMetadata` and include in response
`pinot-controller`	`TableSizeReader` + `ServerSegmentMetadataReader` aggregate per-segment responses on each API call; emit Prometheus gauges

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PEP] Add compression ratio calculation and per-column compression stats #18184

What

Modules Changed

New Fields (both endpoints, same structure)

Behavior

What's Out of Scope

Use Cases

Related Issues and PRs

Draft PR

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[PEP] Add compression ratio calculation and per-column compression stats #18184

Description

What

Modules Changed

New Fields (both endpoints, same structure)

Behavior

What's Out of Scope

Use Cases

Related Issues and PRs

Draft PR

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions