Add distribution aggregation and dual-write by sanghoonio · Pull Request #113 · databio/bbconf

sanghoonio · 2026-04-03T22:59:04Z

Summary

Adds bedset-level distribution aggregation, with the aggregation implemented primarily in SQL (leveraging gtars PR #248's reference-aligned bin widths).

Scope evolution

Initial scope was "add aggregation" (commit 300b45f). During review, two further changes landed:

0152f10: ruff import fix
7fe5c64: prune aggregation to meaningful fields + switch to SQL-side computation

The current PR reflects the full scope described below.

What's included

Schema changes

Drop dead tssdist column (no model references, no writers, no readers)
Add median_neighbor_distance scalar column to BedStats + model field
Add distributions: dict | None field to BedStatsModel (per-file JSONB, from gtars backend)

New `modules/aggregation.py`

Computes BedSetDistributions from member files' per-file distributions:

SQL aggregation (heavy lifting done in Postgres, not Python):
- scalar_summaries: AVG/STDDEV on BedStats scalar columns + 25-bin histogram of per-file means
- region_distribution: jsonb_each + jsonb_array_elements_text WITH ORDINALITY + GROUP BY (chrom, bin_idx) — valid now that gtars #248 gives reference-aligned bin widths per genome
- tss_histogram: element-wise aggregation across fixed-axis 100-bin arrays
Python aggregation (small nested JSONB):
- partitions: mean ± sd of per-file partition percentages
- composition: distinct value counts per metadata field (genome, assay, cell_type, tissue, target)

Aggregation fields pruned (not meaningful at collection level)

These stay in per-file distributions JSONB for single-file views, just aren't aggregated:

widths_histogram: per-file variable-range bins aren't summable; use scalar_summaries.mean_region_width histogram instead
neighbor_distances KDE: use new median_neighbor_distance scalar instead
gc_content KDE: use scalar_summaries.gc_content mean instead
chromosome_summaries: redundant with region_distribution
expected_partitions: per-file null hypothesis, not a collection property

Dual-write in `bedsets.create()`

Old SQL aggregation → bedset_means / bedset_standard_deviation columns (backward compat)
New distribution aggregation → bedset_stats JSONB column (when members have distributions)

New retrieval methods

BedAgentBedSet.get_distributions(): read bedset_stats JSONB with fallback to legacy scalar columns
BedAgentBedFile.get_batch(): multi-ID bed metadata retrieval
BedAgentBedFile.aggregate_collection(): ad-hoc collection stats on arbitrary bed ID list
distributions: bool param on get_stats()

Performance impact (1000-file aggregation)

	Before	After
Wire transfer Postgres→worker	~40 MB	~150 KB
Latency	1-3 s	<500 ms
Worker memory	~40 MB of parsed JSONB	~few KB

The SQL path avoids pulling 1000 raw distribution blobs into the Python worker — Postgres does the element-wise summation and returns only the aggregate result.

Dependencies

Depends on PR Add modular analysis backend config and distribution schema #112 (modular-backend-schema — adds distributions JSONB column and related models)
Uses gtars PR #248's reference-aligned region_distribution bin widths (required for SQL aggregation correctness)

Test plan

All 52 existing tests pass locally (5 skipped, 0 failed)
CI lint + pytest
Manual parity check: compare SQL aggregation output vs old Python output on a small test bedset (after gtars-processed data is available)

🤖 Generated with Claude Code

New module aggregation.py computes collection-level stats from per-file distributions JSONB (composition, scalars, histograms, KDEs, partitions, chromosome stats). Returns BedSetDistributions. bedsets.py create() now dual-writes: old SQL mean/sd columns AND new bedset_stats JSONB. get_distributions() reads JSONB with fallback to old scalar columns. get_metadata() populates distributions when available. bedfiles.py adds get_batch() for multi-ID retrieval, aggregate_collection() wrapper, and distributions param on get_stats(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Drop distributions from bedset aggregation that aren't meaningful at collection level (full blobs stay in per-file storage for single-file views): - widths_histogram: use scalar_summaries.mean_region_width instead - neighbor_distances KDE: use new median_neighbor_distance scalar - gc_content KDE: use scalar_summaries.gc_content mean - chromosome_summaries: redundant with region_distribution - expected_partitions: per-file null hypothesis Schema: - Drop dead tssdist column (no model/writer/reader references) - Add median_neighbor_distance column + model field Aggregation: switch heavy lifting from Python to SQL. With gtars #248's reference-aligned region_distribution bin widths, Postgres can do element-wise aggregation via jsonb_array_elements + GROUP BY: - region_distribution: SQL jsonb_each + unnest per-chrom arrays, GROUP BY (chrom, bin_idx), AVG/STDDEV. Returns only aggregated rows, not raw per-file blobs. - tss_histogram: SQL element-wise SUM across fixed-axis 100-bin arrays. - scalars: AVG/STDDEV on BedStats columns (no JSONB parsing). Plus histogram-of-means computed from the raw scalar values. - partitions: stay in Python (small nested JSONB, already fast). Remove obsolete Python helpers: - _aggregate_variable_histogram (widths) - _aggregate_variable_kde (neighbor_distances, gc_content) - _aggregate_region_distribution (old Python re-bin-and-stack version) - _aggregate_fixed_axis + _aggregate_fixed_axis_from_dists (TSS via JSONB) - _aggregate_chromosome_stats Expected performance impact for 1000-file aggregation: - Wire transfer Postgres→worker: ~40MB → ~150KB - Latency: 1-3s → <500ms - Worker memory: ~40MB held → ~few KB Tests: 52 passed, 5 skipped (same as before changes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Supports bedboss's parallel Python-bindings-direct backend for side-by-side performance comparison against the subprocess-based 'gtars' backend. Both backends coexist during testing; only one will remain after benchmarking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Backend Literal narrowed to "r" | "gtars" — the gtars-py backend was removed from bedboss after benchmarking showed the pure CLI with .fab binary FASTA matches its performance with simpler architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Rewrite aggregation.py: composition, scalars, histograms, and partitions all computed in SQL (no more per-row Python loops or numpy) - Partition aggregation uses flat percentage columns (works for all beds, both R and gtars backends) - Scalar aggregation uses single query with AVG/STDDEV/MIN/MAX + width_bucket for histograms - get_batch() gains distributions param; batch endpoint excludes distribution blobs by default to avoid large payloads Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Change distributions (bed_stats) and bedset_stats (bedsets) columns from JSON to JSONB for native operator support in aggregation queries - Remove ::jsonb casts from aggregation SQL (no longer needed) - Add BedBatchResult model with BedMetadataAll results so batch endpoint includes stats in serialized response - get_batch returns BedBatchResult instead of BedListResult Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Match the client-side collection histogram bin count: min(25, max(3, ceil(sqrt(n)))). Previously used min(25, n) which produced too many bins for small collections. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sanghoonio and others added 10 commits April 3, 2026 16:23

Fix ruff F821: import BedSetDistributions at module level

0152f10

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix ruff formatting in aggregation.py

d57b578

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix ruff: add BedBatchResult to top-level imports

ea92a13

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add distribution aggregation and dual-write#113

Add distribution aggregation and dual-write#113
sanghoonio wants to merge 10 commits into
modular-backend-schemafrom
modular-backend-logic

sanghoonio commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sanghoonio commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope evolution

What's included

Schema changes

New modules/aggregation.py

Aggregation fields pruned (not meaningful at collection level)

Dual-write in bedsets.create()

New retrieval methods

Performance impact (1000-file aggregation)

Dependencies

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sanghoonio commented Apr 3, 2026 •

edited

Loading

New `modules/aggregation.py`

Dual-write in `bedsets.create()`