Skip to content

Add distribution aggregation and dual-write#113

Open
sanghoonio wants to merge 10 commits into
modular-backend-schemafrom
modular-backend-logic
Open

Add distribution aggregation and dual-write#113
sanghoonio wants to merge 10 commits into
modular-backend-schemafrom
modular-backend-logic

Conversation

@sanghoonio

@sanghoonio sanghoonio commented Apr 3, 2026

Copy link
Copy Markdown
Member

Summary

Adds bedset-level distribution aggregation, with the aggregation implemented primarily in SQL (leveraging gtars PR #248's reference-aligned bin widths).

Scope evolution

Initial scope was "add aggregation" (commit 300b45f). During review, two further changes landed:

  • 0152f10: ruff import fix
  • 7fe5c64: prune aggregation to meaningful fields + switch to SQL-side computation

The current PR reflects the full scope described below.

What's included

Schema changes

  • Drop dead tssdist column (no model references, no writers, no readers)
  • Add median_neighbor_distance scalar column to BedStats + model field
  • Add distributions: dict | None field to BedStatsModel (per-file JSONB, from gtars backend)

New modules/aggregation.py

Computes BedSetDistributions from member files' per-file distributions:

  • SQL aggregation (heavy lifting done in Postgres, not Python):
    • scalar_summaries: AVG/STDDEV on BedStats scalar columns + 25-bin histogram of per-file means
    • region_distribution: jsonb_each + jsonb_array_elements_text WITH ORDINALITY + GROUP BY (chrom, bin_idx) — valid now that gtars #248 gives reference-aligned bin widths per genome
    • tss_histogram: element-wise aggregation across fixed-axis 100-bin arrays
  • Python aggregation (small nested JSONB):
    • partitions: mean ± sd of per-file partition percentages
    • composition: distinct value counts per metadata field (genome, assay, cell_type, tissue, target)

Aggregation fields pruned (not meaningful at collection level)

These stay in per-file distributions JSONB for single-file views, just aren't aggregated:

  • widths_histogram: per-file variable-range bins aren't summable; use scalar_summaries.mean_region_width histogram instead
  • neighbor_distances KDE: use new median_neighbor_distance scalar instead
  • gc_content KDE: use scalar_summaries.gc_content mean instead
  • chromosome_summaries: redundant with region_distribution
  • expected_partitions: per-file null hypothesis, not a collection property

Dual-write in bedsets.create()

  • Old SQL aggregation → bedset_means / bedset_standard_deviation columns (backward compat)
  • New distribution aggregation → bedset_stats JSONB column (when members have distributions)

New retrieval methods

  • BedAgentBedSet.get_distributions(): read bedset_stats JSONB with fallback to legacy scalar columns
  • BedAgentBedFile.get_batch(): multi-ID bed metadata retrieval
  • BedAgentBedFile.aggregate_collection(): ad-hoc collection stats on arbitrary bed ID list
  • distributions: bool param on get_stats()

Performance impact (1000-file aggregation)

Before After
Wire transfer Postgres→worker ~40 MB ~150 KB
Latency 1-3 s <500 ms
Worker memory ~40 MB of parsed JSONB ~few KB

The SQL path avoids pulling 1000 raw distribution blobs into the Python worker — Postgres does the element-wise summation and returns only the aggregate result.

Dependencies

Test plan

  • All 52 existing tests pass locally (5 skipped, 0 failed)
  • CI lint + pytest
  • Manual parity check: compare SQL aggregation output vs old Python output on a small test bedset (after gtars-processed data is available)

🤖 Generated with Claude Code

sanghoonio and others added 10 commits April 3, 2026 16:23
New module aggregation.py computes collection-level stats from per-file
distributions JSONB (composition, scalars, histograms, KDEs, partitions,
chromosome stats). Returns BedSetDistributions.

bedsets.py create() now dual-writes: old SQL mean/sd columns AND new
bedset_stats JSONB. get_distributions() reads JSONB with fallback to
old scalar columns. get_metadata() populates distributions when available.

bedfiles.py adds get_batch() for multi-ID retrieval, aggregate_collection()
wrapper, and distributions param on get_stats().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop distributions from bedset aggregation that aren't meaningful at
collection level (full blobs stay in per-file storage for single-file
views):
- widths_histogram: use scalar_summaries.mean_region_width instead
- neighbor_distances KDE: use new median_neighbor_distance scalar
- gc_content KDE: use scalar_summaries.gc_content mean
- chromosome_summaries: redundant with region_distribution
- expected_partitions: per-file null hypothesis

Schema:
- Drop dead tssdist column (no model/writer/reader references)
- Add median_neighbor_distance column + model field

Aggregation: switch heavy lifting from Python to SQL. With gtars #248's
reference-aligned region_distribution bin widths, Postgres can do
element-wise aggregation via jsonb_array_elements + GROUP BY:

- region_distribution: SQL jsonb_each + unnest per-chrom arrays, GROUP
  BY (chrom, bin_idx), AVG/STDDEV. Returns only aggregated rows, not
  raw per-file blobs.
- tss_histogram: SQL element-wise SUM across fixed-axis 100-bin arrays.
- scalars: AVG/STDDEV on BedStats columns (no JSONB parsing). Plus
  histogram-of-means computed from the raw scalar values.
- partitions: stay in Python (small nested JSONB, already fast).

Remove obsolete Python helpers:
- _aggregate_variable_histogram (widths)
- _aggregate_variable_kde (neighbor_distances, gc_content)
- _aggregate_region_distribution (old Python re-bin-and-stack version)
- _aggregate_fixed_axis + _aggregate_fixed_axis_from_dists (TSS via JSONB)
- _aggregate_chromosome_stats

Expected performance impact for 1000-file aggregation:
- Wire transfer Postgres→worker: ~40MB → ~150KB
- Latency: 1-3s → <500ms
- Worker memory: ~40MB held → ~few KB

Tests: 52 passed, 5 skipped (same as before changes).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Supports bedboss's parallel Python-bindings-direct backend for
side-by-side performance comparison against the subprocess-based
'gtars' backend. Both backends coexist during testing; only one
will remain after benchmarking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend Literal narrowed to "r" | "gtars" — the gtars-py backend was
removed from bedboss after benchmarking showed the pure CLI with .fab
binary FASTA matches its performance with simpler architecture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rewrite aggregation.py: composition, scalars, histograms, and
  partitions all computed in SQL (no more per-row Python loops or numpy)
- Partition aggregation uses flat percentage columns (works for all beds,
  both R and gtars backends)
- Scalar aggregation uses single query with AVG/STDDEV/MIN/MAX +
  width_bucket for histograms
- get_batch() gains distributions param; batch endpoint excludes
  distribution blobs by default to avoid large payloads

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Change distributions (bed_stats) and bedset_stats (bedsets) columns
  from JSON to JSONB for native operator support in aggregation queries
- Remove ::jsonb casts from aggregation SQL (no longer needed)
- Add BedBatchResult model with BedMetadataAll results so batch
  endpoint includes stats in serialized response
- get_batch returns BedBatchResult instead of BedListResult

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Match the client-side collection histogram bin count: min(25, max(3, ceil(sqrt(n)))).
Previously used min(25, n) which produced too many bins for small collections.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant