Add distribution aggregation and dual-write#113
Open
sanghoonio wants to merge 10 commits into
Open
Conversation
New module aggregation.py computes collection-level stats from per-file distributions JSONB (composition, scalars, histograms, KDEs, partitions, chromosome stats). Returns BedSetDistributions. bedsets.py create() now dual-writes: old SQL mean/sd columns AND new bedset_stats JSONB. get_distributions() reads JSONB with fallback to old scalar columns. get_metadata() populates distributions when available. bedfiles.py adds get_batch() for multi-ID retrieval, aggregate_collection() wrapper, and distributions param on get_stats(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop distributions from bedset aggregation that aren't meaningful at collection level (full blobs stay in per-file storage for single-file views): - widths_histogram: use scalar_summaries.mean_region_width instead - neighbor_distances KDE: use new median_neighbor_distance scalar - gc_content KDE: use scalar_summaries.gc_content mean - chromosome_summaries: redundant with region_distribution - expected_partitions: per-file null hypothesis Schema: - Drop dead tssdist column (no model/writer/reader references) - Add median_neighbor_distance column + model field Aggregation: switch heavy lifting from Python to SQL. With gtars #248's reference-aligned region_distribution bin widths, Postgres can do element-wise aggregation via jsonb_array_elements + GROUP BY: - region_distribution: SQL jsonb_each + unnest per-chrom arrays, GROUP BY (chrom, bin_idx), AVG/STDDEV. Returns only aggregated rows, not raw per-file blobs. - tss_histogram: SQL element-wise SUM across fixed-axis 100-bin arrays. - scalars: AVG/STDDEV on BedStats columns (no JSONB parsing). Plus histogram-of-means computed from the raw scalar values. - partitions: stay in Python (small nested JSONB, already fast). Remove obsolete Python helpers: - _aggregate_variable_histogram (widths) - _aggregate_variable_kde (neighbor_distances, gc_content) - _aggregate_region_distribution (old Python re-bin-and-stack version) - _aggregate_fixed_axis + _aggregate_fixed_axis_from_dists (TSS via JSONB) - _aggregate_chromosome_stats Expected performance impact for 1000-file aggregation: - Wire transfer Postgres→worker: ~40MB → ~150KB - Latency: 1-3s → <500ms - Worker memory: ~40MB held → ~few KB Tests: 52 passed, 5 skipped (same as before changes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Supports bedboss's parallel Python-bindings-direct backend for side-by-side performance comparison against the subprocess-based 'gtars' backend. Both backends coexist during testing; only one will remain after benchmarking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend Literal narrowed to "r" | "gtars" — the gtars-py backend was removed from bedboss after benchmarking showed the pure CLI with .fab binary FASTA matches its performance with simpler architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rewrite aggregation.py: composition, scalars, histograms, and partitions all computed in SQL (no more per-row Python loops or numpy) - Partition aggregation uses flat percentage columns (works for all beds, both R and gtars backends) - Scalar aggregation uses single query with AVG/STDDEV/MIN/MAX + width_bucket for histograms - get_batch() gains distributions param; batch endpoint excludes distribution blobs by default to avoid large payloads Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Change distributions (bed_stats) and bedset_stats (bedsets) columns from JSON to JSONB for native operator support in aggregation queries - Remove ::jsonb casts from aggregation SQL (no longer needed) - Add BedBatchResult model with BedMetadataAll results so batch endpoint includes stats in serialized response - get_batch returns BedBatchResult instead of BedListResult Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Match the client-side collection histogram bin count: min(25, max(3, ceil(sqrt(n)))). Previously used min(25, n) which produced too many bins for small collections. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds bedset-level distribution aggregation, with the aggregation implemented primarily in SQL (leveraging gtars PR #248's reference-aligned bin widths).
Scope evolution
Initial scope was "add aggregation" (commit
300b45f). During review, two further changes landed:0152f10: ruff import fix7fe5c64: prune aggregation to meaningful fields + switch to SQL-side computationThe current PR reflects the full scope described below.
What's included
Schema changes
tssdistcolumn (no model references, no writers, no readers)median_neighbor_distancescalar column toBedStats+ model fielddistributions: dict | Nonefield toBedStatsModel(per-file JSONB, from gtars backend)New
modules/aggregation.pyComputes
BedSetDistributionsfrom member files' per-file distributions:scalar_summaries:AVG/STDDEVon BedStats scalar columns + 25-bin histogram of per-file meansregion_distribution:jsonb_each+jsonb_array_elements_text WITH ORDINALITY+GROUP BY (chrom, bin_idx)— valid now that gtars #248 gives reference-aligned bin widths per genometss_histogram: element-wise aggregation across fixed-axis 100-bin arrayspartitions: mean ± sd of per-file partition percentagescomposition: distinct value counts per metadata field (genome, assay, cell_type, tissue, target)Aggregation fields pruned (not meaningful at collection level)
These stay in per-file
distributionsJSONB for single-file views, just aren't aggregated:widths_histogram: per-file variable-range bins aren't summable; usescalar_summaries.mean_region_widthhistogram insteadneighbor_distancesKDE: use newmedian_neighbor_distancescalar insteadgc_contentKDE: usescalar_summaries.gc_contentmean insteadchromosome_summaries: redundant withregion_distributionexpected_partitions: per-file null hypothesis, not a collection propertyDual-write in
bedsets.create()bedset_means/bedset_standard_deviationcolumns (backward compat)bedset_statsJSONB column (when members have distributions)New retrieval methods
BedAgentBedSet.get_distributions(): readbedset_statsJSONB with fallback to legacy scalar columnsBedAgentBedFile.get_batch(): multi-ID bed metadata retrievalBedAgentBedFile.aggregate_collection(): ad-hoc collection stats on arbitrary bed ID listdistributions: boolparam onget_stats()Performance impact (1000-file aggregation)
The SQL path avoids pulling 1000 raw distribution blobs into the Python worker — Postgres does the element-wise summation and returns only the aggregate result.
Dependencies
modular-backend-schema— addsdistributionsJSONB column and related models)region_distributionbin widths (required for SQL aggregation correctness)Test plan
🤖 Generated with Claude Code