Skip to content

Fix CollectHsMetrics running for only a subset of input samples#230

Open
phonegor95 wants to merge 1 commit intonf-core:devfrom
phonegor95:fix/hsmetrics-channel-broadcast
Open

Fix CollectHsMetrics running for only a subset of input samples#230
phonegor95 wants to merge 1 commit intonf-core:devfrom
phonegor95:fix/hsmetrics-channel-broadcast

Conversation

@phonegor95
Copy link
Copy Markdown

Summary

Fixes a bug where PICARD_COLLECTHSMETRICS runs for only one input sample even though PICARD_COLLECTMULTIPLEMETRICS (in the same BAM_QC subworkflow) runs for all of them. With 4 input BAMs the pipeline silently produces 1 of 4 expected *.CollectHsMetrics.coverage_metrics files.

There are two interacting channel-topology bugs in the path that feeds PICARD_COLLECTHSMETRICS. Either alone is enough to suppress samples; both need fixing for full broadcast.

Bug 1: bait/target interval channels are single-emission queues

In workflows/seqinspector.nf:

bait_intervals   ? channel.fromPath(bait_intervals).collect()   : channel.empty()
target_intervals ? channel.fromPath(target_intervals).collect() : channel.empty()

channel.fromPath(...).collect() produces a queue channel that emits once. As the right-hand side of .combine() it is consumed by the first BAM and the queue ends — subsequent BAMs have nothing to combine against, so the cartesian product collapses.

Fix: use .first(), which returns a value channel that broadcasts to every consumer.

Bug 2: ch_bam_bai is consumed by both a process and an operator

In subworkflows/local/bam_qc/main.nf:

ch_hsmetrics_in = ch_bam_bai.combine(ch_bait_intervals).combine(ch_target_intervals)

PICARD_COLLECTHSMETRICS(ch_hsmetrics_in.filter { ... }, ...)
PICARD_COLLECTMULTIPLEMETRICS(ch_bam_bai.filter { ... }, ...)

ch_bam_bai is referenced as both the left operand of .combine() and a direct process input. When a queue channel feeds a process input and an operator in the same scope, items get distributed between consumers rather than broadcast — so some BAMs reach only one of the two paths. After fixing Bug 1 alone, only 2 of 4 BAMs reached PICARD_COLLECTHSMETRICS (the others were absorbed by PICARD_COLLECTMULTIPLEMETRICS).

Fix: fork the channel explicitly with multiMap into two named branches so each consumer gets every emission.

PICARD_COLLECTMULTIPLEMETRICS always runs for all samples because it's the consumer that wins the race when the queue is split.

Reproduction (before this PR)

  1. Samplesheet with ≥2 samples
  2. --run_picard_collecthsmetrics true with any valid --bait_intervals and --target_intervals
  3. Run the pipeline
  4. results/picard_collecthsmetrics/ contains 1 file regardless of input count

Verification (after this PR)

Tested locally with 4 input FASTQ pairs:

Process Before After
PICARD_COLLECTMULTIPLEMETRICS 4 of 4 ✔ 4 of 4 ✔
PICARD_COLLECTHSMETRICS 1 of 1 4 of 4

All four *.CollectHsMetrics.coverage_metrics appear under results/picard_collecthsmetrics/ and the MultiQC HsMetrics section now lists every sample.

Test plan

  • Manually verified that 4 BAMs → 4 PICARD_COLLECTHSMETRICS tasks instead of 1
  • Manually verified PICARD_COLLECTMULTIPLEMETRICS still receives all BAMs (it shares the same forked channel)
  • CHANGELOG.md updated under ## dev### Fixed
  • CI nf-test pipeline tests pass

Notes for reviewers

  • The fix is two minimal, independent changes (4 lines of effective code in workflows/seqinspector.nf plus a multiMap block in subworkflows/local/bam_qc/main.nf). Each change is necessary; verified locally that applying only one of the two leaves a residual broadcast bug (1 of N or 2 of N depending on which is applied).
  • A short comment is included in bam_qc/main.nf explaining why multiMap is needed, since the consumed-twice queue-channel pitfall is easy to reintroduce.
  • Out of scope: the default process_single resource label given to PICARD_COLLECTHSMETRICS is wildly insufficient for WGS-sized intervals (the per-base coverage arrays alone need ~12 GB for 3 Gb of "target"). Anyone setting bait_intervals = target_intervals = wgs_calling_regions.bed will hit OOM well before the channel fix matters. Worth a separate issue and a process_high_memory (or new process_huge) label, but I've kept this PR scoped to the channel topology bug.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests! (channel-topology test exercised manually; happy to add an nf-test if maintainers prefer)
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs (N/A — bug fix)
  • If necessary, also make a PR on the nf-core/seqinspector branch on the nf-core/test-datasets repository. (N/A)
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated. (N/A — no parameter changes)
  • Output Documentation in docs/output.md is updated. (N/A — no output changes)
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors). (N/A)

Two interacting channel-topology bugs caused PICARD_COLLECTHSMETRICS to
receive fewer items than the BAM channel emits:

1. Bait/target interval channels were built with `channel.fromPath(...)
   .collect()`, which produces a single-emission queue channel. Used as
   the right operand of `.combine()` it is consumed by the first BAM,
   leaving subsequent BAMs with nothing to combine against. Switching to
   `.first()` returns a value channel that broadcasts to every consumer.

2. `ch_bam_bai` was used both as a direct process input to
   PICARD_COLLECTMULTIPLEMETRICS and as the left operand of `.combine()`
   feeding PICARD_COLLECTHSMETRICS. When a queue channel feeds a process
   input and an operator in the same scope, items are distributed
   between the two paths rather than broadcast. Forking the channel
   explicitly with `multiMap` into two named branches ensures both
   consumers receive every emission.

Verified locally with 4 input FASTQ pairs: before, only 1 of 4 BAMs
reached CollectHsMetrics; after, all 4 do, matching CollectMultipleMetrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant