Fix CollectHsMetrics running for only a subset of input samples#230
Open
phonegor95 wants to merge 1 commit intonf-core:devfrom
Open
Fix CollectHsMetrics running for only a subset of input samples#230phonegor95 wants to merge 1 commit intonf-core:devfrom
phonegor95 wants to merge 1 commit intonf-core:devfrom
Conversation
Two interacting channel-topology bugs caused PICARD_COLLECTHSMETRICS to receive fewer items than the BAM channel emits: 1. Bait/target interval channels were built with `channel.fromPath(...) .collect()`, which produces a single-emission queue channel. Used as the right operand of `.combine()` it is consumed by the first BAM, leaving subsequent BAMs with nothing to combine against. Switching to `.first()` returns a value channel that broadcasts to every consumer. 2. `ch_bam_bai` was used both as a direct process input to PICARD_COLLECTMULTIPLEMETRICS and as the left operand of `.combine()` feeding PICARD_COLLECTHSMETRICS. When a queue channel feeds a process input and an operator in the same scope, items are distributed between the two paths rather than broadcast. Forking the channel explicitly with `multiMap` into two named branches ensures both consumers receive every emission. Verified locally with 4 input FASTQ pairs: before, only 1 of 4 BAMs reached CollectHsMetrics; after, all 4 do, matching CollectMultipleMetrics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a bug where
PICARD_COLLECTHSMETRICSruns for only one input sample even thoughPICARD_COLLECTMULTIPLEMETRICS(in the sameBAM_QCsubworkflow) runs for all of them. With 4 input BAMs the pipeline silently produces 1 of 4 expected*.CollectHsMetrics.coverage_metricsfiles.There are two interacting channel-topology bugs in the path that feeds
PICARD_COLLECTHSMETRICS. Either alone is enough to suppress samples; both need fixing for full broadcast.Bug 1: bait/target interval channels are single-emission queues
In
workflows/seqinspector.nf:channel.fromPath(...).collect()produces a queue channel that emits once. As the right-hand side of.combine()it is consumed by the first BAM and the queue ends — subsequent BAMs have nothing to combine against, so the cartesian product collapses.Fix: use
.first(), which returns a value channel that broadcasts to every consumer.Bug 2:
ch_bam_baiis consumed by both a process and an operatorIn
subworkflows/local/bam_qc/main.nf:ch_bam_baiis referenced as both the left operand of.combine()and a direct process input. When a queue channel feeds a process input and an operator in the same scope, items get distributed between consumers rather than broadcast — so some BAMs reach only one of the two paths. After fixing Bug 1 alone, only 2 of 4 BAMs reachedPICARD_COLLECTHSMETRICS(the others were absorbed byPICARD_COLLECTMULTIPLEMETRICS).Fix: fork the channel explicitly with
multiMapinto two named branches so each consumer gets every emission.PICARD_COLLECTMULTIPLEMETRICSalways runs for all samples because it's the consumer that wins the race when the queue is split.Reproduction (before this PR)
--run_picard_collecthsmetrics truewith any valid--bait_intervalsand--target_intervalsresults/picard_collecthsmetrics/contains 1 file regardless of input countVerification (after this PR)
Tested locally with 4 input FASTQ pairs:
PICARD_COLLECTMULTIPLEMETRICSPICARD_COLLECTHSMETRICSAll four
*.CollectHsMetrics.coverage_metricsappear underresults/picard_collecthsmetrics/and the MultiQC HsMetrics section now lists every sample.Test plan
PICARD_COLLECTHSMETRICStasks instead of 1PICARD_COLLECTMULTIPLEMETRICSstill receives all BAMs (it shares the same forked channel)## dev→### FixedNotes for reviewers
workflows/seqinspector.nfplus amultiMapblock insubworkflows/local/bam_qc/main.nf). Each change is necessary; verified locally that applying only one of the two leaves a residual broadcast bug (1 of N or 2 of N depending on which is applied).bam_qc/main.nfexplaining whymultiMapis needed, since the consumed-twice queue-channel pitfall is easy to reintroduce.process_singleresource label given toPICARD_COLLECTHSMETRICSis wildly insufficient for WGS-sized intervals (the per-base coverage arrays alone need ~12 GB for 3 Gb of "target"). Anyone settingbait_intervals = target_intervals = wgs_calling_regions.bedwill hit OOM well before the channel fix matters. Worth a separate issue and aprocess_high_memory(or newprocess_huge) label, but I've kept this PR scoped to the channel topology bug.PR checklist
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated. (N/A — no parameter changes)docs/output.mdis updated. (N/A — no output changes)CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors). (N/A)