Skip to content

CollectHsMetrics only runs for one sample regardless of input count #229

@phonegor95

Description

@phonegor95

Bug: CollectHsMetrics only runs for one sample regardless of input count

Description

When --run_picard_collecthsmetrics true is set, PICARD_COLLECTHSMETRICS only ever runs for one of the input samples — no matter how many FASTQ pairs / BAMs are in the run. The other samples are silently skipped with no error and no warning. The Nextflow progress UI confirms only one task is ever instantiated:

[xx/xxxxxx] NFCORE_SEQINSPECTOR:SEQINSPECTOR:QC_BAM:PICARD_COLLECTHSMETRICS (sampleX) [100%] 1 of 1

PICARD_COLLECTMULTIPLEMETRICS, which sits in the same subworkflow but does not consume the interval channels, runs correctly for all samples.

Expected behaviour

PICARD_COLLECTHSMETRICS should run once per input BAM, just like PICARD_COLLECTMULTIPLEMETRICS does.

Actual behaviour

Only one task is created. The "winning" sample is whichever one the Nextflow scheduler happens to pull off the BAM queue first — non-deterministic.

Root cause

In workflows/seqinspector.nf the bait and target interval channels are constructed with .collect():

ch_bait_intervals   = bait_intervals   ? channel.fromPath(bait_intervals).collect()   : channel.empty()
ch_target_intervals = target_intervals ? channel.fromPath(target_intervals).collect() : channel.empty()

channel.fromPath(...).collect() produces a queue channel that emits once (one list of paths). In subworkflows/local/qc_bam/main.nf these channels are joined to the BAM channel via .combine(...):

ch_hsmetrics_in = ch_bam_bai
    .combine(ch_bait_intervals)
    .combine(ch_target_intervals)

When .combine() is applied to two queue channels, the right-hand channel is consumed once. With a single emission on the right, the first BAM consumes that emission and the channel ends — the remaining N-1 BAMs find nothing to combine with, so the cartesian product collapses to one tuple.

PICARD_COLLECTMULTIPLEMETRICS is unaffected because it does not combine with the interval channels.

Reproduction

  1. Provide a samplesheet with at least two samples / FASTQ pairs.
  2. Set --run_picard_collecthsmetrics true and supply --bait_intervals and --target_intervals (any valid BED or interval_list works).
  3. Run the pipeline.
  4. Observe results/picard_collecthsmetrics/ — only one *.CollectHsMetrics.coverage_metrics file is produced.

Minimal reproducer config

input:                       samplesheet.csv      # >=2 rows
genome:                      GRCh38
fasta:                       /path/to/Homo_sapiens_assembly38.fasta
dict:                        /path/to/Homo_sapiens_assembly38.dict
bwamem2:                     /path/to/BWAmem2Index
run_picard_collecthsmetrics: true
bait_intervals:              /path/to/wgs_calling_regions.bed
target_intervals:            /path/to/wgs_calling_regions.bed

Evidence from a real run

  • 4 input FASTQ pairs → 4 BAMs produced by BWAMEM2_MEM
  • PICARD_COLLECTMULTIPLEMETRICS → 4 tasks ✅
  • PICARD_COLLECTHSMETRICS → 1 task ❌

.nextflow.log shows only a single PICARD_COLLECTHSMETRICS cache/submission entry across the whole run.

Proposed fix

Use .first() (which produces a value channel that broadcasts to every consumer) instead of .collect():

--- a/workflows/seqinspector.nf
+++ b/workflows/seqinspector.nf
@@ -184,8 +184,8 @@
     if (!("picard_collectmultiplemetrics" in skip_tools)) {
 
-        ch_bait_intervals = bait_intervals ? channel.fromPath(bait_intervals).collect() : channel.empty()
-        ch_target_intervals = target_intervals ? channel.fromPath(target_intervals).collect() : channel.empty()
+        ch_bait_intervals = bait_intervals ? channel.fromPath(bait_intervals).first() : channel.empty()
+        ch_target_intervals = target_intervals ? channel.fromPath(target_intervals).first() : channel.empty()
 
         QC_BAM(
             ch_bwamem2_mem,

Verified locally: applying this patch produces the expected N tasks for N input BAMs, all four *.CollectHsMetrics.coverage_metrics files appear under results/picard_collecthsmetrics/, and MultiQC's HsMetrics section now lists every sample.

Environment

  • nf-core/seqinspector: master (commit at time of report — please replace with nextflow info nf-core/seqinspector output)
  • Nextflow: 25.10.4
  • Profile: singularity
  • Picard: 3.4.0 (from the pipeline-pinned container)
  • Executor: SLURM

Related code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions