Skip to content

norm -m + --atomize inconsistent representation of complex variants #2482

Description

@mi3112

I am working with a WES tumor-only experiment.
Variant calling was performed using three different tools:

  • Mpileup
  • Mutect
  • Freebayes

(in the pictures I kept the same order)

Image

All three callers detect the same complex variant, but each represents it differently in the original VCF.

To normalize the variants, I used the following command:

bcftools norm --atomize -f ref.fasta -o output.vcf input.vcf

Before normalized I have this representation:
Freebayes
chr17 7675081 . GGGGCAGC GGA

Mutect2

chr17	7675082	.	GGGC	G	
chr17	7675086	.	AGC	A	

Mpileup

chr17	7675081	.	GGGGCAG	GG	
chr17	7675088	.	C	A	

After normalization the result was
Freebayes:
chr17 7675083 . GGCAGC A

Mutect2:

chr17	7675082	.	GGGC	G	
chr17	7675086	.	AGC	A

Mpileup

chr17	7675081	.	GGGGCA	G	
chr17	7675088	.	C	A

Even after applying bcftools norm --atomize, the same biological variant is still represented differently across callers:

  • Different POS
  • Different decomposition boundaries
  • Different REF/ALT lengths

I was expecting --atomize to produce a canonical, consistent representation across callers (same coordinates and minimal atomic variants), but this did not happen.

Is this behavior expected?
Does --atomize intentionally preserve caller-specific breakpoints or representations?

Is there a recommended way to obtain an identical representation for complex variants across different callers, so that intersections/overlaps between VCFs can be computed reliably?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions