Skip to content

Commit 008387a

Browse files
committed
Incorporated @yfarjoun #421 into VCFv4.4
Added PSO field to remove traversal ambiguity Using preceding GT notation to match BCF Added BCF clarification what to do with the missing first allele GT separator Defined implicit GT separator based on the other separators Removed absolete definition of bundles #643
1 parent e1acf3f commit 008387a

1 file changed

Lines changed: 57 additions & 62 deletions

File tree

VCFv4.4.draft.tex

Lines changed: 57 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -489,6 +489,10 @@ \subsubsection{Genotype fields}
489489
PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\
490490
PQ & 1 & Integer & Phasing quality \\
491491
PS & 1 & Integer & Phase set \\
492+
PSL & P & String & Phase set list \\
493+
PSO & P & Integer & Phase set list ordinal \\
494+
PSQ & P & Integer & Phase set list quality \\
495+
492496
\end{longtable}
493497
494498
\begin{itemize}
@@ -503,17 +507,18 @@ \subsubsection{Genotype fields}
503507
No whitespace or semicolons permitted.
504508
\item GQ (Integer): Conditional genotype quality, encoded as a phred quality $-10log_{10}$ p(genotype call is wrong, conditioned on the site's being variant).
505509
\item GP (Float): Genotype posterior probabilities in the range 0 to 1 using the same ordering as the GL field; one use can be to store imputed genotype probabilities.
506-
\item GT (String): Genotype, encoded as allele values separated by either of $/$ or $\mid$.
507-
The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on.
508-
For diploid calls examples could be $0/1$, $1\mid0$, or $1/2$, etc.
509-
Haploid calls, e.g.\ on Y, male non-pseudoautosomal X, or mitochondrion, are indicated by having only one allele value.
510-
A triploid call might look like $0/0/1$.
511-
If a call cannot be made for a sample at a given locus, `.' must be specified for each missing allele in the GT field (for example `$./.$' for a diploid genotype and `.' for haploid genotype).
512-
The meanings of the separators are as follows (see the PS field below for more details on incorporating phasing information into the genotypes):
513-
\begin{itemize}
514-
\item $/$ : genotype unphased
515-
\item $\mid$ : genotype phased
516-
\end{itemize}
510+
\item GT (String): Genotype, encoded as allele value preceded by either of $/$ or $\mid$ depending on whether that allele is considered phased.
511+
The first separator may be omitted and is implicitly defined as $/$ if any separator are $/$ and $\mid$ otherwise.
512+
The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on.
513+
For diploid calls examples could be $0/1$, $1\mid0$, $/0/1$, or $1/2$, etc.
514+
Haploid calls, e.g.\ on Y, male non-pseudoautosomal X, or mitochondria, should be indicated by having only one allele value.
515+
A triploid call might look like $0/0/1$, and a partially phased triploid call could be $|0/1/2$ to indicate that the first allele is phased with another variant in the VCF.
516+
If a call cannot be made for a sample at a given locus, `$.$' must be specified for each missing allele in the {\tt GT} field (for example `$./.$' for a diploid genotype and `$.$' for haploid genotype).
517+
The meanings of the separators are as follows (see the {\tt PS} and {\tt PSL} fields below for more details on incorporating phasing information into the genotypes):
518+
\begin{itemize}
519+
\item $/$ : preceding allele is unphased
520+
\item $\mid$ : preceding allele is phased (according to the phase-set indicated in {\tt PS} or {\tt PSL})
521+
\end{itemize}
517522
518523
\item GL (Float): Genotype likelihoods comprised of comma separated floating point $log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields.
519524
In presence of the GT field the same ploidy is expected; without GT field, diploidy is assumed.
@@ -583,6 +588,45 @@ \subsubsection{Genotype fields}
583588
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set.
584589
If the genotype in the GT field is unphased, the corresponding PS field is ignored.
585590
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required).
591+
\item PSL (List of Strings): The list of phase sets, one for each allele specified in the {\tt GT}.
592+
Unphased alleles (without a $\mid$ separator before them) must have the value '$.$' in their corresponding position in the list.
593+
Unlike {\tt PS} (which is defined per CHROM), records with different CHROM but the same phase-set name are considered part of the same phase set.
594+
If an implementation cannot guarantee uniqueness of phase-set names across the VCF (for example, phasing a streaming VCF or each CHROM is processed independently in parallel), new phase-set names should be of the format CHROM*POS*ALLELE-NUMBER of the ``first'' allele which is included in this set, with ALLELE-NUMBER being the index of the allele in the {\tt GT} field, since multiple distinct phase-sets could start at the same position. \footnote{The `*' character is used as a separator since `:' is not reserved in the CHROM column.}
595+
A given sample-genotype must not have values for both PS and PSL.
596+
In addition, PS and PSL are not interoperable, in that a PS mentioned in one variant cannot be referenced in a PSL in another, since when used in PS it isn't connected to any specific haplotype (i.e. first or second), but PSL is.
597+
598+
Example:
599+
600+
\vspace{0.5em}
601+
\begin{tabular}{ l l l l l l l l l l}
602+
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO & FORMAT & SAMPLE1\\
603+
chr19 & $5$ & . & T & G & . & PASS & DP=100 &GT:PSL & \tt{|0/1:chr9*5*1,.}\\
604+
chr20 & $10$ & . & A & T,G & . & PASS & DP=100 &GT:PSL & \tt{|1/2|3:chr20*10*1,.,chr9*5*1} \\
605+
chr20 & $15$ & . & G & C & . & PASS & DP=100 &GT:PSL & \tt{1|2:.,chr20*10*1}\\
606+
\end{tabular}
607+
608+
\item PSO (List of integers): List of phase set ordinals.
609+
For each phase-set name, defines the order in which variants are encountered when traversing a derivate chromosome.
610+
The missing value '$.$' should be used when the corresponding PSO value is missing.
611+
For each phase-set name, PSO should be defined if any allele with that phase-set name on any record is symbolic structural variant or in breakpoint notation.
612+
Variants in breakpoint notation must have the same PSL and PSO on both records.
613+
614+
Without explicitly specifying the derivate chromosome traversal order, multiple derivate chromosome reconstructions are possible.
615+
Take for example this tandem duplication in a triploid organism with SNVs (ID/QUAL/FILTER columns removed for clarity):
616+
617+
\vspace{0.5em}
618+
\begin{tabular}{ l l l l l l l l l l}
619+
\#CHROM & POS & REF & ALT & INFO & FORMAT & SAMPLE1\\
620+
chr1 & $10$ & T & $<$DUP$>$ & SVCLAIM=DJ & GT:PSL:PSO & \tt{/0/0|1:.,.,chr1*10*1:.,.,3}\\
621+
chr1 & $20$ & A & G & . & GT:PSL:PSO & \tt{/0/0|0|1:.,chr1*10*1:.,.,4,1} \\
622+
chr1 & $30$ & G & T & . & GT:PSL:PSO & \tt{/0/0|0|1:.,chr1*10*1:.,.,2,5} \\
623+
\end{tabular}
624+
625+
Without defining PSO, would be ambiguous as to which copy of the duplicated region the SNVs occur on.
626+
In this example, the presence of the PSO field clarifies that the SNVs are cis phased with the duplication, the first SNV occurs on the first copy of the duplicated region, and second SNV on the second copy.
627+
628+
\item PSQ (List of integers): The list of PQs, one for each phase set in PSL (encoded like PQ).
629+
The missing value '$.$' should be used when the corresponding PSL value is missing, or when the phasing is of unknown quality.
586630
\end{itemize}
587631
588632
@@ -1541,57 +1585,6 @@ \subsubsection{Clonal derivation relationships}
15411585
In the case of the duplication of a region within a haplotype, one copy retains the original haplotype identifier, and the others are considered to be novel haplotypes with their own unique identifiers.
15421586
All these novel haplotypes have in common their \textbf{haplotype ancestor} in the parent genome.
15431587
1544-
\subsubsection{Phasing adjacencies in an aneuploid context}
1545-
In a cancer genome, due to duplication followed by mutation, there can in principle exist any number of haplotypes in the sampled genome for a given location in the reference genome.
1546-
We assume each haplotype that the user chooses to name is named with a numerical haplotype identifier.
1547-
Although it is difficult with current technologies to associate haplotypes with novel adjacencies, it might be partially possible to deconvolve these connections in the near future.
1548-
We therefore propose the following notation to allow haplotype-ambiguous as well as haplotype-unambiguous connections to be described.
1549-
The general term for these haplotype-specific adjacencies is \textbf{bundles}.
1550-
1551-
The diagram in Figure 11 will be used to support examples below:
1552-
1553-
\begin{figure}[ht]
1554-
\centering
1555-
\includegraphics[width=4in,height=2.59in]{img/phasing-400x259.png}
1556-
\caption{Phasing}
1557-
\end{figure}
1558-
1559-
In this example, we know that in the sampled genome:
1560-
1561-
\begin{enumerate}
1562-
\item A reference bundle connects breakend U, haplotype 5 on chr13 to its partner, breakend X, haplotype 5 on chr13,
1563-
\item A novel bundle connects breakend U, haplotype 1 on chr13 to its mate breakend V, haplotype 11 on chr2, and finally,
1564-
\item A novel bundle connects breakend U, haplotypes 2, 3 and 4 on chr13 to breakend V, haplotypes 12, 13 or 14 on chr2 without any explicit pairing.
1565-
\end{enumerate}
1566-
1567-
These three are the bundles for breakend U. Each such bundle is referred to as a haplotype of the breakend U.
1568-
Each allele of a breakend corresponds to one or more haplotypes.
1569-
In the above case there are two alleles: the 0 allele, corresponding to the adjacency to the partner X, which has haplotype (1), and the 1 allele, corresponding to the two haplotypes (2) and (3) with adjacency to the mate V.
1570-
1571-
For each haplotype of a breakend, say the haplotype (2) of breakend U above, connecting the end of haplotype 1 on a segment of Chr 13 to a mate on Chr 2 with haplotype 11, in addition to the list of haplotype-specific adjacencies that define it, we can also specify in VCF several other quantities.
1572-
These include:
1573-
1574-
\begin{enumerate}
1575-
\item The depth of reads on the segment where the breakend occurs that support the haplotype, e.g., the depth of reads supporting haplotype 1 in the segment containing breakend U
1576-
\item The estimated copy number of the haplotype on the segment where the breakend occurs
1577-
\item The depth of paired-end or split reads that support the haplotype-specific adjacencies, e.g., that support the adjacency between haplotype 1 on Chr 13 to haplotype 11 on Chr 2
1578-
\item The estimated copy number of the haplotype-specific adjacencies
1579-
\item An overall quality score indicating how confident we are in this asserted haplotype
1580-
\end{enumerate}
1581-
These are specified using the using the DP, CN, BDP, BCN, and HQ subfields, respectively.
1582-
The total information available about the three haplotypes of breakend U in the figure above may be visualized in a table as follows.
1583-
1584-
\vspace{0.3cm}
1585-
\begin{tabular}{ l l l l }
1586-
Allele & 1 & 1 & 0 \\
1587-
Haplotype & 1$>$11 & 2,3,4$>$12,13,14 & 5$>$5 \\
1588-
Segment Depth & 5 & 17 & 4 \\
1589-
Segment Copy Number & 1 & 3 & 1 \\
1590-
Bundle Depth & 4 & 0 & 3 \\
1591-
Bundle Copy Number & 1 & 3 & 1 \\
1592-
Haplotype quality & 30 & 40 & 40 \\
1593-
\end{tabular}
1594-
15951588
\pagebreak
15961589
\subsection{Representing unspecified alleles and REF-only blocks (gVCF)}
15971590
\label{unspecified-allele}
@@ -2037,6 +2030,7 @@ \subsubsection{Type encoding}
20372030
For one individual, each integer in the vector is organized as $(allele+1) << 1 \mid phased$ where allele is set to $-1$ if the allele in GT is a dot `.' (thus the higher bits are all 0).
20382031
The vector is padded with the END\_OF\_VECTOR values if the GT having fewer ploidy.
20392032
We note specifically that except for the END\_OF\_VECTOR byte, no other negative values are allowed in the GT array.
2033+
When processing VCF version 4.3 or earlier files, the phasing of the first allele should be treated as missing and inferred from the remaining alleles.
20402034
20412035
Examples:
20422036
@@ -2302,6 +2296,7 @@ \subsection{Changes between VCFv4.4 and VCFv4.3}
23022296
\item Deprecate SVTYPE INFO field preferring the use of symbolic alleles in the ALT field
23032297
\item Define new reserved INFO field EVENT, EVENTTYPE and SVCLAIM
23042298
\item Redefined INFO field SVLEN to be always positive
2299+
\item Added Phase-Set List (PSL \& PSO \& PSQ) and allele-specific phasing notation (in GT)
23052300
\end{itemize}
23062301
23072302
\subsection{Changes to VCFv4.3}

0 commit comments

Comments
 (0)