Skip to content

Commit df03971

Browse files
committed
Restrict allowed VCF Contig ID chars to those allowed in SAM RNAMEs
Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously fixed by PR samtools#333.
1 parent 51e28f5 commit df03971

1 file changed

Lines changed: 12 additions & 1 deletion

File tree

VCFv4.3.tex

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -226,7 +226,14 @@ \subsubsection{Contig field format}
226226
\end{verbatim}
227227

228228
\noindent
229-
Valid contig names must follow the reference sequence names allowed by the SAM format ("{\tt [!-)+-\char60\char62-\char126][!-\char126]*}") excluding the characters "\texttt{\textless\textgreater[]*}" to avoid clashes with symbolic alleles.
229+
Contig names follow the same rules as the SAM format's reference sequence names:
230+
they may contain any printable ASCII characters in the range \verb|[!-~]| apart from `{\tt\verb|\|\,,\,"`'\,()\,[]\,\verb|{}|\,<>}' and may not start with `{\tt *}' or `{\tt =}'.
231+
Thus they match the following regular expression:
232+
\begin{verbatim}
233+
[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*
234+
\end{verbatim}
235+
\noindent
236+
In particular, excluding commas facilitates parsing \verb|##contig| lines, and excluding the characters `\verb|<>[]|' and initial~`{\tt *}' avoids clashes with symbolic alleles.
230237
The contig names must not use a reserved symbolic allele name.
231238
232239
@@ -2047,6 +2054,10 @@ \subsection{Changes to VCFv4.3}
20472054
\item Tables with Type and Number definitions for INFO and FORMAT reserved keys
20482055
20492056
\item
2057+
The set of characters allowed in VCF contig names is now the same as that allowed in SAM reference sequence names, which was restricted in January 2019.
2058+
The characters `{\tt\verb|\|\,,\,"`'\,()\,\verb|{}|}' are now invalid in VCF contig names, while `{\tt *}' is now valid when not the first character.
2059+
(The characters `{\tt []\,<>}' and initial~`{\tt *}'/`{\tt =}' were already invalid and remain so.)
2060+
20502061
The VCF specification previously disallowed colons (`{\tt :}') in contig names to avoid confusion when parsing breakends, but this was unnecessary.
20512062
Even with contig names containing colons, the breakend mate position notation can be unambiguously parsed because the ``{\tt :}\emph{pos}'' part is \textbf{always} present.
20522063
\end{itemize}

0 commit comments

Comments
 (0)