Disallow commas and other punctuation in RNAME etc (PR samtools#333)

jmarshall · jmarshall · commit 6be1fbf4eaf9 · 2018-09-20T09:41:47.000+01:00
Disallow \ , "`' ()[]{}&lt;&gt; punctuation characters in reference sequence
names. Commas and angle brackets are used to delimit refnames in other
SAM fields (e.g. SA) and in VCF files, and restricting these other
characters facilitates future delimiter and quoting syntax.

Statistics gathered from various reference sequence archives suggest
that these characters appear vanishingly infrequently in refnames in
existing files in the wild.

Add previously omitted SQ-AN history note.
diff --git a/SAMv1.tex b/SAMv1.tex
@@ -32,6 +32,9 @@
 \newcommand*{\firstbytebox}[2]{\byteboxAux{#1}{#2}{\put(0,0){\line(0,1){\bytetotalheight}}}}
 \newcommand*{\bytebox}[2]{\byteboxAux{#1}{#2}{}}
 
+\newcommand*{\cclass}[1]{{\rm\sf :#1:}}
+\newcommand*{\caret}{\textsuperscript{$\wedge$}}
+
 \makeindex
 
 \begin{document}
@@ -178,6 +181,28 @@ \subsection{Terminologies and Concepts}
 mapping, all the other mappings get mapping quality $<$Q3
 and are ignored by most SNP/INDEL callers.}
 
+\subsubsection{Character set restrictions}\label{sec:charset}
+
+Reference sequence names, CIGAR strings, and several other field types are used as values or parts of values of other fields in SAM and related formats such as VCF.
+To ensure that these other fields' representations are unambiguous, these field types disallow particular delimiter characters.
+
+Query or read names may contain any printable ASCII characters in the range \verb"[!-~]" apart from `\verb"@"', so that SAM alignment lines can be easily distinguished from header lines.
+(They are also limited in length.)
+
+Reference sequence names may contain any printable ASCII characters in the range {\tt [!-\verb:~:]} apart from backslashes, commas, quotation marks, and brackets---i.e., apart from `{\tt \verb:\:\,,\,"`'\,()\,[]\,\verb:{}:\,<>}'---and may not start with `{\tt *}' or `{\tt =}'.
+Thus they match the following regular expression:
+\begin{center}
+{\tt [\verb"0-9A-Za-z!#$%&+./:;?@^_|~-"][\verb"0-9A-Za-z!#$%&*+./:;=?@^_|~-"]*}
+\end{center}
+
+% Pedantically this should be [[:rname:]^*=][[:rname:]]*, but we take advantage
+% of POSIX (Issue 7) section 9.3.5/8 to elide the excess brackets for clarity.
+\newcommand*{\rnameRegexp}{[\cclass{rname}\caret*=][\cclass{rname}]*}
+
+\noindent
+For clarity, elsewhere in this specification we write this set of characters as a character class~{\tt [\cclass{rname}]} and extend the POSIX regular expression notation to use {\tt\caret *=} to indicate the omission of `{\tt *}' and `{\tt =}' from the character class.
+Thus this regular expression can be written more clearly as {\tt\rnameRegexp}.
+
 \subsection{The header section}
 Each header line begins with the character `{\tt @}' followed by
 one of the two-letter header record type codes defined in this section.
@@ -229,7 +254,7 @@ \subsection{The header section}
 The {\tt SN} tags and all individual {\tt AN} names in all {\tt @SQ} lines
 must be distinct.
   The value of this field is used in the
-  alignment records in {\sf RNAME} and {\sf RNEXT} fields. Regular expression: {\tt [!-)+-\char60\char62-\char126][!-\char126]*}\\\cline{2-3}
+  alignment records in {\sf RNAME} and {\sf RNEXT} fields. Regular expression: {\tt\rnameRegexp}\\\cline{2-3}
   & {\tt LN}* & Reference sequence length. \emph{Range}: $[1,\,2^{31}-1]$\\\cline{2-3}
   & {\tt AH} & Indicates that this sequence is an alternate locus.%
 \footnote{See \url{https://www.ncbi.nlm.nih.gov/grc/help/definitions} for descriptions of \emph{alternate locus} and \emph{primary assembly}.}
@@ -240,13 +265,12 @@ \subsection{The header section}
 to this reference sequence.%
 \footnote{For example, given `{\tt @SQ SN:MT AN:chrMT,M,chrM LN:16569}',
 tools can ensure that a user's request for any of `MT', `chrMT', `M',
-or~`chrM' succeeds and refers to the same sequence.
-Note the restricted set of characters allowed in an alternative name.}
+or~`chrM' succeeds and refers to the same sequence.}
 These alternative names are not used elsewhere within the SAM file;
 in particular, they must not appear in alignment records' {\sf RNAME}
 or~{\sf RNEXT} fields.
 \emph{Regular expression}: \emph{name}{\tt (,}\emph{name}{\tt )*}
-where \emph{name} is {\tt [0-9A-Za-z][0-9A-Za-z*+.@\_|-]*}\\\cline{2-3}
+where \emph{name} is {\tt\rnameRegexp}\\\cline{2-3}
   & {\tt AS} & Genome assembly identifier. \\\cline{2-3}
   & {\tt DS} & Description.  UTF-8 encoding may be used.\\\cline{2-3}
   & {\tt M5} & MD5 checksum of the sequence.  See Section~\ref{sec:ref-md5}\\\cline{2-3}
@@ -348,11 +372,11 @@ \subsection{The alignment section: mandatory fields}\label{sec:alnrecord}
   \hline
   1 & {\sf QNAME} & String & \verb:[!-?A-~]{1,254}: & Query template NAME\\
   2 & {\sf FLAG} & Int & $[0,\,2^{16}-1]$ & bitwise FLAG \\
-  3 & {\sf RNAME} & String & {\tt \char92*|[!-()+-\char60\char62-\char126][!-\char126]*} & Reference sequence NAME\\
+  3 & {\sf RNAME} & String & {\tt \verb"\*"|\rnameRegexp} & Reference sequence NAME\footnotemark \\
   4 & {\sf POS} & Int & $[0,\,2^{31}-1]$ & 1-based leftmost mapping POSition \\
   5 & {\sf MAPQ} & Int & $[0,\,2^8-1]$ & MAPping Quality \\
   6 & {\sf CIGAR} & String & {\tt \char92*|([0-9]+[MIDNSHPX=])+} & CIGAR string \\
-  7 & {\sf RNEXT} & String & {\tt \char92*|=|[!-()+-\char60\char62-\char126][!-\char126]*} & Ref. name of the mate/next read\\
+  7 & {\sf RNEXT} & String & {\tt \verb"\*"|=|\rnameRegexp} & Reference name of the mate/next read \\
   8 & {\sf PNEXT} & Int & $[0,\,2^{31}-1]$ & Position of the mate/next read \\
   9 & {\sf TLEN} & Int & $[-2^{31}+1,\,2^{31}-1]$ & observed Template LENgth \\
   10 & {\sf SEQ} & String & {\tt \char92*|[A-Za-z=.]+} & segment SEQuence\\
@@ -361,6 +385,9 @@ \subsection{The alignment section: mandatory fields}\label{sec:alnrecord}
 \end{tabular}
 \end{center}
 
+\footnotetext{Reference sequence names may contain any printable ASCII characters with the exception of certain punctuation characters, and may not start with `{\tt *}' or `{\tt =}'.
+See Section~\ref{sec:charset} for details and an explanation of the {\tt [\cclass{rname}]} notation.}
+
 \begin{enumerate}
 \item {\sf QNAME}: Query template NAME. Reads/segments having identical {\sf QNAME}
 	are regarded to come from the same template. A {\sf QNAME} `{\tt *}'
@@ -1233,6 +1260,9 @@ \section{SAM Version History}\label{sec:history}
 \subsection*{1.6: 28 November 2017 to current}
 
 \begin{itemize}
+\item Restricted the allowable punctuation characters in RNAME and similar fields.
+The sets of characters allowed in {\tt @SQ SN} and {\tt @SQ AN} are now identical, which enlarges the previous {\tt AN} set.
+(Sep 2018)
 \item B array optional fields may have no entries---this was already representable in BAM, clarified that empty arrays are permitted in SAM too. (Jul 2018)
 \item Add {\tt @SQ DS} header tag. (Jul 2018)
 \item Add {\tt @RG BC} header tag. (Apr 2018)
@@ -1243,6 +1273,7 @@ \subsection*{1.6: 28 November 2017 to current}
 \subsection*{1.5: 23 May 2013 to November 2017}
 
 \begin{itemize}
+\item Add {\tt @SQ AN} header tag, allowing only alphanumeric and `\verb"*+.@_|-"' characters in its names. (Jul 2017)
 \item Add {\tt @SQ AH} header tag. (Mar 2017)
 \item Auxiliary tags migrated to SAMtags document. (Sep 2016)
 \item Z and H auxiliary tags are permitted to be zero length. (Jun 2016)