Skip to content

Commit 6be1fbf

Browse files
committed
Disallow commas and other punctuation in RNAME etc (PR samtools#333)
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence names. Commas and angle brackets are used to delimit refnames in other SAM fields (e.g. SA) and in VCF files, and restricting these other characters facilitates future delimiter and quoting syntax. Statistics gathered from various reference sequence archives suggest that these characters appear vanishingly infrequently in refnames in existing files in the wild. Add previously omitted SQ-AN history note.
1 parent af89db4 commit 6be1fbf

1 file changed

Lines changed: 37 additions & 6 deletions

File tree

SAMv1.tex

Lines changed: 37 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@
3232
\newcommand*{\firstbytebox}[2]{\byteboxAux{#1}{#2}{\put(0,0){\line(0,1){\bytetotalheight}}}}
3333
\newcommand*{\bytebox}[2]{\byteboxAux{#1}{#2}{}}
3434

35+
\newcommand*{\cclass}[1]{{\rm\sf :#1:}}
36+
\newcommand*{\caret}{\textsuperscript{$\wedge$}}
37+
3538
\makeindex
3639

3740
\begin{document}
@@ -178,6 +181,28 @@ \subsection{Terminologies and Concepts}
178181
mapping, all the other mappings get mapping quality $<$Q3
179182
and are ignored by most SNP/INDEL callers.}
180183

184+
\subsubsection{Character set restrictions}\label{sec:charset}
185+
186+
Reference sequence names, CIGAR strings, and several other field types are used as values or parts of values of other fields in SAM and related formats such as VCF.
187+
To ensure that these other fields' representations are unambiguous, these field types disallow particular delimiter characters.
188+
189+
Query or read names may contain any printable ASCII characters in the range \verb"[!-~]" apart from `\verb"@"', so that SAM alignment lines can be easily distinguished from header lines.
190+
(They are also limited in length.)
191+
192+
Reference sequence names may contain any printable ASCII characters in the range {\tt [!-\verb:~:]} apart from backslashes, commas, quotation marks, and brackets---i.e., apart from `{\tt \verb:\:\,,\,"`'\,()\,[]\,\verb:{}:\,<>}'---and may not start with `{\tt *}' or `{\tt =}'.
193+
Thus they match the following regular expression:
194+
\begin{center}
195+
{\tt [\verb"0-9A-Za-z!#$%&+./:;?@^_|~-"][\verb"0-9A-Za-z!#$%&*+./:;=?@^_|~-"]*}
196+
\end{center}
197+
198+
% Pedantically this should be [[:rname:]^*=][[:rname:]]*, but we take advantage
199+
% of POSIX (Issue 7) section 9.3.5/8 to elide the excess brackets for clarity.
200+
\newcommand*{\rnameRegexp}{[\cclass{rname}\caret*=][\cclass{rname}]*}
201+
202+
\noindent
203+
For clarity, elsewhere in this specification we write this set of characters as a character class~{\tt [\cclass{rname}]} and extend the POSIX regular expression notation to use {\tt\caret *=} to indicate the omission of `{\tt *}' and `{\tt =}' from the character class.
204+
Thus this regular expression can be written more clearly as {\tt\rnameRegexp}.
205+
181206
\subsection{The header section}
182207
Each header line begins with the character `{\tt @}' followed by
183208
one of the two-letter header record type codes defined in this section.
@@ -229,7 +254,7 @@ \subsection{The header section}
229254
The {\tt SN} tags and all individual {\tt AN} names in all {\tt @SQ} lines
230255
must be distinct.
231256
The value of this field is used in the
232-
alignment records in {\sf RNAME} and {\sf RNEXT} fields. Regular expression: {\tt [!-)+-\char60\char62-\char126][!-\char126]*}\\\cline{2-3}
257+
alignment records in {\sf RNAME} and {\sf RNEXT} fields. Regular expression: {\tt\rnameRegexp}\\\cline{2-3}
233258
& {\tt LN}* & Reference sequence length. \emph{Range}: $[1,\,2^{31}-1]$\\\cline{2-3}
234259
& {\tt AH} & Indicates that this sequence is an alternate locus.%
235260
\footnote{See \url{https://www.ncbi.nlm.nih.gov/grc/help/definitions} for descriptions of \emph{alternate locus} and \emph{primary assembly}.}
@@ -240,13 +265,12 @@ \subsection{The header section}
240265
to this reference sequence.%
241266
\footnote{For example, given `{\tt @SQ SN:MT AN:chrMT,M,chrM LN:16569}',
242267
tools can ensure that a user's request for any of `MT', `chrMT', `M',
243-
or~`chrM' succeeds and refers to the same sequence.
244-
Note the restricted set of characters allowed in an alternative name.}
268+
or~`chrM' succeeds and refers to the same sequence.}
245269
These alternative names are not used elsewhere within the SAM file;
246270
in particular, they must not appear in alignment records' {\sf RNAME}
247271
or~{\sf RNEXT} fields.
248272
\emph{Regular expression}: \emph{name}{\tt (,}\emph{name}{\tt )*}
249-
where \emph{name} is {\tt [0-9A-Za-z][0-9A-Za-z*+.@\_|-]*}\\\cline{2-3}
273+
where \emph{name} is {\tt\rnameRegexp}\\\cline{2-3}
250274
& {\tt AS} & Genome assembly identifier. \\\cline{2-3}
251275
& {\tt DS} & Description. UTF-8 encoding may be used.\\\cline{2-3}
252276
& {\tt M5} & MD5 checksum of the sequence. See Section~\ref{sec:ref-md5}\\\cline{2-3}
@@ -348,11 +372,11 @@ \subsection{The alignment section: mandatory fields}\label{sec:alnrecord}
348372
\hline
349373
1 & {\sf QNAME} & String & \verb:[!-?A-~]{1,254}: & Query template NAME\\
350374
2 & {\sf FLAG} & Int & $[0,\,2^{16}-1]$ & bitwise FLAG \\
351-
3 & {\sf RNAME} & String & {\tt \char92*|[!-()+-\char60\char62-\char126][!-\char126]*} & Reference sequence NAME\\
375+
3 & {\sf RNAME} & String & {\tt \verb"\*"|\rnameRegexp} & Reference sequence NAME\footnotemark \\
352376
4 & {\sf POS} & Int & $[0,\,2^{31}-1]$ & 1-based leftmost mapping POSition \\
353377
5 & {\sf MAPQ} & Int & $[0,\,2^8-1]$ & MAPping Quality \\
354378
6 & {\sf CIGAR} & String & {\tt \char92*|([0-9]+[MIDNSHPX=])+} & CIGAR string \\
355-
7 & {\sf RNEXT} & String & {\tt \char92*|=|[!-()+-\char60\char62-\char126][!-\char126]*} & Ref. name of the mate/next read\\
379+
7 & {\sf RNEXT} & String & {\tt \verb"\*"|=|\rnameRegexp} & Reference name of the mate/next read \\
356380
8 & {\sf PNEXT} & Int & $[0,\,2^{31}-1]$ & Position of the mate/next read \\
357381
9 & {\sf TLEN} & Int & $[-2^{31}+1,\,2^{31}-1]$ & observed Template LENgth \\
358382
10 & {\sf SEQ} & String & {\tt \char92*|[A-Za-z=.]+} & segment SEQuence\\
@@ -361,6 +385,9 @@ \subsection{The alignment section: mandatory fields}\label{sec:alnrecord}
361385
\end{tabular}
362386
\end{center}
363387
388+
\footnotetext{Reference sequence names may contain any printable ASCII characters with the exception of certain punctuation characters, and may not start with `{\tt *}' or `{\tt =}'.
389+
See Section~\ref{sec:charset} for details and an explanation of the {\tt [\cclass{rname}]} notation.}
390+
364391
\begin{enumerate}
365392
\item {\sf QNAME}: Query template NAME. Reads/segments having identical {\sf QNAME}
366393
are regarded to come from the same template. A {\sf QNAME} `{\tt *}'
@@ -1233,6 +1260,9 @@ \section{SAM Version History}\label{sec:history}
12331260
\subsection*{1.6: 28 November 2017 to current}
12341261
12351262
\begin{itemize}
1263+
\item Restricted the allowable punctuation characters in RNAME and similar fields.
1264+
The sets of characters allowed in {\tt @SQ SN} and {\tt @SQ AN} are now identical, which enlarges the previous {\tt AN} set.
1265+
(Sep 2018)
12361266
\item B array optional fields may have no entries---this was already representable in BAM, clarified that empty arrays are permitted in SAM too. (Jul 2018)
12371267
\item Add {\tt @SQ DS} header tag. (Jul 2018)
12381268
\item Add {\tt @RG BC} header tag. (Apr 2018)
@@ -1243,6 +1273,7 @@ \subsection*{1.6: 28 November 2017 to current}
12431273
\subsection*{1.5: 23 May 2013 to November 2017}
12441274
12451275
\begin{itemize}
1276+
\item Add {\tt @SQ AN} header tag, allowing only alphanumeric and `\verb"*+.@_|-"' characters in its names. (Jul 2017)
12461277
\item Add {\tt @SQ AH} header tag. (Mar 2017)
12471278
\item Auxiliary tags migrated to SAMtags document. (Sep 2016)
12481279
\item Z and H auxiliary tags are permitted to be zero length. (Jun 2016)

0 commit comments

Comments
 (0)