You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Disallow commas and other punctuation in RNAME etc (PR samtools#333)
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence
names. Commas and angle brackets are used to delimit refnames in other
SAM fields (e.g. SA) and in VCF files, and restricting these other
characters facilitates future delimiter and quoting syntax.
Statistics gathered from various reference sequence archives suggest
that these characters appear vanishingly infrequently in refnames in
existing files in the wild.
Add previously omitted SQ-AN history note.
@@ -178,6 +181,28 @@ \subsection{Terminologies and Concepts}
178
181
mapping, all the other mappings get mapping quality $<$Q3
179
182
and are ignored by most SNP/INDEL callers.}
180
183
184
+
\subsubsection{Character set restrictions}\label{sec:charset}
185
+
186
+
Reference sequence names, CIGAR strings, and several other field types are used as values or parts of values of other fields in SAM and related formats such as VCF.
187
+
To ensure that these other fields' representations are unambiguous, these field types disallow particular delimiter characters.
188
+
189
+
Query or read names may contain any printable ASCII characters in the range \verb"[!-~]" apart from `\verb"@"', so that SAM alignment lines can be easily distinguished from header lines.
190
+
(They are also limited in length.)
191
+
192
+
Reference sequence names may contain any printable ASCII characters in the range {\tt [!-\verb:~:]} apart from backslashes, commas, quotation marks, and brackets---i.e., apart from `{\tt\verb:\:\,,\,"`'\,()\,[]\,\verb:{}:\,<>}'---and may not start with `{\tt *}' or `{\tt =}'.
For clarity, elsewhere in this specification we write this set of characters as a character class~{\tt [\cclass{rname}]} and extend the POSIX regular expression notation to use {\tt\caret *=} to indicate the omission of `{\tt *}' and `{\tt =}' from the character class.
204
+
Thus this regular expression can be written more clearly as {\tt\rnameRegexp}.
205
+
181
206
\subsection{The header section}
182
207
Each header line begins with the character `{\tt @}' followed by
183
208
one of the two-letter header record type codes defined in this section.
\footnotetext{Reference sequence names may contain any printable ASCII characters with the exception of certain punctuation characters, and may not start with `{\tt *}' or `{\tt =}'.
389
+
See Section~\ref{sec:charset} for details and an explanation of the {\tt [\cclass{rname}]} notation.}
are regarded to come from the same template. A {\sf QNAME} `{\tt *}'
@@ -1233,6 +1260,9 @@ \section{SAM Version History}\label{sec:history}
1233
1260
\subsection*{1.6: 28 November 2017 to current}
1234
1261
1235
1262
\begin{itemize}
1263
+
\item Restricted the allowable punctuation characters in RNAME and similar fields.
1264
+
The sets of characters allowed in {\tt @SQ SN} and {\tt @SQ AN} are now identical, which enlarges the previous {\tt AN} set.
1265
+
(Sep 2018)
1236
1266
\item B array optional fields may have no entries---this was already representable in BAM, clarified that empty arrays are permitted in SAM too. (Jul 2018)
1237
1267
\item Add {\tt @SQ DS} header tag. (Jul 2018)
1238
1268
\item Add {\tt @RG BC} header tag. (Apr 2018)
@@ -1243,6 +1273,7 @@ \subsection*{1.6: 28 November 2017 to current}
1243
1273
\subsection*{1.5: 23 May 2013 to November 2017}
1244
1274
1245
1275
\begin{itemize}
1276
+
\item Add {\tt @SQ AN} header tag, allowing only alphanumeric and `\verb"*+.@_|-"' characters in its names. (Jul 2017)
1246
1277
\item Add {\tt @SQ AH} header tag. (Mar 2017)
1247
1278
\item Auxiliary tags migrated to SAMtags document. (Sep 2016)
1248
1279
\item Z and H auxiliary tags are permitted to be zero length. (Jun 2016)
0 commit comments