Skip to content

Commit d131782

Browse files
committed
[DRAFT] Disallow commas and other punctuation in RNAME etc
1 parent ecf37f8 commit d131782

1 file changed

Lines changed: 15 additions & 0 deletions

File tree

SAMv1.tex

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -230,6 +230,7 @@ \subsection{The header section}
230230
must be distinct.
231231
The value of this field is used in the
232232
alignment records in {\sf RNAME} and {\sf RNEXT} fields. Regular expression: {\tt [!-)+-\char60\char62-\char126][!-\char126]*}\\\cline{2-3}
233+
% FIXME SQ-SN regexp
233234
& {\tt LN}* & Reference sequence length. \emph{Range}: $[1,\,2^{31}-1]$\\\cline{2-3}
234235
& {\tt AH} & Indicates that this sequence is an alternate locus.%
235236
\footnote{See \url{https://www.ncbi.nlm.nih.gov/grc/help/definitions} for descriptions of \emph{alternate locus} and \emph{primary assembly}.}
@@ -247,6 +248,7 @@ \subsection{The header section}
247248
or~{\sf RNEXT} fields.
248249
\emph{Regular expression}: \emph{name}{\tt (,}\emph{name}{\tt )*}
249250
where \emph{name} is {\tt [0-9A-Za-z][0-9A-Za-z*+.@\_|-]*}\\\cline{2-3}
251+
% FIXME SQ-AN regexp
250252
& {\tt AS} & Genome assembly identifier. \\\cline{2-3}
251253
& {\tt DS} & Description. UTF-8 encoding may be used.\\\cline{2-3}
252254
& {\tt M5} & MD5 checksum of the sequence. See Section~\ref{sec:ref-md5}\\\cline{2-3}
@@ -346,9 +348,18 @@ \subsection{The alignment section: mandatory fields}\label{sec:alnrecord}
346348
\hline
347349
{\bf Col} & {\bf Field} & {\bf Type} & {\bf Regexp/Range} & {\bf Brief description} \\
348350
\hline
351+
% !-?A-~ :: all punct except @
349352
1 & {\sf QNAME} & String & \verb:[!-?A-~]{1,254}: & Query template NAME\\
350353
2 & {\sf FLAG} & Int & $[0,\,2^{16}-1]$ & bitwise FLAG \\
354+
% !-()+-<>-~ -- all punct except * = forbid comma change +-< to +.-<>-~-
355+
356+
% [!-()+-<>-~][!-~]* 7e14ccee Nov 2010 fixed a few bugs in regexp
357+
% [!-)+-<>-~][!-~]* e6ca3195 Jul 2010 remove implict rules; more descs
358+
% [!-)+-<>-~]+ cdaf8624 Jul 2010 update based on MKTrost (ie [^*=])
359+
% [!-~]+ 07dc1c67 Jul 2010
351360
3 & {\sf RNAME} & String & {\tt \char92*|[!-()+-\char60\char62-\char126][!-\char126]*} & Reference sequence NAME\\
361+
3 & {\sf RNAME} & String & {\tt \char92*|[0-9A-Za-z!\#\char36\%\&+./:;?@\char94\_|\char126-][0-9A-Za-z!\#\char36\%\&+./:;?@\char94\_|\char126*=-]*} & Reference sequence NAME\\
362+
3 & {\sf RNAME} & String & {\tt \char92*|[\char94"'`,()<>[]\{\}\char92*=][\char94"'`,()<>[]\{\}\char92]*} & Reference sequence NAME\\
352363
4 & {\sf POS} & Int & $[0,\,2^{31}-1]$ & 1-based leftmost mapping POSition \\
353364
5 & {\sf MAPQ} & Int & $[0,\,2^8-1]$ & MAPping Quality \\
354365
6 & {\sf CIGAR} & String & {\tt \char92*|([0-9]+[MIDNSHPX=])+} & CIGAR string \\
@@ -435,6 +446,9 @@ \subsection{The alignment section: mandatory fields}\label{sec:alnrecord}
435446
also have an ordinary coordinate such that it can be placed at a
436447
desired position after sorting. If {\sf RNAME} is `*', no assumptions
437448
can be made about {\sf POS} and {\sf CIGAR}.
449+
450+
FIXME reference name details.
451+
438452
\item {\sf POS}: 1-based leftmost mapping POSition of the first {\sf
439453
CIGAR} operation that ``consumes'' a reference base (see table below).
440454
The first base in a reference sequence has coordinate 1. {\sf
@@ -1233,6 +1247,7 @@ \section{SAM Version History}\label{sec:history}
12331247
\subsection*{1.6: 28 November 2017 to current}
12341248

12351249
\begin{itemize}
1250+
\item Allowable punctuation in RNAME and similar fields restricted. (Aug 2018)
12361251
\item B array optional fields may have no entries---this was already representable in BAM, clarified that empty arrays are permitted in SAM too. (Jul 2018)
12371252
\item Add {\tt @SQ DS} header tag. (Jul 2018)
12381253
\item Add {\tt @RG BC} header tag. (Apr 2018)

0 commit comments

Comments
 (0)