You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Clarify SAM file encoding (ASCII, UTF-8 in designated fields)
Reword so that it is clear that the encodings specified apply to the
*entirety* of SAM file contents. Mention allowable line terminators,
and note that most UTF-8 line terminating characters are invalid
(as this whitespace is in the ASCII-only parts outwith UTF-8 fields).
Fixes#664.
Spell out that the locale considerations are about e.g. requiring that
floating-point values use '.' rather than a localised decimal point.
Copy file name to clipboardExpand all lines: SAMv1.tex
+10-2Lines changed: 10 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -67,8 +67,16 @@ \section{The SAM Format Specification}
67
67
BAM file may optionally specify the version being used via the
68
68
{\tt @HD VN} tag. For full version history see Appendix~\ref{sec:history}.
69
69
70
-
Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII \footnote{Charset ANSI\_X3.4-1968 as defined in RFC1345.} in using the POSIX / C locale.
71
-
Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax.
70
+
SAM file contents are 7-bit US-ASCII, except for certain field values as individually specified which may contain other Unicode characters encoded in UTF-8.
71
+
Alternatively and equivalently, SAM files are encoded in UTF-8 but non-ASCII characters are permitted only within certain field values as explicitly specified in the descriptions of those fields.%
72
+
\footnote{Hence in particular SAM files must not begin with a byte order mark~(BOM) and lines of text are delimited by ASCII line terminator characters only.
73
+
% Unicode identifies VT and FF as line break characters as well, but no one uses them in SAM.
74
+
In addition to the local platform's text file line termination conventions, implementations may wish to support \textsc{lf} and \textsc{cr\>lf} for interoperability with other platforms.}
75
+
76
+
Where it makes a difference, SAM file contents should be read and written using the POSIX\,/\,C locale.
77
+
For example, floating-point values in SAM always use `{\tt .}' for the decimal-point character.
78
+
79
+
The regular expressions in this specification are written using the POSIX\,/\,IEEE Std 1003.1 extended syntax.
72
80
73
81
\subsection{An example}\label{sec:example}
74
82
Suppose we have the following alignment with bases in lowercase
0 commit comments