Skip to content

Commit 1011356

Browse files
committed
Clarify SAM file encoding (ASCII, UTF-8 in designated fields)
Reword so that it is clear that the encodings specified apply to the *entirety* of SAM file contents. Mention allowable line terminators, and note that most UTF-8 line terminating characters are invalid (as this whitespace is in the ASCII-only parts outwith UTF-8 fields). Fixes #664. Spell out that the locale considerations are about e.g. requiring that floating-point values use '.' rather than a localised decimal point.
1 parent 3c493e7 commit 1011356

1 file changed

Lines changed: 10 additions & 2 deletions

File tree

SAMv1.tex

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,8 +67,16 @@ \section{The SAM Format Specification}
6767
BAM file may optionally specify the version being used via the
6868
{\tt @HD VN} tag. For full version history see Appendix~\ref{sec:history}.
6969

70-
Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII \footnote{Charset ANSI\_X3.4-1968 as defined in RFC1345.} in using the POSIX / C locale.
71-
Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax.
70+
SAM file contents are 7-bit US-ASCII, except for certain field values as individually specified which may contain other Unicode characters encoded in UTF-8.
71+
Alternatively and equivalently, SAM files are encoded in UTF-8 but non-ASCII characters are permitted only within certain field values as explicitly specified in the descriptions of those fields.%
72+
\footnote{Hence in particular SAM files must not begin with a byte order mark~(BOM) and lines of text are delimited by ASCII line terminator characters only.
73+
% Unicode identifies VT and FF as line break characters as well, but no one uses them in SAM.
74+
In addition to the local platform's text file line termination conventions, implementations may wish to support \textsc{lf} and \textsc{cr\>lf} for interoperability with other platforms.}
75+
76+
Where it makes a difference, SAM file contents should be read and written using the POSIX\,/\,C locale.
77+
For example, floating-point values in SAM always use `{\tt .}' for the decimal-point character.
78+
79+
The regular expressions in this specification are written using the POSIX\,/\,IEEE Std 1003.1 extended syntax.
7280

7381
\subsection{An example}\label{sec:example}
7482
Suppose we have the following alignment with bases in lowercase

0 commit comments

Comments
 (0)