Skip to content

Commit 1f92c08

Browse files
committed
[DRAFT] Clarify SAM file encoding (ASCII, UTF-8 "subset")
1 parent 3c493e7 commit 1f92c08

1 file changed

Lines changed: 7 additions & 2 deletions

File tree

SAMv1.tex

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,8 +67,13 @@ \section{The SAM Format Specification}
6767
BAM file may optionally specify the version being used via the
6868
{\tt @HD VN} tag. For full version history see Appendix~\ref{sec:history}.
6969

70-
Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII \footnote{Charset ANSI\_X3.4-1968 as defined in RFC1345.} in using the POSIX / C locale.
71-
Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax.
70+
SAM files are encoded in UTF-8, though most of their content is limited to ASCII.
71+
They must not begin with a byte order mark, and non-ASCII characters are permitted only in certain field values as individually specified.%
72+
\footnote{Equivalently, SAM file content is primarily US-ASCII characters in the usual single-byte encoding; certain field values as specified may contain other Unicode characters and are encoded as UTF-8.}
73+
Where it makes a difference, SAM file contents should be read and written using the POSIX\,/\,C locale.%
74+
\footnote{For example, floating-point values in SAM always use `{\tt .}' (\textsc{Full Stop}) for the decimal-point character.}
75+
76+
The regular expressions in this specification have been written using the POSIX\,/\,IEEE Std 1003.1 extended syntax.
7277

7378
\subsection{An example}\label{sec:example}
7479
Suppose we have the following alignment with bases in lowercase

0 commit comments

Comments
 (0)