ITF-8/LTF-8 spec text has typos, an incorrect prefix description, and lacks examples


I understand ITF-8/LTF-8 may be on the chopping block for [future CRAM versions](https://github.com/samtools/hts-specs/issues/144), but for versions where they're still in use, the spec should be unambiguous.
 
The current text (quoted from the spec):
 
> **ITF-8 integer (itf8)**
> This is an alternative way to write an integer value. The idea is similar to UTF-8 encoding and therefore this encoding is called ITF-8 (Integer Transformation Format - 8 bit). The most significant bits of the first byte have special meaning and are called 'prefix'. These are 0 to 4 true bits followed by a 0. The number of 1's denote the number of bytes to follow. To accommodate 32 bits such representation requires 5 bytes with only 4 lower bits used in the last byte 5.
>
> **LTF-8 long (ltf8)**
> See ITF-8 for more details. The only difference between ITF-8 and LTF-8 is the number of bytes used to encode a single value. To do so 64 bits are required and this can be done with 9 byte at most with the first byte consisting of just 1s or 0xFF value.
 
## Issues
 
### 1. Typos
 
- Dangling "5" at the end of the ITF-8 paragraph.
- "9 byte" -> "9 bytes" in the LTF-8 paragraph.
 
### 2. The prefix description is wrong for the maximum-length case
 
The spec says the prefix is "0 to 4 true bits followed by a 0." This holds for the 1- through 4-byte encodings, but **not** for the 5-byte encoding. Both htsjdk and htslib implement the 5-byte prefix as `1111` — four ones with **no** trailing zero. There is no room for a terminating zero and still fit 32 bits of data.
 
From htsjdk (`ITF8.java`):
```java
return ((b1 & 15) << 28) | ... | (15 & buffer.get());
```
 
From htslib (`cram_io.h`):
```c
uint32_t uv = (((uint32_t)up[0] & 0x0f)<<28) | (up[1]<<20) | (up[2]<<12) | (up[3]<<4) | (up[4] & 0x0f);
```
 
Both mask the first byte with `0x0F` / 15 , confirming the prefix is `1111` (4 bits), not `11110` (5 bits).
 
The same issue applies to LTF-8: the 9-byte case uses a `0xFF` first byte (all ones, no trailing zero), which the spec alludes to but doesn't reconcile with the "followed by a 0" rule.
 
### 3. The description lacks examples and leaves critical differences from UTF-8 unstated
 
The spec says ITF-8 is "similar to UTF-8," but there is a key structural difference that is never stated: in UTF-8, *every* byte (including continuation bytes) carries a prefix (`10xxxxxx`); in ITF-8, **only the first byte** has a prefix — all subsequent bytes are pure data. A reader familiar with UTF-8 will assume the wrong thing.
 
A bit-pattern table (analogous to what the [UTF-8 Wikipedia page](https://en.wikipedia.org/wiki/UTF-8#Encoding) provides) would make the encoding unambiguous:
 
| Bytes | Byte 1     | Byte 2     | Byte 3     | Byte 4     | Byte 5     | Data bits |
|-------|------------|------------|------------|------------|------------|-----------|
| 1     | `0xxxxxxx` |            |            |            |            | 7         |
| 2     | `10xxxxxx` | `xxxxxxxx` |            |            |            | 14        |
| 3     | `110xxxxx` | `xxxxxxxx` | `xxxxxxxx` |            |            | 21        |
| 4     | `1110xxxx` | `xxxxxxxx` | `xxxxxxxx` | `xxxxxxxx` |            | 28        |
| 5     | `1111xxxx` | `xxxxxxxx` | `xxxxxxxx` | `xxxxxxxx` | `0000xxxx` | 32        |
 
Note the 5-byte row: no trailing zero in the prefix, and only the lower 4 bits of the last byte carry data.
 

Bytes	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Data bits
1	`0xxxxxxx`					7
2	`10xxxxxx`	`xxxxxxxx`				14
3	`110xxxxx`	`xxxxxxxx`	`xxxxxxxx`			21
4	`1110xxxx`	`xxxxxxxx`	`xxxxxxxx`	`xxxxxxxx`		28
5	`1111xxxx`	`xxxxxxxx`	`xxxxxxxx`	`xxxxxxxx`	`0000xxxx`	32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ITF-8/LTF-8 spec text has typos, an incorrect prefix description, and lacks examples #855

Issues

1. Typos

2. The prefix description is wrong for the maximum-length case

3. The description lacks examples and leaves critical differences from UTF-8 unstated

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ITF-8/LTF-8 spec text has typos, an incorrect prefix description, and lacks examples #855

Description

Issues

1. Typos

2. The prefix description is wrong for the maximum-length case

3. The description lacks examples and leaves critical differences from UTF-8 unstated

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions