I understand ITF-8/LTF-8 may be on the chopping block for future CRAM versions, but for versions where they're still in use, the spec should be unambiguous.
The current text (quoted from the spec):
ITF-8 integer (itf8)
This is an alternative way to write an integer value. The idea is similar to UTF-8 encoding and therefore this encoding is called ITF-8 (Integer Transformation Format - 8 bit). The most significant bits of the first byte have special meaning and are called 'prefix'. These are 0 to 4 true bits followed by a 0. The number of 1's denote the number of bytes to follow. To accommodate 32 bits such representation requires 5 bytes with only 4 lower bits used in the last byte 5.
LTF-8 long (ltf8)
See ITF-8 for more details. The only difference between ITF-8 and LTF-8 is the number of bytes used to encode a single value. To do so 64 bits are required and this can be done with 9 byte at most with the first byte consisting of just 1s or 0xFF value.
Issues
1. Typos
- Dangling "5" at the end of the ITF-8 paragraph.
- "9 byte" -> "9 bytes" in the LTF-8 paragraph.
2. The prefix description is wrong for the maximum-length case
The spec says the prefix is "0 to 4 true bits followed by a 0." This holds for the 1- through 4-byte encodings, but not for the 5-byte encoding. Both htsjdk and htslib implement the 5-byte prefix as 1111 — four ones with no trailing zero. There is no room for a terminating zero and still fit 32 bits of data.
From htsjdk (ITF8.java):
return ((b1 & 15) << 28) | ... | (15 & buffer.get());
From htslib (cram_io.h):
uint32_t uv = (((uint32_t)up[0] & 0x0f)<<28) | (up[1]<<20) | (up[2]<<12) | (up[3]<<4) | (up[4] & 0x0f);
Both mask the first byte with 0x0F / 15 , confirming the prefix is 1111 (4 bits), not 11110 (5 bits).
The same issue applies to LTF-8: the 9-byte case uses a 0xFF first byte (all ones, no trailing zero), which the spec alludes to but doesn't reconcile with the "followed by a 0" rule.
3. The description lacks examples and leaves critical differences from UTF-8 unstated
The spec says ITF-8 is "similar to UTF-8," but there is a key structural difference that is never stated: in UTF-8, every byte (including continuation bytes) carries a prefix (10xxxxxx); in ITF-8, only the first byte has a prefix — all subsequent bytes are pure data. A reader familiar with UTF-8 will assume the wrong thing.
A bit-pattern table (analogous to what the UTF-8 Wikipedia page provides) would make the encoding unambiguous:
| Bytes |
Byte 1 |
Byte 2 |
Byte 3 |
Byte 4 |
Byte 5 |
Data bits |
| 1 |
0xxxxxxx |
|
|
|
|
7 |
| 2 |
10xxxxxx |
xxxxxxxx |
|
|
|
14 |
| 3 |
110xxxxx |
xxxxxxxx |
xxxxxxxx |
|
|
21 |
| 4 |
1110xxxx |
xxxxxxxx |
xxxxxxxx |
xxxxxxxx |
|
28 |
| 5 |
1111xxxx |
xxxxxxxx |
xxxxxxxx |
xxxxxxxx |
0000xxxx |
32 |
Note the 5-byte row: no trailing zero in the prefix, and only the lower 4 bits of the last byte carry data.
I understand ITF-8/LTF-8 may be on the chopping block for future CRAM versions, but for versions where they're still in use, the spec should be unambiguous.
The current text (quoted from the spec):
Issues
1. Typos
2. The prefix description is wrong for the maximum-length case
The spec says the prefix is "0 to 4 true bits followed by a 0." This holds for the 1- through 4-byte encodings, but not for the 5-byte encoding. Both htsjdk and htslib implement the 5-byte prefix as
1111— four ones with no trailing zero. There is no room for a terminating zero and still fit 32 bits of data.From htsjdk (
ITF8.java):From htslib (
cram_io.h):Both mask the first byte with
0x0F/ 15 , confirming the prefix is1111(4 bits), not11110(5 bits).The same issue applies to LTF-8: the 9-byte case uses a
0xFFfirst byte (all ones, no trailing zero), which the spec alludes to but doesn't reconcile with the "followed by a 0" rule.3. The description lacks examples and leaves critical differences from UTF-8 unstated
The spec says ITF-8 is "similar to UTF-8," but there is a key structural difference that is never stated: in UTF-8, every byte (including continuation bytes) carries a prefix (
10xxxxxx); in ITF-8, only the first byte has a prefix — all subsequent bytes are pure data. A reader familiar with UTF-8 will assume the wrong thing.A bit-pattern table (analogous to what the UTF-8 Wikipedia page provides) would make the encoding unambiguous:
0xxxxxxx10xxxxxxxxxxxxxx110xxxxxxxxxxxxxxxxxxxxx1110xxxxxxxxxxxxxxxxxxxxxxxxxxxx1111xxxxxxxxxxxxxxxxxxxxxxxxxxxx0000xxxxNote the 5-byte row: no trailing zero in the prefix, and only the lower 4 bits of the last byte carry data.