Skip to content

ITF-8/LTF-8 spec text has typos, an incorrect prefix description, and lacks examples #855

@yfarjoun

Description

@yfarjoun

I understand ITF-8/LTF-8 may be on the chopping block for future CRAM versions, but for versions where they're still in use, the spec should be unambiguous.

The current text (quoted from the spec):

ITF-8 integer (itf8)
This is an alternative way to write an integer value. The idea is similar to UTF-8 encoding and therefore this encoding is called ITF-8 (Integer Transformation Format - 8 bit). The most significant bits of the first byte have special meaning and are called 'prefix'. These are 0 to 4 true bits followed by a 0. The number of 1's denote the number of bytes to follow. To accommodate 32 bits such representation requires 5 bytes with only 4 lower bits used in the last byte 5.

LTF-8 long (ltf8)
See ITF-8 for more details. The only difference between ITF-8 and LTF-8 is the number of bytes used to encode a single value. To do so 64 bits are required and this can be done with 9 byte at most with the first byte consisting of just 1s or 0xFF value.

Issues

1. Typos

  • Dangling "5" at the end of the ITF-8 paragraph.
  • "9 byte" -> "9 bytes" in the LTF-8 paragraph.

2. The prefix description is wrong for the maximum-length case

The spec says the prefix is "0 to 4 true bits followed by a 0." This holds for the 1- through 4-byte encodings, but not for the 5-byte encoding. Both htsjdk and htslib implement the 5-byte prefix as 1111 — four ones with no trailing zero. There is no room for a terminating zero and still fit 32 bits of data.

From htsjdk (ITF8.java):

return ((b1 & 15) << 28) | ... | (15 & buffer.get());

From htslib (cram_io.h):

uint32_t uv = (((uint32_t)up[0] & 0x0f)<<28) | (up[1]<<20) | (up[2]<<12) | (up[3]<<4) | (up[4] & 0x0f);

Both mask the first byte with 0x0F / 15 , confirming the prefix is 1111 (4 bits), not 11110 (5 bits).

The same issue applies to LTF-8: the 9-byte case uses a 0xFF first byte (all ones, no trailing zero), which the spec alludes to but doesn't reconcile with the "followed by a 0" rule.

3. The description lacks examples and leaves critical differences from UTF-8 unstated

The spec says ITF-8 is "similar to UTF-8," but there is a key structural difference that is never stated: in UTF-8, every byte (including continuation bytes) carries a prefix (10xxxxxx); in ITF-8, only the first byte has a prefix — all subsequent bytes are pure data. A reader familiar with UTF-8 will assume the wrong thing.

A bit-pattern table (analogous to what the UTF-8 Wikipedia page provides) would make the encoding unambiguous:

Bytes Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Data bits
1 0xxxxxxx 7
2 10xxxxxx xxxxxxxx 14
3 110xxxxx xxxxxxxx xxxxxxxx 21
4 1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx 28
5 1111xxxx xxxxxxxx xxxxxxxx xxxxxxxx 0000xxxx 32

Note the 5-byte row: no trailing zero in the prefix, and only the lower 4 bits of the last byte carry data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions