Skip to content

Commit 033a471

Browse files
committed
Doc: Attempting to specify this format
1 parent 6a0aef1 commit 033a471

1 file changed

Lines changed: 140 additions & 0 deletions

File tree

format.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# About this document
2+
3+
This document attempts to document the format currently implemented by binjs-fbssdc, also known as "context 0.1".
4+
5+
# Global structure
6+
7+
```
8+
Stream ::= MagicHeader BrotliBody
9+
```
10+
11+
# Magic Header
12+
13+
The magic header serves to identify the format.
14+
15+
```
16+
MagicHeader ::= "\x89BJS\r\n\0\n" FormatVersion
17+
FormatVersion ::= 0b00000010
18+
```
19+
20+
# Brotli content
21+
22+
With the exception of the header, the entire file is brotli-compressed.
23+
24+
```
25+
BrotliBody ::= Brotli(Body)
26+
Body ::= Prelude AST
27+
```
28+
29+
Where `Brotli(...)` represents data that may be uncompressed by the
30+
`brotli` command-line tool or any compatible library.
31+
32+
# Prelude
33+
34+
The prelude defines a dictionary of strings and a dictionary of probabilities. The order of both is meaningful.
35+
36+
```
37+
Prelude ::= StringPrelude ProbabilityPrelude
38+
```
39+
40+
## String dictionary
41+
42+
```
43+
StringPrelude ::= n=NumberOfStrings StringDefinition{n}
44+
NumberOfStrings ::= varnum
45+
StringDefinition ::= NonZeroByte* ZeroByte
46+
NonZeroByte ::= 0x01-0xFF
47+
ZeroByte ::= 0x00
48+
```
49+
50+
Strings are utf-8 encoded, then we replace any embedded `0x00 0x01` with `0x01 0x01` (FIXME: Why does this work?)
51+
52+
53+
54+
## Probability dictionaries
55+
56+
```
57+
ProbabilityPrelude ::= ProbabilityTable* # FIXME: How do we determine the number of tables?
58+
ProbabilityTable ::= ProbabilityTableUnreachable # Compression artifact. A table that needs to appear here but is never used.
59+
| ProbabilityTableOptimizedOne # Optimization: A probability table with a single symbol.
60+
| ProbabilityTableExplicitSymbols # Used for strings, numbers.
61+
| ProbabilityTableIndexedSymbols # Used for enums, booleans, sums of interfaces.
62+
```
63+
64+
The probability tables are written down in an order extracted from the grammar and define a model
65+
`huffman_at: (parent type, my type) -> HuffmanTable`.
66+
67+
FIXME: Specify how the order is extracted from the grammar.
68+
69+
```
70+
ProbabilityTableUnreachable ::= 0x02
71+
ProbabilityTableOptimizedOne ::= 0x00 ExplicitSymbolData # Used for strings, numbers.
72+
| 0x00 Index # Used for enums, booleans, sums of interfaces.
73+
ProbabilityTableExplicitSymbols ::= 0x01 n=ProbabilityTableLen Probability{n} ExplicitSymbolData{n} # Only list the symbols actually used.
74+
ProbabilityTableIndexedSymbols ::= 0x01 Probability* # List all symbols, in the order extracted from the grammar.
75+
ProbabilityTableLen ::= varnum
76+
Index ::= varnum
77+
Probability ::= u8
78+
ExplicitSymbolData ::= ExplicitSymbolStringIndex
79+
| ExplicitSymbolOptionalStringIndex
80+
| ExplicitSymbolF64
81+
| ExplicitSymbolU32
82+
| ExplicitSymbolI32
83+
ExplicitSymbolStringIndex ::= varnum
84+
ExplicitSymbolOptionalStringIndex ::= 0x00
85+
| n=varnum
86+
where n > 0
87+
ExplicitSymbolF64 ::= f64 (IEEE 754, big endian)
88+
ExplicitSymbolU32 ::= u32 (big endian)
89+
ExplicitSymbolI32 ::= i32 (big endian)
90+
```
91+
92+
An `Index` is an index in a list of well-known symbols (enums, booleans, sums of interfaces). The list is
93+
extracted statically from the grammar.
94+
95+
FIXME: Specify the order of well-known symbols.
96+
97+
Both `ExplicitSymbolStringIndex` and `ExplicitSymbolOptionalStringIndex` are indices in the list of strings.
98+
The list is in the order specified by `StringPrelude`. In `ExplicitSymbolOptionalStringIndex`, if the result
99+
is the non-0 value `n`, the actual index is `n - 1`.
100+
101+
# AST
102+
103+
AST definitions are recursive. Any AST definition may itself contain further definitions,
104+
used to represent lazy functions.
105+
106+
```
107+
AST ::= RootNode n=NumberOfLazyParts LazyPartByteLen{n} LazyAST{n}
108+
NumerOfLazyParts ::= varnum
109+
LazyPartByteLen ::= varnum
110+
LazyAST ::= Node
111+
```
112+
113+
In the definition of `AST`, for each `i`, `LazyPartByteLen[i]` represents the number
114+
of bytes used to store the item of the sub-ast `LazyAST[i]`.
115+
116+
# Nodes
117+
118+
Nodes are stored as sequences of Huffman-encoded values. Note that the encoding uses
119+
numerous distinct Huffman tables. Each `(parent tag, value type)` pair determines the
120+
Huffman table to be used to decode the next few bits in the sequence.
121+
122+
```
123+
RootNode ::= Value(ε)*
124+
Node(parent) ::= t=Tag(parent) Field(t)*
125+
Tag(parent) ::= Primitive(parent, TAG)
126+
Value(parent) ::= "" # If field is lazy
127+
| Node(parent) # If field is an interface or sum of interfaces
128+
| List(parent) # If field is a list
129+
| Primitive(parent, U32) # If field is a u32
130+
| Primitive(parent, I32) # If field is a i32
131+
| Primitive(parent, F64) # ...
132+
| Primitive(parent, StringIndex)
133+
| Primitive(parent, OptionalStringIndex)
134+
List(parent) ::= ListLength(parent) Value(parent)*
135+
ListLength(parent) ::= Primitive(ListLength<parent>, U32) # List lengths are u32 values with a special parent
136+
Primitive(parent, type) ::= bit*
137+
```
138+
139+
In every instance of `Primitive(parent, type)`, we use the Huffman table defined as `huffman_at` (see above)
140+
to both determine the number of bits to read and interpret these bits as a value of the corresponding `type`.

0 commit comments

Comments
 (0)