Skip to content

Commit 329a3e1

Browse files
committed
Doc: Clarification on the format
1 parent 19166c7 commit 329a3e1

1 file changed

Lines changed: 90 additions & 29 deletions

File tree

format.md

Lines changed: 90 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,32 @@
22

33
This document attempts to document the format currently implemented by binjs-fbssdc, also known as "context 0.1".
44

5-
# Global structure
5+
This format is parameterized by the AST of the host language (as of this writing, the JavaScript ES6 AST, as
6+
defined [here](https://github.com/binast/binjs-ref/blob/master/spec/es6.webidl).
7+
8+
# Compressed files
9+
10+
## Global structure
611

712
```
813
Stream ::= MagicHeader BrotliBody
914
```
1015

11-
# Magic Header
16+
### Note for future versions
17+
18+
Future versions may add footers containing data that is not necessary for compilation/interpretation
19+
of the program, e.g. license, documentation, sourcemaps.
20+
21+
## Magic Header
1222

1323
The magic header serves to identify the format.
1424

1525
```
1626
MagicHeader ::= "\x89BJS\r\n\0\n" FormatVersion
17-
FormatVersion ::= 2 as varnum
27+
FormatVersion ::= varnum(2)
1828
```
1929

20-
# Brotli content
30+
## Brotli content
2131

2232
With the exception of the header, the entire file is brotli-compressed.
2333

@@ -29,15 +39,19 @@ Body ::= Prelude AST
2939
Where `Brotli(...)` represents data that may be uncompressed by the
3040
`brotli` command-line tool or any compatible library.
3141

32-
# Prelude
42+
## Prelude
3343

3444
The prelude defines a dictionary of strings and a dictionary of huffman tables. The order of both is meaningful.
3545

3646
```
3747
Prelude ::= StringPrelude HuffmanPrelude
3848
```
3949

40-
## String dictionary
50+
### String dictionary
51+
52+
The Prelude string dictionary extends the Shared string dictionary with addition strings used in the file.
53+
These strings may be identifier names, property keys or literal strings. String enums are **not** part of
54+
the Prelude string dictionary, nor of the Shared string dictionar.
4155

4256
```
4357
StringPrelude ::= n=NumberOfStrings StringDefinition{n}
@@ -51,32 +65,64 @@ ZeroByte ::= 0x00
5165

5266
Strings are utf-8 encoded and any instance of `0x00` is replaced with `0x01 0x00`, any instance of `0x01` is replaced with `0x01 0x01`.
5367

68+
#### Note for future versions
5469

70+
Experience has shown that namespacing differently between literal strings, identifier names and property keys
71+
could seriously improve compression. To be tested with this format.
5572

56-
## Huffman dictionaries
73+
### Huffman dictionaries
74+
75+
The Prelude Huffman dictionaries encodes a set of Huffman tables *for the host language
76+
grammar*. Each possible combination of `(parent_tag, child_type)` is assigned one
77+
Huffman table, which may be used to decode sequences of bits into a value of type
78+
`child_type` for a field that is part of an interface with tag `parent_tag`.
79+
80+
The tables are written down in an order extracted from the grammar and define a model
81+
`huffman_at: (parent tag, my type) -> HuffmanTable`.
82+
83+
The tables are written down in an order extracted from the grammar and define a model
84+
`huffman_at: (parent tag, my type) -> HuffmanTable`.
5785

5886
```
59-
HuffmanPrelude ::= HuffmanTable* # FIXME: How do we determine the number of tables?
87+
HuffmanPrelude ::= HuffmanTable{N} # The number of tables is extracted from the grammar.
6088
HuffmanTable ::= HuffmanTableUnreachable # Compression artifact. A table that needs to appear here but is never used.
6189
| HuffmanTableOptimizedOne # Optimization: A table with a single symbol.
6290
| HuffmanTableExplicitSymbols # Used for strings, numbers.
6391
| HuffmanTableIndexedSymbols # Used for enums, booleans, sums of interfaces.
6492
```
6593

66-
The tables are written down in an order extracted from the grammar and define a model
67-
`huffman_at: (parent type, my type) -> HuffmanTable`.
94+
As all tables need to be expressed, regardless of whether they are used, a number of tables may
95+
contain `HuffmanTableUnreachable`. If only one value is possible at a given `(parent tag, my type)`,
96+
we collapse the Huffman table into a single symbol definition, and its encoding in the AST takes
97+
0 bits.
6898

69-
FIXME: Specify how the order is extracted from the grammar.
99+
Otherwise, we differentiate between tables of Indexed Symbols (used whenever all possible values
100+
at this point form a simple, finite set, specified in the grammar as a sum of constants) and
101+
tables of Explicit Symbols (used whenever the set of values at this point is extensible by
102+
any file).
70103

71104
```
72105
HuffmanTableUnreachable ::= 0x02
73106
HuffmanTableOptimizedOne ::= 0x00 ExplicitSymbolData # Used for strings, numbers.
74107
| 0x00 Index # Used for enums, booleans, sums of interfaces.
75108
HuffmanTableExplicitSymbols ::= 0x01 n=HuffmanTableLen BitLength{n} ExplicitSymbolData{n} # Only list the symbols actually used.
76-
HuffmanTableIndexedSymbols ::= 0x01 BitLength* # List all symbols, in the order extracted from the grammar.
109+
HuffmanTableIndexedSymbols ::= 0x01 BitLength{N} # List all possible symbols. Number and order are extracted from the grammar.
77110
HuffmanTableLen ::= varnum
78111
Index ::= varnum
79112
BitLength ::= u8 # Number of bits needed to decode this symbol.
113+
```
114+
115+
An `Index` is an index in a list of well-known symbols (enums, booleans, sums of interfaces). The list is
116+
extracted statically from the grammar.
117+
118+
FIXME: Specify the order of well-known symbols.
119+
120+
Strings are always represented as indices into the string dictionary.
121+
122+
FIXME: Specify how we interpret an index when there is both a Shared string dictionary and a Prelude string
123+
dictionary.
124+
125+
```
80126
ExplicitSymbolData ::= ExplicitSymbolStringIndex
81127
| ExplicitSymbolOptionalStringIndex
82128
| ExplicitSymbolF64
@@ -91,16 +137,17 @@ ExplicitSymbolU32 ::= u32 (big endian)
91137
ExplicitSymbolI32 ::= i32 (big endian)
92138
```
93139

94-
An `Index` is an index in a list of well-known symbols (enums, booleans, sums of interfaces). The list is
95-
extracted statically from the grammar.
96-
97-
FIXME: Specify the order of well-known symbols.
98-
99140
Both `ExplicitSymbolStringIndex` and `ExplicitSymbolOptionalStringIndex` are indices in the list of strings.
100141
The list is in the order specified by `StringPrelude`. In `ExplicitSymbolOptionalStringIndex`, if the result
101142
is the non-0 value `n`, the actual index is `n - 1`.
102143

103-
# AST
144+
#### Note for future versions
145+
146+
Experience has shown that nearly all instances of f64 are actually short
147+
integers and may therefore be represented efficiently as `varnum`. To
148+
be experimented.
149+
150+
## AST
104151

105152
AST definitions are recursive. Any AST definition may itself contain further definitions,
106153
used to represent lazy functions.
@@ -115,28 +162,42 @@ LazyAST ::= Node
115162
In the definition of `AST`, for each `i`, `LazyPartByteLen[i]` represents the number
116163
of bytes used to store the item of the sub-ast `LazyAST[i]`.
117164

118-
# Nodes
165+
### Note for future versions
166+
167+
We intend to experiment on splitting the String prelude to ensure that
168+
we do not include in `RootNode` fragments that are only useful in lazy
169+
parts, and which would therefore generally slow startup.
170+
171+
172+
## Nodes
119173

120174
Nodes are stored as sequences of Huffman-encoded values. Note that the encoding uses
121-
numerous distinct Huffman tables. Each `(parent tag, value type)` pair determines the
122-
Huffman table to be used to decode the next few bits in the sequence.
175+
numerous distinct Huffman tables, rather than a single one. Each `(parent tag, value type)`
176+
pair determines the Huffman table to be used to decode the next few bits in the sequence.
177+
178+
At each node, the `parent tag` is determined by decoding the parent node (see `Tag` below),
179+
while the `value type` is specified by the language grammar.
123180

124181
```
125182
RootNode ::= Value(ε)*
126-
Node(parent) ::= t=Tag(parent) Field(t)*
183+
Node(parent) ::= t=Tag(parent) Value(t)* # Number of values is specified by the grammar
127184
Tag(parent) ::= Primitive(parent, TAG)
128-
Value(parent) ::= "" # If field is lazy
129-
| Node(parent) # If field is an interface or sum of interfaces
130-
| List(parent) # If field is a list
131-
| Primitive(parent, U32) # If field is a u32
132-
| Primitive(parent, I32) # If field is a i32
185+
Value(parent) ::= "" # If grammar specifies that field is lazy
186+
| Node(parent) # If grammar specifies that field is an interface or sum of interfaces
187+
| List(parent) # If grammar specifies that field is a list
188+
| Primitive(parent, U32) # If grammar specifies that field is a u32
189+
| Primitive(parent, I32) # If grammar specifies that field is a i32
133190
| Primitive(parent, F64) # ...
134191
| Primitive(parent, StringIndex)
135192
| Primitive(parent, OptionalStringIndex)
136-
List(parent) ::= ListLength(parent) Value(parent)*
193+
List(parent) ::= n=ListLength(parent) Value(parent){n}
137194
ListLength(parent) ::= Primitive(ListLength<parent>, U32) # List lengths are u32 values with a special parent
138195
Primitive(parent, type) ::= bit*
139196
```
140197

141198
In every instance of `Primitive(parent, type)`, we use the Huffman table defined as `huffman_at` (see above)
142-
to both determine the number of bits to read and interpret these bits as a value of the corresponding `type`.
199+
to both determine the number of bits to read and interpret these bits as a value of the corresponding `type`.
200+
201+
# Shared dictionaries
202+
203+
TBD

0 commit comments

Comments
 (0)