22
33This document attempts to document the format currently implemented by binjs-fbssdc, also known as "context 0.1".
44
5- # Global structure
5+ This format is parameterized by the AST of the host language (as of this writing, the JavaScript ES6 AST, as
6+ defined [ here] ( https://github.com/binast/binjs-ref/blob/master/spec/es6.webidl ) .
7+
8+ # Compressed files
9+
10+ ## Global structure
611
712```
813Stream ::= MagicHeader BrotliBody
914```
1015
11- # Magic Header
16+ ### Note for future versions
17+
18+ Future versions may add footers containing data that is not necessary for compilation/interpretation
19+ of the program, e.g. license, documentation, sourcemaps.
20+
21+ ## Magic Header
1222
1323The magic header serves to identify the format.
1424
1525```
1626MagicHeader ::= "\x89BJS\r\n\0\n" FormatVersion
17- FormatVersion ::= 2 as varnum
27+ FormatVersion ::= varnum(2)
1828```
1929
20- # Brotli content
30+ ## Brotli content
2131
2232With the exception of the header, the entire file is brotli-compressed.
2333
@@ -29,15 +39,19 @@ Body ::= Prelude AST
2939Where ` Brotli(...) ` represents data that may be uncompressed by the
3040` brotli ` command-line tool or any compatible library.
3141
32- # Prelude
42+ ## Prelude
3343
3444The prelude defines a dictionary of strings and a dictionary of huffman tables. The order of both is meaningful.
3545
3646```
3747Prelude ::= StringPrelude HuffmanPrelude
3848```
3949
40- ## String dictionary
50+ ### String dictionary
51+
52+ The Prelude string dictionary extends the Shared string dictionary with addition strings used in the file.
53+ These strings may be identifier names, property keys or literal strings. String enums are ** not** part of
54+ the Prelude string dictionary, nor of the Shared string dictionar.
4155
4256```
4357StringPrelude ::= n=NumberOfStrings StringDefinition{n}
@@ -51,32 +65,64 @@ ZeroByte ::= 0x00
5165
5266Strings are utf-8 encoded and any instance of ` 0x00 ` is replaced with ` 0x01 0x00 ` , any instance of ` 0x01 ` is replaced with ` 0x01 0x01 ` .
5367
68+ #### Note for future versions
5469
70+ Experience has shown that namespacing differently between literal strings, identifier names and property keys
71+ could seriously improve compression. To be tested with this format.
5572
56- ## Huffman dictionaries
73+ ### Huffman dictionaries
74+
75+ The Prelude Huffman dictionaries encodes a set of Huffman tables * for the host language
76+ grammar* . Each possible combination of ` (parent_tag, child_type) ` is assigned one
77+ Huffman table, which may be used to decode sequences of bits into a value of type
78+ ` child_type ` for a field that is part of an interface with tag ` parent_tag ` .
79+
80+ The tables are written down in an order extracted from the grammar and define a model
81+ ` huffman_at: (parent tag, my type) -> HuffmanTable ` .
82+
83+ The tables are written down in an order extracted from the grammar and define a model
84+ ` huffman_at: (parent tag, my type) -> HuffmanTable ` .
5785
5886```
59- HuffmanPrelude ::= HuffmanTable* # FIXME: How do we determine the number of tables?
87+ HuffmanPrelude ::= HuffmanTable{N} # The number of tables is extracted from the grammar.
6088HuffmanTable ::= HuffmanTableUnreachable # Compression artifact. A table that needs to appear here but is never used.
6189 | HuffmanTableOptimizedOne # Optimization: A table with a single symbol.
6290 | HuffmanTableExplicitSymbols # Used for strings, numbers.
6391 | HuffmanTableIndexedSymbols # Used for enums, booleans, sums of interfaces.
6492```
6593
66- The tables are written down in an order extracted from the grammar and define a model
67- ` huffman_at: (parent type, my type) -> HuffmanTable ` .
94+ As all tables need to be expressed, regardless of whether they are used, a number of tables may
95+ contain ` HuffmanTableUnreachable ` . If only one value is possible at a given ` (parent tag, my type) ` ,
96+ we collapse the Huffman table into a single symbol definition, and its encoding in the AST takes
97+ 0 bits.
6898
69- FIXME: Specify how the order is extracted from the grammar.
99+ Otherwise, we differentiate between tables of Indexed Symbols (used whenever all possible values
100+ at this point form a simple, finite set, specified in the grammar as a sum of constants) and
101+ tables of Explicit Symbols (used whenever the set of values at this point is extensible by
102+ any file).
70103
71104```
72105HuffmanTableUnreachable ::= 0x02
73106HuffmanTableOptimizedOne ::= 0x00 ExplicitSymbolData # Used for strings, numbers.
74107 | 0x00 Index # Used for enums, booleans, sums of interfaces.
75108HuffmanTableExplicitSymbols ::= 0x01 n=HuffmanTableLen BitLength{n} ExplicitSymbolData{n} # Only list the symbols actually used.
76- HuffmanTableIndexedSymbols ::= 0x01 BitLength* # List all symbols, in the order extracted from the grammar.
109+ HuffmanTableIndexedSymbols ::= 0x01 BitLength{N} # List all possible symbols. Number and order are extracted from the grammar.
77110HuffmanTableLen ::= varnum
78111Index ::= varnum
79112BitLength ::= u8 # Number of bits needed to decode this symbol.
113+ ```
114+
115+ An ` Index ` is an index in a list of well-known symbols (enums, booleans, sums of interfaces). The list is
116+ extracted statically from the grammar.
117+
118+ FIXME: Specify the order of well-known symbols.
119+
120+ Strings are always represented as indices into the string dictionary.
121+
122+ FIXME: Specify how we interpret an index when there is both a Shared string dictionary and a Prelude string
123+ dictionary.
124+
125+ ```
80126ExplicitSymbolData ::= ExplicitSymbolStringIndex
81127 | ExplicitSymbolOptionalStringIndex
82128 | ExplicitSymbolF64
@@ -91,16 +137,17 @@ ExplicitSymbolU32 ::= u32 (big endian)
91137ExplicitSymbolI32 ::= i32 (big endian)
92138```
93139
94- An ` Index ` is an index in a list of well-known symbols (enums, booleans, sums of interfaces). The list is
95- extracted statically from the grammar.
96-
97- FIXME: Specify the order of well-known symbols.
98-
99140Both ` ExplicitSymbolStringIndex ` and ` ExplicitSymbolOptionalStringIndex ` are indices in the list of strings.
100141The list is in the order specified by ` StringPrelude ` . In ` ExplicitSymbolOptionalStringIndex ` , if the result
101142is the non-0 value ` n ` , the actual index is ` n - 1 ` .
102143
103- # AST
144+ #### Note for future versions
145+
146+ Experience has shown that nearly all instances of f64 are actually short
147+ integers and may therefore be represented efficiently as ` varnum ` . To
148+ be experimented.
149+
150+ ## AST
104151
105152AST definitions are recursive. Any AST definition may itself contain further definitions,
106153used to represent lazy functions.
@@ -115,28 +162,42 @@ LazyAST ::= Node
115162In the definition of ` AST ` , for each ` i ` , ` LazyPartByteLen[i] ` represents the number
116163of bytes used to store the item of the sub-ast ` LazyAST[i] ` .
117164
118- # Nodes
165+ ### Note for future versions
166+
167+ We intend to experiment on splitting the String prelude to ensure that
168+ we do not include in ` RootNode ` fragments that are only useful in lazy
169+ parts, and which would therefore generally slow startup.
170+
171+
172+ ## Nodes
119173
120174Nodes are stored as sequences of Huffman-encoded values. Note that the encoding uses
121- numerous distinct Huffman tables. Each ` (parent tag, value type) ` pair determines the
122- Huffman table to be used to decode the next few bits in the sequence.
175+ numerous distinct Huffman tables, rather than a single one. Each ` (parent tag, value type) `
176+ pair determines the Huffman table to be used to decode the next few bits in the sequence.
177+
178+ At each node, the ` parent tag ` is determined by decoding the parent node (see ` Tag ` below),
179+ while the ` value type ` is specified by the language grammar.
123180
124181```
125182RootNode ::= Value(ε)*
126- Node(parent) ::= t=Tag(parent) Field (t)*
183+ Node(parent) ::= t=Tag(parent) Value (t)* # Number of values is specified by the grammar
127184Tag(parent) ::= Primitive(parent, TAG)
128- Value(parent) ::= "" # If field is lazy
129- | Node(parent) # If field is an interface or sum of interfaces
130- | List(parent) # If field is a list
131- | Primitive(parent, U32) # If field is a u32
132- | Primitive(parent, I32) # If field is a i32
185+ Value(parent) ::= "" # If grammar specifies that field is lazy
186+ | Node(parent) # If grammar specifies that field is an interface or sum of interfaces
187+ | List(parent) # If grammar specifies that field is a list
188+ | Primitive(parent, U32) # If grammar specifies that field is a u32
189+ | Primitive(parent, I32) # If grammar specifies that field is a i32
133190 | Primitive(parent, F64) # ...
134191 | Primitive(parent, StringIndex)
135192 | Primitive(parent, OptionalStringIndex)
136- List(parent) ::= ListLength(parent) Value(parent)*
193+ List(parent) ::= n= ListLength(parent) Value(parent){n}
137194ListLength(parent) ::= Primitive(ListLength<parent>, U32) # List lengths are u32 values with a special parent
138195Primitive(parent, type) ::= bit*
139196```
140197
141198In every instance of ` Primitive(parent, type) ` , we use the Huffman table defined as ` huffman_at ` (see above)
142- to both determine the number of bits to read and interpret these bits as a value of the corresponding ` type ` .
199+ to both determine the number of bits to read and interpret these bits as a value of the corresponding ` type ` .
200+
201+ # Shared dictionaries
202+
203+ TBD
0 commit comments