Skip to content

Commit 46b03de

Browse files
committed
Merge branch 'chanicpanic-terminal-priorities-earley'
2 parents 97ce098 + c62a471 commit 46b03de

9 files changed

Lines changed: 186 additions & 76 deletions

File tree

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,6 @@ v1.0
88

99
- `v_args(meta=True)` now gives meta as the first argument. i.e. `(meta, children)`
1010

11-
- Renamed TraditionalLexer to BasicLexer, and 'standard' lexer option to 'basic'
11+
- Renamed TraditionalLexer to BasicLexer, and 'standard' lexer option to 'basic'
12+
13+
- Default priority is now 0, for both terminals and rules (used to be 1 for terminals)

docs/grammar.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -99,11 +99,13 @@ num_list: "[" _separated{NUMBER, ","} "]" // Will match "[1, 2, 3]" etc.
9999

100100
### Priority
101101

102-
Terminals can be assigned priority only when using a lexer (future versions may support Earley's dynamic lexing).
102+
Terminals can be assigned a priority to influence lexing. Terminal priorities
103+
are signed integers with a default value of 0.
103104

104-
Priority can be either positive or negative. If not specified for a terminal, it defaults to 1.
105+
When using a lexer, the highest priority terminals are always matched first.
105106

106-
Highest priority terminals are always matched first.
107+
When using Earley's dynamic lexing, terminal priorities are used to prefer
108+
certain lexings and resolve ambiguity.
107109

108110
### Regexp Flags
109111

@@ -228,9 +230,12 @@ four_words: word ~ 4
228230
229231
### Priority
230232
231-
Rules can be assigned priority only when using Earley (future versions may support LALR as well).
233+
Like terminals, rules can be assigned a priority. Rule priorities are signed
234+
integers with a default value of 0.
232235
233-
Priority can be either positive or negative. In not specified for a terminal, it's assumed to be 1 (i.e. the default).
236+
When using LALR, the highest priority rules are used to resolve collision errors.
237+
238+
When using Earley, rule priorities are used to resolve ambiguity.
234239
235240
<a name="dirs"></a>
236241
## Directives
@@ -321,4 +326,4 @@ Can also be used to implement a plugin system where a core grammar is extended b
321326
%extend NUMBER: /0x\w+/
322327
```
323328

324-
For both `%extend` and `%override`, there is not requirement for a rule/terminal to come from another file, but that is probably the most common usecase
329+
For both `%extend` and `%override`, there is not requirement for a rule/terminal to come from another file, but that is probably the most common usecase

lark/lark.py

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ class LarkOptions(Serialize):
7373
start
7474
The start symbol. Either a string, or a list of strings for multiple possible starts (Default: "start")
7575
debug
76-
Display debug information and extra warnings. Use only when debugging (default: False)
76+
Display debug information and extra warnings. Use only when debugging (Default: ``False``)
7777
When used with Earley, it generates a forest graph as "sppf.png", if 'dot' is installed.
7878
transformer
7979
Applies the transformer to every parse tree (equivalent to applying it after the parse, but faster)
@@ -95,7 +95,7 @@ class LarkOptions(Serialize):
9595
g_regex_flags
9696
Flags that are applied to all terminals (both regex and strings)
9797
keep_all_tokens
98-
Prevent the tree builder from automagically removing "punctuation" tokens (default: False)
98+
Prevent the tree builder from automagically removing "punctuation" tokens (Default: ``False``)
9999
tree_class
100100
Lark will produce trees comprised of instances of this class instead of the default ``lark.Tree``.
101101
@@ -123,13 +123,13 @@ class LarkOptions(Serialize):
123123
**=== Misc. / Domain Specific Options ===**
124124
125125
postlex
126-
Lexer post-processing (Default: None) Only works with the basic and contextual lexers.
126+
Lexer post-processing (Default: ``None``) Only works with the basic and contextual lexers.
127127
priority
128-
How priorities should be evaluated - auto, none, normal, invert (Default: auto)
128+
How priorities should be evaluated - "auto", ``None``, "normal", "invert" (Default: "auto")
129129
lexer_callbacks
130130
Dictionary of callbacks for the lexer. May alter tokens during lexing. Use with caution.
131131
use_bytes
132-
Accept an input of type ``bytes`` instead of ``str`` (Python 3 only).
132+
Accept an input of type ``bytes`` instead of ``str``.
133133
edit_terminals
134134
A callback for editing the terminals before parse.
135135
import_paths
@@ -391,13 +391,17 @@ def __init__(self, grammar: 'Union[Grammar, str, IO[str]]', **options) -> None:
391391
for rule in self.rules:
392392
if rule.options.priority is not None:
393393
rule.options.priority = -rule.options.priority
394+
for term in self.terminals:
395+
term.priority = -term.priority
394396
# Else, if the user asked to disable priorities, strip them from the
395-
# rules. This allows the Earley parsers to skip an extra forest walk
397+
# rules and terminals. This allows the Earley parsers to skip an extra forest walk
396398
# for improved performance, if you don't need them (or didn't specify any).
397399
elif self.options.priority is None:
398400
for rule in self.rules:
399401
if rule.options.priority is not None:
400402
rule.options.priority = None
403+
for term in self.terminals:
404+
term.priority = 0
401405

402406
# TODO Deprecate lexer_callbacks?
403407
self.lexer_conf = LexerConf(

lark/parser_frontends.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -162,8 +162,6 @@ class EarleyRegexpMatcher:
162162
def __init__(self, lexer_conf):
163163
self.regexps = {}
164164
for t in lexer_conf.terminals:
165-
if t.priority:
166-
raise GrammarError("Dynamic Earley doesn't support weights on terminals", t, t.priority)
167165
regexp = t.pattern.to_regexp()
168166
try:
169167
width = get_regexp_width(regexp)[0]
@@ -186,13 +184,13 @@ def create_earley_parser__dynamic(lexer_conf, parser_conf, options=None, **kw):
186184
raise GrammarError("Earley's dynamic lexer doesn't support lexer_callbacks.")
187185

188186
earley_matcher = EarleyRegexpMatcher(lexer_conf)
189-
return xearley.Parser(parser_conf, earley_matcher.match, ignore=lexer_conf.ignore, **kw)
187+
return xearley.Parser(lexer_conf, parser_conf, earley_matcher.match, **kw)
190188

191189
def _match_earley_basic(term, token):
192190
return term.name == token.type
193191

194192
def create_earley_parser__basic(lexer_conf, parser_conf, options, **kw):
195-
return earley.Parser(parser_conf, _match_earley_basic, **kw)
193+
return earley.Parser(lexer_conf, parser_conf, _match_earley_basic, **kw)
196194

197195
def create_earley_parser(lexer_conf, parser_conf, options):
198196
resolve_ambiguity = options.ambiguity == 'resolve'

lark/parsers/earley.py

Lines changed: 27 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,17 +11,19 @@
1111

1212
from collections import deque
1313

14+
from ..lexer import Token
1415
from ..tree import Tree
1516
from ..exceptions import UnexpectedEOF, UnexpectedToken
1617
from ..utils import logger
1718
from .grammar_analysis import GrammarAnalyzer
1819
from ..grammar import NonTerminal
1920
from .earley_common import Item, TransitiveItem
20-
from .earley_forest import ForestSumVisitor, SymbolNode, ForestToParseTree
21+
from .earley_forest import ForestSumVisitor, SymbolNode, TokenNode, ForestToParseTree
2122

2223
class Parser:
23-
def __init__(self, parser_conf, term_matcher, resolve_ambiguity=True, debug=False, tree_class=Tree):
24+
def __init__(self, lexer_conf, parser_conf, term_matcher, resolve_ambiguity=True, debug=False, tree_class=Tree):
2425
analysis = GrammarAnalyzer(parser_conf)
26+
self.lexer_conf = lexer_conf
2527
self.parser_conf = parser_conf
2628
self.resolve_ambiguity = resolve_ambiguity
2729
self.debug = debug
@@ -42,13 +44,21 @@ def __init__(self, parser_conf, term_matcher, resolve_ambiguity=True, debug=Fals
4244
if rule.origin not in self.predictions:
4345
self.predictions[rule.origin] = [x.rule for x in analysis.expand_rule(rule.origin)]
4446

45-
## Detect if any rules have priorities set. If the user specified priority = "none" then
46-
# the priorities will be stripped from all rules before they reach us, allowing us to
47+
## Detect if any rules/terminals have priorities set. If the user specified priority = None, then
48+
# the priorities will be stripped from all rules/terminals before they reach us, allowing us to
4749
# skip the extra tree walk. We'll also skip this if the user just didn't specify priorities
48-
# on any rules.
50+
# on any rules/terminals.
4951
if self.forest_sum_visitor is None and rule.options.priority is not None:
5052
self.forest_sum_visitor = ForestSumVisitor
5153

54+
# Check terminals for priorities
55+
# Ignore terminal priorities if the basic lexer is used
56+
if self.lexer_conf.lexer_type != 'basic' and self.forest_sum_visitor is None:
57+
for term in self.lexer_conf.terminals:
58+
if term.priority:
59+
self.forest_sum_visitor = ForestSumVisitor
60+
break
61+
5262
self.term_matcher = term_matcher
5363

5464

@@ -232,8 +242,17 @@ def scan(i, token, to_scan):
232242
if match(item.expect, token):
233243
new_item = item.advance()
234244
label = (new_item.s, new_item.start, i)
245+
# 'terminals' may not contain token.type when using %declare
246+
# Additionally, token is not always a Token
247+
# For example, it can be a Tree when using TreeMatcher
248+
term = terminals.get(token.type) if isinstance(token, Token) else None
249+
# Set the priority of the token node to 0 so that the
250+
# terminal priorities do not affect the Tree chosen by
251+
# ForestSumVisitor after the basic lexer has already
252+
# "used up" the terminal priorities
253+
token_node = TokenNode(token, term, priority=0)
235254
new_item.node = node_cache[label] if label in node_cache else node_cache.setdefault(label, SymbolNode(*label))
236-
new_item.node.add_family(new_item.s, item.rule, new_item.start, item.node, token)
255+
new_item.node.add_family(new_item.s, item.rule, new_item.start, item.node, token_node)
237256

238257
if new_item.expect in self.TERMINALS:
239258
# add (B ::= Aai+1.B, h, y) to Q'
@@ -252,6 +271,8 @@ def scan(i, token, to_scan):
252271
# Define parser functions
253272
match = self.term_matcher
254273

274+
terminals = self.lexer_conf.terminals_by_name
275+
255276
# Cache for nodes & tokens created in a particular parse step.
256277
transitives = [{}]
257278

lark/parsers/earley_forest.py

Lines changed: 73 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,14 @@ class SymbolNode(ForestNode):
3636
3737
Hence a Symbol Node with a single child is unambiguous.
3838
39-
:ivar s: A Symbol, or a tuple of (rule, ptr) for an intermediate node.
40-
:ivar start: The index of the start of the substring matched by this
41-
symbol (inclusive).
42-
:ivar end: The index of the end of the substring matched by this
43-
symbol (exclusive).
44-
:ivar is_intermediate: True if this node is an intermediate node.
45-
:ivar priority: The priority of the node's symbol.
39+
Parameters:
40+
s: A Symbol, or a tuple of (rule, ptr) for an intermediate node.
41+
start: The index of the start of the substring matched by this symbol (inclusive).
42+
end: The index of the end of the substring matched by this symbol (exclusive).
43+
44+
Properties:
45+
is_intermediate: True if this node is an intermediate node.
46+
priority: The priority of the node's symbol.
4647
"""
4748
__slots__ = ('s', 'start', 'end', '_children', 'paths', 'paths_loaded', 'priority', 'is_intermediate', '_hash')
4849
def __init__(self, s, start, end):
@@ -113,11 +114,12 @@ class PackedNode(ForestNode):
113114
"""
114115
A Packed Node represents a single derivation in a symbol node.
115116
116-
:ivar rule: The rule associated with this node.
117-
:ivar parent: The parent of this node.
118-
:ivar left: The left child of this node. ``None`` if one does not exist.
119-
:ivar right: The right child of this node. ``None`` if one does not exist.
120-
:ivar priority: The priority of this node.
117+
Parameters:
118+
rule: The rule associated with this node.
119+
parent: The parent of this node.
120+
left: The left child of this node. ``None`` if one does not exist.
121+
right: The right child of this node. ``None`` if one does not exist.
122+
priority: The priority of this node.
121123
"""
122124
__slots__ = ('parent', 's', 'rule', 'start', 'left', 'right', 'priority', '_hash')
123125
def __init__(self, parent, s, rule, start, left, right):
@@ -172,6 +174,36 @@ def __repr__(self):
172174
symbol = self.s.name
173175
return "({}, {}, {}, {})".format(symbol, self.start, self.priority, self.rule.order)
174176

177+
class TokenNode(ForestNode):
178+
"""
179+
A Token Node represents a matched terminal and is always a leaf node.
180+
181+
Parameters:
182+
token: The Token associated with this node.
183+
term: The TerminalDef matched by the token.
184+
priority: The priority of this node.
185+
"""
186+
__slots__ = ('token', 'term', 'priority', '_hash')
187+
def __init__(self, token, term, priority=None):
188+
self.token = token
189+
self.term = term
190+
if priority is not None:
191+
self.priority = priority
192+
else:
193+
self.priority = term.priority if term is not None else 0
194+
self._hash = hash(token)
195+
196+
def __eq__(self, other):
197+
if not isinstance(other, TokenNode):
198+
return False
199+
return self is other or (self.token == other.token)
200+
201+
def __hash__(self):
202+
return self._hash
203+
204+
def __repr__(self):
205+
return repr(self.token)
206+
175207
class ForestVisitor:
176208
"""
177209
An abstract base class for building forest visitors.
@@ -187,7 +219,8 @@ class ForestVisitor:
187219
methods. Returning a node(s) will schedule them to be visited. The visitor
188220
will begin to backtrack if no nodes are returned.
189221
190-
:ivar single_visit: If ``True``, non-Token nodes will only be visited once.
222+
Parameters:
223+
single_visit: If ``True``, non-Token nodes will only be visited once.
191224
"""
192225

193226
def __init__(self, single_visit=False):
@@ -224,11 +257,12 @@ def visit_packed_node_out(self, node):
224257
def on_cycle(self, node, path):
225258
"""Called when a cycle is encountered.
226259
227-
:param node: The node that causes a cycle.
228-
:param path: The list of nodes being visited: nodes that have been
229-
entered but not exited. The first element is the root in a forest
230-
visit, and the last element is the node visited most recently.
231-
``path`` should be treated as read-only.
260+
Parameters:
261+
node: The node that causes a cycle.
262+
path: The list of nodes being visited: nodes that have been
263+
entered but not exited. The first element is the root in a forest
264+
visit, and the last element is the node visited most recently.
265+
``path`` should be treated as read-only.
232266
"""
233267
pass
234268

@@ -291,8 +325,8 @@ def visit(self, root):
291325
input_stack.append(next_node)
292326
continue
293327

294-
if not isinstance(current, ForestNode):
295-
vtn(current)
328+
if isinstance(current, TokenNode):
329+
vtn(current.token)
296330
input_stack.pop()
297331
continue
298332

@@ -322,8 +356,7 @@ def visit(self, root):
322356
if next_node is None:
323357
continue
324358

325-
if not isinstance(next_node, ForestNode) and \
326-
not isinstance(next_node, Token):
359+
if not isinstance(next_node, ForestNode):
327360
next_node = iter(next_node)
328361
elif id(next_node) in visiting:
329362
oc(next_node, path)
@@ -491,15 +524,13 @@ class ForestToParseTree(ForestTransformer):
491524
"""Used by the earley parser when ambiguity equals 'resolve' or
492525
'explicit'. Transforms an SPPF into an (ambiguous) parse tree.
493526
494-
tree_class: The tree class to use for construction
495-
callbacks: A dictionary of rules to functions that output a tree
496-
prioritizer: A ``ForestVisitor`` that manipulates the priorities of
497-
ForestNodes
498-
resolve_ambiguity: If True, ambiguities will be resolved based on
499-
priorities. Otherwise, `_ambig` nodes will be in the resulting
500-
tree.
501-
use_cache: If True, the results of packed node transformations will be
502-
cached.
527+
Parameters:
528+
tree_class: The tree class to use for construction
529+
callbacks: A dictionary of rules to functions that output a tree
530+
prioritizer: A ``ForestVisitor`` that manipulates the priorities of ForestNodes
531+
resolve_ambiguity: If True, ambiguities will be resolved based on
532+
priorities. Otherwise, `_ambig` nodes will be in the resulting tree.
533+
use_cache: If True, the results of packed node transformations will be cached.
503534
"""
504535

505536
def __init__(self, tree_class=Tree, callbacks=dict(), prioritizer=ForestSumVisitor(), resolve_ambiguity=True, use_cache=True):
@@ -643,21 +674,18 @@ class TreeForestTransformer(ForestToParseTree):
643674
Non-tree transformations are made possible by override of
644675
``__default__``, ``__default_token__``, and ``__default_ambig__``.
645676
646-
.. note::
647-
677+
Note:
648678
Tree shaping features such as inlined rules and token filtering are
649-
not built into the transformation. Positions are also not
650-
propagated.
651-
652-
:param tree_class: The tree class to use for construction
653-
:param prioritizer: A ``ForestVisitor`` that manipulates the priorities of
654-
nodes in the SPPF.
655-
:param resolve_ambiguity: If True, ambiguities will be resolved based on
656-
priorities.
657-
:param use_cache: If True, caches the results of some transformations,
658-
potentially improving performance when ``resolve_ambiguity==False``.
659-
Only use if you know what you are doing: i.e. All transformation
660-
functions are pure and referentially transparent.
679+
not built into the transformation. Positions are also not propagated.
680+
681+
Parameters:
682+
tree_class: The tree class to use for construction
683+
prioritizer: A ``ForestVisitor`` that manipulates the priorities of nodes in the SPPF.
684+
resolve_ambiguity: If True, ambiguities will be resolved based on priorities.
685+
use_cache (bool): If True, caches the results of some transformations,
686+
potentially improving performance when ``resolve_ambiguity==False``.
687+
Only use if you know what you are doing: i.e. All transformation
688+
functions are pure and referentially transparent.
661689
"""
662690

663691
def __init__(self, tree_class=Tree, prioritizer=ForestSumVisitor(), resolve_ambiguity=True, use_cache=False):

0 commit comments

Comments
 (0)