You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+10-9Lines changed: 10 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,12 +5,9 @@
5
5
6
6
# Probabilistic Earley parser
7
7
8
-
This is an implementation of a probabilistic Earley parsing algorithm, which can parse any Probabilistic Context Free Grammar (PCFG) (also
9
-
known as Stochastic Context Free Grammar (SCFG)),
10
-
or equivalently any language described in Backus-Naur Form (BNF). In these grammars,
11
-
rewrite rules may be non-deterministic and have a probability attached to them.
12
-
8
+
This is a library for parsing a string of tokens (like words) into parse trees that are weighted by probability. For example: you might want to know the probabilities for all derivations of an English sentence, or the most likely table of contents structure for a list of paragraphs. This library allows you to do so efficiently, as long as you can describe the rules as a [Context-free Grammar](https://en.wikipedia.org/wiki/Context-free_grammar) (CFG).
13
9
10
+
The innovation of this library with respect to the gazillion other parsing libraries is that this one allows the poduction rules in your grammar to have a probability attached to them. This allows us to make a better choice in case of an ambiguous sentence: just select the derivation with the highest probability (this is called the Viterbi parse). If you do not need probabilities attached to your parse trees, you are probably better off using [nearley](http://nearley.js.org) instead.
14
11
15
12
For a theoretical grounding of this work, refer to [*Stolcke, An Efficient Probabilistic Context-Free
Written in TypeScript, published as a [commonjs module on NPM](https://www.npmjs.com/package/probabilistic-earley-parser) and a [single-file minified UMD module on Github](https://github.com/digitalheir/probabilistic-earley-parser-javascript/releases) in vulgar ES5.
142
139
140
+
This is an implementation of a probabilistic Earley parsing algorithm, which can parse any Probabilistic Context Free Grammar (PCFG) (also
141
+
known as Stochastic Context Free Grammar (SCFG)),
142
+
or equivalently any language described in Backus-Naur Form (BNF). In these grammars,
143
+
rewrite rules may be non-deterministic and have a probability attached to them.
144
+
143
145
The probability of a parse is defined as the product of the probalities all the applied rules. Usually,
144
146
we define probability as a number between 0 and 1 inclusive, and use common algebraic notions of addition and
145
147
multiplication.
@@ -151,6 +153,7 @@ semiring which holds the minus log of the probability. So that maps the numbers
151
153
between infinity and zero, skewed towards lower probabilities:
152
154
153
155
#### Graph plot of f(x) = -log(x)
156
+
154
157

155
158
156
159
@@ -167,12 +170,10 @@ Note that this implementation does not apply innovations such as [Joop Leo's imp
167
170
For a faster parser that work on non-probabilistic grammars, look into [nearley](nearley.js.org).
168
171
169
172
### Limitations
170
-
* I have not provisioned for ε-rules
173
+
* I have not provisioned for ε-rules (rules with an empty right hand side)
171
174
* Rule probability estimation may be performed using the inside-outside algorithm, but is not currently implemented
172
175
* Higher level concepts such as wildcards, * and + are not implemented
173
-
* Viterbi parsing (querying the most likely parse tree) only returns one single parse. In the case of an ambiguous sentence, the returned parse is not guaranteed the left-most parse.
174
-
* Behavior for strangely defined grammars is not defined, such as when the same rule is defined multiple times with
175
-
a different probability
176
+
* Viterbi parsing (querying the most likely parse tree) only returns one single parse. In the case of an ambiguous sentence in which multiple dervation have the highest probability, the returned parse is not guaranteed the left-most parse (I think).
176
177
177
178
## License
178
179
This software is licensed under a permissive [MIT license](https://opensource.org/licenses/MIT).
0 commit comments