|
| 1 | + |
| 2 | +[](https://opensource.org/licenses/MIT) |
| 3 | +[](https://www.python.org/downloads/release/python-390/) |
| 4 | +[](https://pypi.org/project/reynir-correct/) |
| 5 | +[](https://github.com/mideind/GreynirCorrect/releases) |
| 6 | +[](https://github.com/mideind/GreynirCorrect/actions?query=workflow%3A%22Python+package%22) |
| 7 | + |
| 8 | +# GreynirCorrect: Spelling and grammar correction for Icelandic |
| 9 | + |
| 10 | +## Overview |
| 11 | + |
| 12 | +**GreynirCorrect** is a Python 3 (>=3.9) package and command line tool for |
| 13 | +**checking and correcting spelling and grammar** in Icelandic text. |
| 14 | + |
| 15 | +GreynirCorrect relies on the [Greynir](https://pypi.org/project/reynir/) package, |
| 16 | +by the same authors, to tokenize and parse text. |
| 17 | + |
| 18 | +GreynirCorrect is documented in detail [here](https://yfirlestur.is/doc/). |
| 19 | + |
| 20 | +The software has three main modes of operation, described below. |
| 21 | + |
| 22 | +As a fourth alternative, you can call the JSON REST API |
| 23 | +of [Yfirlestur.is](https://yfirlestur.is) |
| 24 | +to apply the GreynirCorrect spelling and grammar engine to your text, |
| 25 | +as [documented here](https://github.com/mideind/Yfirlestur#https-api). |
| 26 | + |
| 27 | +### Token-level correction |
| 28 | + |
| 29 | +GreynirCorrect can tokenize text and return an automatically corrected token stream. |
| 30 | +This catches token-level errors, such as spelling errors and erroneous |
| 31 | +phrases, but not grammatical errors. Token-level correction is relatively fast. |
| 32 | + |
| 33 | +### Full grammar analysis |
| 34 | + |
| 35 | +GreynirCorrect can analyze text grammatically by attempting to parse |
| 36 | +it, after token-level correction. The parsing is done according to Greynir's |
| 37 | +context-free grammar for Icelandic, augmented with additional production |
| 38 | +rules for common grammatical errors. The analysis returns a set of annotations |
| 39 | +(errors and suggestions) that apply to spans (consecutive tokens) within |
| 40 | +sentences in the resulting token list. Full grammar analysis is considerably |
| 41 | +slower than token-level correction. |
| 42 | + |
| 43 | +### Command-line tool |
| 44 | + |
| 45 | +GreynirCorrect can be invoked as a command-line tool |
| 46 | +to perform token-level correction and, optionally, grammar analysis. |
| 47 | +The command is `correct infile.txt outfile.txt`. |
| 48 | +The command-line tool is further documented below. |
| 49 | + |
| 50 | +## Examples |
| 51 | + |
| 52 | +To perform token-level correction from Python code: |
| 53 | + |
| 54 | +```python |
| 55 | +>>> from reynir_correct import tokenize |
| 56 | +>>> g = tokenize("Af gefnu tilefni fékk fékk daninn vilja sýnum " |
| 57 | +>>> "framgengt í auknu mæli.") |
| 58 | +>>> for tok in g: |
| 59 | +>>> print("{0:10} {1}".format(tok.txt or "", tok.error_description)) |
| 60 | +``` |
| 61 | + |
| 62 | +Output: |
| 63 | + |
| 64 | +``` |
| 65 | +Að Orðasambandið 'Af gefnu tilefni' var leiðrétt í 'að gefnu tilefni' |
| 66 | +gefnu |
| 67 | +tilefni |
| 68 | +fékk Endurtekið orð ('fékk') var fellt burt |
| 69 | +Daninn Orð á að byrja á hástaf: 'daninn' |
| 70 | +vilja Orðasambandið 'vilja sýnum framgengt' var leiðrétt í 'vilja sínum framgengt' |
| 71 | +sínum |
| 72 | +framgengt |
| 73 | +í Orðasambandið 'í auknu mæli' var leiðrétt í 'í auknum mæli' |
| 74 | +auknum |
| 75 | +mæli |
| 76 | +. |
| 77 | +``` |
| 78 | + |
| 79 | +To perform full spelling and grammar analysis of a sentence from Python code: |
| 80 | + |
| 81 | +```python |
| 82 | +from reynir_correct import check_single |
| 83 | +sent = check_single("Páli, vini mínum, langaði að horfa á sjónnvarpið.") |
| 84 | +for annotation in sent.annotations: |
| 85 | + print("{0}".format(annotation)) |
| 86 | +``` |
| 87 | + |
| 88 | +Output: |
| 89 | + |
| 90 | +``` |
| 91 | +000-004: P_WRONG_CASE_þgf_þf Á líklega að vera 'Pál, vin minn' / [Pál , vin minn] |
| 92 | +009-009: S004 Orðið 'sjónnvarpið' var leiðrétt í 'sjónvarpið' |
| 93 | +``` |
| 94 | + |
| 95 | +```python |
| 96 | +sent.tidy_text |
| 97 | +``` |
| 98 | + |
| 99 | +Output: |
| 100 | + |
| 101 | +``` |
| 102 | +'Páli, vini mínum, langaði að horfa á sjónvarpið.' |
| 103 | +``` |
| 104 | + |
| 105 | +The `annotation.start` and `annotation.end` properties |
| 106 | +(here `start` is 0 and `end` is 4) contain the 0-based indices of the first |
| 107 | +and last tokens to which the annotation applies. |
| 108 | +The `annotation.start_char` and `annotation.end_char` properties |
| 109 | +contain the indices of the first and last character to which the |
| 110 | +annotation applies, within the original input string. |
| 111 | + |
| 112 | +`P_WRONG_CASE_þgf_þf` and `S004` are error codes. |
| 113 | + |
| 114 | +For more detailed, low-level control, the `check_errors()` function |
| 115 | +supports options and can produce various types of output: |
| 116 | + |
| 117 | +```python |
| 118 | +from reynir_correct import check_errors |
| 119 | +x = "Páli, vini mínum, langaði að horfa á sjónnvarpið." |
| 120 | +options = { "input": x, "annotations": True, "format": "text" } |
| 121 | +s = check_errors(**options) |
| 122 | +for i in s.split("\n"): |
| 123 | + print(i) |
| 124 | +``` |
| 125 | + |
| 126 | +Output: |
| 127 | + |
| 128 | +``` |
| 129 | +Pál, vin minn, langaði að horfa á sjónvarpið. |
| 130 | +000-004: P_WRONG_CASE_þgf_þf Á líklega að vera 'Pál, vin minn' | 'Páli, vini mínum,' -> 'Pál, vin minn' | None |
| 131 | +009-009: S004 Orðið 'sjónnvarpið' var leiðrétt í 'sjónvarpið' | 'sjónnvarpið' -> 'sjónvarpið' | None |
| 132 | +``` |
| 133 | + |
| 134 | +The following options can be specified: |
| 135 | + |
| 136 | +| Option | Description | Default value | |
| 137 | +|---|---|---| |
| 138 | +| `input` | Defines the input. Can be a string or an iterable of strings, such as a file object. | `sys.stdin` | |
| 139 | +| `all_errors` (alias `grammar`) | Defines the level of correction. If False, only token-level annotation is carried out. If True, sentence-level annotation is carried out. | `True` | |
| 140 | +| `annotate_unparsed_sentences` | If True, sentences that cannot be parsed are annotated in their entirety as errors. | `True` | |
| 141 | +| `generate_suggestion_list` | If True, annotations can in certain cases contain a list of possible corrections, for the user to pick from. | `False` | |
| 142 | +| `suppress_suggestions` | If True, more farfetched automatically suggested corrections are suppressed. | `False` | |
| 143 | +| `ignore_wordlist` | The value is a set of strings to whitelist. Each string is a word that should not be marked as an error or corrected. The comparison is case-sensitive. | `set()` | |
| 144 | +| `one_sent` | The input contains a single sentence only. Sentence splitting should not be attempted. | `False` | |
| 145 | +| `ignore_rules` | A set of error codes that should be ignored in the annotation process. | `set()` | |
| 146 | +| `tov_config` | Path to an additional configuration file that may be provided for correcting custom tone-of-voice issues. | `False` | |
| 147 | + |
| 148 | +An overview of error codes is available [here](https://github.com/mideind/GreynirCorrect/blob/master/doc/errorcodes.rst). |
| 149 | + |
| 150 | +## Prerequisites |
| 151 | + |
| 152 | +GreynirCorrect runs on CPython 3.9 or newer, and on PyPy 3.9 or newer. It has |
| 153 | +been tested on Linux, macOS and Windows. The |
| 154 | +[PyPi package](https://pypi.org/project/reynir-correct/) |
| 155 | +includes binary wheels for common environments, but if the setup on your OS |
| 156 | +requires compilation from sources, you may need |
| 157 | + |
| 158 | +```bash |
| 159 | +$ sudo apt-get install python3-dev |
| 160 | +``` |
| 161 | + |
| 162 | +...or something to similar effect to enable this. |
| 163 | + |
| 164 | +## Installation |
| 165 | + |
| 166 | +To install this package (assuming you have Python >= 3.9 with `pip` installed): |
| 167 | + |
| 168 | +```bash |
| 169 | +$ pip install reynir-correct |
| 170 | +``` |
| 171 | + |
| 172 | +If you want to be able to edit the source, do like so |
| 173 | +(assuming you have `git` installed): |
| 174 | + |
| 175 | +```bash |
| 176 | +$ git clone https://github.com/mideind/GreynirCorrect |
| 177 | +$ cd GreynirCorrect |
| 178 | +$ # [ Activate your virtualenv here if you have one ] |
| 179 | +$ pip install -e . |
| 180 | +``` |
| 181 | + |
| 182 | +The package source code is now in `GreynirCorrect/src/reynir_correct`. |
| 183 | + |
| 184 | +## The command line tool |
| 185 | + |
| 186 | +After installation, the corrector can be invoked directly from the command line: |
| 187 | + |
| 188 | +```bash |
| 189 | +$ correct input.txt output.txt |
| 190 | +``` |
| 191 | + |
| 192 | +...or: |
| 193 | + |
| 194 | +```bash |
| 195 | +$ echo "Þinngið samþikkti tilöguna" | correct |
| 196 | +Þingið samþykkti tillöguna |
| 197 | +``` |
| 198 | + |
| 199 | +Input and output files are encoded in UTF-8. If the files are not |
| 200 | +given explicitly, `stdin` and `stdout` are used for input and output, |
| 201 | +respectively. |
| 202 | + |
| 203 | +Empty lines in the input are treated as sentence boundaries. |
| 204 | + |
| 205 | +By default, the output consists of one sentence per line, where each |
| 206 | +line ends with a single newline character (ASCII LF, `chr(10)`, `"\n"`). |
| 207 | +Within each line, tokens are separated by spaces. |
| 208 | + |
| 209 | +The following (mutually exclusive) options can be specified |
| 210 | +on the command line: |
| 211 | + |
| 212 | +| Option | Description | |
| 213 | +|---|---| |
| 214 | +| `--csv` | Output token objects in CSV format, one per line. Sentences are separated by lines containing `0,"",""` | |
| 215 | +| `--json` | Output token objects in JSON format, one per line.| |
| 216 | +| `--normalize` | Normalize punctuation, causing e.g. quotes to be output in Icelandic form and hyphens to be regularized. | |
| 217 | +| `--grammar` | Output whole-sentence annotations, including corrections and suggestions for spelling and grammar. Each sentence in the input is output as a text line containing a JSON object, terminated by a newline. | |
| 218 | + |
| 219 | +The CSV and JSON formats of token objects are identical to those documented |
| 220 | +for the [Tokenizer package](https://github.com/mideind/Tokenizer). |
| 221 | + |
| 222 | +The JSON format of whole-sentence annotations is identical to the one documented for |
| 223 | +the [Yfirlestur.is HTTPS REST API](https://github.com/mideind/Yfirlestur#https-api). |
| 224 | + |
| 225 | +Type `correct -h` to get a short help message. |
| 226 | + |
| 227 | +### Command Line Examples |
| 228 | + |
| 229 | +```bash |
| 230 | +$ echo "Atvinuleysi jógst um 3%" | correct |
| 231 | +Atvinnuleysi jókst um 3% |
| 232 | +``` |
| 233 | + |
| 234 | +```bash |
| 235 | +$ echo "Barnið vil grænann lit" | correct --csv |
| 236 | +6,"Barnið","" |
| 237 | +6,"vil","" |
| 238 | +6,"grænan","" |
| 239 | +6,"lit","" |
| 240 | +0,"","" |
| 241 | +``` |
| 242 | + |
| 243 | +Note how *vil* is not corrected, as it is a valid and common word, and |
| 244 | +the `correct` command does not perform grammar checking by default. |
| 245 | + |
| 246 | +```bash |
| 247 | +$ echo "Pakkin er fyrir hestin" | correct --json |
| 248 | +{"k":"BEGIN SENT"} |
| 249 | +{"k":"WORD","t":"Pakkinn"} |
| 250 | +{"k":"WORD","t":"er"} |
| 251 | +{"k":"WORD","t":"fyrir"} |
| 252 | +{"k":"WORD","t":"hestinn"} |
| 253 | +{"k":"END SENT"} |
| 254 | +``` |
| 255 | + |
| 256 | +To perform whole-sentence grammar checking and annotation as well as spell checking, |
| 257 | +use the `--grammar` option: |
| 258 | + |
| 259 | +```bash |
| 260 | +$ echo "Ég kláraði verkefnið þrátt fyrir að ég var þreittur." | correct --grammar |
| 261 | +{ |
| 262 | + "original":"Ég kláraði verkefnið þrátt fyrir að ég var þreittur.", |
| 263 | + "corrected":"Ég kláraði verkefnið þrátt fyrir að ég var þreyttur.", |
| 264 | + "tokens":[ |
| 265 | + {"k":6,"x":"Ég","o":"Ég"}, |
| 266 | + {"k":6,"x":"kláraði","o":" kláraði"}, |
| 267 | + {"k":6,"x":"verkefnið","o":" verkefnið"}, |
| 268 | + {"k":6,"x":"þrátt fyrir","o":" þrátt fyrir"}, |
| 269 | + {"k":6,"x":"að","o":" að"}, |
| 270 | + {"k":6,"x":"ég","o":" ég"}, |
| 271 | + {"k":6,"x":"var","o":" var"}, |
| 272 | + {"k":6,"x":"þreyttur","o":" þreittur"}, |
| 273 | + {"k":1,"x":".","o":"."} |
| 274 | + ], |
| 275 | + "annotations":[ |
| 276 | + { |
| 277 | + "start":6, |
| 278 | + "end":6, |
| 279 | + "start_char":35, |
| 280 | + "end_char":37, |
| 281 | + "code":"P_MOOD_ACK", |
| 282 | + "text":"Hér er réttara að nota viðtengingarhátt\n sagnarinnar 'vera', þ.e. 'væri'.", |
| 283 | + "detail":"Í viðurkenningarsetningum á borð við 'Z'\n í dæminu 'X gerði Y þrátt fyrir að Z' á sögnin að vera |
| 284 | + í viðtengingarhætti fremur en framsöguhætti.", |
| 285 | + "suggest":"væri" |
| 286 | + }, |
| 287 | + { |
| 288 | + "start":7, |
| 289 | + "end":7, |
| 290 | + "start_char":38, |
| 291 | + "end_char":41, |
| 292 | + "code":"S004", |
| 293 | + "text":"Orðið 'þreittur' var leiðrétt í 'þreyttur'", |
| 294 | + "detail":"", |
| 295 | + "suggest":"þreyttur" |
| 296 | + } |
| 297 | + ] |
| 298 | +} |
| 299 | +``` |
| 300 | + |
| 301 | +The output has been formatted for legibility - each input sentence is actually |
| 302 | +represented by a JSON object in a single line of text, terminated by newline. |
| 303 | + |
| 304 | +Note that the `corrected` field only includes token-level spelling correction |
| 305 | +(in this case *þreittur* `->` *þreyttur*), but no grammar corrections. |
| 306 | +The grammar corrections are found in the `annotations` list. |
| 307 | +To apply corrections and suggestions from the annotations, |
| 308 | +replace source text or tokens (as identified by the `start` and `end`, |
| 309 | +or `start_char` and `end_char` properties) with the `suggest` field, if present. |
| 310 | + |
| 311 | +## Tests |
| 312 | + |
| 313 | +To run the built-in tests, install [pytest](https://docs.pytest.org/en/latest/), |
| 314 | +`cd` to your `GreynirCorrect` subdirectory (and optionally activate your |
| 315 | +virtualenv), then run: |
| 316 | + |
| 317 | +```bash |
| 318 | +$ python -m pytest |
| 319 | +``` |
| 320 | + |
| 321 | +## Acknowledgements |
| 322 | + |
| 323 | +Parts of this software are developed under the auspices of the |
| 324 | +Icelandic Government's 5-year Language Technology Programme for Icelandic, |
| 325 | +which is managed by Almannarómur and described |
| 326 | +[here](https://www.stjornarradid.is/lisalib/getfile.aspx?itemid=56f6368e-54f0-11e7-941a-005056bc530c) |
| 327 | +(English version [here](https://clarin.is/media/uploads/mlt-en.pdf)). |
| 328 | + |
| 329 | +## Copyright and License |
| 330 | + |
| 331 | +[](https://mideind.is) |
| 332 | + |
| 333 | +**Copyright © 2018-2025 Miðeind ehf.** |
| 334 | + |
| 335 | +GreynirCorrect's original author is *Vilhjálmur Þorsteinsson*. |
| 336 | + |
| 337 | +This software is licensed under the *MIT License*: |
| 338 | + |
| 339 | + *Permission is hereby granted, free of charge, to any person |
| 340 | + obtaining a copy of this software and associated documentation |
| 341 | + files (the "Software"), to deal in the Software without restriction, |
| 342 | + including without limitation the rights to use, copy, modify, merge, |
| 343 | + publish, distribute, sublicense, and/or sell copies of the Software, |
| 344 | + and to permit persons to whom the Software is furnished to do so, |
| 345 | + subject to the following conditions:* |
| 346 | + |
| 347 | + *The above copyright notice and this permission notice shall be |
| 348 | + included in all copies or substantial portions of the Software.* |
| 349 | + |
| 350 | + *THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, |
| 351 | + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF |
| 352 | + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY |
| 353 | + CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE |
| 354 | + SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.* |
| 355 | + |
| 356 | +---- |
| 357 | + |
| 358 | +GreynirCorrect indirectly embeds the [Database of Icelandic Morphology](https://bin.arnastofnun.is) |
| 359 | +([Beygingarlýsing íslensks nútímamáls](https://bin.arnastofnun.is)) |
| 360 | +along with directly using |
| 361 | +[Ritmyndir](https://bin.arnastofnun.is/DMII/LTdata/comp-format/nonstand-form/), |
| 362 | +a collection of non-standard word forms. |
| 363 | +Miðeind does not claim any endorsement by the BÍN authors or copyright holders. |
| 364 | + |
| 365 | +The BÍN source data are publicly available under the |
| 366 | +[CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/), as further |
| 367 | +detailed [here in English](https://bin.arnastofnun.is/DMII/LTdata/conditions/) |
| 368 | +and [here in Icelandic](https://bin.arnastofnun.is/gogn/mimisbrunnur/). |
| 369 | + |
| 370 | +In accordance with the BÍN license terms, credit is hereby given as follows: |
| 371 | + |
| 372 | +*Beygingarlýsing íslensks nútímamáls. Stofnun Árna Magnússonar í íslenskum fræðum.* |
| 373 | +*Höfundur og ritstjóri Kristín Bjarnadóttir.* |
| 374 | + |
0 commit comments