Skip to content

Commit 4f7cdfb

Browse files
committed
Migrated README from RST to GitHub-flavored Markdown + rm explicit Python 3.14 support in project metadata until Icegrams dependency issue is resolved
1 parent faebcff commit 4f7cdfb

3 files changed

Lines changed: 374 additions & 443 deletions

File tree

README.md

Lines changed: 374 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,374 @@
1+
2+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
3+
[![Python 3.9+](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)
4+
[![PyPI version](https://img.shields.io/pypi/v/reynir-correct)](https://pypi.org/project/reynir-correct/)
5+
[![GitHub release](https://shields.io/github/v/release/mideind/GreynirCorrect?display_name=tag)](https://github.com/mideind/GreynirCorrect/releases)
6+
[![Python package](https://github.com/mideind/GreynirCorrect/actions/workflows/python-package.yml/badge.svg)](https://github.com/mideind/GreynirCorrect/actions?query=workflow%3A%22Python+package%22)
7+
8+
# GreynirCorrect: Spelling and grammar correction for Icelandic
9+
10+
## Overview
11+
12+
**GreynirCorrect** is a Python 3 (>=3.9) package and command line tool for
13+
**checking and correcting spelling and grammar** in Icelandic text.
14+
15+
GreynirCorrect relies on the [Greynir](https://pypi.org/project/reynir/) package,
16+
by the same authors, to tokenize and parse text.
17+
18+
GreynirCorrect is documented in detail [here](https://yfirlestur.is/doc/).
19+
20+
The software has three main modes of operation, described below.
21+
22+
As a fourth alternative, you can call the JSON REST API
23+
of [Yfirlestur.is](https://yfirlestur.is)
24+
to apply the GreynirCorrect spelling and grammar engine to your text,
25+
as [documented here](https://github.com/mideind/Yfirlestur#https-api).
26+
27+
### Token-level correction
28+
29+
GreynirCorrect can tokenize text and return an automatically corrected token stream.
30+
This catches token-level errors, such as spelling errors and erroneous
31+
phrases, but not grammatical errors. Token-level correction is relatively fast.
32+
33+
### Full grammar analysis
34+
35+
GreynirCorrect can analyze text grammatically by attempting to parse
36+
it, after token-level correction. The parsing is done according to Greynir's
37+
context-free grammar for Icelandic, augmented with additional production
38+
rules for common grammatical errors. The analysis returns a set of annotations
39+
(errors and suggestions) that apply to spans (consecutive tokens) within
40+
sentences in the resulting token list. Full grammar analysis is considerably
41+
slower than token-level correction.
42+
43+
### Command-line tool
44+
45+
GreynirCorrect can be invoked as a command-line tool
46+
to perform token-level correction and, optionally, grammar analysis.
47+
The command is `correct infile.txt outfile.txt`.
48+
The command-line tool is further documented below.
49+
50+
## Examples
51+
52+
To perform token-level correction from Python code:
53+
54+
```python
55+
>>> from reynir_correct import tokenize
56+
>>> g = tokenize("Af gefnu tilefni fékk fékk daninn vilja sýnum "
57+
>>> "framgengt í auknu mæli.")
58+
>>> for tok in g:
59+
>>> print("{0:10} {1}".format(tok.txt or "", tok.error_description))
60+
```
61+
62+
Output:
63+
64+
```
65+
Að Orðasambandið 'Af gefnu tilefni' var leiðrétt í 'að gefnu tilefni'
66+
gefnu
67+
tilefni
68+
fékk Endurtekið orð ('fékk') var fellt burt
69+
Daninn Orð á að byrja á hástaf: 'daninn'
70+
vilja Orðasambandið 'vilja sýnum framgengt' var leiðrétt í 'vilja sínum framgengt'
71+
sínum
72+
framgengt
73+
í Orðasambandið 'í auknu mæli' var leiðrétt í 'í auknum mæli'
74+
auknum
75+
mæli
76+
.
77+
```
78+
79+
To perform full spelling and grammar analysis of a sentence from Python code:
80+
81+
```python
82+
from reynir_correct import check_single
83+
sent = check_single("Páli, vini mínum, langaði að horfa á sjónnvarpið.")
84+
for annotation in sent.annotations:
85+
print("{0}".format(annotation))
86+
```
87+
88+
Output:
89+
90+
```
91+
000-004: P_WRONG_CASE_þgf_þf Á líklega að vera 'Pál, vin minn' / [Pál , vin minn]
92+
009-009: S004 Orðið 'sjónnvarpið' var leiðrétt í 'sjónvarpið'
93+
```
94+
95+
```python
96+
sent.tidy_text
97+
```
98+
99+
Output:
100+
101+
```
102+
'Páli, vini mínum, langaði að horfa á sjónvarpið.'
103+
```
104+
105+
The `annotation.start` and `annotation.end` properties
106+
(here `start` is 0 and `end` is 4) contain the 0-based indices of the first
107+
and last tokens to which the annotation applies.
108+
The `annotation.start_char` and `annotation.end_char` properties
109+
contain the indices of the first and last character to which the
110+
annotation applies, within the original input string.
111+
112+
`P_WRONG_CASE_þgf_þf` and `S004` are error codes.
113+
114+
For more detailed, low-level control, the `check_errors()` function
115+
supports options and can produce various types of output:
116+
117+
```python
118+
from reynir_correct import check_errors
119+
x = "Páli, vini mínum, langaði að horfa á sjónnvarpið."
120+
options = { "input": x, "annotations": True, "format": "text" }
121+
s = check_errors(**options)
122+
for i in s.split("\n"):
123+
print(i)
124+
```
125+
126+
Output:
127+
128+
```
129+
Pál, vin minn, langaði að horfa á sjónvarpið.
130+
000-004: P_WRONG_CASE_þgf_þf Á líklega að vera 'Pál, vin minn' | 'Páli, vini mínum,' -> 'Pál, vin minn' | None
131+
009-009: S004 Orðið 'sjónnvarpið' var leiðrétt í 'sjónvarpið' | 'sjónnvarpið' -> 'sjónvarpið' | None
132+
```
133+
134+
The following options can be specified:
135+
136+
| Option | Description | Default value |
137+
|---|---|---|
138+
| `input` | Defines the input. Can be a string or an iterable of strings, such as a file object. | `sys.stdin` |
139+
| `all_errors` (alias `grammar`) | Defines the level of correction. If False, only token-level annotation is carried out. If True, sentence-level annotation is carried out. | `True` |
140+
| `annotate_unparsed_sentences` | If True, sentences that cannot be parsed are annotated in their entirety as errors. | `True` |
141+
| `generate_suggestion_list` | If True, annotations can in certain cases contain a list of possible corrections, for the user to pick from. | `False` |
142+
| `suppress_suggestions` | If True, more farfetched automatically suggested corrections are suppressed. | `False` |
143+
| `ignore_wordlist` | The value is a set of strings to whitelist. Each string is a word that should not be marked as an error or corrected. The comparison is case-sensitive. | `set()` |
144+
| `one_sent` | The input contains a single sentence only. Sentence splitting should not be attempted. | `False` |
145+
| `ignore_rules` | A set of error codes that should be ignored in the annotation process. | `set()` |
146+
| `tov_config` | Path to an additional configuration file that may be provided for correcting custom tone-of-voice issues. | `False` |
147+
148+
An overview of error codes is available [here](https://github.com/mideind/GreynirCorrect/blob/master/doc/errorcodes.rst).
149+
150+
## Prerequisites
151+
152+
GreynirCorrect runs on CPython 3.9 or newer, and on PyPy 3.9 or newer. It has
153+
been tested on Linux, macOS and Windows. The
154+
[PyPi package](https://pypi.org/project/reynir-correct/)
155+
includes binary wheels for common environments, but if the setup on your OS
156+
requires compilation from sources, you may need
157+
158+
```bash
159+
$ sudo apt-get install python3-dev
160+
```
161+
162+
...or something to similar effect to enable this.
163+
164+
## Installation
165+
166+
To install this package (assuming you have Python >= 3.9 with `pip` installed):
167+
168+
```bash
169+
$ pip install reynir-correct
170+
```
171+
172+
If you want to be able to edit the source, do like so
173+
(assuming you have `git` installed):
174+
175+
```bash
176+
$ git clone https://github.com/mideind/GreynirCorrect
177+
$ cd GreynirCorrect
178+
$ # [ Activate your virtualenv here if you have one ]
179+
$ pip install -e .
180+
```
181+
182+
The package source code is now in `GreynirCorrect/src/reynir_correct`.
183+
184+
## The command line tool
185+
186+
After installation, the corrector can be invoked directly from the command line:
187+
188+
```bash
189+
$ correct input.txt output.txt
190+
```
191+
192+
...or:
193+
194+
```bash
195+
$ echo "Þinngið samþikkti tilöguna" | correct
196+
Þingið samþykkti tillöguna
197+
```
198+
199+
Input and output files are encoded in UTF-8. If the files are not
200+
given explicitly, `stdin` and `stdout` are used for input and output,
201+
respectively.
202+
203+
Empty lines in the input are treated as sentence boundaries.
204+
205+
By default, the output consists of one sentence per line, where each
206+
line ends with a single newline character (ASCII LF, `chr(10)`, `"\n"`).
207+
Within each line, tokens are separated by spaces.
208+
209+
The following (mutually exclusive) options can be specified
210+
on the command line:
211+
212+
| Option | Description |
213+
|---|---|
214+
| `--csv` | Output token objects in CSV format, one per line. Sentences are separated by lines containing `0,"",""` |
215+
| `--json` | Output token objects in JSON format, one per line.|
216+
| `--normalize` | Normalize punctuation, causing e.g. quotes to be output in Icelandic form and hyphens to be regularized. |
217+
| `--grammar` | Output whole-sentence annotations, including corrections and suggestions for spelling and grammar. Each sentence in the input is output as a text line containing a JSON object, terminated by a newline. |
218+
219+
The CSV and JSON formats of token objects are identical to those documented
220+
for the [Tokenizer package](https://github.com/mideind/Tokenizer).
221+
222+
The JSON format of whole-sentence annotations is identical to the one documented for
223+
the [Yfirlestur.is HTTPS REST API](https://github.com/mideind/Yfirlestur#https-api).
224+
225+
Type `correct -h` to get a short help message.
226+
227+
### Command Line Examples
228+
229+
```bash
230+
$ echo "Atvinuleysi jógst um 3%" | correct
231+
Atvinnuleysi jókst um 3%
232+
```
233+
234+
```bash
235+
$ echo "Barnið vil grænann lit" | correct --csv
236+
6,"Barnið",""
237+
6,"vil",""
238+
6,"grænan",""
239+
6,"lit",""
240+
0,"",""
241+
```
242+
243+
Note how *vil* is not corrected, as it is a valid and common word, and
244+
the `correct` command does not perform grammar checking by default.
245+
246+
```bash
247+
$ echo "Pakkin er fyrir hestin" | correct --json
248+
{"k":"BEGIN SENT"}
249+
{"k":"WORD","t":"Pakkinn"}
250+
{"k":"WORD","t":"er"}
251+
{"k":"WORD","t":"fyrir"}
252+
{"k":"WORD","t":"hestinn"}
253+
{"k":"END SENT"}
254+
```
255+
256+
To perform whole-sentence grammar checking and annotation as well as spell checking,
257+
use the `--grammar` option:
258+
259+
```bash
260+
$ echo "Ég kláraði verkefnið þrátt fyrir að ég var þreittur." | correct --grammar
261+
{
262+
"original":"Ég kláraði verkefnið þrátt fyrir að ég var þreittur.",
263+
"corrected":"Ég kláraði verkefnið þrátt fyrir að ég var þreyttur.",
264+
"tokens":[
265+
{"k":6,"x":"Ég","o":"Ég"},
266+
{"k":6,"x":"kláraði","o":" kláraði"},
267+
{"k":6,"x":"verkefnið","o":" verkefnið"},
268+
{"k":6,"x":"þrátt fyrir","o":" þrátt fyrir"},
269+
{"k":6,"x":"","o":""},
270+
{"k":6,"x":"ég","o":" ég"},
271+
{"k":6,"x":"var","o":" var"},
272+
{"k":6,"x":"þreyttur","o":" þreittur"},
273+
{"k":1,"x":".","o":"."}
274+
],
275+
"annotations":[
276+
{
277+
"start":6,
278+
"end":6,
279+
"start_char":35,
280+
"end_char":37,
281+
"code":"P_MOOD_ACK",
282+
"text":"Hér er réttara að nota viðtengingarhátt\n sagnarinnar 'vera', þ.e. 'væri'.",
283+
"detail":"Í viðurkenningarsetningum á borð við 'Z'\n í dæminu 'X gerði Y þrátt fyrir að Z' á sögnin að vera
284+
í viðtengingarhætti fremur en framsöguhætti.",
285+
"suggest":"væri"
286+
},
287+
{
288+
"start":7,
289+
"end":7,
290+
"start_char":38,
291+
"end_char":41,
292+
"code":"S004",
293+
"text":"Orðið 'þreittur' var leiðrétt í 'þreyttur'",
294+
"detail":"",
295+
"suggest":"þreyttur"
296+
}
297+
]
298+
}
299+
```
300+
301+
The output has been formatted for legibility - each input sentence is actually
302+
represented by a JSON object in a single line of text, terminated by newline.
303+
304+
Note that the `corrected` field only includes token-level spelling correction
305+
(in this case *þreittur* `->` *þreyttur*), but no grammar corrections.
306+
The grammar corrections are found in the `annotations` list.
307+
To apply corrections and suggestions from the annotations,
308+
replace source text or tokens (as identified by the `start` and `end`,
309+
or `start_char` and `end_char` properties) with the `suggest` field, if present.
310+
311+
## Tests
312+
313+
To run the built-in tests, install [pytest](https://docs.pytest.org/en/latest/),
314+
`cd` to your `GreynirCorrect` subdirectory (and optionally activate your
315+
virtualenv), then run:
316+
317+
```bash
318+
$ python -m pytest
319+
```
320+
321+
## Acknowledgements
322+
323+
Parts of this software are developed under the auspices of the
324+
Icelandic Government's 5-year Language Technology Programme for Icelandic,
325+
which is managed by Almannarómur and described
326+
[here](https://www.stjornarradid.is/lisalib/getfile.aspx?itemid=56f6368e-54f0-11e7-941a-005056bc530c)
327+
(English version [here](https://clarin.is/media/uploads/mlt-en.pdf)).
328+
329+
## Copyright and License
330+
331+
[![Miðeind ehf.](https://github.com/mideind/GreynirPackage/raw/master/doc/_static/MideindLogoVert100.png?raw=true)](https://mideind.is)
332+
333+
**Copyright © 2018-2025 Miðeind ehf.**
334+
335+
GreynirCorrect's original author is *Vilhjálmur Þorsteinsson*.
336+
337+
This software is licensed under the *MIT License*:
338+
339+
*Permission is hereby granted, free of charge, to any person
340+
obtaining a copy of this software and associated documentation
341+
files (the "Software"), to deal in the Software without restriction,
342+
including without limitation the rights to use, copy, modify, merge,
343+
publish, distribute, sublicense, and/or sell copies of the Software,
344+
and to permit persons to whom the Software is furnished to do so,
345+
subject to the following conditions:*
346+
347+
*The above copyright notice and this permission notice shall be
348+
included in all copies or substantial portions of the Software.*
349+
350+
*THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
351+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
352+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
353+
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
354+
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.*
355+
356+
----
357+
358+
GreynirCorrect indirectly embeds the [Database of Icelandic Morphology](https://bin.arnastofnun.is)
359+
([Beygingarlýsing íslensks nútímamáls](https://bin.arnastofnun.is))
360+
along with directly using
361+
[Ritmyndir](https://bin.arnastofnun.is/DMII/LTdata/comp-format/nonstand-form/),
362+
a collection of non-standard word forms.
363+
Miðeind does not claim any endorsement by the BÍN authors or copyright holders.
364+
365+
The BÍN source data are publicly available under the
366+
[CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/), as further
367+
detailed [here in English](https://bin.arnastofnun.is/DMII/LTdata/conditions/)
368+
and [here in Icelandic](https://bin.arnastofnun.is/gogn/mimisbrunnur/).
369+
370+
In accordance with the BÍN license terms, credit is hereby given as follows:
371+
372+
*Beygingarlýsing íslensks nútímamáls. Stofnun Árna Magnússonar í íslenskum fræðum.*
373+
*Höfundur og ritstjóri Kristín Bjarnadóttir.*
374+

0 commit comments

Comments
 (0)