Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 20 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Punkt sentence tokenizer

This code is a ruby 1.9.x port of the Punkt sentence tokenizer algorithm implemented by the NLTK Project ([http://www.nltk.org/]). Punkt is a **language-independent**, unsupervised approach to **sentence boundary detection**. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified.
This code is a Ruby port of the Punkt sentence tokenizer algorithm implemented by the NLTK Project ([http://www.nltk.org/]). Punkt is a **language-independent**, unsupervised approach to **sentence boundary detection**. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified.

The full description of the algorithm is presented in the following academic paper:

> Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
> Computational Linguistics 32: 485-525.
> Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
> Computational Linguistics 32: 485-525.
> [Download paper]

Here are the credits for the original implementation:
Expand All @@ -21,7 +21,10 @@ I simply did the ruby port and some API changes.

gem install punkt-segmenter

Currently, this gem only runs on ruby 1.9.x (because of unicode_utils dependency)
Currently, this gem only runs on ruby 2.3 or above because it uses
Unicode string internals. Note that you really should use Ruby 2.4 or
above if you are using it for non-English languages, since Unicode
downcase might not be as reliable.

## How to use

Expand All @@ -31,10 +34,10 @@ Let's suppose we have the following text:

You can separate in sentences using the Punkt::SentenceTokenizer object:

tokenizer = Punkt::SentenceTokenizer.new(text)
tokenizer = Punkt::SentenceTokenizer.new(:english)
result = tokenizer.sentences_from_text(text, :output => :sentences_text)

The result will be:
This loads in some settings derived from training the Tokenizer against the WSJ English corpus. Other languages are supported (look at the data/ directory for a list). The result will be:

result = [
[0] "A minute is a unit of measurement of time or of angle.",
Expand All @@ -43,14 +46,19 @@ The result will be:
[3] "The minute is not an SI unit; however, it is accepted for use with SI units.",
[4] "The symbol for minute or minutes is min.",
[5] "The fact that an hour contains 60 minutes is probably due to influences from the Babylonians, who used a base-60 or sexagesimal counting system.",
[6] "Colloquially, a min. may also refer to an indefinite amount of time substantially longer than the standardized length."
[6] "Colloquially, a min.""
[7] "may also refer to an indefinite amount of time substantially longer than the standardized length."
]

The algorithm uses the text passed as parameter to train and tokenize in sentences. Sometimes the size of the input text is not enough to have a well trained set, which may cause some mistakes on the sentences splitting. For these cases you can train the Punkt segmenter:
Notice that "min." was interpreted as a sentence break. This isn't an abbreviation seen in the WSJ corpus, so it is considered the end of a sentence. If you have a large enough text to analyze, you can pass it into the constructor to train it instead.

tokenizer = Punkt::SentenceTokenizer.new(text)
result = tokenizer.sentences_from_text(text, :output => :sentences_text)

Note, that sometimes the size of the input text is not enough to have a well trained set, which may cause some mistakes on the sentences splitting. You can also explicitly train the tokenizer with one text and analyze another.

trainer = Punkt::Trainer.new()
trainer.train(trainning_text)

tokenizer = Punkt::SentenceTokenizer.new(trainer.parameters)
result = tokenizer.sentences_from_text(text, :output => :sentences_text)

Expand All @@ -62,9 +70,9 @@ The available options for *sentences_from_text* method are:

- array of sentences indexes (default)
- array of sentences string (**:output => :sentences_text**)
- array of sentences tokens (**:output => :tokenized_sentences**)
- array of sentences tokens (**:output => :tokenized_sentences**)
- realigned boundaries (**:realign_boundaries => true**): do this if you want to realign sentences that end with, for example, parenthesis, quotes, brackets, etc

If you have a list of tokens, you can use the *sentences_from_tokens* method, which takes only the list of tokens as parameter.

Check the unit tests for more detailed examples in English and Portuguese.
Expand All @@ -73,11 +81,8 @@ Check the unit tests for more detailed examples in English and Portuguese.
*This code follows the terms and conditions of Apache License v2 (http://www.apache.org/licenses/LICENSE-2.0)*

*Copyright (C) Luis Cipriani*

[http://www.nltk.org/]: http://www.nltk.org/
[Download paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.5017&rep=rep1&type=pdf



[![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/lfcipriani/punkt-segmenter/trend.png)](https://bitdeli.com/free "Bitdeli Badge")

78 changes: 78 additions & 0 deletions data/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)

Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
been contributed by various people using NLTK for sentence boundary detection.

For information about how to use these models, please confer the tokenization HOWTO:
http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
and chapter 3.8 of the NLTK book:
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation

There are pretrained tokenizers for the following languages. These were originally provided as part of the Python NLTK package inside a pickled
SentenceTokenizer class. I have extracted out the language parameters as JSON data. Here are the training corpuses used:

File Language Source Contents Size of training corpus(in tokens) Model contributed by
=======================================================================================================================================================================
czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
Literarni Noviny
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
(Berlingske Avisdata, Copenhagen) Weekend Avisen
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
(American)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
Text Bank (Suomen Kielen newspapers
Tekstipankki)
Finnish Center for IT Science
(CSC)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
(European)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
(Switzerland) CD-ROM
(Uses "ss"
instead of "ß")
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
(Bokmål and Information Technologies,
Nynorsk) Bergen
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
(http://www.nkjp.pl/)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
(Brazilian) (Linguateca)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
Slovene Academy for Arts
and Sciences
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
(European)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
(and some other texts)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
(Türkçe Derlem Projesi)
University of Ankara
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
Unicode using the codecs module.

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
Computational Linguistics 32: 485-525.

-
1 change: 1 addition & 0 deletions data/czech.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/danish.json

Large diffs are not rendered by default.

20 changes: 20 additions & 0 deletions data/dump_pickle_json.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/usr/local/bin/python

# This is the Python script I used to save JSON data from the NLTK pickle files.

import json
import pickle

langs = ('czech', 'danish', 'dutch', 'english', 'estonian', 'finnish', 'french', 'german', 'greek', 'italian', 'norwegian', 'polish', 'portuguese', 'slovene', 'spanish', 'swedish', 'turkish')

# Originally used this for a Go project.
for l in langs:
print l
src_file = "/nltk_data/tokenizers/punkt/" + l + ".pickle"
dest_file = "/code/gocode/src/github.com/harrisj/punkt/data/" + l + ".json"
p = pickle.load(open(src_file,"rb"))

data = {"sentence_starters": list(p._params.sent_starters), "collocations": list(p._params.collocations), "abbrev_types": list(p._params.abbrev_types), "ortho_context": p._params.ortho_context}

with open(dest_file, 'w') as fp:
json.dump(data, fp)
1 change: 1 addition & 0 deletions data/dutch.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/english.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/estonian.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/finnish.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/french.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/german.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/greek.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/italian.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/norwegian.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/polish.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/portuguese.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/slovene.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/spanish.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/swedish.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/turkish.json

Large diffs are not rendered by default.

7 changes: 3 additions & 4 deletions lib/punkt-segmenter.rb
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
if RUBY_VERSION >= "1.9"
if RUBY_VERSION >= "2.3"
$:.unshift(File.dirname(__FILE__)) unless $:.include?(File.dirname(__FILE__))

# Dependencies
require "unicode_utils"
require "set"

# Lib requires
require "punkt-segmenter/frequency_distribution"
require "punkt-segmenter/punkt"
else
raise "This gem requires Ruby 1.9 or superior."
end
raise "This gem requires Ruby 2.3 or superior."
end
19 changes: 8 additions & 11 deletions lib/punkt-segmenter/punkt/base.rb
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
module Punkt
class Base
def initialize(language_vars = Punkt::LanguageVars.new,
def initialize(language_vars = Punkt::LanguageVars.new,
token_class = Punkt::Token,
parameters = Punkt::Parameters.new)

@parameters = parameters
@language_vars = language_vars
@token_class = token_class
end

def tokenize_words(plain_text, options = {})
return @language_vars.word_tokenize(plain_text) if options[:output] == :string
result = []
Expand All @@ -21,17 +21,16 @@ def tokenize_words(plain_text, options = {})
:line_start => true)
paragraph_start = false
line_tokens.map! { |token| @token_class.new(token) }.unshift(first_token)

result += line_tokens
else
paragraph_start = true
end
end
return result
end

private


private
def annotate_first_pass(tokens)
tokens.each do |aug_token|
tok = aug_token.token
Expand All @@ -41,17 +40,16 @@ def annotate_first_pass(tokens)
elsif aug_token.is_ellipsis?
aug_token.ellipsis = true
elsif aug_token.ends_with_period? && !tok.end_with?("..")
tok_low = UnicodeUtils.downcase(tok.chop)
tok_low = tok.chop.downcase
if @parameters.abbreviation_types.include?(tok_low) || @parameters.abbreviation_types.include?(tok_low.split("-")[-1])
aug_token.abbr = true
else
aug_token.sentence_break = true
end
end

end
end

def pair_each(list, &block)
previous = list[0]
list[1..list.size-1].each do |item|
Expand All @@ -60,6 +58,5 @@ def pair_each(list, &block)
end
yield(previous, nil)
end

end
end
23 changes: 22 additions & 1 deletion lib/punkt-segmenter/punkt/parameters.rb
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
require 'rubygems'
require 'json'

module Punkt
class Parameters

Expand Down Expand Up @@ -32,6 +35,24 @@ def clear_orthographic_context
def add_orthographic_context(type, flag)
@orthographic_context[type] |= flag
end


def self.load_language(language)
data_path = File.join(File.dirname(__FILE__), "..", "..", "..", "data", "#{language}.json")

json_body = ""
open(data_path) {|file| json_body = file.read }
json = JSON.parse(json_body)

# let's load
p = new

json["sentence_starters"].each {|s| p.sentence_starters << s}
json["abbrev_types"].each {|a| p.abbreviation_types << a}
json["collocations"].each {|a| p.collocations << a}

json["ortho_context"].each {|k,v| p.orthographic_context[k] = v }

p
end
end
end
Loading