lfcipriani · harrisj · Dec 5, 2014 · Jun 10, 2018 · Jun 10, 2018 · Jun 10, 2018
diff --git a/README.md b/README.md
@@ -1,11 +1,11 @@
 # Punkt sentence tokenizer
 
-This code is a ruby 1.9.x port of the Punkt sentence tokenizer algorithm implemented by the NLTK Project ([http://www.nltk.org/]). Punkt is a **language-independent**, unsupervised approach to **sentence boundary detection**. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identiﬁed.
+This code is a Ruby port of the Punkt sentence tokenizer algorithm implemented by the NLTK Project ([http://www.nltk.org/]). Punkt is a **language-independent**, unsupervised approach to **sentence boundary detection**. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identiﬁed.
 
 The full description of the algorithm is presented in the following academic paper:
 
-> Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.  
-> Computational Linguistics 32: 485-525.  
+> Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
+> Computational Linguistics 32: 485-525.
 > [Download paper]
 
 Here are the credits for the original implementation:
@@ -21,7 +21,10 @@ I simply did the ruby port and some API changes.
 
     gem install punkt-segmenter
 
-Currently, this gem only runs on ruby 1.9.x (because of unicode_utils dependency)
+Currently, this gem only runs on ruby 2.3 or above because it uses
+Unicode string internals. Note that you really should use Ruby 2.4 or
+above if you are using it for non-English languages, since Unicode
+downcase might not be as reliable.
 
 ## How to use
 
@@ -31,10 +34,10 @@ Let's suppose we have the following text:
 
 You can separate in sentences using the Punkt::SentenceTokenizer object:
 
-    tokenizer = Punkt::SentenceTokenizer.new(text)
+    tokenizer = Punkt::SentenceTokenizer.new(:english)
     result    = tokenizer.sentences_from_text(text, :output => :sentences_text)
 
-The result will be:
+This loads in some settings derived from training the Tokenizer against the WSJ English corpus. Other languages are supported (look at the data/ directory for a list). The result will be:
 
     result    = [
         [0] "A minute is a unit of measurement of time or of angle.",
@@ -43,14 +46,19 @@ The result will be:
         [3] "The minute is not an SI unit; however, it is accepted for use with SI units.",
         [4] "The symbol for minute or minutes is min.",
         [5] "The fact that an hour contains 60 minutes is probably due to influences from the Babylonians, who used a base-60 or sexagesimal counting system.",
-        [6] "Colloquially, a min. may also refer to an indefinite amount of time substantially longer than the standardized length."
+        [6] "Colloquially, a min.""
+        [7] "may also refer to an indefinite amount of time substantially longer than the standardized length."
     ]
 
-The algorithm uses the text passed as parameter to train and tokenize in sentences. Sometimes the size of the input text is not enough to have a well trained set, which may cause some mistakes on the sentences splitting. For these cases you can train the Punkt segmenter:
+Notice that "min." was interpreted as a sentence break. This isn't an abbreviation seen in the WSJ corpus, so it is considered the end of a sentence. If you have a large enough text to analyze, you can pass it into the constructor to train it instead.
+
+    tokenizer = Punkt::SentenceTokenizer.new(text)
+    result    = tokenizer.sentences_from_text(text, :output => :sentences_text)
+
+Note, that sometimes the size of the input text is not enough to have a well trained set, which may cause some mistakes on the sentences splitting. You can also explicitly train the tokenizer with one text and analyze another.
 
     trainer = Punkt::Trainer.new()
     trainer.train(trainning_text)
-
     tokenizer = Punkt::SentenceTokenizer.new(trainer.parameters)
     result    = tokenizer.sentences_from_text(text, :output => :sentences_text)
 
@@ -62,9 +70,9 @@ The available options for *sentences_from_text* method are:
 
 - array of sentences indexes (default)
 - array of sentences string  (**:output => :sentences_text**)
-- array of sentences tokens  (**:output => :tokenized_sentences**)	
+- array of sentences tokens  (**:output => :tokenized_sentences**)
 - realigned boundaries (**:realign_boundaries => true**): do this if you want to realign sentences that end with, for example, parenthesis, quotes, brackets, etc
-	
+
 If you have a list of tokens, you can use the *sentences_from_tokens* method, which takes only the list of tokens as parameter.
 
 Check the unit tests for more detailed examples in English and Portuguese.
@@ -73,11 +81,8 @@ Check the unit tests for more detailed examples in English and Portuguese.
 *This code follows the terms and conditions of Apache License v2 (http://www.apache.org/licenses/LICENSE-2.0)*
 
 *Copyright (C) Luis Cipriani*
-  
+
   [http://www.nltk.org/]: http://www.nltk.org/
   [Download paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.5017&rep=rep1&type=pdf
 
 
-
-[![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/lfcipriani/punkt-segmenter/trend.png)](https://bitdeli.com/free "Bitdeli Badge")
-
diff --git a/data/README b/data/README
@@ -0,0 +1,78 @@
+Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
+
+Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
+been contributed by various people using NLTK for sentence boundary detection.
+
+For information about how to use these models, please confer the tokenization HOWTO:
+http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
+and chapter 3.8 of the NLTK book:
+http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
+
+There are pretrained tokenizers for the following languages. These were originally provided as part of the Python NLTK package inside a pickled
+SentenceTokenizer class. I have extracted out the language parameters as JSON data. Here are the training corpuses used:
+
+File                Language            Source                             Contents                Size of training corpus(in tokens)           Model contributed by
+=======================================================================================================================================================================
+czech.pickle        Czech               Multilingual Corpus 1 (ECI)        Lidove Noviny                   ~345,000                             Jan Strunk / Tibor Kiss
+                                                                           Literarni Noviny
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+danish.pickle       Danish              Avisdata CD-Rom Ver. 1.1. 1995     Berlingske Tidende              ~550,000                             Jan Strunk / Tibor Kiss
+                                        (Berlingske Avisdata, Copenhagen)  Weekend Avisen
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+dutch.pickle        Dutch               Multilingual Corpus 1 (ECI)        De Limburger                    ~340,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+english.pickle      English             Penn Treebank (LDC)                Wall Street Journal             ~469,000                             Jan Strunk / Tibor Kiss
+                    (American)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+estonian.pickle     Estonian            University of Tartu, Estonia       Eesti Ekspress                  ~359,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+finnish.pickle      Finnish             Finnish Parole Corpus, Finnish     Books and major national        ~364,000                             Jan Strunk / Tibor Kiss
+                                        Text Bank (Suomen Kielen           newspapers
+                                        Tekstipankki)
+                                        Finnish Center for IT Science
+                                        (CSC)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+french.pickle       French              Multilingual Corpus 1 (ECI)        Le Monde                        ~370,000                             Jan Strunk / Tibor Kiss
+                    (European)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+german.pickle       German              Neue Zürcher Zeitung AG            Neue Zürcher Zeitung            ~847,000                             Jan Strunk / Tibor Kiss
+                    (Switzerland)       CD-ROM
+                    (Uses "ss"
+                     instead of "ß")
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+greek.pickle        Greek               Efstathios Stamatatos              To Vima (TO BHMA)               ~227,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+italian.pickle      Italian             Multilingual Corpus 1 (ECI)        La Stampa, Il Mattino           ~312,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+norwegian.pickle    Norwegian           Centre for Humanities              Bergens Tidende                 ~479,000                             Jan Strunk / Tibor Kiss
+                    (Bokmål and         Information Technologies,
+                     Nynorsk)           Bergen
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+polish.pickle       Polish              Polish National Corpus             Literature, newspapers, etc.  ~1,000,000                             Krzysztof Langner
+                                        (http://www.nkjp.pl/)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+portuguese.pickle   Portuguese          CETENFolha Corpus                  Folha de São Paulo              ~321,000                             Jan Strunk / Tibor Kiss
+                    (Brazilian)         (Linguateca)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+slovene.pickle      Slovene             TRACTOR                            Delo                            ~354,000                             Jan Strunk / Tibor Kiss
+                                        Slovene Academy for Arts
+                                        and Sciences
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+spanish.pickle      Spanish             Multilingual Corpus 1 (ECI)        Sur                             ~353,000                             Jan Strunk / Tibor Kiss
+                    (European)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+swedish.pickle      Swedish             Multilingual Corpus 1 (ECI)        Dagens Nyheter                  ~339,000                             Jan Strunk / Tibor Kiss
+                                                                           (and some other texts)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+turkish.pickle      Turkish             METU Turkish Corpus                Milliyet                        ~333,000                             Jan Strunk / Tibor Kiss
+                                        (Türkçe Derlem Projesi)
+                                        University of Ankara
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
+The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
+Unicode using the codecs module.
+
+Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
+Computational Linguistics 32: 485-525.
+
+-
diff --git a/data/czech.json b/data/czech.json
diff --git a/data/danish.json b/data/danish.json
diff --git a/data/dump_pickle_json.py b/data/dump_pickle_json.py
@@ -0,0 +1,20 @@
+#!/usr/local/bin/python
+
+# This is the Python script I used to save JSON data from the NLTK pickle files. 
+
+import json
+import pickle
+
+langs = ('czech', 'danish', 'dutch', 'english', 'estonian', 'finnish', 'french', 'german', 'greek', 'italian', 'norwegian', 'polish', 'portuguese', 'slovene', 'spanish', 'swedish', 'turkish')
+
+# Originally used this for a Go project.
+for l in langs:
+  print l
+  src_file = "/nltk_data/tokenizers/punkt/" + l + ".pickle"
+  dest_file = "/code/gocode/src/github.com/harrisj/punkt/data/" + l + ".json"
+  p = pickle.load(open(src_file,"rb"))
+
+  data = {"sentence_starters": list(p._params.sent_starters), "collocations": list(p._params.collocations), "abbrev_types": list(p._params.abbrev_types), "ortho_context": p._params.ortho_context}
+
+  with open(dest_file, 'w') as fp:
+    json.dump(data, fp)
diff --git a/data/dutch.json b/data/dutch.json
diff --git a/data/english.json b/data/english.json
diff --git a/data/estonian.json b/data/estonian.json
diff --git a/data/finnish.json b/data/finnish.json
diff --git a/data/french.json b/data/french.json
diff --git a/data/german.json b/data/german.json
diff --git a/data/greek.json b/data/greek.json
diff --git a/data/italian.json b/data/italian.json
diff --git a/data/norwegian.json b/data/norwegian.json
diff --git a/data/polish.json b/data/polish.json
diff --git a/data/portuguese.json b/data/portuguese.json
diff --git a/data/slovene.json b/data/slovene.json
diff --git a/data/spanish.json b/data/spanish.json
diff --git a/data/swedish.json b/data/swedish.json
diff --git a/data/turkish.json b/data/turkish.json
diff --git a/lib/punkt-segmenter.rb b/lib/punkt-segmenter.rb
@@ -1,13 +1,12 @@
-if RUBY_VERSION >= "1.9"
+if RUBY_VERSION >= "2.3"
   $:.unshift(File.dirname(__FILE__)) unless $:.include?(File.dirname(__FILE__))
 
   # Dependencies
-  require "unicode_utils"
   require "set"
 
   # Lib requires
   require "punkt-segmenter/frequency_distribution"
   require "punkt-segmenter/punkt"
 else
-  raise "This gem requires Ruby 1.9 or superior."
-end
+  raise "This gem requires Ruby 2.3 or superior."
+end
diff --git a/lib/punkt-segmenter/punkt/base.rb b/lib/punkt-segmenter/punkt/base.rb
@@ -1,14 +1,14 @@
 module Punkt
   class Base
-    def initialize(language_vars = Punkt::LanguageVars.new, 
+    def initialize(language_vars = Punkt::LanguageVars.new,
                    token_class   = Punkt::Token,
                    parameters    = Punkt::Parameters.new)
-                   
+
       @parameters    = parameters
       @language_vars = language_vars
       @token_class   = token_class
     end
-      
+
     def tokenize_words(plain_text, options = {})
       return @language_vars.word_tokenize(plain_text) if options[:output] == :string
       result = []
@@ -21,17 +21,16 @@ def tokenize_words(plain_text, options = {})
                            :line_start      => true)
           paragraph_start = false
           line_tokens.map! { |token| @token_class.new(token) }.unshift(first_token)
-          
+
           result += line_tokens
         else
           paragraph_start = true
         end
       end
       return result
     end
-
-  private 
-
+
+  private
     def annotate_first_pass(tokens)
       tokens.each do |aug_token|
         tok = aug_token.token
@@ -41,17 +40,16 @@ def annotate_first_pass(tokens)
         elsif aug_token.is_ellipsis?
           aug_token.ellipsis = true
         elsif aug_token.ends_with_period? && !tok.end_with?("..")
-          tok_low = UnicodeUtils.downcase(tok.chop)
+          tok_low = tok.chop.downcase
           if @parameters.abbreviation_types.include?(tok_low) || @parameters.abbreviation_types.include?(tok_low.split("-")[-1])
             aug_token.abbr = true
           else
             aug_token.sentence_break = true
           end
         end
-
       end
     end
-    
+
     def pair_each(list, &block)
       previous = list[0]
       list[1..list.size-1].each do |item|
@@ -60,6 +58,5 @@ def pair_each(list, &block)
       end
       yield(previous, nil)
     end
-
   end
 end
diff --git a/lib/punkt-segmenter/punkt/parameters.rb b/lib/punkt-segmenter/punkt/parameters.rb
@@ -1,3 +1,6 @@
+require 'rubygems'
+require 'json'
+
 module Punkt
   class Parameters
 
@@ -32,6 +35,24 @@ def clear_orthographic_context
     def add_orthographic_context(type, flag)
       @orthographic_context[type] |= flag
     end
-
+
+    def self.load_language(language)
+      data_path = File.join(File.dirname(__FILE__), "..", "..", "..", "data", "#{language}.json")
+
+      json_body = ""
+      open(data_path) {|file| json_body = file.read }
+      json = JSON.parse(json_body)
+
+      # let's load
+      p = new
+
+      json["sentence_starters"].each {|s| p.sentence_starters << s}
+      json["abbrev_types"].each {|a| p.abbreviation_types << a}
+      json["collocations"].each {|a| p.collocations << a}
+
+      json["ortho_context"].each {|k,v| p.orthographic_context[k] = v }
+
+      p
+    end
   end
 end