Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
3654f79
Merge pull request #12 from vi3k6i5/develop
vi3k6i5 Nov 10, 2017
105b24a
Merge pull request #13 from vi3k6i5/develop
vi3k6i5 Nov 10, 2017
43e972a
Merge pull request #25 from vi3k6i5/develop
vi3k6i5 Nov 19, 2017
86f01ef
Added test case to get all keywords from kp
vi3k6i5 Nov 19, 2017
92ebec9
added test case to check term in kp
vi3k6i5 Nov 19, 2017
b5bf2aa
added test case to check term in kp
vi3k6i5 Nov 19, 2017
f2440b9
added test case coverage
vi3k6i5 Nov 19, 2017
5478cfd
Merge pull request #26 from vi3k6i5/develop
vi3k6i5 Nov 19, 2017
90d3cd1
added citation
vi3k6i5 Nov 21, 2017
06066e0
Merge pull request #27 from vi3k6i5/develop
vi3k6i5 Nov 21, 2017
ae2d85d
correccted citation
vi3k6i5 Nov 21, 2017
9adde82
Merge pull request #28 from vi3k6i5/develop
vi3k6i5 Nov 21, 2017
7090989
added feature to return span_info when using extract_keywords
vi3k6i5 Nov 21, 2017
77d6bc7
added docs for span_info
vi3k6i5 Nov 21, 2017
f18c3c8
updated version to 2.5
vi3k6i5 Nov 21, 2017
2255b50
Merge pull request #29 from vi3k6i5/develop
vi3k6i5 Nov 21, 2017
a966304
imporved example of span_info
vi3k6i5 Nov 21, 2017
1c970b5
added example for getting extra info with extract keywords
vi3k6i5 Dec 7, 2017
8adc57d
Fix typos:
delirious-lettuce Dec 15, 2017
c8a01ae
Merge pull request #41 from delirious-lettuce/fix_typos
vi3k6i5 Dec 18, 2017
eb2c6ca
added fix for encoding of file
vi3k6i5 Jan 19, 2018
4d1ed19
Merge branch 'master' of github.com:vi3k6i5/flashtext
vi3k6i5 Jan 19, 2018
599a836
Fix issue with incomplete keyword at the end of the sentence
killfactory Jan 21, 2018
dec72ad
added comment for encoding parameter
vi3k6i5 Jan 26, 2018
33355ce
Merge pull request #45 from killfactory/master
vi3k6i5 Jan 26, 2018
5591859
added bug fix for https://github.com/vi3k6i5/flashtext/issues/47
vi3k6i5 Feb 16, 2018
ac25fc3
Performances improvement for string manipulation
rompom Jun 21, 2018
50c45f1
Merge pull request #55 from rompom/master
vi3k6i5 Nov 9, 2018
241b158
first implementation of fuzzy matching via leventhstein, first tests …
remiadon Jun 1, 2019
9eb55f4
fuzzy extract_keywords : [FIX] mismatch on first character
remiadon Jun 2, 2019
a28b620
fuzzy extract_keywords : small fix on unused variable
remiadon Jun 5, 2019
d4c7e79
fuzzy extract_keywords : [ADD] testing for intermediate matches
remiadon Jun 5, 2019
8b22cf0
fuzzy extract_keywords : [ADD] test for substitutions
remiadon Jun 5, 2019
a61dcbf
fuzzy extract_keywords : simplify code for intermediate matches
remiadon Jun 5, 2019
bb56cb6
fuzzy extract_keywords : improve performance for intermediate matches
remiadon Jun 6, 2019
fec427c
fuzzy extract_keywords : [ADD] examples for levensthein and fix mista…
remiadon Jun 6, 2019
bf3b689
fuzzy matching: [ADD] test for get_next_word
remiadon Jun 6, 2019
e8372c8
fuzzy extract: [ADD] test for intermediate match
remiadon Jun 6, 2019
c3c2ee6
fuzzy matching: [FIX] last test on intermediate match
remiadon Jun 6, 2019
7a4190d
[ADD] fuzzy replacement and tests for it
remiadon Jun 13, 2019
6a0aaaf
fuzzy_matching [FIX] typos and import in tests
remiadon Jun 13, 2019
097cc0d
fuzzy_matching replacement [ADD] test for intermediate match
remiadon Jun 13, 2019
9e1b5f4
fuzzy replacement [ADD] improve performance and [FIX] test for interm…
remiadon Jun 13, 2019
a5d8b8d
[FIX] fuzzy replacement case sensitive when tiggering levenshtein
remiadon Jun 17, 2019
9cc6d0b
[FIX] case for match while remaining subword
remiadon Jan 7, 2020
b316c7e
Merge pull request #84 from Quantmetry/fuzzy_matching
vi3k6i5 May 3, 2020
dff79df
Added add_keywords_from_file_dict function
rudrashisgorai Jul 26, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
[run]
omit = test/*
omit =
test/*
setup.py
40 changes: 37 additions & 3 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,24 @@ Case Sensitive example
>>> keywords_found
>>> # ['Bay Area']

Span of keywords extracted
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.', span_info=True)
>>> keywords_found
>>> # [('New York', 7, 16), ('Bay Area', 21, 29)]

Get Extra information with keywords extracted
>>> from flashtext import KeywordProcessor
>>> kp = KeywordProcessor()
>>> kp.add_keyword('Taj Mahal', ('Monument', 'Taj Mahal'))
>>> kp.add_keyword('Delhi', ('Location', 'Delhi'))
>>> kp.extract_keywords('Taj Mahal is in Delhi.')
>>> # [('Monument', 'Taj Mahal'), ('Location', 'Delhi')]
>>> # NOTE: replace_keywords feature won't work with this.

No clean name for Keywords
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
Expand Down Expand Up @@ -134,9 +152,9 @@ Get all keywords in dictionary
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('j2ee', 'Java')
>>> keyword_processor.add_keyword('onGoing', 'rendom')
>>> keyword_processor.add_keyword('colour', 'color')
>>> keyword_processor.get_all_keywords()
>>> # output: {'j2ee': 'Java', 'ongoing': 'rendom'}
>>> # output: {'colour': 'color', 'j2ee': 'Java'}

For detecting Word Boundary currently any character other than this `\\w` `[A-Za-z0-9_]` is considered a word boundary.

Expand Down Expand Up @@ -199,11 +217,27 @@ The idea for this library came from the following `StackOverflow question
<https://stackoverflow.com/questions/44178449/regex-replace-is-taking-time-for-millions-of-documents-how-to-make-it-faster>`_.


References
Citation
----------

The original paper published on `FlashText algorithm <https://arxiv.org/abs/1711.00046>`_.

::

@ARTICLE{2017arXiv171100046S,
author = {{Singh}, V.},
title = "{Replace or Retrieve Keywords In Documents at Scale}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1711.00046},
primaryClass = "cs.DS",
keywords = {Computer Science - Data Structures and Algorithms},
year = 2017,
month = oct,
adsurl = {http://adsabs.harvard.edu/abs/2017arXiv171100046S},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

The article published on `Medium freeCodeCamp <https://medium.freecodecamp.org/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f>`_.


Expand Down
40 changes: 37 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,24 @@ Case Sensitive example
>>> keywords_found
>>> # ['Bay Area']

Span of keywords extracted
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.', span_info=True)
>>> keywords_found
>>> # [('New York', 7, 16), ('Bay Area', 21, 29)]

Get Extra information with keywords extracted
>>> from flashtext import KeywordProcessor
>>> kp = KeywordProcessor()
>>> kp.add_keyword('Taj Mahal', ('Monument', 'Taj Mahal'))
>>> kp.add_keyword('Delhi', ('Location', 'Delhi'))
>>> kp.extract_keywords('Taj Mahal is in Delhi.')
>>> # [('Monument', 'Taj Mahal'), ('Location', 'Delhi')]
>>> # NOTE: replace_keywords feature won't work with this.

No clean name for Keywords
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
Expand Down Expand Up @@ -131,9 +149,9 @@ Get all keywords in dictionary
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('j2ee', 'Java')
>>> keyword_processor.add_keyword('onGoing', 'rendom')
>>> keyword_processor.add_keyword('colour', 'color')
>>> keyword_processor.get_all_keywords()
>>> # output: {'j2ee': 'Java', 'ongoing': 'rendom'}
>>> # output: {'colour': 'color', 'j2ee': 'Java'}

For detecting Word Boundary currently any character other than this `\\w` `[A-Za-z0-9_]` is considered a word boundary.

Expand Down Expand Up @@ -207,11 +225,27 @@ The idea for this library came from the following `StackOverflow question
<https://stackoverflow.com/questions/44178449/regex-replace-is-taking-time-for-millions-of-documents-how-to-make-it-faster>`_.


References
Citation
----------

The original paper published on `FlashText algorithm <https://arxiv.org/abs/1711.00046>`_.

::

@ARTICLE{2017arXiv171100046S,
author = {{Singh}, V.},
title = "{Replace or Retrieve Keywords In Documents at Scale}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1711.00046},
primaryClass = "cs.DS",
keywords = {Computer Science - Data Structures and Algorithms},
year = 2017,
month = oct,
adsurl = {http://adsabs.harvard.edu/abs/2017arXiv171100046S},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

The article published on `Medium freeCodeCamp <https://medium.freecodecamp.org/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f>`_.


Expand Down
Loading