Fixes on tokeniser, normalisation, qualifiers and CI#329
Conversation
Span in BaseQualifier.processSpan in BaseQualifier.process
Coverage Report
Files without new missing coverage
263 files skipped due to complete coverage. Coverage failure: total of 97.77% is less than 97.78% ❌ |
5f31166 to
4f90b63
Compare
Span in BaseQualifier.process4f90b63 to
585b9d2
Compare
6852be5 to
c1cf750
Compare
2038fb9 to
232ca91
Compare
fe81659 to
1ffa7c6
Compare
|
|
|
||
| assert not (max_steps and max_epochs), "Use only steps or epochs" | ||
| if max_epochs: | ||
| max_steps = int(0.9 * (4464 / batch_size[0])) |
d2e1f39 to
65669dc
Compare


Description
Regarding tokenization:
In texts, words can be split with "-" when too long. This can impede matching:
dia-\nbetewon't be matched by a simple "diabete" regex. To this end:EDS.Tokenizernow threats-\nas a token by itselfeds.pollutioncan tag this token a to-be-discardedRegarding
ignore_space_tokensWith
ignore_space_tokens=True, usingedsnlp.utils.doc_to_text.get_text(which is used under the hood by e.g. the regex matcher) will remove linebreaks, which can be problematic in texts with enumeration without trailing spaces. E.g,get_text("Tabac\nAlcool\nSport", "TEXT", ignore_space_tokens=True) would ouput"TabacAlcoolSport"`.Now, we replace this
\nwith a space when necessaryRegarding the status mapping of behavior/disorder pipes
For entities matched by those pipes, there is:
_.statusattribute, by default set to 1, but that can take the value 2_.detailed_statusattribute, which is actually a getter that uses a mapping dictionary to get the human-readable statusWhen loading already-annotated docs, it can occurs that a status will be automaticaly set to None. To avoid a
KeyError, when now handle thisstatus=NonecaseRegarding CI
ubuntu-latestdoesn't support python 3.7 anymore, so we should useubuntu-22Checklist