Skip to content

Commit cb2fbaf

Browse files
authored
Merge pull request #1038 from KnowledgeCaptureAndDiscovery/dev
Dev
2 parents 82fcf58 + a0a1c1c commit cb2fbaf

4 files changed

Lines changed: 159 additions & 153 deletions

File tree

README.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -97,9 +97,12 @@ We use different supervised classifiers, header analysis, regular expressions, t
9797

9898
### Confidence values in header analysis
9999

100-
When extracting metadata using header analysis, SOMEF assigns a confidence value based on the length
101-
of the header. Shorter headers are more likely to be a good fit for a category, while longer headers
102-
may contain additional context that makes the classification less reliable:
100+
When extracting metadata through header analysis, SOMEF filters out headers
101+
whose confidence is below a certain threshold to avoid false positives.
102+
For instance, a header with 11+ words receives a confidence of 0.1, which
103+
is considered too low for a reliable classification — such headers are
104+
discarded from the results. The filtering ensures that only headers with a
105+
reasonable match quality are reported in the output.
103106

104107
| Header length | Confidence |
105108
|---------------|------------|

docs/output.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -135,9 +135,12 @@ The following table summarized the properties used to describe a `category`:
135135

136136
### Confidence values in header analysis
137137

138-
When extracting metadata using header analysis, SOMEF assigns a confidence value based on the length
139-
of the header. Shorter headers are more likely to be a good fit for a category, while longer headers
140-
may contain additional context that makes the classification less reliable:
138+
When extracting metadata through header analysis, SOMEF filters out headers
139+
whose confidence is below a certain threshold to avoid false positives.
140+
For instance, a header with 11+ words receives a confidence of 0.1, which
141+
is considered too low for a reliable classification. Such headers are
142+
discarded from the results. The filtering ensures that only headers with a
143+
reasonable match quality are reported in the output.
141144

142145
| Header length | Confidence |
143146
|---------------|------------|

0 commit comments

Comments
 (0)