Skip to content
Merged

Dev #1038

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,9 +97,12 @@ We use different supervised classifiers, header analysis, regular expressions, t

### Confidence values in header analysis

When extracting metadata using header analysis, SOMEF assigns a confidence value based on the length
of the header. Shorter headers are more likely to be a good fit for a category, while longer headers
may contain additional context that makes the classification less reliable:
When extracting metadata through header analysis, SOMEF filters out headers
whose confidence is below a certain threshold to avoid false positives.
For instance, a header with 11+ words receives a confidence of 0.1, which
is considered too low for a reliable classification — such headers are
discarded from the results. The filtering ensures that only headers with a
reasonable match quality are reported in the output.

| Header length | Confidence |
|---------------|------------|
Expand Down
9 changes: 6 additions & 3 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,9 +135,12 @@ The following table summarized the properties used to describe a `category`:

### Confidence values in header analysis

When extracting metadata using header analysis, SOMEF assigns a confidence value based on the length
of the header. Shorter headers are more likely to be a good fit for a category, while longer headers
may contain additional context that makes the classification less reliable:
When extracting metadata through header analysis, SOMEF filters out headers
whose confidence is below a certain threshold to avoid false positives.
For instance, a header with 11+ words receives a confidence of 0.1, which
is considered too low for a reliable classification. Such headers are
discarded from the results. The filtering ensures that only headers with a
reasonable match quality are reported in the output.

| Header length | Confidence |
|---------------|------------|
Expand Down
Loading