Skip to content

Commit 9d0a024

Browse files
authored
Merge pull request #1001 from juanjemdIos/fix-529
solve problem with long headers and improve header analisys confidence. Fixes #529, Fixes #138
2 parents 6a4eeb9 + 5959ffb commit 9d0a024

6 files changed

Lines changed: 139 additions & 5 deletions

File tree

README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,19 @@ We recognize the following properties:
9595

9696
We use different supervised classifiers, header analysis, regular expressions, the GitHub/Gitlab API to retrieve all these fields (more than one technique may be used for each field) and language specific metadata parsers (e.g., for package files). Each extraction records its provenance, with the confidence and technique used on each step. For more information check the [output format description](https://somef.readthedocs.io/en/latest/output/)
9797

98+
### Confidence values in header analysis
99+
100+
When extracting metadata using header analysis, SOMEF assigns a confidence value based on the length
101+
of the header. Shorter headers are more likely to be a good fit for a category, while longer headers
102+
may contain additional context that makes the classification less reliable:
103+
104+
| Header length | Confidence |
105+
|---------------|------------|
106+
| 1–3 words | 1.0 |
107+
| 4–6 words | 0.8 |
108+
| 7–10 words | 0.5 |
109+
| 11+ words | 0.1 |
110+
98111
## Documentation
99112

100113
See full documentation at [https://somef.readthedocs.io/en/latest/](https://somef.readthedocs.io/en/latest/)

docs/output.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,19 @@ The following table summarized the properties used to describe a `category`:
133133
| **source** | No | Url | URL of the source file used for the extraction. |
134134
| **technique** | Yes | String | Technique used for the extraction. One of the following list: Supervised classification, header analysis, regular expression, GitHub API, File exploration, Code parsing |
135135

136+
### Confidence values in header analysis
137+
138+
When extracting metadata using header analysis, SOMEF assigns a confidence value based on the length
139+
of the header. Shorter headers are more likely to be a good fit for a category, while longer headers
140+
may contain additional context that makes the classification less reliable:
141+
142+
| Header length | Confidence |
143+
|---------------|------------|
144+
| 1–3 words | 1.0 |
145+
| 4–6 words | 0.8 |
146+
| 7–10 words | 0.5 |
147+
| 11+ words | 0.1 |
148+
136149
### Result
137150
Field returning the extracted output from the code repository. An example can be seen below for a citation found in BibteX format in a README file of a code repository:
138151

@@ -446,6 +459,7 @@ The table below summarizes the mapping between the SOMEF internal JSON structure
446459
| `logo` | `logo` | Project logo URL |
447460
| `maintainer` | `maintainer` | Project maintainers |
448461
| `name` | `name` | Software name |
462+
| `schema:owner` | `owner` | Software owner |
449463
| `programmingLanguage` | `programming_languages` | Languages used |
450464
| `readme` | `readme_url` | README file URL |
451465
| `referencePublication`| `citation` (Papers) || References to the main publication associated with this software component (as per author preference) *1*|

src/somef/header_analysis.py

Lines changed: 28 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -329,11 +329,20 @@ def is_false_positive_header(text: str, category: str) -> bool:
329329

330330
text_lower = text.lower()
331331

332+
if '?' in text or '!' in text:
333+
return True
334+
332335
# false positives for bibliographic citations
333336
if category == constants.CAT_CITATION:
334337
for pattern in constants.NEGATIVE_PATTERNS_CITATION_HEADERS:
335338
if pattern in text_lower:
336339
return True
340+
341+
if category in constants.MAX_HEADER_WORDS:
342+
num_words = len(text.split())
343+
if num_words > constants.MAX_HEADER_WORDS[category]:
344+
return True
345+
337346
return False
338347

339348

@@ -431,6 +440,13 @@ def extract_categories(repo_data: str, repository_metadata: Result, similarity_t
431440
df.loc[df['Group'].str.len() == 0, 'Group'] = df['ParentGroup']
432441
df = df.drop(columns=['ParentGroup'])
433442

443+
# Installation keywords that wordnet cannot handle correctly
444+
mask = df['Group'].str.len() == 0
445+
df.loc[mask, 'Group'] = df.loc[mask, 'Header'].map(
446+
lambda h: [constants.CAT_INSTALLATION]
447+
if any(kw in h.lower() for kw in constants.INSTALLATION_HEADER_KEYWORDS)
448+
else []
449+
)
434450
# detection for os/platform headers that wordnet cannot handle correctly
435451
mask = df['Group'].str.len() == 0
436452
df.loc[mask, 'Group'] = df.loc[mask, 'Header'].map(
@@ -494,6 +510,7 @@ def extract_categories(repo_data: str, repository_metadata: Result, similarity_t
494510
if row[constants.PROP_PARENT_HEADER]:
495511
result[constants.PROP_PARENT_HEADER] = row[constants.PROP_PARENT_HEADER]
496512

513+
confidence = calculate_header_confidence(row[constants.PROP_ORIGINAL_HEADER])
497514
if row['Group'] == constants.CAT_LICENSE:
498515
license_text = row[constants.PROP_VALUE]
499516
license_info = detect_license_spdx(license_text, 'HEADER')
@@ -507,7 +524,7 @@ def extract_categories(repo_data: str, repository_metadata: Result, similarity_t
507524
repository_metadata.add_result(
508525
row['Group'],
509526
result,
510-
1,
527+
confidence,
511528
constants.TECHNIQUE_HEADER_ANALYSIS,
512529
source,
513530
)
@@ -613,6 +630,15 @@ def build_wordnet_groups() -> Dict[str, List]:
613630
return g
614631

615632

633+
def calculate_header_confidence(header: str) -> float:
634+
"""Returns a confidence value based on the header length."""
635+
num_words = len(header.split())
636+
for max_words, confidence in constants.HEADER_CONFIDENCE_THRESHOLDS:
637+
if num_words <= max_words:
638+
return confidence
639+
return 0.1
640+
641+
616642
def extract_os_from_content(text: str) -> List[dict]:
617643
"""
618644
Scans a text block for mentions of operating systems, platforms or runtime
@@ -655,4 +681,4 @@ def extract_os_from_content(text: str) -> List[dict]:
655681
"value": name,
656682
})
657683

658-
return results
684+
return results

src/somef/test/test_codemeta_export.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -531,12 +531,15 @@ def test_issue_417(self):
531531
json_content = json.loads(data)
532532
issue_tracker = json_content["issueTracker"] # JSON is in Codemeta format
533533

534-
#len(json_content["citation"])
535-
#codemeta category citation is now referencePublication
534+
535+
# buildInstructions was previously generated from the header "Browser issues (Why can't I see
536+
# the generated documentation / visualization?)" which was incorrectly classified as documentation
537+
# due to the word "documentation" in the header. This was a false positive fixed in issue #529
538+
# (long headers with punctuation are now discarded), so buildInstructions is no longer expected here.
539+
# len(json_content["buildInstructions"]) > 0 and
536540
assert issue_tracker == 'https://github.com/dgarijo/Widoco/issues' and len(json_content["referencePublication"]) > 0 and \
537541
len(json_content["name"]) > 0 and len(json_content["identifier"]) > 0 and \
538542
len(json_content["description"]) > 0 and len(json_content["readme"]) > 0 and \
539-
len(json_content["buildInstructions"]) > 0 and \
540543
len(json_content["softwareRequirements"]) > 0 and len(json_content["programmingLanguage"]) > 0 and \
541544
len(json_content["keywords"]) > 0 and len(json_content["logo"]) > 0 and \
542545
len(json_content["license"]) > 0 and len(json_content["dateCreated"]) > 0

src/somef/test/test_header_analysis.py

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,54 @@ def test_extract_headers_with_separators(self):
151151
assert 'Funding' in headers
152152

153153

154+
def test_issue_529(self):
155+
"""
156+
Test that ensures long headers or headers with punctuation are not incorrectly
157+
classified. 'Browser issues (Why can't I see...)' should not appear in documentation.
158+
"""
159+
with open(test_data_path + "widoco_readme.md", "r") as data_file:
160+
file_text = data_file.read()
161+
json_test, results = extract_categories(file_text, Result())
162+
if constants.CAT_DOCUMENTATION in json_test.results:
163+
headers = [e[constants.PROP_RESULT].get(constants.PROP_ORIGINAL_HEADER, "")
164+
for e in json_test.results[constants.CAT_DOCUMENTATION]]
165+
assert not any("Browser issues" in h for h in headers)
166+
167+
168+
def test_issue_529_installation(self):
169+
"""
170+
Test that ensures long headers or headers with punctuation are not incorrectly
171+
classified. 'Importing WIDOCO as a dependency' should not appear in documentation.
172+
"""
173+
with open(test_data_path + "widoco_readme.md", "r") as data_file:
174+
file_text = data_file.read()
175+
json_test, results = extract_categories(file_text, Result())
176+
177+
assert constants.CAT_INSTALLATION in json_test.results, "No installation category found"
178+
# print(json_test.results[constants.CAT_INSTALLATION])
179+
if constants.CAT_INSTALLATION in json_test.results:
180+
headers = [e[constants.PROP_RESULT].get(constants.PROP_ORIGINAL_HEADER, "")
181+
for e in json_test.results[constants.CAT_INSTALLATION]]
182+
183+
assert any("Importing WIDOCO as a dependency" in h for h in headers)
184+
185+
def test_issue_138(self):
186+
"""
187+
Test that ensures header analysis returns a confidence lower than 1
188+
for long headers (4+ words).
189+
"""
190+
with open(test_data_path + "widoco_readme.md", "r") as data_file:
191+
file_text = data_file.read()
192+
json_test, results = extract_categories(file_text, Result())
193+
for category, entries in json_test.results.items():
194+
if category == constants.PROP_PROVENANCE:
195+
continue
196+
for entry in entries:
197+
if entry[constants.PROP_TECHNIQUE] == constants.TECHNIQUE_HEADER_ANALYSIS:
198+
header = entry[constants.PROP_RESULT].get(constants.PROP_ORIGINAL_HEADER, "")
199+
if header and len(header.split()) > 3:
200+
print(f"Header: '{header}' | words: {len(header.split())} | confidence: {entry[constants.PROP_CONFIDENCE]}")
201+
assert entry[constants.PROP_CONFIDENCE] < 1.0
154202
def test_issue_112_similarity_threshold(self):
155203
"""
156204
Checks that the similarity_threshold parameter is respected in header analysis.

src/somef/utils/constants.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -547,6 +547,22 @@ class RepositoryType(Enum):
547547
DEPENDENCY_TYPE_RUNTIME = "runtime"
548548
DEPENDENCY_TYPE_DEVELOPMENT = "development"
549549

550+
# same length for all categories or different length depending on the category????????.
551+
# This is used in the header analysis technique, to determine how many words from the header should be included in the analysis
552+
# and avoid including false positives.
553+
MAX_HEADER_WORDS = {
554+
CAT_DOCUMENTATION: 5,
555+
CAT_REQUIREMENTS: 3,
556+
CAT_CITATION: 5,
557+
}
558+
559+
# Confidence thresholds for header analysis based on header length
560+
HEADER_CONFIDENCE_THRESHOLDS = [
561+
(3, 1.0), # 1-3 words -> confidence 1.0
562+
(6, 0.8), # 4-6 words -> confidence 0.8
563+
(10, 0.5), # 7-10 words -> confidence 0.5
564+
(11, 0.1), # 11+ words -> confidence 0.1
565+
]
550566
# in case not exist in config file. But config file has higher priority than this default value.
551567
CONF_SIMILARITY_THRESHOLD = "similarity_threshold"
552568
CONF_DEFAULT_SIMILARITY_THRESHOLD = 0.8
@@ -561,6 +577,20 @@ class RepositoryType(Enum):
561577
"supported platforms", "tested on", "runs on", "environment",
562578
]
563579

580+
INSTALLATION_HEADER_KEYWORDS = [
581+
"importing",
582+
"downloading",
583+
"download",
584+
"as a dependency",
585+
"as dependency",
586+
"via pip",
587+
"via conda",
588+
"via npm",
589+
"via maven",
590+
"getting started",
591+
"quick start",
592+
"quickstart",
593+
]
564594
# Regular expressions for OS/platform detection in header analysis
565595
REGEXP_OS_WINDOWS = r'(?i)\bwindows\s*(\d[\d.]*\d|\d+)?'
566596
REGEXP_OS_MACOS = r'(?i)(?:\bmacos|\bmac\s*os|\bos\s*x|\bosx)\s*([\d.]+)?'

0 commit comments

Comments
 (0)