Skip to content

Commit 26ca8e7

Browse files
authored
Merge pull request #1028 from KnowledgeCaptureAndDiscovery/dev
Bringing issues for the next release
2 parents 057c688 + 5ab8db1 commit 26ca8e7

46 files changed

Lines changed: 3004 additions & 111 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
name: Build and Publish Docker Image after Release (view instruccions for Docker in readme.md file)
2+
3+
on:
4+
release:
5+
types: [published]
6+
# workflow_dispatch:
7+
8+
jobs:
9+
build-and-push:
10+
runs-on: ubuntu-latest
11+
12+
steps:
13+
- name: Checkout repository
14+
uses: actions/checkout@v4
15+
16+
- name: Set up Docker Buildx
17+
uses: docker/setup-buildx-action@v3
18+
19+
- name: Log in to Docker Hub
20+
uses: docker/login-action@v3
21+
with:
22+
username: ${{ secrets.DOCKERHUB_USERNAME }}
23+
password: ${{ secrets.DOCKERHUB_TOKEN }}
24+
25+
- name: Build and push Docker image
26+
uses: docker/build-push-action@v5
27+
with:
28+
context: .
29+
file: Dockerfile
30+
push: true
31+
tags: |
32+
kcapd/somef:latest
33+
kcapd/somef:${{ github.event.release.tag_name }}

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,5 @@ repos.txt
2424
!package_neors.json
2525
!package_npm.json
2626
!test_data/api_responses/*.json
27+
uv.lock
28+
.python-version

README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,19 @@ We recognize the following properties:
9595

9696
We use different supervised classifiers, header analysis, regular expressions, the GitHub/Gitlab API to retrieve all these fields (more than one technique may be used for each field) and language specific metadata parsers (e.g., for package files). Each extraction records its provenance, with the confidence and technique used on each step. For more information check the [output format description](https://somef.readthedocs.io/en/latest/output/)
9797

98+
### Confidence values in header analysis
99+
100+
When extracting metadata using header analysis, SOMEF assigns a confidence value based on the length
101+
of the header. Shorter headers are more likely to be a good fit for a category, while longer headers
102+
may contain additional context that makes the classification less reliable:
103+
104+
| Header length | Confidence |
105+
|---------------|------------|
106+
| 1–3 words | 1.0 |
107+
| 4–6 words | 0.8 |
108+
| 7–10 words | 0.5 |
109+
| 11+ words | 0.1 |
110+
98111
## Documentation
99112

100113
See full documentation at [https://somef.readthedocs.io/en/latest/](https://somef.readthedocs.io/en/latest/)
@@ -362,6 +375,10 @@ The following command extracts all metadata available from [https://github.com/d
362375
somef describe -r https://github.com/dgarijo/Widoco/ -o test.json -t 0.8
363376
```
364377

378+
We recommend having a high value for the `threshold` parameter, 0.8 (default) or above.
379+
Additional configuration parameters (such as the `similarity_threshold` for header analysis)
380+
can be set in `~/.somef/config.json`. See the [usage documentation](https://somef.readthedocs.io/en/latest/usage/) for details.
381+
365382
Try SOMEF in Binder with our sample notebook: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/KnowledgeCaptureAndDiscovery/somef/HEAD?filepath=notebook%2FSOMEF%20Usage%20Example.ipynb)
366383

367384
## Contribute:

config.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,6 @@
22
"description" : "./models/description.p",
33
"citation" : "./models/citation.p",
44
"installation" : "./models/installation.p",
5-
"invocation" : "./models/invocation.p"
5+
"invocation" : "./models/invocation.p",
6+
"similarity_threshold": 0.8
67
}

docs/output.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,19 @@ The following table summarized the properties used to describe a `category`:
133133
| **source** | No | Url | URL of the source file used for the extraction. |
134134
| **technique** | Yes | String | Technique used for the extraction. One of the following list: Supervised classification, header analysis, regular expression, GitHub API, File exploration, Code parsing |
135135

136+
### Confidence values in header analysis
137+
138+
When extracting metadata using header analysis, SOMEF assigns a confidence value based on the length
139+
of the header. Shorter headers are more likely to be a good fit for a category, while longer headers
140+
may contain additional context that makes the classification less reliable:
141+
142+
| Header length | Confidence |
143+
|---------------|------------|
144+
| 1–3 words | 1.0 |
145+
| 4–6 words | 0.8 |
146+
| 7–10 words | 0.5 |
147+
| 11+ words | 0.1 |
148+
136149
### Result
137150
Field returning the extracted output from the code repository. An example can be seen below for a citation found in BibteX format in a README file of a code repository:
138151

@@ -423,6 +436,7 @@ The table below summarizes the mapping between the SOMEF internal JSON structure
423436

424437
| Codemeta / Schema.org Field | SOMEF Category | Description |
425438
| :--- | :--- | :--- |
439+
| `applicationCategory` | `application_domain` | Categories |
426440
| `author` | `author` | Principal authors |
427441
| `buildInstructions` | `installation` / `documentation` | Installation or build instructions |
428442
| `creditText` | `citation` (Software) | Human-readable citation for the software *1*|
@@ -445,6 +459,7 @@ The table below summarizes the mapping between the SOMEF internal JSON structure
445459
| `logo` | `logo` | Project logo URL |
446460
| `maintainer` | `maintainer` | Project maintainers |
447461
| `name` | `name` | Software name |
462+
| `schema:owner` | `owner` | Software owner |
448463
| `programmingLanguage` | `programming_languages` | Languages used |
449464
| `readme` | `readme_url` | README file URL |
450465
| `referencePublication`| `citation` (Papers) || References to the main publication associated with this software component (as per author preference) *1*|

docs/setupcfg.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
The following metadata fields can be extracted from a setup.cfg file.
2+
These fields are defined in the [setuptools declarative configuration specification](https://setuptools.pypa.io/en/latest/userguide/declarative_config.html), and are mapped according to the [CodeMeta crosswalk for Python Distutils](https://github.com/codemeta/codemeta/blob/master/crosswalks/Python%20Distutils%20(PyPI).csv).
3+
4+
| Software metadata category | SOMEF metadata JSON path | SETUP.CFG metadata file field |
5+
|--------------------------------|-----------------------------|----------------------------------------|
6+
| author - value | author[i].result.value | metadata.author |
7+
| author - email | author[i].result.email | metadata.author_email |
8+
| author - name | author[i].result.name | metadata.author |
9+
| code_repository | code_repository[i].result.value | project_urls (source, repository, code) |
10+
| description | description[i].result.value | metadata.description |
11+
| documentation | documentation[i].result.value | project_urls (Documentation, docs) |
12+
| license - value | license[i].result.value | metadata.license or metadata.license_files |
13+
| license - name | license[i].result.name | metadata.license *(1)* |
14+
| license - spdx id | license[i].result.spdx_id | metadata.license if "spdx.org/licenses/" *(1)* |
15+
| has_package_file | has_package_file[i].result.value | URL of the setup.cfg file |
16+
| homepage | homepage[i].result.value | metadata.url or project_urls (Homepage) |
17+
| keywords | keywords[i].result.value | metadata.keywords |
18+
| package_id | package_id[i].result.value | metadata.name |
19+
| requirements - value | requirements[i].result.value | options.install_requires or options.setup_requires *(2)* |
20+
| requirements - name | requirements[i].result.name | options.install_requires or options.setup_requires -> name *(2)* |
21+
| requirements - version | requirements[i].result.version | options.install_requires or options.setup_requires -> version *(2)* |
22+
| runtime_platform - value | runtime_platform[i].result.value | options.python_requires -> "Python" + version *(3)* |
23+
| runtime_platform - name | runtime_platform[i].result.name | options.python_requires -> "Python" *(3)* |
24+
| runtime_platform - version | runtime_platform[i].result.version | options.python_requires *(3)* |
25+
| version - value | version[i].result.value | metadata.version |
26+
| version - tag | version[i].result.tag | metadata.version |
27+
28+
---
29+
30+
*(1)*
31+
- Look for the name and spdx_id in a local dictionary with all licenses.
32+
33+
*(2)*
34+
- Examples of requirements
35+
```
36+
[options]
37+
install_requires =
38+
astropy
39+
ctapipe >= 0.12
40+
h5py ~= 3.1.0
41+
42+
setup_requires =
43+
setuptools >= 40.6.0
44+
wheel
45+
46+
```
47+
48+
*(3)*
49+
- Example:
50+
```
51+
python_requires = >= 3.10.0
52+
```

docs/supported_languages.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ To know more about the extraction details for each type of file, click on it.
1212
| JavaScript | [`package.json`](./packagejson.md), [`bower.json`](./bower.md) |
1313
| Julia | [`Project.toml`](./julia.md) |
1414
| PHP | [`composer.json`](./composer.md) |
15-
| Python | [`setup.py`](./setuppy.md), [`pyproject.toml`](./pyprojecttoml.md), [`requirements.txt`](./requirementstxt.md) |
15+
| Python | [`setup.py`](./setuppy.md), [`setup.cfg`](./setupcfg.md), [`pyproject.toml`](./pyprojecttoml.md), [`requirements.txt`](./requirementstxt.md) |
1616
| R | [`DESCRIPTION`](./description.md) |
1717
| Ruby | [`*.gemspec`](./gemspec.md) |
1818
| Rust | [`Cargo.toml`](./cargo.md) |

docs/supported_metadata_files.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ SOMEF can extract metadata from a wide range of files commonly found in software
2121
| `pyproject.toml` | Python | Modern Python project configuration file used by tools like Poetry and Flit | [🔍](./pyprojecttoml.md)| [📄](https://packaging.python.org/en/latest/guides/writing-pyproject-toml/)| [PEP 621](https://peps.python.org/pep-0621/)| [Example](https://github.com/KnowledgeCaptureAndDiscovery/somef/blob/master/pyproject.toml) |
2222
| `requirements.txt` | Python | Lists Python package dependencies | [🔍](./requirementstxt.md)| [📄](https://pip.pypa.io/en/stable/reference/requirements-file-format/)| [Latest](https://pip.pypa.io/en/stable/reference/requirements-file-format/)| [Example](https://github.com/oeg-upm/FAIR-Research-Object/blob/main/requirements.txt) |
2323
| `setup.py` | Python | Package file format used in python projects | [🔍](./setuppy.md)| [📄](https://setuptools.pypa.io/en/latest/references/keywords.html)| [v75.0.0](https://github.com/pypa/setuptools)| [Example](https://github.com/oeg-upm/soca/blob/main/setup.py) |
24+
| `setup.cfg` | Python | Configuration file for setuptools used to define package metadata and options in a declarative way | [🔍](./setupcfg.md)| [📄](https://setuptools.pypa.io/en/latest/userguide/declarative_config.html) | [v75.0.0](https://github.com/pypa/setuptools)|[Example](https://github.com/oeg-upm/soca/blob/main/setup.cfg)|
2425
| `DESCRIPTION` | R | Metadata file for R packages including title, author, and version | [🔍](./description.md) | [📄](https://cran.r-project.org/doc/manuals/R-exts.html#The-DESCRIPTION-file)| [v4.4.1](https://cran.r-project.org/doc/manuals/r-release/R-exts.html) | [Example](https://github.com/cran/ggplot2/blob/master/DESCRIPTION) |
2526
| `*.gemspec` | Ruby | Manifest file serves as the package descriptor used in Ruby gem projects. | [🔍](./gemspec.md)| [📄](https://guides.rubygems.org/specification-reference/)| [v3.5.22](https://github.com/rubygems/rubygems)|[Example](https://github.com/rubygems/rubygems/blob/master/bundler/bundler.gemspec) |
2627
| `cargo.toml` | Rust | Manifest file serves as the package descriptor used in Rust projects | [🔍](./cargo.md) | [📄](https://doc.rust-lang.org/cargo/reference/manifest.html)| [v0.85.0](https://github.com/rust-lang/cargo) | [Example](https://github.com/rust-lang/cargo/blob/master/Cargo.toml) |

docs/usage.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,8 +82,33 @@ If you prefer to export as a [Codemeta](https://codemeta.github.io/) JSON-LD, ju
8282
somef describe -r https://github.com/dgarijo/Widoco/ -c test.json
8383
```
8484

85-
For more information about the output types supported by SOMEF, please see [the output format help page](https://somef.readthedocs.io/en/latest/output/).
85+
For more information about the output types supported by SOMEF, please see [the output format help page](https://somef.readthedocs.io/en/latest/output/).
8686

8787
We recommend having a high value for the `threshold` parameter, 0.8 (default) or above.
8888

89+
## Configuration parameters
90+
91+
SOMEF uses a configuration file located at `~/.somef/config.json` that can be edited to customize its behavior.
92+
To generate it, run `somef configure`. The following parameters are available:
93+
94+
### Similarity threshold
95+
96+
Controls the minimum similarity score required for a README header to be matched to a
97+
category (e.g., installation, usage, license). SOMEF uses WordNet path similarity to
98+
compare header words against known category terms.
99+
100+
- **Default value**: `0.8`
101+
- **Range**: `0.0` to `1.0` (higher values = stricter matching, lower values = more permissive)
102+
103+
To change it, edit your `~/.somef/config.json`:
104+
105+
```json
106+
{
107+
"similarity_threshold": 0.75
108+
}
109+
```
110+
111+
Note: This parameter is different from the `-t` threshold used in `somef describe`,
112+
which controls the confidence of the supervised classifiers.
113+
89114
To see a live usage example, try our Binder Notebook: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/KnowledgeCaptureAndDiscovery/somef/HEAD?filepath=notebook%2FSOMEF%20Usage%20Example.ipynb)

poetry.lock

Lines changed: 23 additions & 25 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)