Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,7 @@ repos.txt
!package_neors.json
!package_npm.json
!test_data/api_responses/*.json
!**/test_data/api_responses/codeberg/*.json
!**/test_data/api_responses/bitbucket/*.json
uv.lock
.python-version
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ A command line interface for automatically extracting relevant metadata from cod

## Features

Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present), listed in alphabetical order:
Given a readme file (or a GitHub/Gitlab/Codeberg/Bitbucket repository) SOMEF will extract the following categories (if present), listed in alphabetical order:

- **Acknowledgement**: Text acknowledging funding sources or contributors
- **Application domain**: The application domain of the repository. Current supported domains include: Astrophysics, Audio, Computer vision, Graphs, Natural language processing, Reinforcement learning, Semantc web, Sequential. Domains are not mutually exclusive. These domains have been extracted from [awesome lists](https://github.com/topics/awesome-list) and [Papers with code](https://paperswithcode.com/). Find more information in our [documentation](https://somef.readthedocs.io/en/latest/)
Expand All @@ -38,7 +38,7 @@ We recognize the following properties:
- Year: Year of publication
- Pages: Page range in the journal
- **Code of conduct**: Link to the code of conduct of the project
- **Code repository**: Link to the GitHub/GitLab repository used for the extraction
- **Code repository**: Link to the GitHub/GitLab/Codeberg and Bitbucket repository used for the extraction
- **Contact**: Contact person responsible for maintaining a software component
- **Continuous integration**: Link to continuous integration service(s)
- **Contribution guidelines**: Text indicating how to contribute to this code repository
Expand Down Expand Up @@ -72,7 +72,7 @@ We recognize the following properties:
- **Package files**: Links to package files used to wrap the project in a package.
- **Programming languages**: Languages used in the repository
- **Related papers**: URL to possible related papers within the repository stated within the readme file (from Arxiv)
- **Releases** (GitHub only): Pointer to the available versions of a software component. For each release, somef will track the following properties:
- **Releases**: Pointer to the available versions of a software component. For each release, somef will track the following properties:
- Description: Release notes
- Author: Agent responsible of creating the release
- Name: Name of the release
Expand All @@ -93,7 +93,7 @@ We recognize the following properties:
- **Usage examples**: Assumptions and considerations recorded by the authors when executing a software component, or examples on how to use it
- **Workflows**: URL and path to the computational workflow files present in the repository

We use different supervised classifiers, header analysis, regular expressions, the GitHub/Gitlab API to retrieve all these fields (more than one technique may be used for each field) and language specific metadata parsers (e.g., for package files). Each extraction records its provenance, with the confidence and technique used on each step. For more information check the [output format description](https://somef.readthedocs.io/en/latest/output/)
We use different supervised classifiers, header analysis, regular expressions, the GitHub/Gitlab/Codeberg and Bitbucket API to retrieve all these fields (more than one technique may be used for each field) and language specific metadata parsers (e.g., for package files). Each extraction records its provenance, with the confidence and technique used on each step. For more information check the [output format description](https://somef.readthedocs.io/en/latest/output/)

## Documentation

Expand Down Expand Up @@ -297,10 +297,10 @@ Usage: somef describe [OPTIONS]
Options:
-t, --threshold FLOAT Threshold to classify the text [required]
Input: [mutually_exclusive, required]
-r, --repo_url URL Github/Gitlab Repository URL
-r, --repo_url URL Github/Gitlab/Codeberg/Bitbucket Repository URL
-d, --doc_src PATH Path to the README file source
-i, --in_file PATH A file of newline separated links to GitHub/
Gitlab repositories
Gitlab/Codeberg/Bitbucket repositories
-l, --local_repo PATH Path to the local repository source. No APIs will be used

Output: [required_any]
Expand Down
6 changes: 3 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ We recognize the following properties:
- Year: Year of publication
- Pages: Page range in the journal
- **Code of conduct**: Link to the code of conduct of the project
- **Code repository**: Link to the GitHub/GitLab repository used for the extraction
- **Code repository**: Link to the GitHub/GitLab/Codeberg/Bitbucket repository used for the extraction
- **Contact**: Contact person responsible for maintaining a software component
- **Continuous integration**: Link to continuous integration service(s)
- **Contribution guidelines**: Text indicating how to contribute to this code repository
Expand Down Expand Up @@ -80,7 +80,7 @@ We recognize the following properties:
- **Package files**: Links to package files used to wrap the project in a package.
- **Programming languages**: Languages used in the repository
- **Related papers**: URL to possible related papers within the repository stated within the readme file (from Arxiv)
- **Releases** (GitHub and Gitlab): Pointer to the available versions of a software component. For each release, somef will track the following properties:
- **Releases** (GitHub, Gitlab, Codeberg and Bitbucket): Pointer to the available versions of a software component. For each release, somef will track the following properties:
- Assets: files attached to the release
- Description: Release notes
- Author: Agent responsible of creating the release
Expand All @@ -102,7 +102,7 @@ We recognize the following properties:
- **Usage examples**: Assumptions and considerations recorded by the authors when executing a software component, or examples on how to use it
- **Workflows**: URL and path to the computational workflow files present in the repository

We use different supervised classifiers, header analysis, regular expressions, the GitHub/Gitlab API to retrieve all these fields (more than one technique may be used for each field) and language specific metadata parsers (e.g., for package files). Each extraction records its provenance, with the confidence and technique used on each step. For more information check the [output format description](https://somef.readthedocs.io/en/latest/output/)
We use different supervised classifiers, header analysis, regular expressions, the GitHub/Gitlab/Codeberg/Bitbucket API to retrieve all these fields (more than one technique may be used for each field) and language specific metadata parsers (e.g., for package files). Each extraction records its provenance, with the confidence and technique used on each step. For more information check the [output format description](https://somef.readthedocs.io/en/latest/output/)

<a name="myfootnote1">1</a> The available application domains currently are:

Expand Down
58 changes: 56 additions & 2 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ SOMEF aims to recognize the following categories (in alphabetical order):
- `code_of_conduct`: Link to the code of conduct file of the project
- `code_repository`: Link to the source code (typically the repository where the readme can be found)
- `contact`: Contact person responsible for maintaining a software component.
- `continuous_integration`: Link to continuous integration service, supported on GitHub as well as in GitLab.
- `continuous_integration`: Link to continuous integration service, supported on GitHub as well as in GitLab, Codeberg and Bitbucket.
- `contributing guidelines`: Guidelines indicating how to contribute to a software component.
- `contributor`: Contributors to this software. Note: Contributor metadata is exported from metadata files (e.g., CodeMeta, CONTRIBUTORS, etc.) not from git logs.
- `copyright_holder`: Entity or individual owning the rights to the software. The year is also extracted, if available.
Expand Down Expand Up @@ -167,7 +167,7 @@ Depending on the `type` of the result, additional properties may be found.

The following object `types` are currently supported:

- `Release`: software releases of the current code repository, as available from GitHub.
- `Release`: software releases of the current code repository, as available from GitHub, GitLab and Codeberg
- `Programming_language`: Programming language used in the repository.
- `License`: object representing all the metadata SOMEF extracts from a license.
- `Agent`: user (typically, a person) or organization responsible for authoring a software release or a paper.
Expand Down Expand Up @@ -317,6 +317,8 @@ The techniques can be of several types:
- `file_exploration`: the result comes from an exploration of the files in the repository
- `GitHub_API`: the result was obtained from the GitHub API.
- `GitLab_API`: the result was obtained from the GitLab API.
- `Codeberg_API`: the result was obtained from the Codeberg API.
- `Bitbucket_API`: the result was obtained from the Bitbucket API.
- `regular_expression`: the result was obtained after performing regular expressions on the files in the repository.
- `software_type_heuristics`: the result was obtained from analysis of the repository based on various heuristics from the README, code and extension analysis.
- `supervised_classification`: the results were obtained after running text classifiers trained for detecting that type of header.
Expand Down Expand Up @@ -405,6 +407,58 @@ A more detailed explanation is provided in the [wiki](https://github.com/oeg-upm
```
As shown in the Turtle snippet above, SOMEF represents the software as an entity, its relationship with each release (software version), the license found in the repository and the Person who owns it.
-->
## Codeberg API Crosswalk

When analyzing a Codeberg repository, SOMEF uses the [Codeberg API](https://codeberg.org/api/v1/swagger)
(`GET /api/v1/repos/{owner}/{repo}`) to retrieve metadata. The table below shows how Codeberg API
fields map to SOMEF categories:

| SOMEF category | Codeberg API field | Notes |
|---|---|---|
| `name` | `name` | |
| `description` | `description` | |
| `code_repository` | `html_url` | |
| `owner` | `owner.login` | |
| `date_created` | `created_at` | |
| `date_updated` | `updated_at` | |
| `stars` | `stars_count` | In GitHub this field is `stargazers_count` |
| `forks_count` | `forks_count` | |
| `homepage` | `website` | In GitHub this field is `homepage` |
| `keywords` | `topics` | |
| `issue_tracker` | *(constructed)* | Built as `{html_url}/issues` |
| `license` | *(not available)* | Codeberg API does not return license information |
| `programming_languages` | `languages_url` | Additional GET request to the languages endpoint |
| `releases` | `/repos/{owner}/{repo}/releases` | Additional GET request |

For releases, the field mapping is identical to GitHub. The only differences are that Codeberg
uses `attachments` instead of `assets` for release files, and it does not provide
`author.type` (`AGENT_TYPE`) for release authors.


## Bitbucket API Crosswalk

When analyzing a Bitbucket repository, SOMEF uses the [Bitbucket Cloud API](https://developer.atlassian.com/cloud/bitbucket/rest/api-group-repositories/)
(`GET /2.0/repositories/{workspace}/{repo_slug}`) to retrieve metadata. The table below shows how Bitbucket API
fields map to SOMEF categories:

| SOMEF category | Bitbucket API field | Notes |
|---|---|---|
| `name` | `slug` | |
| `description` | `description` | |
| `full_name` | `full_name` | Format: `{workspace}/{slug}` |
| `code_repository` | `links.html.href` | |
| `owner` | `owner.nickname` | Falls back to `owner.username` for team workspaces |
| `date_created` | `created_on` | |
| `date_updated` | `updated_on` | |
| `homepage` | `website` | |
| `forks_url` | `links.forks.href` | |
| `download_url` | *(constructed)* | Built as `{html_url}/downloads` |
| `issue_tracker` | *(constructed)* | Built as `{html_url}/issues` when `has_issues` is true |
| `programming_languages` | `language` | Single string, not a dictionary with sizes |
| `releases` | `/refs/tags` | Bitbucket has no dedicated releases endpoint; uses the tags endpoint |
| `stars` | *(not available)* | Bitbucket does not have a stargazers feature |
| `forks_count` | *(not available)* | Bitbucket does not expose fork counts in its API |


## Citation Reconciliation

Expand Down
58 changes: 44 additions & 14 deletions src/somef/process_files.py
Original file line number Diff line number Diff line change
Expand Up @@ -328,6 +328,7 @@ def process_repository_files(repo_dir, metadata_result: Result, repo_type, owner

# if repo_type == constants.RepositoryType.GITLAB:
if filename.endswith(".yml"):
category = None
if repo_type == constants.RepositoryType.GITLAB:
analysis = extract_workflows.is_file_continuous_integration_gitlab(os.path.join(repo_dir, file_path))
if analysis:
Expand All @@ -345,26 +346,29 @@ def process_repository_files(repo_dir, metadata_result: Result, repo_type, owner
{
constants.PROP_VALUE: workflow_url,
constants.PROP_TYPE: constants.URL
}, 1, constants.TECHNIQUE_FILE_EXPLORATION)
}, 1, constants.TECHNIQUE_FILE_EXPLORATION)
elif repo_type == constants.RepositoryType.CODEBERG:
if (file_path.startswith(".forgejo/workflows/") or file_path.startswith(".gitea/workflows/")):
category = constants.CAT_CONTINUOUS_INTEGRATION
else:
category = None
elif repo_type == constants.RepositoryType.BITBUCKET:
if os.path.basename(file_path) == "bitbucket-pipelines.yml":
category = constants.CAT_CONTINUOUS_INTEGRATION
else:
category = None
elif repo_type == constants.RepositoryType.GITHUB:
# if file_path.startswith(".github/workflows/"):
# category = constants.CAT_WORKFLOWS
# elif filename in [".travis.yml", "azure-pipelines.yml", "jenkinsfile"] or file_path.startswith(".circleci/"):
# category = constants.CAT_CONTINUOUS_INTEGRATION
# else:
# category = None
if file_path.startswith(".github/workflows/"):
category = constants.CAT_CONTINUOUS_INTEGRATION
else:
category = None

if category:
workflow_url = get_file_link(repo_type, file_path, owner, repo_name, repo_default_branch,
repo_dir, repo_relative_path, filename)
metadata_result.add_result(category,
{constants.PROP_VALUE: workflow_url, constants.PROP_TYPE: constants.URL},
1, constants.TECHNIQUE_FILE_EXPLORATION)

if category:
workflow_url = get_file_link(repo_type, file_path, owner, repo_name, repo_default_branch,
repo_dir, repo_relative_path, filename)
metadata_result.add_result(category,
{constants.PROP_VALUE: workflow_url, constants.PROP_TYPE: constants.URL},
1, constants.TECHNIQUE_FILE_EXPLORATION)
if filename.endswith(".ga") or filename.endswith(".cwl") or filename.endswith(".nf") or (
filename.endswith(".snake") or filename.endswith(
".smk") or "Snakefile" == filename_no_ext) or filename.endswith(".knwf") or filename.endswith(
Expand Down Expand Up @@ -413,6 +417,10 @@ def process_repository_files(repo_dir, metadata_result: Result, repo_type, owner
docs_url = f"https://github.com/{owner}/{repo_name}/tree/{urllib.parse.quote(repo_default_branch)}/{docs_path}"
elif repo_type == constants.RepositoryType.GITLAB:
docs_url = f"https://{domain_gitlab}/{owner}/{repo_name}/-/tree/{urllib.parse.quote(repo_default_branch)}/{docs_path}"
elif repo_type == constants.RepositoryType.CODEBERG:
docs_url = f"https://codeberg.org/{owner}/{repo_name}/src/branch/{urllib.parse.quote(repo_default_branch)}/{docs_path}"
elif repo_type == constants.RepositoryType.BITBUCKET:
docs_url = f"https://bitbucket.org/{owner}/{repo_name}/src/{urllib.parse.quote(repo_default_branch)}/{docs_path}"
else:
docs_url = os.path.join(repo_dir, docs_path)
# docs.append(docs_url)
Expand Down Expand Up @@ -452,6 +460,10 @@ def get_file_link(repo_type, file_path, owner, repo_name, repo_default_branch, r
return convert_to_raw_user_content_github(file_path, owner, repo_name, repo_default_branch)
elif repo_type == constants.RepositoryType.GITLAB:
return convert_to_raw_user_content_gitlab(file_path, owner, repo_name, repo_default_branch)
elif repo_type == constants.RepositoryType.CODEBERG:
return convert_to_raw_user_content_codeberg(file_path, owner, repo_name, repo_default_branch)
elif repo_type == constants.RepositoryType.BITBUCKET:
return convert_to_raw_user_content_bitbucket(file_path, owner, repo_name, repo_default_branch)
else:
return os.path.join(repo_dir, repo_relative_path, filename)

Expand Down Expand Up @@ -695,6 +707,24 @@ def convert_to_raw_user_content_github(partial, owner, repo_name, repo_ref):
return f"https://raw.githubusercontent.com/{owner}/{repo_name}/{repo_ref}/{urllib.parse.quote(partial)}"


def convert_to_raw_user_content_codeberg(partial, owner, repo_name, repo_ref):
"""Converts Codeberg paths into raw content URLs"""
if partial.startswith("./"):
partial = partial.replace("./", "")
if partial.startswith(".\\"):
partial = partial.replace(".\\", "")
return f"https://codeberg.org/{owner}/{repo_name}/raw/branch/{repo_ref}/{urllib.parse.quote(partial)}"


def convert_to_raw_user_content_bitbucket(partial, owner, repo_name, repo_ref):
"""Converts Bitbucket paths into raw content URLs"""
if partial.startswith("./"):
partial = partial.replace("./", "")
if partial.startswith(".\\"):
partial = partial.replace(".\\", "")
return f"https://bitbucket.org/{owner}/{repo_name}/raw/{repo_ref}/{urllib.parse.quote(partial)}"


def convert_to_raw_user_content_gitlab(partial, owner, repo_name, repo_ref):
"""Converts GitLab paths into raw.githubuser content URLs, accessible by users"""
if partial.startswith("./"):
Expand Down
Loading
Loading