Skip to content

Improve PURL creation logic for SBoM#782

Merged
spoorcc merged 15 commits intomainfrom
spoorcc/issue780
Sep 29, 2025
Merged

Improve PURL creation logic for SBoM#782
spoorcc merged 15 commits intomainfrom
spoorcc/issue780

Conversation

@spoorcc
Copy link
Copy Markdown
Contributor

@spoorcc spoorcc commented Sep 13, 2025

Move out the purl logic and add tests for it. Also add support for Bitbucket PURLs

To be tested on dependency-track

Description by Korbit AI

What change is being made?

Improve PURL creation logic for SBoM by centralizing remote URL to PURL conversion, enriching SBOM components with accurate VCS references, and expanding support for various VCS styles (GitHub, Bitbucket, SVN, SSH paths, and generic VCS URLs); add tests and minor workflow/readme updates to reflect the changes.

Why are these changes being made?

Provide a robust, consistent mapping from remote URLs to PURLs, ensuring accurate type, namespace, name, and external VCS references across common VCS providers and edge cases; align SBOM generation with the new PURL logic and expand test coverage.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

Copy link
Copy Markdown

@korbit-ai korbit-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Readability Hardcoded domain list without explanation ▹ view
Readability Hardcoded fallback string 'unknown' ▹ view
Functionality Invalid empty VCS reference creation ▹ view
Security Credential leakage via VCS URL in SBOM ▹ view
Error Handling Lack of error handling around PURL creation ▹ view
Design Rigid handler registry (Open-Closed violation) ▹ view
Documentation Internal helper docstrings lack return value details ▹ view
Files scanned
File Path Reviewed
dfetch/reporting/sbom_reporter.py
dfetch/util/purl.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

Comment thread dfetch/util/purl.py Outdated
Comment thread dfetch/util/purl.py Outdated
Comment thread dfetch/reporting/sbom_reporter.py
Comment thread dfetch/reporting/sbom_reporter.py Outdated
Comment thread dfetch/reporting/sbom_reporter.py
Comment thread dfetch/util/purl.py Outdated
Comment thread dfetch/util/purl.py Outdated
@spoorcc
Copy link
Copy Markdown
Contributor Author

spoorcc commented Sep 15, 2025

/korbit-review

Copy link
Copy Markdown

@korbit-ai korbit-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Readability Undocumented Complex Regex Pattern ▹ view ✅ Fix detected
Files scanned
File Path Reviewed
dfetch/util/purl.py
dfetch/reporting/sbom_reporter.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

Comment thread dfetch/util/purl.py
@spoorcc spoorcc linked an issue Sep 15, 2025 that may be closed by this pull request
@ben-edna
Copy link
Copy Markdown
Contributor

Using https://github.com/CycloneDX/sbom-utility with

$ sbom-utility validate --input-file my-sbom.json

shows that PURL seems ok, but externalReferences url still uses ssh

 {
        "type": "format",
        "field": "components.0.externalReferences.0.url",
        "context": "(root).components.0.externalReferences.0.url",
        "description": "Does not match format 'iri-reference'",
        "value": "git@git.mycompany.com:mycompany/some-repo"
  }

@spoorcc
Copy link
Copy Markdown
Contributor Author

spoorcc commented Sep 19, 2025

Verified SBoM with:

 wget https://github.com/CycloneDX/sbom-utility/releases/download/v0.18.1/sbom-utility-v0.18.1-linux-amd64.tar.gz
tar -xvzf sbom-utility-v0.18.1-linux-amd64.tar.gz 
./sbom-utility validate --input-file report.json 

@spoorcc
Copy link
Copy Markdown
Contributor Author

spoorcc commented Sep 19, 2025

/korbit-review

Copy link
Copy Markdown

@korbit-ai korbit-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Performance Inefficient TLD extraction without caching ▹ view
Readability Unexplained SSH URL exclusion ▹ view
Functionality Python version compatibility issue with removeprefix ▹ view
Error Handling Incorrect unpacking of actual_content in assertion message ▹ view
Files scanned
File Path Reviewed
dfetch/reporting/sbom_reporter.py
dfetch/util/purl.py
features/steps/generic_steps.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

Comment thread dfetch/util/purl.py
from packageurl import PackageURL
from tldextract import TLDExtract

NO_FETCH_EXTRACT = TLDExtract(suffix_list_urls=(), extra_suffixes=("local",))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient TLD extraction without caching category Performance

Tell me more
What is the issue?

TLDExtract is instantiated at module level with network-disabled configuration, creating a global object that performs domain parsing on every call.

Why this matters

This creates unnecessary overhead when the same domains are parsed repeatedly, and the object could be reused more efficiently or cached results could be stored.

Suggested change ∙ Feature Preview

Consider caching the results of domain extraction or using a lazy initialization pattern if the TLDExtract object is expensive to create. Alternatively, cache the results of NO_FETCH_EXTRACT(domain).domain calls within the _namespace_and_name_from_domain_and_path function using functools.lru_cache.

Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

component.group = purl.namespace

vcs_url = purl.qualifiers.get("vcs_url", "")
if vcs_url and "ssh://" not in vcs_url:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexplained SSH URL exclusion category Readability

Tell me more
What is the issue?

The string literal 'ssh://' is hardcoded in the condition without explanation of why SSH URLs are excluded.

Why this matters

Without context, developers need to guess why SSH URLs are being filtered out, making the code harder to maintain.

Suggested change ∙ Feature Preview
# Define a constant at module level
SSH_URL_PREFIX = 'ssh://'

# Add a comment explaining the exclusion
# Skip SSH URLs as they are not publicly accessible
if vcs_url and SSH_URL_PREFIX not in vcs_url:
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment thread dfetch/util/purl.py Outdated

if "svn" in parsed.scheme or "svn." in parsed.netloc:
namespace, name = _namespace_and_name_from_domain_and_path(parsed.netloc, path)
namespace = namespace.replace("/svn/", "/").removeprefix("p/")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python version compatibility issue with removeprefix category Functionality

Tell me more
What is the issue?

The removeprefix method is only available in Python 3.9+, but the code doesn't specify this requirement.

Why this matters

This will cause AttributeError on Python versions below 3.9, making the code incompatible with older Python versions.

Suggested change ∙ Feature Preview

Use compatible string slicing or add version requirement:

# Option 1: Compatible approach
namespace = namespace.replace("/svn/", "/")
if namespace.startswith("p/"):
    namespace = namespace[2:]

# Option 2: Or ensure Python 3.9+ requirement is documented
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment thread features/steps/generic_steps.py Outdated
@ben-edna
Copy link
Copy Markdown
Contributor

SBoM is now accepted by Dependency track. A thing to document or be aware of is that the name used is the project name in the manifest and not necessarily the authoritative name

@spoorcc spoorcc force-pushed the spoorcc/issue780 branch 4 times, most recently from 629c622 to 6bf70cd Compare September 29, 2025 21:07
@spoorcc spoorcc merged commit 33638c1 into main Sep 29, 2025
25 checks passed
@spoorcc spoorcc deleted the spoorcc/issue780 branch September 29, 2025 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Invalid SBoM created when projects uses ssh

2 participants