Skip to content

Feature/allow remote href blobs#683

Closed
a2geek wants to merge 9 commits into
cloudfoundry:mainfrom
a2geek:feature/allow-remote-href-blobs
Closed

Feature/allow remote href blobs#683
a2geek wants to merge 9 commits into
cloudfoundry:mainfrom
a2geek:feature/allow-remote-href-blobs

Conversation

@a2geek
Copy link
Copy Markdown

@a2geek a2geek commented Feb 27, 2025

This stems from a question I asked in the Cloud Foundry BOSH Slack channel. I've been wondering why the BOSH CLI for development must use some S3 (or similar) buckets for storing blobs. Why not just allow URL references and download from the source? Many BOSH related artifacts are hosted in Github. Why duplicate that somewhere?

This was just a quick spike to prove the idea.

Note that this only impacts CLI for development usage.

Since this was more of a spike, the HREF uses the Go http.Get (etc) directly. From the rest of the code base, this likely should be behind some interface so it can be faked for direct unit tests.

Is this of interest?
What needs to be changed to meet standards?
Is there any existing code that should be used? (For example, around the URL handling.)

The following commands were modified:

  • bosh add-blob now accepts a URL, stored as HREF in the blob structures and yaml file.
  • bosh blobs lists the HREF. (This ended up a bit too wide, and I could easily be convinced this is superfluous).
  • bosh sync-blobs defers to the blobstore configuration -- if there is none and the HREF exists, the blob is simply downloaded from the source.

Example repository is here: https://github.com/a2geek/test-release
(feel free to compile this BOSH CLI, clone that test release, and run the bosh sync-blobs...)

$ bosh add-blob https://nodejs.org/dist/v20.18.3/node-v20.18.3-linux-x64.tar.xz node/node-v20.18.3-linux-x64.tar.xz
Added blob 'node/node-v20.18.3-linux-x64.tar.xz'

Succeeded
 bosh add-blob https://github.com/adoptium/temurin21-binaries/releases/download/jdk-21.0.6%2B7/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz
Added blob 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz'

Succeeded
$ bosh blobs
Path                                                   Size     Blobstore ID  Digest                                                                   HREF  
java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz  197 MiB  (local)       sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae  https://github.com/adoptium/temurin21-binaries/releases/download/jdk-21.0.6%2B7/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz  
nodejs/node-v22.14.0-linux-x64.tar.xz                  28 MiB   (local)       sha256:69b09dba5c8dcb05c4e4273a4340db1005abeafe3927efda2bc5b249e80437ec  https://nodejs.org/dist/v22.14.0/node-v22.14.0-linux-x64.tar.xz  

2 blobs

Succeeded
$ tree .
.
├── blobs
│   ├── java
│   │   └── OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz
│   └── nodejs
│       └── node-v22.14.0-linux-x64.tar.xz
├── config
│   ├── blobs.yml
│   └── final.yml
├── jobs
├── packages
└── src

8 directories, 4 files
$ rm -rf blobs/*
$ tree .
.
├── blobs
├── config
│   ├── blobs.yml
│   └── final.yml
├── jobs
├── packages
└── src

6 directories, 2 files
$ bosh sync-blobs 
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (207 MB) (id: - sha1: sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae) started
Blob download 'nodejs/node-v22.14.0-linux-x64.tar.xz' (30 MB) (id: - sha1: sha256:69b09dba5c8dcb05c4e4273a4340db1005abeafe3927efda2bc5b249e80437ec) started
Blob download 'nodejs/node-v22.14.0-linux-x64.tar.xz' (id: -) finished
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (id: -) finished

Succeeded
$ tree .
.
├── blobs
│   ├── java
│   │   └── OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz
│   └── nodejs
│       └── node-v22.14.0-linux-x64.tar.xz
├── config
│   ├── blobs.yml
│   └── final.yml
├── jobs
├── packages
└── src

8 directories, 4 files

@rkoster
Copy link
Copy Markdown
Contributor

rkoster commented Mar 6, 2025

Question why not just use the local provider and add a dev script to download the blobs. Given this feature won't work with final releases anyway.

@rkoster rkoster moved this from Inbox to Pending Review | Discussion in Foundational Infrastructure Working Group Mar 6, 2025
@a2geek
Copy link
Copy Markdown
Author

a2geek commented Mar 6, 2025

If you're using local blobs, you need to create some system to identify where those blobs are stored/sourced -- either a custom script for every repository, or something generic with the metadata that is contributed to each repository and can be reused. Seems that 'bosh' itself would be correct place to reuse that code instead of every developer having to make up some mechanism.

I see what you mean for the final release. You must declare a provider. So I setup the local provider, and it's good. My goal is to simply change the way we retrieve blobs when developing. Maybe an enhancement to the local provider would be better?

@rkoster
Copy link
Copy Markdown
Contributor

rkoster commented Mar 6, 2025

The final release blobs will sill contain the packages and all there blobs dependencies.

I do really like the idea of first class support for references to the original upstream artifacts. But then more from a provenance point of view. Ideally this could be used to verify that blobs match with upstream.

@a2geek
Copy link
Copy Markdown
Author

a2geek commented Mar 6, 2025

But then more from a provenance point of view. Ideally this could be used to verify that blobs match with upstream.

node/node-v20.18.3-linux-x64.tar.xz:
  size: 25810368
  sha: sha256:595bcc9a28e6d1ee5fc7277b5c3cb029275b98ec0524e162a0c566c992a7ee5c

Wouldn't this be one of the purposes of the embedded SHA256? Just trying to clarify.

Poking through the code, switching the local provider over to being the rebuild looks to be a larger change than I'd expect. Everything works off the blob id and not the Blob structure. (Meaning the URL gets dropped somewhere along the way.)

@rkoster
Copy link
Copy Markdown
Contributor

rkoster commented Mar 7, 2025

Correct, the SHA is there to verify that, but sometimes it's difficult to know where a blob actually came from. Basically it would be great if a build system could independently verify that a blob when downloaded from the original source produces the same SHA.

@a2geek
Copy link
Copy Markdown
Author

a2geek commented Mar 14, 2025

Hoping you can clarify what you're thinking of from the CLI. I assume bosh add-blob http://... blobref to identify the URL. How would the validation work? (Obviously not for every sync-blobs command... maybe a flag of some sort?)

@rkoster
Copy link
Copy Markdown
Contributor

rkoster commented Mar 25, 2025

Yeah some flag on bosh create-release --final --validate-blob-origin, which would validate that the blobs match their origins.

@a2geek
Copy link
Copy Markdown
Author

a2geek commented Mar 26, 2025

First stab:

$ bosh-cli create-release --validate-blob-origin
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (207 MB) (id: - sha1: sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae) started
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (id: -) finished
Blob download 'node/node-v20.18.3-linux-x64.tar.xz' (26 MB) (id: - sha1: sha256:595bcc9a28e6d1ee5fc7277b5c3cb029275b98ec0524e162a0c566c992a7ee5c) started
Blob download 'node/node-v20.18.3-linux-x64.tar.xz' (id: -) finished

Added dev release 'test/v2+dev.1'

Name         test  
Version      v2+dev.1  
Commit Hash  2c76fc1  

Job                                                                            Digest                                                                   Packages  
does-nothing/1ca7516f74d7a497f78785b924bc3691c14a9e63a5905134ffb6d0b6158f4687  sha256:c1019226ae0a52a442c05eed96d3f90b8b2942d20e67eacb031d892c136c360a  java  
                                                                                                                                                        node  

1 jobs

Package                                                                Digest                                                                   Dependencies  
java/889c392818ee6efc9e38f3db86b55d757d068d70a9087f95ee628eefc3751fec  sha256:6b0c68cb8ea090112af7b2948ef9843cd23a9c85c98710758ecd1e4595e04d35  -  
node/1dc0f3b044375b6865258b9207d6a12695baac376834090e60d86e1f3a5ee231  sha256:7e66dc70d7c9001346bc7375273c586455301c4f40407c79d315e04dddd994d1  -  

2 packages

Succeeded

... and manually botched one of the SHA codes:

$ bosh-cli create-release --validate-blob-origin
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (207 MB) (id: - sha1: sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbaf) started

Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (id: -) failed

Validating SHA for 'https://github.com/adoptium/temurin21-binaries/releases/download/jdk-21.0.6%!B(MISSING)7/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz':
  Expected stream to have digest 'sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbaf' but was 'sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae'

Exit code 1

I'm just using the current reporting code. I removed an HREF from one blob and it's currently listing as an error. I'll try to make that more of a warning instead. If you want to validate the source, but the source isn't configured, thinking it's not an error, just a warning that it can't be validated.

@rkoster
Copy link
Copy Markdown
Contributor

rkoster commented Mar 27, 2025

Great progress!! Yeah, it makes sense to start with a warning, once this feature has been adopted more broadly, we can always add another flag (like --expect-all-origins).

At some point, it would also be good to think about how to express the fact that blob origin was checked during release creation in the final release metadata.

@aramprice aramprice requested review from a team and ragaskar and removed request for a team and klakin-pivotal April 17, 2025 14:44
@beyhan
Copy link
Copy Markdown
Member

beyhan commented Apr 24, 2025

This should be ready for review.

@a2geek would be great if you could resolve the merge conflicts.

@beyhan beyhan requested a review from aramprice April 24, 2025 14:58
@a2geek
Copy link
Copy Markdown
Author

a2geek commented Apr 28, 2025

I think I got it resolved!

Copy link
Copy Markdown
Contributor

@ramonskie ramonskie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@github-project-automation github-project-automation Bot moved this from Pending Review | Discussion to Pending Merge | Prioritized in Foundational Infrastructure Working Group May 26, 2025

BlobstoreID string `yaml:"object_id,omitempty"`
SHA1 string `yaml:"sha"`
HREF string `yaml:"href,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this change has been refocused on the notion of validating Blob origin, I would propose we change HREF to URI, even if we only support HTTP URI's at this point.

This would allow additional URI types to be added in the future.

@beyhan beyhan moved this from Pending Merge | Prioritized to Waiting for Changes | Open for Contribution in Foundational Infrastructure Working Group Jun 12, 2025
Comment thread cmd/add_blob.go
Comment on lines -25 to +49
file, err := c.fs.OpenFile(opts.Args.Path, os.O_RDONLY, 0)
if err != nil {
return bosherr.WrapErrorf(err, "Opening blob")
var file io.ReadCloser
var err error
href := ""
if u, err := url.ParseRequestURI(opts.Args.Path); err == nil && u.Scheme != "" && u.Host != "" {
resp, err := http.Get(opts.Args.Path)
if err != nil {
return bosherr.WrapErrorf(err, "Downloading blob")
}
defer resp.Body.Close() //nolint:errcheck
file = resp.Body
href = opts.Args.Path
} else {
file, err = c.fs.OpenFile(opts.Args.Path, os.O_RDONLY, 0)
if err != nil {
return bosherr.WrapErrorf(err, "Opening blob")
}
defer file.Close() //nolint:errcheck
}

defer file.Close() //nolint:errcheck

blob, err := c.blobsDir.TrackBlob(opts.Args.BlobsPath, file)
blob, err := c.blobsDir.TrackBlob(opts.Args.BlobsPath, file, href)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than handling the HREF -> File conversion here, I think it would make more sense either

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If TrackBlob() is handling the fetching, perhaps a better interface would be to pass a Blob{} struct explicitly and allow that function to fill in sha2 and size?

@aramprice
Copy link
Copy Markdown
Member

aramprice commented Jun 13, 2025

moved to review section

@github-project-automation github-project-automation Bot moved this from Waiting for Changes | Open for Contribution to Pending Merge | Prioritized in Foundational Infrastructure Working Group Jun 17, 2025
@beyhan beyhan moved this from Pending Merge | Prioritized to Waiting for Changes | Open for Contribution in Foundational Infrastructure Working Group Jun 26, 2025
Copy link
Copy Markdown
Member

@aramprice aramprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, I appreciate the value of being able to create a bosh release by referencing blobs hosted remotely on a web server.

I'm worried that this change, as is, doesn't provide a full implementation of BOSH's blobs lifecycle (add, list, remove, sync, upload).

I do like the idea of being able to provide a URI for a blob which can be used to attest to, and (re)validate provenance of a blob, though this is something that I'd like to see as a feature which is added to the existing blob lifecycle, and available to any blobstore implementation.

In short it seems like PR is conflating two independent ideas:

  • having blobs.yml which references remote URL's, not a blobstore (in bosh's sense)
  • adding metadata about the provenance of a blob to blobs.yml

The BOSH CLI has historically expected blob storage to be read-write and this change will presumably break bosh upload-blobs. There is also a question about what one should expect to find the the blobstore config. Do we allow, for example, both a final.yml containing blobstore: {provider: gcs}, as well as a blobs to YAML which contains "HREF" and an object_id?

What happens if the HREF file doesn't match the checksum but the file in the GCS bucket does? What about the inverse?


A possible way forward would be to build out the "adding upstream/origin URI metadata for any blob" such that it fits in with all other blobstore implementations.

Separately perhaps opening an issue to discuss what it might look like for there to be aread-only blobstore lifecycle.

@a2geek
Copy link
Copy Markdown
Author

a2geek commented Jun 26, 2025

My original intent with this change was inspired by (what I thought) was the fact that the only blobs that go in the blob store were the binaries that we attach while developing. I have found that this is not true. In shifting between computers, I realized all the "internal" packages and jobs are also blobs that are stored -- and that the BOSH CLI cannot recover (at least not easily) if those blobs don't exist; even for prior versions of those blobs. Due to these realizations, I don't think this change could have ever worked. We all need to use a blob store, and a commercial one if we are doing public work.

Sorry for the chase!

Cancelling the PR.

@a2geek a2geek closed this Jun 26, 2025
@github-project-automation github-project-automation Bot moved this from Waiting for Changes | Open for Contribution to Done in Foundational Infrastructure Working Group Jun 26, 2025
@aramprice
Copy link
Copy Markdown
Member

@a2geek - no worries, thank you for the reply and for digging into this.

If you ever feel like working on the "metadata pointing to the upstream source" aspect that is part of this PR that would definitely be valuable.

In any case welcome to the bosh world, and thanks for raising this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

6 participants