Feature/allow remote href blobs by a2geek · Pull Request #683 · cloudfoundry/bosh-cli

a2geek · 2025-02-27T17:55:45Z

This stems from a question I asked in the Cloud Foundry BOSH Slack channel. I've been wondering why the BOSH CLI for development must use some S3 (or similar) buckets for storing blobs. Why not just allow URL references and download from the source? Many BOSH related artifacts are hosted in Github. Why duplicate that somewhere?

This was just a quick spike to prove the idea.

Note that this only impacts CLI for development usage.

Since this was more of a spike, the HREF uses the Go http.Get (etc) directly. From the rest of the code base, this likely should be behind some interface so it can be faked for direct unit tests.

Is this of interest?
What needs to be changed to meet standards?
Is there any existing code that should be used? (For example, around the URL handling.)

The following commands were modified:

bosh add-blob now accepts a URL, stored as HREF in the blob structures and yaml file.
bosh blobs lists the HREF. (This ended up a bit too wide, and I could easily be convinced this is superfluous).
bosh sync-blobs defers to the blobstore configuration -- if there is none and the HREF exists, the blob is simply downloaded from the source.

Example repository is here: https://github.com/a2geek/test-release
(feel free to compile this BOSH CLI, clone that test release, and run the bosh sync-blobs...)

$ bosh add-blob https://nodejs.org/dist/v20.18.3/node-v20.18.3-linux-x64.tar.xz node/node-v20.18.3-linux-x64.tar.xz
Added blob 'node/node-v20.18.3-linux-x64.tar.xz'

Succeeded
 bosh add-blob https://github.com/adoptium/temurin21-binaries/releases/download/jdk-21.0.6%2B7/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz
Added blob 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz'

Succeeded
$ bosh blobs
Path                                                   Size     Blobstore ID  Digest                                                                   HREF  
java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz  197 MiB  (local)       sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae  https://github.com/adoptium/temurin21-binaries/releases/download/jdk-21.0.6%2B7/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz  
nodejs/node-v22.14.0-linux-x64.tar.xz                  28 MiB   (local)       sha256:69b09dba5c8dcb05c4e4273a4340db1005abeafe3927efda2bc5b249e80437ec  https://nodejs.org/dist/v22.14.0/node-v22.14.0-linux-x64.tar.xz  

2 blobs

Succeeded
$ tree .
.
├── blobs
│   ├── java
│   │   └── OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz
│   └── nodejs
│       └── node-v22.14.0-linux-x64.tar.xz
├── config
│   ├── blobs.yml
│   └── final.yml
├── jobs
├── packages
└── src

8 directories, 4 files
$ rm -rf blobs/*
$ tree .
.
├── blobs
├── config
│   ├── blobs.yml
│   └── final.yml
├── jobs
├── packages
└── src

6 directories, 2 files
$ bosh sync-blobs 
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (207 MB) (id: - sha1: sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae) started
Blob download 'nodejs/node-v22.14.0-linux-x64.tar.xz' (30 MB) (id: - sha1: sha256:69b09dba5c8dcb05c4e4273a4340db1005abeafe3927efda2bc5b249e80437ec) started
Blob download 'nodejs/node-v22.14.0-linux-x64.tar.xz' (id: -) finished
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (id: -) finished

Succeeded
$ tree .
.
├── blobs
│   ├── java
│   │   └── OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz
│   └── nodejs
│       └── node-v22.14.0-linux-x64.tar.xz
├── config
│   ├── blobs.yml
│   └── final.yml
├── jobs
├── packages
└── src

8 directories, 4 files

…ads.

rkoster · 2025-03-06T16:07:13Z

Question why not just use the local provider and add a dev script to download the blobs. Given this feature won't work with final releases anyway.

a2geek · 2025-03-06T18:28:18Z

If you're using local blobs, you need to create some system to identify where those blobs are stored/sourced -- either a custom script for every repository, or something generic with the metadata that is contributed to each repository and can be reused. Seems that 'bosh' itself would be correct place to reuse that code instead of every developer having to make up some mechanism.

I see what you mean for the final release. You must declare a provider. So I setup the local provider, and it's good. My goal is to simply change the way we retrieve blobs when developing. Maybe an enhancement to the local provider would be better?

rkoster · 2025-03-06T18:52:16Z

The final release blobs will sill contain the packages and all there blobs dependencies.

I do really like the idea of first class support for references to the original upstream artifacts. But then more from a provenance point of view. Ideally this could be used to verify that blobs match with upstream.

a2geek · 2025-03-06T19:43:47Z

But then more from a provenance point of view. Ideally this could be used to verify that blobs match with upstream.

node/node-v20.18.3-linux-x64.tar.xz:
  size: 25810368
  sha: sha256:595bcc9a28e6d1ee5fc7277b5c3cb029275b98ec0524e162a0c566c992a7ee5c

Wouldn't this be one of the purposes of the embedded SHA256? Just trying to clarify.

Poking through the code, switching the local provider over to being the rebuild looks to be a larger change than I'd expect. Everything works off the blob id and not the Blob structure. (Meaning the URL gets dropped somewhere along the way.)

rkoster · 2025-03-07T08:51:19Z

Correct, the SHA is there to verify that, but sometimes it's difficult to know where a blob actually came from. Basically it would be great if a build system could independently verify that a blob when downloaded from the original source produces the same SHA.

a2geek · 2025-03-14T14:25:02Z

Hoping you can clarify what you're thinking of from the CLI. I assume bosh add-blob http://... blobref to identify the URL. How would the validation work? (Obviously not for every sync-blobs command... maybe a flag of some sort?)

rkoster · 2025-03-25T08:27:35Z

Yeah some flag on bosh create-release --final --validate-blob-origin, which would validate that the blobs match their origins.

…ource (if present) to check SHA.

a2geek · 2025-03-26T15:43:37Z

First stab:

$ bosh-cli create-release --validate-blob-origin
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (207 MB) (id: - sha1: sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae) started
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (id: -) finished
Blob download 'node/node-v20.18.3-linux-x64.tar.xz' (26 MB) (id: - sha1: sha256:595bcc9a28e6d1ee5fc7277b5c3cb029275b98ec0524e162a0c566c992a7ee5c) started
Blob download 'node/node-v20.18.3-linux-x64.tar.xz' (id: -) finished

Added dev release 'test/v2+dev.1'

Name         test  
Version      v2+dev.1  
Commit Hash  2c76fc1  

Job                                                                            Digest                                                                   Packages  
does-nothing/1ca7516f74d7a497f78785b924bc3691c14a9e63a5905134ffb6d0b6158f4687  sha256:c1019226ae0a52a442c05eed96d3f90b8b2942d20e67eacb031d892c136c360a  java  
                                                                                                                                                        node  

1 jobs

Package                                                                Digest                                                                   Dependencies  
java/889c392818ee6efc9e38f3db86b55d757d068d70a9087f95ee628eefc3751fec  sha256:6b0c68cb8ea090112af7b2948ef9843cd23a9c85c98710758ecd1e4595e04d35  -  
node/1dc0f3b044375b6865258b9207d6a12695baac376834090e60d86e1f3a5ee231  sha256:7e66dc70d7c9001346bc7375273c586455301c4f40407c79d315e04dddd994d1  -  

2 packages

Succeeded

... and manually botched one of the SHA codes:

$ bosh-cli create-release --validate-blob-origin
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (207 MB) (id: - sha1: sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbaf) started

Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (id: -) failed

Validating SHA for 'https://github.com/adoptium/temurin21-binaries/releases/download/jdk-21.0.6%!B(MISSING)7/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz':
  Expected stream to have digest 'sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbaf' but was 'sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae'

Exit code 1

I'm just using the current reporting code. I removed an HREF from one blob and it's currently listing as an error. I'll try to make that more of a warning instead. If you want to validate the source, but the source isn't configured, thinking it's not an error, just a warning that it can't be validated.

rkoster · 2025-03-27T08:37:46Z

Great progress!! Yeah, it makes sense to start with a warning, once this feature has been adopted more broadly, we can always add another flag (like --expect-all-origins).

At some point, it would also be good to think about how to express the fact that blob origin was checked during release creation in the final release metadata.

beyhan · 2025-04-24T14:58:18Z

This should be ready for review.

@a2geek would be great if you could resolve the merge conflicts.

…e-href-blobs

a2geek · 2025-04-28T21:42:16Z

I think I got it resolved!

ramonskie

lgtm

aramprice · 2025-06-12T14:36:07Z


 	BlobstoreID string `yaml:"object_id,omitempty"`
 	SHA1        string `yaml:"sha"`
+	HREF        string `yaml:"href,omitempty"`


As this change has been refocused on the notion of validating Blob origin, I would propose we change HREF to URI, even if we only support HTTP URI's at this point.

This would allow additional URI types to be added in the future.

aramprice · 2025-06-13T19:24:58Z

-	file, err := c.fs.OpenFile(opts.Args.Path, os.O_RDONLY, 0)
-	if err != nil {
-		return bosherr.WrapErrorf(err, "Opening blob")
+	var file io.ReadCloser
+	var err error
+	href := ""
+	if u, err := url.ParseRequestURI(opts.Args.Path); err == nil && u.Scheme != "" && u.Host != "" {
+		resp, err := http.Get(opts.Args.Path)
+		if err != nil {
+			return bosherr.WrapErrorf(err, "Downloading blob")
+		}
+		defer resp.Body.Close() //nolint:errcheck
+		file = resp.Body
+		href = opts.Args.Path
+	} else {
+		file, err = c.fs.OpenFile(opts.Args.Path, os.O_RDONLY, 0)
+		if err != nil {
+			return bosherr.WrapErrorf(err, "Opening blob")
+		}
+		defer file.Close() //nolint:errcheck
 	}

 	defer file.Close() //nolint:errcheck

-	blob, err := c.blobsDir.TrackBlob(opts.Args.BlobsPath, file)
+	blob, err := c.blobsDir.TrackBlob(opts.Args.BlobsPath, file, href)


Rather than handling the HREF -> File conversion here, I think it would make more sense either

TrackBlob() to take opt.ArgsPath itself, and handle writing the remote file into tempFile (https://github.com/cloudfoundry/bosh-cli/pull/683/files?diff=unified&w=1#diff-17144e6602ffb3d080986c8a0843436cd75cd11990789e6ba43d64ad98e9ed41L195)

Have an explicit flag (ex: --href or --uri) that indicates that a remote file is being added

If the local file exists and there is a --uri flag passed this would be stored for later verification

If TrackBlob() is handling the fetching, perhaps a better interface would be to pass a Blob{} struct explicitly and allow that function to fill in sha2 and size?

aramprice · 2025-06-13T19:44:49Z

moved to review section

aramprice

Thank you for the contribution, I appreciate the value of being able to create a bosh release by referencing blobs hosted remotely on a web server.

I'm worried that this change, as is, doesn't provide a full implementation of BOSH's blobs lifecycle (add, list, remove, sync, upload).

I do like the idea of being able to provide a URI for a blob which can be used to attest to, and (re)validate provenance of a blob, though this is something that I'd like to see as a feature which is added to the existing blob lifecycle, and available to any blobstore implementation.

In short it seems like PR is conflating two independent ideas:

having blobs.yml which references remote URL's, not a blobstore (in bosh's sense)
adding metadata about the provenance of a blob to blobs.yml

The BOSH CLI has historically expected blob storage to be read-write and this change will presumably break bosh upload-blobs. There is also a question about what one should expect to find the the blobstore config. Do we allow, for example, both a final.yml containing blobstore: {provider: gcs}, as well as a blobs to YAML which contains "HREF" and an object_id?

What happens if the HREF file doesn't match the checksum but the file in the GCS bucket does? What about the inverse?

A possible way forward would be to build out the "adding upstream/origin URI metadata for any blob" such that it fits in with all other blobstore implementations.

Separately perhaps opening an issue to discuss what it might look like for there to be aread-only blobstore lifecycle.

a2geek · 2025-06-26T16:36:38Z

My original intent with this change was inspired by (what I thought) was the fact that the only blobs that go in the blob store were the binaries that we attach while developing. I have found that this is not true. In shifting between computers, I realized all the "internal" packages and jobs are also blobs that are stored -- and that the BOSH CLI cannot recover (at least not easily) if those blobs don't exist; even for prior versions of those blobs. Due to these realizations, I don't think this change could have ever worked. We all need to use a blob store, and a commercial one if we are doing public work.

Sorry for the chase!

Cancelling the PR.

aramprice · 2025-06-26T16:45:00Z

@a2geek - no worries, thank you for the reply and for digging into this.

If you ever feel like working on the "metadata pointing to the upstream source" aspect that is part of this PR that would definitely be valuable.

In any case welcome to the bosh world, and thanks for raising this!

a2geek added 5 commits February 26, 2025 19:05

Updating add-blob command and related components to allow HREF.

6a42280

Updating blobs command to show HREF.

25338d3

Adding sync-blobs to use HREF when blobstore not specified.

568081e

Omitting empty HREF's in blob listing.

a06a564

Forcing blobstore ID to be "-" since it doesn't apply for HREF downlo…

e680e89

…ads.

cf-foundation-community-automation Bot added this to Foundational Infrastructure Working Group Feb 27, 2025

cf-foundation-community-automation Bot moved this to Inbox in Foundational Infrastructure Working Group Feb 27, 2025

rkoster requested review from a team, anshrupani and klakin-pivotal and removed request for a team March 6, 2025 16:05

rkoster moved this from Inbox to Pending Review | Discussion in Foundational Infrastructure Working Group Mar 6, 2025

Adding --validate-blob-origin to create-release to re-download from s…

8f41766

…ource (if present) to check SHA.

Adding a simple message to BlobsDirReporter.

afa1d22

aramprice requested review from a team and ragaskar and removed request for a team and klakin-pivotal April 17, 2025 14:44

beyhan requested a review from aramprice April 24, 2025 14:58

Merge remote-tracking branch 'upstream/main' into feature/allow-remot…

6d95f6b

…e-href-blobs

Updates to ignore deferred closes.

bead4e8

ramonskie approved these changes May 26, 2025

View reviewed changes

github-project-automation Bot moved this from Pending Review | Discussion to Pending Merge | Prioritized in Foundational Infrastructure Working Group May 26, 2025

aramprice reviewed Jun 12, 2025

View reviewed changes

beyhan moved this from Pending Merge | Prioritized to Waiting for Changes | Open for Contribution in Foundational Infrastructure Working Group Jun 12, 2025

aramprice reviewed Jun 13, 2025

View reviewed changes

lnguyen approved these changes Jun 17, 2025

View reviewed changes

github-project-automation Bot moved this from Waiting for Changes | Open for Contribution to Pending Merge | Prioritized in Foundational Infrastructure Working Group Jun 17, 2025

beyhan moved this from Pending Merge | Prioritized to Waiting for Changes | Open for Contribution in Foundational Infrastructure Working Group Jun 26, 2025

aramprice requested changes Jun 26, 2025

View reviewed changes

a2geek closed this Jun 26, 2025

github-project-automation Bot moved this from Waiting for Changes | Open for Contribution to Done in Foundational Infrastructure Working Group Jun 26, 2025

Uh oh!

Conversation

a2geek commented Feb 27, 2025

Uh oh!

rkoster commented Mar 6, 2025

Uh oh!

a2geek commented Mar 6, 2025

Uh oh!

rkoster commented Mar 6, 2025

Uh oh!

a2geek commented Mar 6, 2025

Uh oh!

rkoster commented Mar 7, 2025

Uh oh!

a2geek commented Mar 14, 2025

Uh oh!

rkoster commented Mar 25, 2025

Uh oh!

a2geek commented Mar 26, 2025

Uh oh!

rkoster commented Mar 27, 2025

Uh oh!

beyhan commented Apr 24, 2025

Uh oh!

a2geek commented Apr 28, 2025

Uh oh!

ramonskie left a comment

Choose a reason for hiding this comment

Uh oh!

aramprice Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

aramprice Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

aramprice Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

aramprice commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aramprice left a comment

Choose a reason for hiding this comment

Uh oh!

a2geek commented Jun 26, 2025

Uh oh!

aramprice commented Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

aramprice commented Jun 13, 2025 •

edited

Loading