Feature/allow remote href blobs#683
Conversation
|
Question why not just use the local provider and add a dev script to download the blobs. Given this feature won't work with final releases anyway. |
|
If you're using local blobs, you need to create some system to identify where those blobs are stored/sourced -- either a custom script for every repository, or something generic with the metadata that is contributed to each repository and can be reused. Seems that 'bosh' itself would be correct place to reuse that code instead of every developer having to make up some mechanism. I see what you mean for the final release. You must declare a provider. So I setup the local provider, and it's good. My goal is to simply change the way we retrieve blobs when developing. Maybe an enhancement to the local provider would be better? |
|
The final release blobs will sill contain the packages and all there blobs dependencies. I do really like the idea of first class support for references to the original upstream artifacts. But then more from a provenance point of view. Ideally this could be used to verify that blobs match with upstream. |
node/node-v20.18.3-linux-x64.tar.xz:
size: 25810368
sha: sha256:595bcc9a28e6d1ee5fc7277b5c3cb029275b98ec0524e162a0c566c992a7ee5cWouldn't this be one of the purposes of the embedded SHA256? Just trying to clarify. Poking through the code, switching the |
|
Correct, the SHA is there to verify that, but sometimes it's difficult to know where a blob actually came from. Basically it would be great if a build system could independently verify that a blob when downloaded from the original source produces the same SHA. |
|
Hoping you can clarify what you're thinking of from the CLI. I assume |
|
Yeah some flag on |
…ource (if present) to check SHA.
|
First stab: $ bosh-cli create-release --validate-blob-origin
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (207 MB) (id: - sha1: sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae) started
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (id: -) finished
Blob download 'node/node-v20.18.3-linux-x64.tar.xz' (26 MB) (id: - sha1: sha256:595bcc9a28e6d1ee5fc7277b5c3cb029275b98ec0524e162a0c566c992a7ee5c) started
Blob download 'node/node-v20.18.3-linux-x64.tar.xz' (id: -) finished
Added dev release 'test/v2+dev.1'
Name test
Version v2+dev.1
Commit Hash 2c76fc1
Job Digest Packages
does-nothing/1ca7516f74d7a497f78785b924bc3691c14a9e63a5905134ffb6d0b6158f4687 sha256:c1019226ae0a52a442c05eed96d3f90b8b2942d20e67eacb031d892c136c360a java
node
1 jobs
Package Digest Dependencies
java/889c392818ee6efc9e38f3db86b55d757d068d70a9087f95ee628eefc3751fec sha256:6b0c68cb8ea090112af7b2948ef9843cd23a9c85c98710758ecd1e4595e04d35 -
node/1dc0f3b044375b6865258b9207d6a12695baac376834090e60d86e1f3a5ee231 sha256:7e66dc70d7c9001346bc7375273c586455301c4f40407c79d315e04dddd994d1 -
2 packages
Succeeded... and manually botched one of the SHA codes: $ bosh-cli create-release --validate-blob-origin
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (207 MB) (id: - sha1: sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbaf) started
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (id: -) failed
Validating SHA for 'https://github.com/adoptium/temurin21-binaries/releases/download/jdk-21.0.6%!B(MISSING)7/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz':
Expected stream to have digest 'sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbaf' but was 'sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae'
Exit code 1I'm just using the current reporting code. I removed an HREF from one blob and it's currently listing as an error. I'll try to make that more of a warning instead. If you want to validate the source, but the source isn't configured, thinking it's not an error, just a warning that it can't be validated. |
|
Great progress!! Yeah, it makes sense to start with a warning, once this feature has been adopted more broadly, we can always add another flag (like --expect-all-origins). At some point, it would also be good to think about how to express the fact that blob origin was checked during release creation in the final release metadata. |
|
This should be ready for review. @a2geek would be great if you could resolve the merge conflicts. |
|
I think I got it resolved! |
|
|
||
| BlobstoreID string `yaml:"object_id,omitempty"` | ||
| SHA1 string `yaml:"sha"` | ||
| HREF string `yaml:"href,omitempty"` |
There was a problem hiding this comment.
As this change has been refocused on the notion of validating Blob origin, I would propose we change HREF to URI, even if we only support HTTP URI's at this point.
This would allow additional URI types to be added in the future.
| file, err := c.fs.OpenFile(opts.Args.Path, os.O_RDONLY, 0) | ||
| if err != nil { | ||
| return bosherr.WrapErrorf(err, "Opening blob") | ||
| var file io.ReadCloser | ||
| var err error | ||
| href := "" | ||
| if u, err := url.ParseRequestURI(opts.Args.Path); err == nil && u.Scheme != "" && u.Host != "" { | ||
| resp, err := http.Get(opts.Args.Path) | ||
| if err != nil { | ||
| return bosherr.WrapErrorf(err, "Downloading blob") | ||
| } | ||
| defer resp.Body.Close() //nolint:errcheck | ||
| file = resp.Body | ||
| href = opts.Args.Path | ||
| } else { | ||
| file, err = c.fs.OpenFile(opts.Args.Path, os.O_RDONLY, 0) | ||
| if err != nil { | ||
| return bosherr.WrapErrorf(err, "Opening blob") | ||
| } | ||
| defer file.Close() //nolint:errcheck | ||
| } | ||
|
|
||
| defer file.Close() //nolint:errcheck | ||
|
|
||
| blob, err := c.blobsDir.TrackBlob(opts.Args.BlobsPath, file) | ||
| blob, err := c.blobsDir.TrackBlob(opts.Args.BlobsPath, file, href) |
There was a problem hiding this comment.
Rather than handling the HREF -> File conversion here, I think it would make more sense either
TrackBlob()to takeopt.ArgsPathitself, and handle writing the remote file intotempFile(https://github.com/cloudfoundry/bosh-cli/pull/683/files?diff=unified&w=1#diff-17144e6602ffb3d080986c8a0843436cd75cd11990789e6ba43d64ad98e9ed41L195)- Have an explicit flag (ex:
--hrefor--uri) that indicates that a remote file is being added- If the local file exists and there is a
--uriflag passed this would be stored for later verification
- If the local file exists and there is a
There was a problem hiding this comment.
If TrackBlob() is handling the fetching, perhaps a better interface would be to pass a Blob{} struct explicitly and allow that function to fill in sha2 and size?
|
moved to review section |
aramprice
left a comment
There was a problem hiding this comment.
Thank you for the contribution, I appreciate the value of being able to create a bosh release by referencing blobs hosted remotely on a web server.
I'm worried that this change, as is, doesn't provide a full implementation of BOSH's blobs lifecycle (add, list, remove, sync, upload).
I do like the idea of being able to provide a URI for a blob which can be used to attest to, and (re)validate provenance of a blob, though this is something that I'd like to see as a feature which is added to the existing blob lifecycle, and available to any blobstore implementation.
In short it seems like PR is conflating two independent ideas:
- having
blobs.ymlwhich references remote URL's, not a blobstore (in bosh's sense) - adding metadata about the provenance of a blob to
blobs.yml
The BOSH CLI has historically expected blob storage to be read-write and this change will presumably break bosh upload-blobs. There is also a question about what one should expect to find the the blobstore config. Do we allow, for example, both a final.yml containing blobstore: {provider: gcs}, as well as a blobs to YAML which contains "HREF" and an object_id?
What happens if the HREF file doesn't match the checksum but the file in the GCS bucket does? What about the inverse?
A possible way forward would be to build out the "adding upstream/origin URI metadata for any blob" such that it fits in with all other blobstore implementations.
Separately perhaps opening an issue to discuss what it might look like for there to be aread-only blobstore lifecycle.
|
My original intent with this change was inspired by (what I thought) was the fact that the only blobs that go in the blob store were the binaries that we attach while developing. I have found that this is not true. In shifting between computers, I realized all the "internal" packages and jobs are also blobs that are stored -- and that the BOSH CLI cannot recover (at least not easily) if those blobs don't exist; even for prior versions of those blobs. Due to these realizations, I don't think this change could have ever worked. We all need to use a blob store, and a commercial one if we are doing public work. Sorry for the chase! Cancelling the PR. |
|
@a2geek - no worries, thank you for the reply and for digging into this. If you ever feel like working on the "metadata pointing to the upstream source" aspect that is part of this PR that would definitely be valuable. In any case welcome to the bosh world, and thanks for raising this! |
This stems from a question I asked in the Cloud Foundry BOSH Slack channel. I've been wondering why the BOSH CLI for development must use some S3 (or similar) buckets for storing blobs. Why not just allow URL references and download from the source? Many BOSH related artifacts are hosted in Github. Why duplicate that somewhere?
This was just a quick spike to prove the idea.
Note that this only impacts CLI for development usage.
Since this was more of a spike, the HREF uses the Go
http.Get(etc) directly. From the rest of the code base, this likely should be behind some interface so it can be faked for direct unit tests.Is this of interest?
What needs to be changed to meet standards?
Is there any existing code that should be used? (For example, around the URL handling.)
The following commands were modified:
bosh add-blobnow accepts a URL, stored as HREF in the blob structures and yaml file.bosh blobslists the HREF. (This ended up a bit too wide, and I could easily be convinced this is superfluous).bosh sync-blobsdefers to the blobstore configuration -- if there is none and the HREF exists, the blob is simply downloaded from the source.Example repository is here: https://github.com/a2geek/test-release
(feel free to compile this BOSH CLI, clone that test release, and run the
bosh sync-blobs...)