[v2] Validate full object checksum on multipart downloads#10180
Open
[v2] Validate full object checksum on multipart downloads#10180
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Current State
s3transferdoesn't perform any checksum calculation or validation. This is handled bybotocoreunder the following conditions (assuming checksum mode is enabled):botocorecalculates and validates the full object checksum.botocorecalculates and validates the part-level checksum for each part retrieved.There are 2 gaps here:
botocoreperforms no validation.botocoreperforms no validation. There's not much the client can do here. It would need to know the exact part sizes the object was originally uploaded with and then use those offsets to independently calculate the composite checksum.This PR is designed to address the first gap by having
s3transfercalculate and validate the full object checksum.Solution
Calculating and validating full object checksums is constrained to CRC-based algorithms because because CRC algorithms have a property where combining multiple part-level checksums produces the same final value as if the checksum was calculated in a single, serial stream. This is important because as the transfer manager downloads multiple parts in parallel, the part bodies won't be blocked from being released after write. If full object checksums had to be calculated in a single, serial stream, then the part bodies would be blocked waiting to update a single checksum object. We use CRT's CRC combine functions for this.
The design makes calculating part-level checksums the responsibility of
botocoreand combining part-level checksums into a single full object checksum the responsibility ofs3transfer.botocore'sStreamingChecksumBodyalready calculates the checksum as the body is read from stream, and returns it tos3transfer. However, if the object was downloaded without checksum mode enabled, then it won't be returned as aStreamingChecksumBodyobject. In this case,s3transferwill wrap the body intoStreamingChecksumBodyso the checksum is calculated. This prevents double-computation of checksums. One tradeoff here is that checking to see if the returned body has a checksum attribute creates some coupling betweenbotocoreands3transfer, but I couldn't think of a way to cleanly separate responsibilities without introducing any coupling.When
s3transferinitiates a multipart download, it decides from theHeadObjectresponse whether or not it should calculate the full object checksum. If yes, then it creates aFullObjectChecksumCombinerobject. As each part is downloaded, the part-level checksums are downloaded and stored in theFullObjectChecksumCombinerobject. Once all parts have been downloaded,FullObjectChecksumCombinercombines all the part-level checksums and validates the full object checksum against the stored value returned from the initialHeadObject.Manual Testing
Reviewer should also sanity check here.
s3transfercalculates part checksum since S3 doesn't return part-level checksums when object has full object checksum, sobotocoredoesn't do any calculation.s3transferalso combines part checksums and validates full object checksum.botocorecalculates and validates full object checksum.s3transfercalculates part checksum and validates full object checksum.botocorecalculates and validates full object checksum.botocorecalculates part checksum and validates part checksums. No full object checksum validation is performed.Current Statesection.s3transfercalculates part checksum and validates full object checksum.botocorecalculates and validates full object checksum.