Skip to content

Commit 4dfec8d

Browse files
committed
The original problem, reported by one customer, was that they got non retryable
errors during CompleteMPU calls. In this case, their client would follow up with a call to the AbortMPU API, logically. We made a mistake during the last bugfix: - We incorrectly thought that the reason of the error was a failure during the extra part deletion. And fixed it by cleaning everything during AbortMPU. But this is wrong: when an error happens in this step of the CompleteMPU API, it means the S3 object was properly created AND the overview keys and parts metadata were already cleaned up. Aborting in such case would just return a 404 error, but won't do anything. Instead, the state that the customer encountered was different: they failed at the `deletePartsMetadata` step. This is something we support with the `CompleteMultiPartUpload` retry logic. The clean up in the AbortMPU API is actually great: when we end up in this situation, the client has two available (and valid) choices: - They retry the CompleteMultiPartUpload. - They call the AbortMPU API. The latter is now possible, since the cleanup introduction. In this case, we detect that the object is in an inconsistent state and proceed with the removal of everything (both in the S3 bucket and the mpuShadowBucket), so the result is consistent: the MPU is aborted and the customer must re-upload the whole object. The first option is preferred, as it keeps the object in the bucket and avoids the need to re-upload. But the problem here is that we do not guarantee that a retryable error is returned to the client: in case of MD update failure, the returned error might not be retryable: we can have DeleteConflict or NoSuchKey errors. Other usually are retryable (InternalError, etc). So we want to make sure of two behaviors: - In the batchDeleteObjectMetadata error callback case, we should always return a retryable error. It means that in case of a DeleteConflict, we should always retry, and in case of a NoSuchKey, we should not retry. - During the batchDeleteExtraParts, we already cleaned up the mpu bucket, so returning an error to the client if the operation fails will wrongly lead to a completeMPU retry or ABort, while the operation succeeded, and we only created ghosts. In any case, the ghosts are permanent: without the mpu bucket, we cannot clean them in subsequent calls. Issue: CLDSRV-669
1 parent 34ee809 commit 4dfec8d

1 file changed

Lines changed: 52 additions & 3 deletions

File tree

lib/api/completeMultipartUpload.js

Lines changed: 52 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -524,15 +524,64 @@ function completeMultipartUpload(authInfo, request, log, callback) {
524524
function deletePartsMetadata(mpuBucket, keysToDelete, aggregateETag,
525525
extraPartLocations, destinationBucket, generatedVersionId, droppedMPUSize, next) {
526526
services.batchDeleteObjectMetadata(mpuBucket.getName(),
527-
keysToDelete, log, err => next(err, extraPartLocations,
528-
destinationBucket, aggregateETag, generatedVersionId, droppedMPUSize));
527+
keysToDelete, log, err => {
528+
if (err) {
529+
log.error('error deleting MPU metadata during completeMPU', {
530+
method: 'completeMultipartUpload',
531+
mpuBucket: mpuBucket.getName(),
532+
keysToDeleteCount: keysToDelete.length,
533+
error: err,
534+
});
535+
536+
// Handle specific error cases according to retry strategy
537+
if (err.is?.DeleteConflict) {
538+
// DeleteConflict should trigger automatic retry
539+
// Convert to InternalError to make it retryable
540+
return next(errors.InternalError, extraPartLocations,
541+
destinationBucket, aggregateETag, generatedVersionId, droppedMPUSize);
542+
}
543+
544+
// For NoSuchKey and other errors, return them as-is
545+
// NoSuchKey is non-retryable, InternalError and others are retryable
546+
return next(err, extraPartLocations,
547+
destinationBucket, aggregateETag, generatedVersionId, droppedMPUSize);
548+
}
549+
550+
log.debug('successfully deleted MPU metadata during completeMPU', {
551+
method: 'completeMultipartUpload',
552+
mpuBucket: mpuBucket.getName(),
553+
keysToDeleteCount: keysToDelete.length,
554+
});
555+
return next(null, extraPartLocations,
556+
destinationBucket, aggregateETag, generatedVersionId, droppedMPUSize);
557+
});
529558
},
530559
function batchDeleteExtraParts(extraPartLocations, destinationBucket,
531560
aggregateETag, generatedVersionId, droppedMPUSize, next) {
532561
if (extraPartLocations && extraPartLocations.length > 0) {
533562
return data.batchDelete(extraPartLocations, request.method, null, log, err => {
534563
if (err) {
535-
return next(err);
564+
// Extra part deletion failure should not fail the operation
565+
// The S3 object was created successfully and MPU metadata was cleaned up
566+
// Orphaned extra parts are acceptable since the main operation succeeded
567+
log.warn('failed to delete extra parts, keeping orphan but returning success', {
568+
method: 'completeMultipartUpload',
569+
extraPartLocationsCount: extraPartLocations.length,
570+
error: err,
571+
});
572+
573+
// Continue with quota validation but don't fail if it errors either
574+
return validateQuotas(request, destinationBucket, request.accountQuotas,
575+
['objectDelete'], 'objectDelete', -droppedMPUSize, false, log, quotaErr => {
576+
if (quotaErr) {
577+
log.warn('failed to update inflights after extra part deletion failure', {
578+
method: 'completeMultipartUpload',
579+
error: quotaErr,
580+
});
581+
}
582+
// Always succeed - the core operation was successful
583+
return next(null, destinationBucket, aggregateETag, generatedVersionId);
584+
});
536585
}
537586

538587
return validateQuotas(request, destinationBucket, request.accountQuotas,

0 commit comments

Comments
 (0)