Skip to content

Commit f7b886f

Browse files
committed
Refine design and specify If-Match ETag validation in OM write path (no pre-flight RPC)
- Add If-Match implementation: validate ETag in OM validateAndUpdateCache, avoiding GetS3KeyDetails pre-flight check to optimize happy path - Document If-None-Match using EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS=-1 constant for atomic create-if-not-exists semantics - Reorganize spec sections: separate Write/Read/Copy specifications - Clarify OM validation logic: locking, key lookup, ETag comparison, error cases - Update error mapping: add PRECONDITION_FAILED for missing ETag scenarios - Add HDDS-13963 reference for Create-If-Not-Exists capability
1 parent dc046a1 commit f7b886f

1 file changed

Lines changed: 69 additions & 38 deletions

File tree

hadoop-hdds/docs/content/design/s3-conditional-requests.md

Lines changed: 69 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -26,23 +26,26 @@ AWS S3 supports conditional requests using HTTP conditional headers, enabling at
2626
## Use Cases
2727

2828
### Conditional Writes
29+
2930
- **Atomic key rewrites**: Prevent race conditions when updating existing objects
3031
- **Create-only semantics**: Prevent accidental overwrites (`If-None-Match: *`)
3132
- **Optimistic locking**: Enable concurrent access with conflict detection
3233
- **Leader election**: Implement distributed coordination using S3 as backing store
3334

3435
### Conditional Reads
36+
3537
- **Bandwidth optimization**: Avoid downloading unchanged objects (304 Not Modified)
3638
- **HTTP caching**: Support standard browser/CDN caching semantics
3739
- **Conditional processing**: Only process objects that meet specific criteria
3840

3941
### Conditional Copy
42+
4043
- **Atomic copy operations**: Copy only if source/destination meets specific conditions
4144
- **Prevent overwrite**: Copy only if destination doesn't exist
4245

43-
## AWS S3 Conditional Write
46+
## Specification
4447

45-
### Specification
48+
### AWS S3 Conditional Write Specification
4649

4750
#### If-None-Match Header
4851

@@ -69,75 +72,101 @@ If-Match: "<etag>"
6972
- Cannot use both headers together in same request
7073
- No additional charges for failed conditional requests
7174

72-
### Implementation
75+
### AWS S3 Conditional Read Specification
76+
77+
TODO
78+
79+
### AWS S3 Conditional Copy Specification
80+
81+
TODO
7382

74-
#### Architecture Overview
83+
## Implementation
84+
85+
### AWS S3 Conditional Write Implementation
86+
87+
The implementation aims to minimize Redundant RPCs (RTT) while ensuring strict atomicity for conditional operations.
88+
89+
- **If-None-Match** utilizes the atomic "Create-If-Not-Exists" capability ([HDDS-13963](https://issues.apache.org/jira/browse/HDDS-13963 "null")).
90+
- **If-Match** optimizes the happy path by pushing ETag validation directly into the Ozone Manager's write path, avoiding preliminary read operations.
7591

7692
#### If-None-Match Implementation
7793

94+
This implementation ensures strict create-only semantics by utilizing a specific generation ID marker.
95+
96+
In `OzoneConsts.java`, add the `-1` as a constant for readability:
97+
```java
98+
/**
99+
* Special value for expectedDataGeneration to indicate "Create-If-Not-Exists" semantics.
100+
* When used with If-None-Match conditional requests, this ensures atomicity:
101+
* if a concurrent write commits between Create and Commit phases, the commit
102+
* fails the validation check, preserving strict create-if-not-exists semantics.
103+
*/
104+
public static final long EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS = -1L;
105+
```
106+
78107
##### S3 Gateway Layer
79108

80109
1. Parse `If-None-Match: *`.
81-
2. Set `existingKeyGeneration = -1`.
110+
2. Set `existingKeyGeneration = OzoneConsts.EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS`.
82111
3. Call `RpcClient.rewriteKey()`.
83112

84113
##### OM Create Phase
85114

86-
1. Validate `expectedDataGeneration == -1`.
87-
2. If key exists → throw `KEY_ALREADY_EXISTS`.
88-
3. Store `-1` in open key metadata.
115+
1. OM receives request with `expectedDataGeneration == OzoneConsts.EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS`.
116+
2. **Pre-check**: If key is already in the OpenKeyTable or KeyTable, throw `KEY_ALREADY_EXISTS`.
117+
3. If not exists, proceed to create the open key entry.
89118

90-
##### OM Commit Phase
119+
##### OM Commit Phase (Atomicity)
91120

92-
1. Check `expectedDataGeneration == -1` from open key.
93-
2. If key now exists (race condition) → throw `KEY_ALREADY_EXISTS`.
94-
3. Commit key.
121+
1. During the commit phase (or strict atomic create), the OM validates that the key still does not exist.
122+
2. If a concurrent client created the key between the Create and Commit phases, the transaction fails with `KEY_ALREADY_EXISTS`.
95123

96124
##### Race Condition Handling
97125

98-
Using `-1` ensures atomicity. If a concurrent write (Client B) commits between Client A's Create and Commit, Client A's commit fails the `-1` validation check (key now exists), preserving strict create-if-not-exists semantics.
126+
Using `OzoneConsts.EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS = -1` ensures atomicity. If a concurrent write (Client B) commits between Client A's Create and Commit,
127+
Client A's commit fails the `CREATE IF NOT EXISTS` validation check, preserving strict create-if-not-exists semantics.
99128

100129
#### If-Match Implementation
101130

102-
Leverages existing `expectedDataGeneration` from HDDS-10656:
131+
To optimize performance and reduce latency, we avoid a pre-flight check (GetS3KeyDetails) and instead validate the ETag during the OM Write operation.
132+
This requires adding an optional `expectedETag` field to `KeyArgs`. This approach optimizes the "happy path" (successful match) by removing an extra network round trip.
133+
For failing requests, they still incur the cost of a write RPC and Raft log entry, but this is acceptable under optimistic concurrency control assumptions.
103134

104135
##### S3 Gateway Layer
105136

106-
1. Parse `If-Match: "<etag>"` header
107-
2. Look up existing key via `getS3KeyDetails()`
108-
3. Validate ETag matches, else throw `PRECOND_FAILED` (412)
109-
4. Extract `expectedGeneration` from existing key
110-
5. Pass `expectedGeneration` to RpcClient
137+
1. Parse `If-Match: "<etag>"` header.
138+
3. Populate `KeyArgs` with the parsed `expectedETag`.
139+
4. Send the write request (CreateKey/OpenKey) to OM.
111140

112-
##### OM Create Phase
141+
##### OM Layer (Validation Logic)
142+
143+
Validation is performed within the `validateAndUpdateCache` method to ensure atomicity within the Ratis state machine application.
113144

114-
1. Receive `expectedDataGeneration` parameter
115-
2. Look up current key and validate exists
116-
3. Extract current key's `updateID` value
117-
4. Create open key with `expectedDataGeneration = updateID`
118-
5. Return stream to S3 gateway
145+
1. **Locking**: The OM acquires the write lock for the bucket/key.
146+
2. **Key Lookup**: Retrieve the existing key from `KeyTable`.
147+
3. **Validation**:
119148

120-
##### OM Commit Phase
149+
- **Key Not Found**: If the key does not exist, throw `KEY_NOT_FOUND` (maps to S3 412).
150+
- **No ETag Metadata**: If the existing key (e.g., uploaded via OFS) does not have an ETag property, validation fails. We do **not** calculate ETag on the spot to avoid performance overhead on the applier thread. Throws `PRECONDITION_FAILED`.
151+
- **ETag Mismatch**: Compare `existingKey.ETag` with `expectedETag`. If they do not match, throw `PRECONDITION_FAILED` (maps to S3 412).
121152

122-
1. Read open key (contains `expectedDataGeneration`)
123-
2. Read current committed key
124-
3. Validate `current.updateID == openKey.expectedDataGeneration`
125-
4. Commit if match, reject if mismatch (existing HDDS-10656 logic)
153+
4. **Execution**: If validation passes, proceed with the operation (adding to OpenKeyTable).
126154

127155
#### Error Mapping
128156

129-
| OM Error | S3 Status | S3 Error Code | Scenario |
130-
|----------|-----------|---------------|----------|
131-
| `KEY_ALREADY_EXISTS` | 412 | PreconditionFailed | If-None-Match failed |
132-
| `KEY_NOT_FOUND` | 412 | PreconditionFailed | If-Match failed (key missing) |
133-
| `ETAG_MISMATCH` | 412 | PreconditionFailed | If-Match failed (ETag mismatch) |
134-
| `GENERATION_MISMATCH` | 412 | PreconditionFailed | If-Match failed (concurrent modification) |
157+
| | | | |
158+
|---|---|---|---|
159+
|**OM Error**|**S3 Status**|**S3 Error Code**|**Scenario**|
160+
|`KEY_ALREADY_EXISTS`|412|PreconditionFailed|If-None-Match failed|
161+
|`KEY_NOT_FOUND`|412|PreconditionFailed|If-Match failed (key missing)|
162+
|`ETAG_MISMATCH`|412|PreconditionFailed|If-Match failed (ETag mismatch)|
163+
|`PRECONDITION_FAILED`|412|PreconditionFailed|If-Match failed (General/No ETag)|
135164

136-
## AWS S3 Conditional Read
165+
## AWS S3 Conditional Read Implementation
137166

138167
TODO
139168

140-
## AWS S3 Conditional Copy
169+
## AWS S3 Conditional Copy Implementation
141170

142171
TODO
143172

@@ -146,4 +175,6 @@ TODO
146175
- [AWS S3 Conditional Requests](https://docs.aws.amazon.com/AmazonS3/latest/userguide/conditional-requests.html)
147176
- [RFC 7232 - HTTP Conditional Requests](https://tools.ietf.org/html/rfc7232)
148177
- [HDDS-10656 - Atomic Rewrite Key](https://issues.apache.org/jira/browse/HDDS-10656)
178+
- [HDDS-13963 - Atomic Create-If-Not-Exists](https://issues.apache.org/jira/browse/HDDS-13963)
149179
- [Leader Election with S3 Conditional Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/)
180+
- [An MVCC-like columnar table on S3 with constant-time deletes](https://simonwillison.net/2025/Oct/11/mvcc-s3/)

0 commit comments

Comments
 (0)