Conversation
Coverage Report for CI Build 24348647476Coverage increased (+0.02%) to 84.357%Details
Uncovered Changes
Coverage Regressions1 previously-covered line in 1 file lost coverage.
Coverage Stats
💛 - Coveralls |
nothingmuch
left a comment
There was a problem hiding this comment.
max-age handling as implemented will not be correct without an Age (or possibly Date) header. i also have a bunch of suggestions to simplify, some more to do with the code itself others with the approach
| listener: "[::]:8080".parse().expect("valid default listener address"), | ||
| storage_dir: PathBuf::from("./data"), | ||
| timeout: Duration::from_secs(30), | ||
| ohttp_keys_max_age: Some(Duration::from_secs(604800)), |
There was a problem hiding this comment.
| ohttp_keys_max_age: Some(Duration::from_secs(604800)), | |
| ohttp_keys_max_age: Some(Duration::from_secs(7 * 24 * 60 * 60)), |
easier to read IMO. Duration::from_weeks also exists but only in nightly
| { | ||
| // duration value of 0 isn't accepted | ||
| let secs: Option<u64> = Option::deserialize(deserializer)?; | ||
| Ok(secs.filter(|&s| s > 0).map(Duration::from_secs)) |
There was a problem hiding this comment.
if an invalid duration is given this will silently deserialize as Ok(None) instead of giving an error
| #[derive(Debug)] | ||
| pub struct OhttpKeySet { | ||
| pub current_key_id: u8, | ||
| pub servers: BTreeMap<u8, ohttp::Server>, |
There was a problem hiding this comment.
why is this pub and named servers which seems like a reflection of an implementation detail?
also BTreeMap seems overkill if this is constrained to have only IDs 1 and 2. an [Option<ohttp::Server>; 2] could be used and key ID used to derive an index by subtracting 1, seems simpler?
There was a problem hiding this comment.
True!, my initial implementation wasn't deleting old keys automatically and would load them back in memory on restart but only accept the one with largest id number. i'll change this
| /// All keys in the `servers` map are accepted for decapsulation, allowing clients | ||
| /// with cached older keys to continue working during the overlap window. | ||
| #[derive(Debug)] | ||
| pub struct OhttpKeySet { |
There was a problem hiding this comment.
the name of this struct seems a bit at odds with its functionality, which handles OHTTP decapsulation as well, maybe KeyRotatingServer is more appropriate?
| if let Some(max_age) = self.ohttp_keys_max_age { | ||
| res.headers_mut().insert( | ||
| CACHE_CONTROL, | ||
| HeaderValue::from_str(&format!("public, max-age={}", max_age.as_secs())) |
There was a problem hiding this comment.
this will specify the max age of the HTTP resource relative to the time at which the response was generated, but the semantics for the key expiration and self.ohttp_keys_max_age is relative to when the key was initialized
the Age header can be used to specify the age of the response, which is to be subtracted from max_age or this could be done server side
s-maxage is arguably the more appropriate cache control directive to use since this can be shared (c.f. key consistency expired draft), or the equivalent public, max-age=... can be used. immutable can also be set.
There was a problem hiding this comment.
This was my first time seeing the "immutable" cache header
implementing this .
| /// A new key with the next key_id is generated and persisted to disk. | ||
| /// The new key becomes the current key served to clients. | ||
| /// After `max_age` elapses the old key is dropped from memory and its .ikm` file is deleted from disk. | ||
| pub fn spawn_key_rotation( |
There was a problem hiding this comment.
perhaps it'd be simpler to do the keygen logic in the request handling path instead of as a background task?
another way to potentially simplify this is to use a KDF to generate keys as a function of time, i.e. divide the current time by the max key age, use that as a nonce/counter in a KDF based on a single persisted master key, and just generate the per-epoch key deterministically, that would remove the need for persisting the key material on disk (and so avoid any consistency issues and the need for a sync() call while the rwlock is held to ensure that key material is in durable storage)
There was a problem hiding this comment.
another way to potentially simplify this is to use a KDF to generate keys as a function of time, i.e. divide the current time by the max key age, use that as a nonce/counter in a KDF based on a single persisted master key, and just generate the per-epoch key deterministically, that would remove the need for persisting the key material on disk (and so avoid any consistency issues and the need for a sync() call while the rwlock is held to ensure that key material is in durable storage)
not sure i completely understand what you mean here. could this be elaborated or discussed in the upcoming mob programming session if you got the time ?
There was a problem hiding this comment.
i think my suggestion was a mistake, see below for reasoning. at any rate to answer the question:
simplest version:
generate a root secret, save it on disk
current epoch number = current unix time / epoch length
epoch key = KDF(root secret, current epoch number)
so at every point in time there's a deterministically derived epoch key
ratcheting version:
generate initial epoch key, save it on disk
epoch key = H(previous epoch key), overwrite previous saved epoch key
but thinking about this more, assuming a temporary compromise (same as in forward secrecy) where a key from an earlier epoch was leaked, and after that the adversary loses access and then a new randomly generated key is introduced, then secrecy will be recovered. this is no longer the case with a deterministic approach.
it could be argued that this model is too optimistic for real world settings. once key material is exflitrated, supposing the compromise is detected and a new host is provisioned... what is gained by copying the key key to the new instance? i think it's better to revoke it entirely (i.e. respond with 422 and force re-configuration).
on the other hand, if compromise isn't detected so no new host is provisioned, why would a real world access lose the ability to compromise the host again if they have enough access to exfiltrate key material at an earlier time? that seems like a pretty contrived scenario, it could be realized using scheduled reboots and ephemeral root filesystem combined to ensure that persistent access can't be set up, and with regular updates that might patch the vulnerability recovering from the compromise... anyway the point is actually realizing this threat model takes significant effort.
so, if this reason is not strong enough to want to do this, what other reasons are there to rotate keys? i can't think of one... and if it is strong enough, we definitely need the stateful version you've implemented, not a deterministic approach since that seems to invalidate the only justification
71abcf7 to
aa33a74
Compare
| .and_then(|i| self.keys[i as usize].as_ref()); | ||
| match server { | ||
| Some(s) => s.decapsulate(ohttp_body), | ||
| None => Err(ohttp::Error::Truncated), |
There was a problem hiding this comment.
| None => Err(ohttp::Error::Truncated), | |
| None => Err(ohttp::Error::KeyId), |
why truncated?
| if self.ohttp.read().await.current_key_created_at().elapsed() < max_age { | ||
| return Ok(()); | ||
| } | ||
| let mut keyset = self.ohttp.write().await; |
There was a problem hiding this comment.
both keyset and self.ohttp are hard to read... i would consider renaming, maybe self.servers_by_keyid?
| let new_key_id = keyset.next_key_id(); | ||
| if let Some(dir) = &self.ohttp_keys_dir { | ||
| let _ = tokio::fs::remove_file(dir.join(format!("{new_key_id}.ikm"))).await; | ||
| } | ||
| let config = crate::key_config::gen_ohttp_server_config_with_id(new_key_id) | ||
| .map_err(HandlerError::InternalServerError)?; | ||
| if let Some(dir) = &self.ohttp_keys_dir { | ||
| crate::key_config::persist_key_config(&config, dir) | ||
| .map_err(HandlerError::InternalServerError)?; | ||
| } |
There was a problem hiding this comment.
seems cleaner to try generating before zapping the old key
| let new_key_id = keyset.next_key_id(); | |
| if let Some(dir) = &self.ohttp_keys_dir { | |
| let _ = tokio::fs::remove_file(dir.join(format!("{new_key_id}.ikm"))).await; | |
| } | |
| let config = crate::key_config::gen_ohttp_server_config_with_id(new_key_id) | |
| .map_err(HandlerError::InternalServerError)?; | |
| if let Some(dir) = &self.ohttp_keys_dir { | |
| crate::key_config::persist_key_config(&config, dir) | |
| .map_err(HandlerError::InternalServerError)?; | |
| } | |
| let new_key_id = keyset.next_key_id(); | |
| let config = crate::key_config::gen_ohttp_server_config_with_id(new_key_id) | |
| .map_err(HandlerError::InternalServerError)?; | |
| if let Some(dir) = &self.ohttp_keys_dir { | |
| let _ = tokio::fs::remove_file(dir.join(format!("{new_key_id}.ikm"))).await; | |
| crate::key_config::persist_key_config(&config, dir) | |
| .map_err(HandlerError::InternalServerError)?; | |
| } |
but actually what i would prefer even more is if the old key was cleared sooner, so that the array contains an Option and a None most of the time, and the two keyids are only valid for a grace period accounting for reasonable clock skew and latency (so on the order of a few seconds)
if two keyids are always available for use, then clients using the old key leak information about their local clock, which is a strong fingerprint
that said we definitely should support two keyids during a short time window, because if only one key is ever available then clients who are slightly ahead will expire their key but then fetch the same key and effectively need to poll the ohttp configuration until it changes, which is wasteful, or clients who are behind will make requests with a stale key, get an error, need to do key configuration, and then submit the request under the new key. since this is wasteful and degrades ux, some allowance is necessary.
| } | ||
| } | ||
|
|
||
| async fn maybe_rotate_keys(&self) -> Result<(), HandlerError> { |
There was a problem hiding this comment.
since i retracted my suggestion to do deterministic keys, i think it also no longer makes sense to do this on the request path, and it should go back to being done in a background loop
the deterministic approach would not have required an RwLock or any IO... the combination of both of these things including IO that happens while holding the RwLock will now potentially block long running requests, and long polling requests which will hold a read() lock for their entire duration.
rwlock does not define reader/writer priority, so depending on the OS either many long polls will block key rotation until all long poll requests expire (though this can be fixed, see other comment), and any new requests will likewise block while attempting to obtain an rwlock because the current key expired
There was a problem hiding this comment.
Dropping the read guard just after decapsulation fixes this in a way. Although i'm dropping the function entirely in favour of the background task one
| let (bhttp_req, res_ctx) = self | ||
| .ohttp | ||
| // Decapsulate OHTTP request using the key matching the message's key_id | ||
| let keyset = self.ohttp.read().await; |
There was a problem hiding this comment.
this holds the rwlock with read for entirety of the request handling, but it could be dropped right after decapsulation since res_ctx contains all the information required to respond
|
a simpler way to support two keyids only during transition window is to always have two of them, no after valid_from + ttl + grace period, a background loop can overwrite the inactive key with a new one. when decapsulating, if the key is stale but hasn't been overwritten yet, it's still in the grace period so the request should be accepted. if the key is stale then the keyid is now representing the key of the next epoch, but that hasn't been served via config yet so it's impossible for any client to use it. then the config can start serving the next key ~1/2 of the grace period before the current key expires (valid_from + ttl), it should already usable because it was overwritten shortly after the end of the last epoch, and serving it slightly before expiry means that clients whose clock is slightly ahead and therefore have expired their local copy won't refetch the same exact key with a really short remaining time to live finally, if the RwLock inside of the array, then the two locks should generally not be in contention either (assuming they aren't on the same cache line, which may need some alignment pragma), since the rwlock on the stale key should only be taken after requests involving that keyid kinda die out. this means the loop can freely hold the rwlock while it does IO without introducing any jitter to requests that wouldn't have generated an error anyway |
| let mut iter = configs.into_iter(); | ||
| let (current_cfg, current_mtime) = iter.next().expect("non-empty"); | ||
| let current_age = | ||
| std::time::SystemTime::now().duration_since(current_mtime).unwrap_or_default(); |
There was a problem hiding this comment.
should expect() not default, a negative duration indicates the system clock was reset
add6ced to
93eeb8b
Compare
|
|
||
| impl KeyRotatingServer { | ||
| pub fn from_single(server: ohttp::Server, key_id: u8) -> Self { | ||
| assert!(key_id == 1 || key_id == 2, "key_id must be 1 or 2"); |
There was a problem hiding this comment.
hmm, key identifiers are an arbitrary u8: https://www.ietf.org/rfc/rfc9458.html#section-3.1
so probably easier to use key ids 0 and 1 instead of 1 and 2? i think i said 1 and 2 somewhere so sorry if i made it seem like that was an important detail
There was a problem hiding this comment.
This needs to be addressed. Ohttp targets can set whatever key id they want
There was a problem hiding this comment.
but the mailroom only deals with mailroom ohttp targets and the only reason to support more than 1 key is to do graceful rotation
this may change if say a PQ KEM is added to the OHTTP layer or something like that, but right now supporting more than two distinct keyids serves no purpose that i can see and neither rfc 9458 nor our gateway opt-in extension allows for arbitrary targets, only the mailbox service
| pub fn from_single(server: ohttp::Server, key_id: u8) -> Self { | ||
| assert!(key_id == 1 || key_id == 2, "key_id must be 1 or 2"); | ||
| let mut keys = [None, None]; | ||
| keys[(key_id - 1) as usize] = Some(server); |
There was a problem hiding this comment.
it'd be simpler to just initialize both and avoid the Option IMO, no need to pattern match or expect everywhere
btw that's also why i suggested changing "created_at" to "valid_from", since both are created at the same time but only one is valid from Instant::now()
| assert!(id == 1 || id == 2, "key_id must be 1 or 2"); | ||
| keys[(id - 1) as usize] = Some(server); | ||
| } | ||
| let created_at = Instant::now().checked_sub(current_key_age).unwrap_or_else(Instant::now); |
There was a problem hiding this comment.
this should .expect(), not `.unwrap_or_else. if there's a key claimed to be from the future it's best to crash the server because the host machine is clearly misconfigured
| self.current_key_id = new_key_id; | ||
| self.current_key_created_at = Instant::now(); |
There was a problem hiding this comment.
| self.current_key_id = new_key_id; | |
| self.current_key_created_at = Instant::now(); |
best if this rotates the next key but doesn't yet activate it, see my previous comments discussing why we want two keys and why we want a grace period, if clients whose clock is running a bit fast expire their local key they will request the same key repeatedly, so it's better to already offer the next key (and decapsulate from it) a few seconds before the current key expires, and to still accept an expired key a few seconds after expiry, but then after this grace period already generate the new key to invalidate the old one, removing the need for retire as well
| } | ||
| }; | ||
| let _ = tokio::fs::remove_file(keys_dir.join(format!("{new_key_id}.ikm"))).await; | ||
| if let Err(e) = crate::key_config::persist_key_config(&config, &keys_dir) { |
There was a problem hiding this comment.
IO, especially blocking should not be done while holding a write lock on the whole keyset, since that prevents requests from being handled
see also my suggestion to use a Arc<[Box<RwLock<Server>; 2]> inside of keyset as that eliminates this concern, i think it's perfectly fine to write the key while holding a lock there, but persist_key_config should be async
There was a problem hiding this comment.
i opted to make the keyid an atomic u8 as plain u8 isn't safe for concurrent reads and writes ,if switch() writes to it while decapsulate() reads it, that might lead to data-races
There was a problem hiding this comment.
There are couple unaddressed comments from the previous review. In general the hardcoded key ids and array indexing seem like a footgun. See my comment about creating two fields in the key config struct.
Edit: A fixed sized array with indexing is fine. However, i would still like to understand why the fields are optional. The primary key should never be None FWIU, and the secondary key should always be present as well -- only used during the grace period. @zealsham
|
|
||
| impl KeyRotatingServer { | ||
| pub fn from_single(server: ohttp::Server, key_id: u8) -> Self { | ||
| assert!(key_id == 1 || key_id == 2, "key_id must be 1 or 2"); |
There was a problem hiding this comment.
This needs to be addressed. Ohttp targets can set whatever key id they want
| // with a cached previous key still work during the overlap window. | ||
| #[derive(Debug)] | ||
| pub struct KeyRotatingServer { | ||
| keys: [Option<ohttp::Server>; 2], |
There was a problem hiding this comment.
Would we ever have more than 2 active keys? If not, I would suggest having two fields in this struct instead of indexing
{
current_key: ohttp::Server,
previous_key: Option<ohttp::Server>,
current_key_created_at: Instant,
}This at least ensures that the primary key is always Some. And may obviate the need for key ids to be in {1,2}
There was a problem hiding this comment.
Yes , we would have two key for a short while , the grace period of the previous key before it is retired
There was a problem hiding this comment.
I understand. However, it seems odd to me that they are optional. Shouldn't the current key always be Some?
There was a problem hiding this comment.
both keys should always be Some #1449 (comment)
| None | ||
| }; | ||
| Ok(crate::directory::Service::new(db, ohttp_config.into(), sentinel_tag, v1)) | ||
| let svc = |
There was a problem hiding this comment.
Nit: unnecessary abbreviation
| let svc = | |
| let service = |
nothingmuch
left a comment
There was a problem hiding this comment.
mtime of the key file corresponds to when the key was generated, as does some of the usage of the valid_from field, it's still semantically a created_at field
i think the math would be a bit simpler if it was valid_until instead of valid_from, see suggestions but the main thing is that when rotating the key the mtime has to be bumped or the key rotation will switch between one long lived key and another key which nominally lasts a few microseconds longer
|
|
||
| #[derive(Debug)] | ||
| pub struct KeyRotatingServer { | ||
| keys: [RwLock<KeySlot>; 2], |
There was a problem hiding this comment.
i believe this needs to be a Box or to have some alignment declaration, or these are effectively one RwLock since locking works on a cache line granularity
| #[derive(Debug)] | ||
| pub(crate) struct KeySlot { | ||
| pub(crate) server: ohttp::Server, | ||
| pub(crate) valid_from: Instant, |
There was a problem hiding this comment.
since the relationship between these is always a difference of epoch length, it's enough to still keep that as a top level field and bump it along with the current keyid as it was before...
also seems like making it valid_until would simplify some of the code to deal with this?
| let config = match crate::key_config::gen_ohttp_server_config_with_id(old_key_id) { | ||
| Ok(c) => c, | ||
| Err(e) => { | ||
| tracing::error!("Failed to generate OHTTP key: {e}"); |
There was a problem hiding this comment.
this should panic, not just log, if there's no ability to generate key material something has gone terribly wrong
| }; | ||
| let _ = tokio::fs::remove_file(keys_dir.join(format!("{old_key_id}.ikm"))).await; | ||
| if let Err(e) = crate::key_config::persist_key_config(&config, &keys_dir).await { | ||
| tracing::error!("Failed to persist OHTTP key: {e}"); |
There was a problem hiding this comment.
this should also panic, if it's impossible to write to disk then again something has gone terribly wrong
| // Replace a slot with fresh key material. | ||
| pub async fn overwrite(&self, key_id: u8, server: ohttp::Server) { | ||
| assert!(key_id <= 1, "key_id must be 0 or 1"); | ||
| *self.keys[key_id as usize].write().await = KeySlot { server, valid_from: Instant::now() }; |
There was a problem hiding this comment.
it's not valid from now, it's valid from the end of the previous epoch
the point of renaming it is to indicate that this represents when the key starts being active not when it is generated
| let mut res = Response::new(full(ohttp_keys)); | ||
| res.headers_mut().insert(CONTENT_TYPE, HeaderValue::from_static("application/ohttp-keys")); | ||
| if let Some(max_age) = self.ohttp_keys_max_age { | ||
| let remaining = max_age.saturating_sub(self.ohttp.current_valid_from().await.elapsed()); |
There was a problem hiding this comment.
in line with the other simplification of the delays, adding something like ROTATION_GRACE / 3 to this duration should be done here
| let switch_delay = { | ||
| let valid_from = keyset.current_valid_from().await; | ||
| let switch_at = valid_from + interval - ROTATION_GRACE / 2; | ||
| switch_at.saturating_duration_since(Instant::now()) | ||
| }; | ||
| tokio::time::sleep(switch_delay).await; |
There was a problem hiding this comment.
| let switch_delay = { | |
| let valid_from = keyset.current_valid_from().await; | |
| let switch_at = valid_from + interval - ROTATION_GRACE / 2; | |
| switch_at.saturating_duration_since(Instant::now()) | |
| }; | |
| tokio::time::sleep(switch_delay).await; | |
| let switch_delay = { | |
| let switch_at = keyset.current_valid_until().await; | |
| switch_at.saturating_duration_since(Instant::now()) | |
| }; | |
| tokio::time::sleep(switch_delay).await; |
i think all the math would be easier to follow if the field was defined like this
There was a problem hiding this comment.
tokio::time::sleep_until exists, no need to calculate the difference from now
| let overwrite_delay = { | ||
| let valid_from = keyset.valid_from(old_key_id).await; | ||
| let overwrite_at = valid_from + interval + ROTATION_GRACE; | ||
| overwrite_at.saturating_duration_since(Instant::now()) | ||
| }; | ||
| tokio::time::sleep(overwrite_delay).await; |
There was a problem hiding this comment.
| let overwrite_delay = { | |
| let valid_from = keyset.valid_from(old_key_id).await; | |
| let overwrite_at = valid_from + interval + ROTATION_GRACE; | |
| overwrite_at.saturating_duration_since(Instant::now()) | |
| }; | |
| tokio::time::sleep(overwrite_delay).await; | |
| tokio::time::sleep(ROTATION_GRACE).await; |
| let valid_from = now.checked_sub(age).unwrap_or(now); | ||
| let key_id = cfg.key_id(); | ||
| slots[key_id as usize] = | ||
| Some(crate::directory::KeySlot { server: cfg.into_server(), valid_from }); |
There was a problem hiding this comment.
valid_from only makes sense as mtime for the initial key
secondly, if one or both of these keys are expired, then the key generation logic should trigger here during initialization
| }; | ||
| tokio::time::sleep(switch_delay).await; | ||
|
|
||
| let old_key_id = keyset.current_key_id(); |
There was a problem hiding this comment.
the mtime of the key we're about to switch to should be updated at this moment
|
|
||
| // Grace period after a switch during which the old key is still | ||
| // accepted for decapsulation. | ||
| const ROTATION_GRACE: Duration = Duration::from_secs(30); |
There was a problem hiding this comment.
| const ROTATION_GRACE: Duration = Duration::from_secs(30); | |
| const ROTATION_GRACE: Duration = Duration::from_secs(15); |
this delay should be reduced somewhat, it should only really account for end to end latency, and even then the expected delays not worst case so thinking about it more 30 seconds seems a bit excessive
There was a problem hiding this comment.
unfortunately this suggestion is completely wrong, we need the grace period to be roughly 1 week (!!!) long because of the way OHTTP configs and expiry is handled in BIP 77 URIs, see later review comments
| } | ||
| } | ||
|
|
||
| let configs = crate::key_config::read_server_configs_by_mtime(ohttp_keys_dir)?; |
There was a problem hiding this comment.
this sorts by mtime, but then they are installed by keyid... isn't it simpler to just return them in keyid order. the older mtime would correspond to the valid_from of the earlier epoch, allowing the later epoch's valid_from to be calculated from that even if it hadn't yet started (server was shutdown before a key it was rotated into use but after it was created)
| current.0 = 1 - current.0; | ||
| current.1 = Instant::now() + interval; |
There was a problem hiding this comment.
in my suggestion i believe i wrote RwLock<(u8, Instant)> but i didn't mean that literally, just the tuple a struct would be equivalent to, IMO this should have named fields
| .await | ||
| .saturating_duration_since(Instant::now()) | ||
| .min(max_age) | ||
| .saturating_sub(ROTATION_GRACE / 3); |
There was a problem hiding this comment.
this should add a few seconds, not subtract, the purpose is to avoid clients expiring this before the key is actually rotated (a client whose clock is running fast) only to fetch the same key and have it expire basically immediately
| loop { | ||
| // Sleep until just before the current key expires. | ||
| let valid_until = keyset.valid_until().await; | ||
| let switch_at = valid_until.checked_sub(ROTATION_GRACE / 2).unwrap_or(valid_until); |
There was a problem hiding this comment.
since the only meaning of the valid_until field is to time this, why not just set it to the time things should be rotated (no grace period related adjustment)
There was a problem hiding this comment.
i made this adjustments
| let valid_until = keyset.valid_until().await; | ||
| let overwrite_at = | ||
| valid_until.checked_sub(interval).unwrap_or(valid_until) + ROTATION_GRACE; | ||
| tokio::time::sleep_until(overwrite_at.into()).await; |
There was a problem hiding this comment.
... then this can just be tokio::time::sleep(ROTATION_GRACE)
| let slot1 = slots[1].take().expect("slot 1 missing after init"); | ||
| let valid_until = match interval { | ||
| Some(ivl) => now + ivl.saturating_sub(current_age), | ||
| None => now + std::time::Duration::from_secs(365 * 24 * 3600), |
There was a problem hiding this comment.
doesn't this force rotation on a yearly schedule even if not configured?
There was a problem hiding this comment.
Yes it does, but this part of the code is never reached as it's dependent on max-age being "some". i set it to 10 years now
| // Capture old key id before switching, then switch. | ||
| // `switch` stamps valid_until = Instant::now() + interval, anchored | ||
| // to the actual moment the new key goes live. | ||
| let old_key_id = keyset.current_key_id().await; |
There was a problem hiding this comment.
the new key file's mtime needs to be updated here before switching, otherwise it may be served in response a key config GET request before the server shuts down, and on the next startup it will be invalidated, causing the client to get a 422 response for their cached key
|
please consider all scenarios and make sure they are handled correctly:
as far as i can tell only the last one is, and even that is buggy due to not updating mtime there should be tests covering this |
This pr addresses payjoin#445. It implements OHTTP-key rotation to payjoin-mailroom Mailroom operators can now decide the time interval for keys to be rotated. Also if a key has expired, a 422 error is returned to clients. Clients can handle they key-rotation via the cach-control header returned by the directory. fix spawn rotation
implemented this |
nothingmuch
left a comment
There was a problem hiding this comment.
the code is looking good. i have a few suggestions
but more importantly, i also realized there's a problem with the design i proposed...
a pj URI will contain the OHTTP config from the receiver to the sender. the receiver can set expiry to be before the OHTTP config expires, but that just shortens the time window for the URI to be useful. if a longer duration is required the receiver can only wait until the config expires and then generate a new one.
in addition to cache control, we should probably send a last modified header as well,
one potential workaround for this is to allow post dating an If-Modified-Since, i.e. have it contain a date after the expiry of the current cache control, specifically to request the next key. if the interval is long this should probably be capped at ~1 week or 24 hours, or whatever we decide on in #1457
however this can be be a strong fingerprint, it leaks information about the expiry that a receiver is setting up, and allows linking the sender to the receiver more strongly based on this... i think this means that the design of the URI including the OHTTP config is inherently at odds with privacy preserving key rotation
| slot0: KeySlot, | ||
| slot1: KeySlot, |
There was a problem hiding this comment.
KeySlot wrapper contains no additional information, so i think it just imposes boilerplate on the calling code
there's a better use for it IMO see other comment
| slot0: KeySlot, | |
| slot1: KeySlot, | |
| slot0: ohttp::Server, | |
| slot1: ohttp::Server, |
| pub(crate) struct KeySlot { | ||
| pub(crate) server: ohttp::Server, | ||
| } |
There was a problem hiding this comment.
i think this type can be removed as a pub type, it's a newtype wrapper for ohttp::Server but doesn't have an impl block...
it can be instead a private struct that wraps the boxed mutex and implements overwrite and decapsulate methods that take the lock internally.
| pub(crate) struct KeySlot { | |
| pub(crate) server: ohttp::Server, | |
| } | |
| struct KeySlot { | |
| pub(crate) server: Box<RwLock<ohttp::Server>>, | |
| } |
| let key_id = ohttp_body.first().copied().ok_or(ohttp::Error::Truncated)?; | ||
| match self.keys.get(key_id as usize) { | ||
| Some(slot) => slot.read().await.server.decapsulate(ohttp_body), | ||
| None => Err(ohttp::Error::KeyId), | ||
| } |
There was a problem hiding this comment.
nit: i find more ? heavy code easier to read because of early exit, and this would be more consistent. not a big difference a match block is still readable but maybe you also prefer this approach
| let key_id = ohttp_body.first().copied().ok_or(ohttp::Error::Truncated)?; | |
| match self.keys.get(key_id as usize) { | |
| Some(slot) => slot.read().await.server.decapsulate(ohttp_body), | |
| None => Err(ohttp::Error::KeyId), | |
| } | |
| let key_id = ohttp_body.first().copied().ok_or(ohttp::Error::Truncated)? as usize; | |
| let slot = self.keys.get(key_id as usize).ok_or(ohttp::Error::KeyId)?; | |
| slot.read().await.server.decapsulate(ohttp_body) |
or even
| let key_id = ohttp_body.first().copied().ok_or(ohttp::Error::Truncated)?; | |
| match self.keys.get(key_id as usize) { | |
| Some(slot) => slot.read().await.server.decapsulate(ohttp_body), | |
| None => Err(ohttp::Error::KeyId), | |
| } | |
| let key_id = ohttp_body.first().copied().ok_or(ohttp::Error::Truncated)? as usize; | |
| self.keys.get(key_id).ok_or(ohttp::Error::KeyId)?.decapsulate(ohttp_body) |
if impl KeySlot gets a decapsulate
| let ohttp_keys = self | ||
| .ohttp | ||
| .config() | ||
| .encode() | ||
| .encode_current() | ||
| .await | ||
| .map_err(|e| HandlerError::InternalServerError(e.into()))?; | ||
| let mut res = Response::new(full(ohttp_keys)); | ||
| res.headers_mut().insert(CONTENT_TYPE, HeaderValue::from_static("application/ohttp-keys")); | ||
| if let Some(max_age) = self.ohttp_keys_max_age { | ||
| // Subtract ROTATION_GRACE / 3 so clients refresh their cached key | ||
| // slightly before the rotation boundary, staying well within the | ||
| // grace window where the old key is still accepted. | ||
| let remaining = self | ||
| .ohttp | ||
| .valid_until() | ||
| .await | ||
| .saturating_duration_since(Instant::now()) | ||
| .min(max_age) | ||
| .saturating_add(ROTATION_GRACE / 3); | ||
| res.headers_mut().insert( | ||
| CACHE_CONTROL, | ||
| HeaderValue::from_str(&format!( | ||
| "public, s-maxage={}, immutable", | ||
| remaining.as_secs() | ||
| )) | ||
| .expect("valid header value"), | ||
| ); | ||
| } |
There was a problem hiding this comment.
there's a race condition here.
ohttp_keys is the serialized keys but then ohttp.valid_until() is only called after constructing the response, which may race with the rotation.
encode_current should probably just return a pair, the config and how long it's valid from that holds a lock on the keyslot while it's reading both values out of the active key struct, then encodes that keyslot's config before dropping the lock
There was a problem hiding this comment.
the duration instead of instant conversion logic can also be done inside encode_current, and i don't think and min(max_age) is required, the max age param affects the active key structure, the valid until data is set up by the active key rotation so this part of the code can just trust it instead of redundantly enforcing the config parameter
| // Sleep until just before the current key expires. | ||
| let valid_until = keyset.valid_until().await; | ||
| tracing::info!("Sleeping until {:?}", valid_until); | ||
| //let switch_at = valid_until.checked_sub(ROTATION_GRACE / 2).unwrap_or(valid_until); |
There was a problem hiding this comment.
| //let switch_at = valid_until.checked_sub(ROTATION_GRACE / 2).unwrap_or(valid_until); |
| tracing::info!( | ||
| "---------------------------------------------------------------------------" | ||
| ); | ||
|
|
There was a problem hiding this comment.
| tracing::info!( | |
| "---------------------------------------------------------------------------" | |
| ); |
There was a problem hiding this comment.
oops, mistakenly left a line i used for seperating outputs
| "---------------------------------------------------------------------------" | ||
| ); | ||
|
|
||
| // Touch the new active key file *after* overwriting the old slot so |
There was a problem hiding this comment.
| // Touch the new active key file *after* overwriting the old slot so | |
| // Touch the new active key file *before* overwriting the old slot so |
| // its mtime is newest on disk. On restart, | ||
| // and derives valid_until from its age. | ||
| let active_path = keys_dir.join(format!("{new_key_id}.ikm")); | ||
| let times = std::fs::FileTimes::new().set_modified(std::time::SystemTime::now()); |
There was a problem hiding this comment.
should be tokio::fs not std::fs
There was a problem hiding this comment.
True!, this is will lock the async runtime
| match secs { | ||
| None => Ok(None), | ||
| Some(0) => Err(<D::Error as serde::de::Error>::custom( | ||
| "ohttp_keys_max_age must be greater than 0 seconds when set", |
There was a problem hiding this comment.
not sure where the right place to validate, though i guess this could be renamed to deserialize_key_epoch_duration_secs or something to make it clear that it's not for arbitrary durations...
but either way, the value should probably be at least 2*GRACE_PERIOD, and we might want to warn for unreasonably short values
|
oh.. hmm. turns out this isn't the first time i realized that, even came up with the same hack lol: #445 (comment) anyway i think past me was less of an idiot than present me, and was correct in saying this isn't a big deal if the no-collusion assumption of the OHTTP privacy model holds, although this can link sessions' mailboxes there's not much a directory can do with that info, and temporal leaks likely reveal that anyway |
|
hmm, doing weird tricks with asking for the next key seems overcomplicated, so again i think past self was less of an idiot, but unfortunately this means that the grace period needs to be at least as long as we expect a session to be (24H? a week? ...), instead of on the order of seconds =( |
This pr addresses #445. It implements OHTTP-key rotation to payjoin-mailroom Mailroom operators can now decide the time interval for keys to be rotated. Also if a key has expired, a 422 error is returned to clients. Clients can handle they key-rotation via the cach-control header returned by the directory.
Pull Request Checklist
Please confirm the following before requesting review:
AI
in the body of this PR.