Skip to content

OHTTP keys should be rotated#1449

Open
zealsham wants to merge 1 commit intopayjoin:masterfrom
zealsham:key-rotation
Open

OHTTP keys should be rotated#1449
zealsham wants to merge 1 commit intopayjoin:masterfrom
zealsham:key-rotation

Conversation

@zealsham
Copy link
Copy Markdown
Collaborator

@zealsham zealsham commented Mar 29, 2026

This pr addresses #445. It implements OHTTP-key rotation to payjoin-mailroom Mailroom operators can now decide the time interval for keys to be rotated. Also if a key has expired, a 422 error is returned to clients. Clients can handle they key-rotation via the cach-control header returned by the directory.

Pull Request Checklist

Please confirm the following before requesting review:

@coveralls
Copy link
Copy Markdown
Collaborator

coveralls commented Mar 29, 2026

Coverage Report for CI Build 24348647476

Coverage increased (+0.02%) to 84.357%

Details

  • Coverage increased (+0.02%) from the base build.
  • Patch coverage: 43 uncovered changes across 3 files (281 of 324 lines covered, 86.73%).
  • 1 coverage regression across 1 file.

Uncovered Changes

File Changed Covered %
payjoin-mailroom/src/directory.rs 107 77 71.96%
payjoin-mailroom/src/config.rs 14 2 14.29%
payjoin-mailroom/src/key_config.rs 62 61 98.39%

Coverage Regressions

1 previously-covered line in 1 file lost coverage.

File Lines Losing Coverage Coverage
payjoin-mailroom/src/key_config.rs 1 97.7%

Coverage Stats

Coverage Status
Relevant Lines: 13092
Covered Lines: 11044
Line Coverage: 84.36%
Coverage Strength: 403.75 hits per line

💛 - Coveralls

Copy link
Copy Markdown
Contributor

@nothingmuch nothingmuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max-age handling as implemented will not be correct without an Age (or possibly Date) header. i also have a bunch of suggestions to simplify, some more to do with the code itself others with the approach

Comment thread payjoin-mailroom/src/config.rs Outdated
listener: "[::]:8080".parse().expect("valid default listener address"),
storage_dir: PathBuf::from("./data"),
timeout: Duration::from_secs(30),
ohttp_keys_max_age: Some(Duration::from_secs(604800)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ohttp_keys_max_age: Some(Duration::from_secs(604800)),
ohttp_keys_max_age: Some(Duration::from_secs(7 * 24 * 60 * 60)),

easier to read IMO. Duration::from_weeks also exists but only in nightly

Comment thread payjoin-mailroom/src/config.rs Outdated
{
// duration value of 0 isn't accepted
let secs: Option<u64> = Option::deserialize(deserializer)?;
Ok(secs.filter(|&s| s > 0).map(Duration::from_secs))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if an invalid duration is given this will silently deserialize as Ok(None) instead of giving an error

Comment thread payjoin-mailroom/src/directory.rs Outdated
#[derive(Debug)]
pub struct OhttpKeySet {
pub current_key_id: u8,
pub servers: BTreeMap<u8, ohttp::Server>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this pub and named servers which seems like a reflection of an implementation detail?

also BTreeMap seems overkill if this is constrained to have only IDs 1 and 2. an [Option<ohttp::Server>; 2] could be used and key ID used to derive an index by subtracting 1, seems simpler?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True!, my initial implementation wasn't deleting old keys automatically and would load them back in memory on restart but only accept the one with largest id number. i'll change this

Comment thread payjoin-mailroom/src/directory.rs Outdated
/// All keys in the `servers` map are accepted for decapsulation, allowing clients
/// with cached older keys to continue working during the overlap window.
#[derive(Debug)]
pub struct OhttpKeySet {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name of this struct seems a bit at odds with its functionality, which handles OHTTP decapsulation as well, maybe KeyRotatingServer is more appropriate?

Comment thread payjoin-mailroom/src/directory.rs Outdated
if let Some(max_age) = self.ohttp_keys_max_age {
res.headers_mut().insert(
CACHE_CONTROL,
HeaderValue::from_str(&format!("public, max-age={}", max_age.as_secs()))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will specify the max age of the HTTP resource relative to the time at which the response was generated, but the semantics for the key expiration and self.ohttp_keys_max_age is relative to when the key was initialized

the Age header can be used to specify the age of the response, which is to be subtracted from max_age or this could be done server side

s-maxage is arguably the more appropriate cache control directive to use since this can be shared (c.f. key consistency expired draft), or the equivalent public, max-age=... can be used. immutable can also be set.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my first time seeing the "immutable" cache header

implementing this .

Comment thread payjoin-mailroom/src/directory.rs Outdated
/// A new key with the next key_id is generated and persisted to disk.
/// The new key becomes the current key served to clients.
/// After `max_age` elapses the old key is dropped from memory and its .ikm` file is deleted from disk.
pub fn spawn_key_rotation(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it'd be simpler to do the keygen logic in the request handling path instead of as a background task?

another way to potentially simplify this is to use a KDF to generate keys as a function of time, i.e. divide the current time by the max key age, use that as a nonce/counter in a KDF based on a single persisted master key, and just generate the per-epoch key deterministically, that would remove the need for persisting the key material on disk (and so avoid any consistency issues and the need for a sync() call while the rwlock is held to ensure that key material is in durable storage)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another way to potentially simplify this is to use a KDF to generate keys as a function of time, i.e. divide the current time by the max key age, use that as a nonce/counter in a KDF based on a single persisted master key, and just generate the per-epoch key deterministically, that would remove the need for persisting the key material on disk (and so avoid any consistency issues and the need for a sync() call while the rwlock is held to ensure that key material is in durable storage)

not sure i completely understand what you mean here. could this be elaborated or discussed in the upcoming mob programming session if you got the time ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think my suggestion was a mistake, see below for reasoning. at any rate to answer the question:

simplest version:

generate a root secret, save it on disk

current epoch number = current unix time / epoch length

epoch key = KDF(root secret, current epoch number)

so at every point in time there's a deterministically derived epoch key

ratcheting version:

generate initial epoch key, save it on disk

epoch key = H(previous epoch key), overwrite previous saved epoch key

but thinking about this more, assuming a temporary compromise (same as in forward secrecy) where a key from an earlier epoch was leaked, and after that the adversary loses access and then a new randomly generated key is introduced, then secrecy will be recovered. this is no longer the case with a deterministic approach.

it could be argued that this model is too optimistic for real world settings. once key material is exflitrated, supposing the compromise is detected and a new host is provisioned... what is gained by copying the key key to the new instance? i think it's better to revoke it entirely (i.e. respond with 422 and force re-configuration).

on the other hand, if compromise isn't detected so no new host is provisioned, why would a real world access lose the ability to compromise the host again if they have enough access to exfiltrate key material at an earlier time? that seems like a pretty contrived scenario, it could be realized using scheduled reboots and ephemeral root filesystem combined to ensure that persistent access can't be set up, and with regular updates that might patch the vulnerability recovering from the compromise... anyway the point is actually realizing this threat model takes significant effort.

so, if this reason is not strong enough to want to do this, what other reasons are there to rotate keys? i can't think of one... and if it is strong enough, we definitely need the stateful version you've implemented, not a deterministic approach since that seems to invalidate the only justification

@zealsham zealsham force-pushed the key-rotation branch 2 times, most recently from 71abcf7 to aa33a74 Compare April 2, 2026 16:59
Comment thread payjoin-mailroom/src/directory.rs Outdated
.and_then(|i| self.keys[i as usize].as_ref());
match server {
Some(s) => s.decapsulate(ohttp_body),
None => Err(ohttp::Error::Truncated),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
None => Err(ohttp::Error::Truncated),
None => Err(ohttp::Error::KeyId),

why truncated?

Comment thread payjoin-mailroom/src/directory.rs Outdated
if self.ohttp.read().await.current_key_created_at().elapsed() < max_age {
return Ok(());
}
let mut keyset = self.ohttp.write().await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both keyset and self.ohttp are hard to read... i would consider renaming, maybe self.servers_by_keyid?

Comment thread payjoin-mailroom/src/directory.rs Outdated
Comment on lines +488 to +497
let new_key_id = keyset.next_key_id();
if let Some(dir) = &self.ohttp_keys_dir {
let _ = tokio::fs::remove_file(dir.join(format!("{new_key_id}.ikm"))).await;
}
let config = crate::key_config::gen_ohttp_server_config_with_id(new_key_id)
.map_err(HandlerError::InternalServerError)?;
if let Some(dir) = &self.ohttp_keys_dir {
crate::key_config::persist_key_config(&config, dir)
.map_err(HandlerError::InternalServerError)?;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems cleaner to try generating before zapping the old key

Suggested change
let new_key_id = keyset.next_key_id();
if let Some(dir) = &self.ohttp_keys_dir {
let _ = tokio::fs::remove_file(dir.join(format!("{new_key_id}.ikm"))).await;
}
let config = crate::key_config::gen_ohttp_server_config_with_id(new_key_id)
.map_err(HandlerError::InternalServerError)?;
if let Some(dir) = &self.ohttp_keys_dir {
crate::key_config::persist_key_config(&config, dir)
.map_err(HandlerError::InternalServerError)?;
}
let new_key_id = keyset.next_key_id();
let config = crate::key_config::gen_ohttp_server_config_with_id(new_key_id)
.map_err(HandlerError::InternalServerError)?;
if let Some(dir) = &self.ohttp_keys_dir {
let _ = tokio::fs::remove_file(dir.join(format!("{new_key_id}.ikm"))).await;
crate::key_config::persist_key_config(&config, dir)
.map_err(HandlerError::InternalServerError)?;
}

but actually what i would prefer even more is if the old key was cleared sooner, so that the array contains an Option and a None most of the time, and the two keyids are only valid for a grace period accounting for reasonable clock skew and latency (so on the order of a few seconds)

if two keyids are always available for use, then clients using the old key leak information about their local clock, which is a strong fingerprint

that said we definitely should support two keyids during a short time window, because if only one key is ever available then clients who are slightly ahead will expire their key but then fetch the same key and effectively need to poll the ohttp configuration until it changes, which is wasteful, or clients who are behind will make requests with a stale key, get an error, need to do key configuration, and then submit the request under the new key. since this is wasteful and degrades ux, some allowance is necessary.

Comment thread payjoin-mailroom/src/directory.rs Outdated
}
}

async fn maybe_rotate_keys(&self) -> Result<(), HandlerError> {
Copy link
Copy Markdown
Contributor

@nothingmuch nothingmuch Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since i retracted my suggestion to do deterministic keys, i think it also no longer makes sense to do this on the request path, and it should go back to being done in a background loop

the deterministic approach would not have required an RwLock or any IO... the combination of both of these things including IO that happens while holding the RwLock will now potentially block long running requests, and long polling requests which will hold a read() lock for their entire duration.

rwlock does not define reader/writer priority, so depending on the OS either many long polls will block key rotation until all long poll requests expire (though this can be fixed, see other comment), and any new requests will likewise block while attempting to obtain an rwlock because the current key expired

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping the read guard just after decapsulation fixes this in a way. Although i'm dropping the function entirely in favour of the background task one

Comment thread payjoin-mailroom/src/directory.rs Outdated
let (bhttp_req, res_ctx) = self
.ohttp
// Decapsulate OHTTP request using the key matching the message's key_id
let keyset = self.ohttp.read().await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this holds the rwlock with read for entirety of the request handling, but it could be dropped right after decapsulation since res_ctx contains all the information required to respond

@nothingmuch
Copy link
Copy Markdown
Contributor

a simpler way to support two keyids only during transition window is to always have two of them, no Option, instead keep a valid_from time for each separately.

after valid_from + ttl + grace period, a background loop can overwrite the inactive key with a new one.

when decapsulating, if the key is stale but hasn't been overwritten yet, it's still in the grace period so the request should be accepted. if the key is stale then the keyid is now representing the key of the next epoch, but that hasn't been served via config yet so it's impossible for any client to use it.

then the config can start serving the next key ~1/2 of the grace period before the current key expires (valid_from + ttl), it should already usable because it was overwritten shortly after the end of the last epoch, and serving it slightly before expiry means that clients whose clock is slightly ahead and therefore have expired their local copy won't refetch the same exact key with a really short remaining time to live

finally, if the RwLock inside of the array, then the two locks should generally not be in contention either (assuming they aren't on the same cache line, which may need some alignment pragma), since the rwlock on the stale key should only be taken after requests involving that keyid kinda die out. this means the loop can freely hold the rwlock while it does IO without introducing any jitter to requests that wouldn't have generated an error anyway

Comment thread payjoin-mailroom/src/lib.rs Outdated
let mut iter = configs.into_iter();
let (current_cfg, current_mtime) = iter.next().expect("non-empty");
let current_age =
std::time::SystemTime::now().duration_since(current_mtime).unwrap_or_default();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should expect() not default, a negative duration indicates the system clock was reset

@zealsham zealsham force-pushed the key-rotation branch 3 times, most recently from add6ced to 93eeb8b Compare April 3, 2026 19:16
Comment thread payjoin-mailroom/src/directory.rs Outdated

impl KeyRotatingServer {
pub fn from_single(server: ohttp::Server, key_id: u8) -> Self {
assert!(key_id == 1 || key_id == 2, "key_id must be 1 or 2");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, key identifiers are an arbitrary u8: https://www.ietf.org/rfc/rfc9458.html#section-3.1

so probably easier to use key ids 0 and 1 instead of 1 and 2? i think i said 1 and 2 somewhere so sorry if i made it seem like that was an important detail

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be addressed. Ohttp targets can set whatever key id they want

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the mailroom only deals with mailroom ohttp targets and the only reason to support more than 1 key is to do graceful rotation

this may change if say a PQ KEM is added to the OHTTP layer or something like that, but right now supporting more than two distinct keyids serves no purpose that i can see and neither rfc 9458 nor our gateway opt-in extension allows for arbitrary targets, only the mailbox service

Comment thread payjoin-mailroom/src/directory.rs Outdated
pub fn from_single(server: ohttp::Server, key_id: u8) -> Self {
assert!(key_id == 1 || key_id == 2, "key_id must be 1 or 2");
let mut keys = [None, None];
keys[(key_id - 1) as usize] = Some(server);
Copy link
Copy Markdown
Contributor

@nothingmuch nothingmuch Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be simpler to just initialize both and avoid the Option IMO, no need to pattern match or expect everywhere

btw that's also why i suggested changing "created_at" to "valid_from", since both are created at the same time but only one is valid from Instant::now()

Comment thread payjoin-mailroom/src/directory.rs Outdated
assert!(id == 1 || id == 2, "key_id must be 1 or 2");
keys[(id - 1) as usize] = Some(server);
}
let created_at = Instant::now().checked_sub(current_key_age).unwrap_or_else(Instant::now);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should .expect(), not `.unwrap_or_else. if there's a key claimed to be from the future it's best to crash the server because the host machine is clearly misconfigured

Comment thread payjoin-mailroom/src/directory.rs Outdated
Comment on lines +113 to +114
self.current_key_id = new_key_id;
self.current_key_created_at = Instant::now();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.current_key_id = new_key_id;
self.current_key_created_at = Instant::now();

best if this rotates the next key but doesn't yet activate it, see my previous comments discussing why we want two keys and why we want a grace period, if clients whose clock is running a bit fast expire their local key they will request the same key repeatedly, so it's better to already offer the next key (and decapsulate from it) a few seconds before the current key expires, and to still accept an expired key a few seconds after expiry, but then after this grace period already generate the new key to invalidate the old one, removing the need for retire as well

Comment thread payjoin-mailroom/src/directory.rs Outdated
}
};
let _ = tokio::fs::remove_file(keys_dir.join(format!("{new_key_id}.ikm"))).await;
if let Err(e) = crate::key_config::persist_key_config(&config, &keys_dir) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IO, especially blocking should not be done while holding a write lock on the whole keyset, since that prevents requests from being handled

see also my suggestion to use a Arc<[Box<RwLock<Server>; 2]> inside of keyset as that eliminates this concern, i think it's perfectly fine to write the key while holding a lock there, but persist_key_config should be async

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i opted to make the keyid an atomic u8 as plain u8 isn't safe for concurrent reads and writes ,if switch() writes to it while decapsulate() reads it, that might lead to data-races

Copy link
Copy Markdown
Collaborator

@arminsabouri arminsabouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are couple unaddressed comments from the previous review. In general the hardcoded key ids and array indexing seem like a footgun. See my comment about creating two fields in the key config struct.

Edit: A fixed sized array with indexing is fine. However, i would still like to understand why the fields are optional. The primary key should never be None FWIU, and the secondary key should always be present as well -- only used during the grace period. @zealsham

Comment thread payjoin-mailroom/src/directory.rs
Comment thread payjoin-mailroom/src/directory.rs Outdated

impl KeyRotatingServer {
pub fn from_single(server: ohttp::Server, key_id: u8) -> Self {
assert!(key_id == 1 || key_id == 2, "key_id must be 1 or 2");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be addressed. Ohttp targets can set whatever key id they want

Comment thread payjoin-mailroom/src/directory.rs Outdated
// with a cached previous key still work during the overlap window.
#[derive(Debug)]
pub struct KeyRotatingServer {
keys: [Option<ohttp::Server>; 2],
Copy link
Copy Markdown
Collaborator

@arminsabouri arminsabouri Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we ever have more than 2 active keys? If not, I would suggest having two fields in this struct instead of indexing

{
current_key: ohttp::Server,
previous_key: Option<ohttp::Server>,
current_key_created_at: Instant,
}

This at least ensures that the primary key is always Some. And may obviate the need for key ids to be in {1,2}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , we would have two key for a short while , the grace period of the previous key before it is retired

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. However, it seems odd to me that they are optional. Shouldn't the current key always be Some?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both keys should always be Some #1449 (comment)

Comment thread payjoin-mailroom/src/lib.rs Outdated
None
};
Ok(crate::directory::Service::new(db, ohttp_config.into(), sentinel_tag, v1))
let svc =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary abbreviation

Suggested change
let svc =
let service =

Copy link
Copy Markdown
Contributor

@nothingmuch nothingmuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mtime of the key file corresponds to when the key was generated, as does some of the usage of the valid_from field, it's still semantically a created_at field

i think the math would be a bit simpler if it was valid_until instead of valid_from, see suggestions but the main thing is that when rotating the key the mtime has to be bumped or the key rotation will switch between one long lived key and another key which nominally lasts a few microseconds longer

Comment thread payjoin-mailroom/src/directory.rs Outdated

#[derive(Debug)]
pub struct KeyRotatingServer {
keys: [RwLock<KeySlot>; 2],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this needs to be a Box or to have some alignment declaration, or these are effectively one RwLock since locking works on a cache line granularity

Comment thread payjoin-mailroom/src/directory.rs Outdated
#[derive(Debug)]
pub(crate) struct KeySlot {
pub(crate) server: ohttp::Server,
pub(crate) valid_from: Instant,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the relationship between these is always a difference of epoch length, it's enough to still keep that as a top level field and bump it along with the current keyid as it was before...

also seems like making it valid_until would simplify some of the code to deal with this?

Comment thread payjoin-mailroom/src/directory.rs Outdated
let config = match crate::key_config::gen_ohttp_server_config_with_id(old_key_id) {
Ok(c) => c,
Err(e) => {
tracing::error!("Failed to generate OHTTP key: {e}");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should panic, not just log, if there's no ability to generate key material something has gone terribly wrong

Comment thread payjoin-mailroom/src/directory.rs Outdated
};
let _ = tokio::fs::remove_file(keys_dir.join(format!("{old_key_id}.ikm"))).await;
if let Err(e) = crate::key_config::persist_key_config(&config, &keys_dir).await {
tracing::error!("Failed to persist OHTTP key: {e}");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should also panic, if it's impossible to write to disk then again something has gone terribly wrong

Comment thread payjoin-mailroom/src/directory.rs Outdated
// Replace a slot with fresh key material.
pub async fn overwrite(&self, key_id: u8, server: ohttp::Server) {
assert!(key_id <= 1, "key_id must be 0 or 1");
*self.keys[key_id as usize].write().await = KeySlot { server, valid_from: Instant::now() };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not valid from now, it's valid from the end of the previous epoch

the point of renaming it is to indicate that this represents when the key starts being active not when it is generated

Comment thread payjoin-mailroom/src/directory.rs Outdated
let mut res = Response::new(full(ohttp_keys));
res.headers_mut().insert(CONTENT_TYPE, HeaderValue::from_static("application/ohttp-keys"));
if let Some(max_age) = self.ohttp_keys_max_age {
let remaining = max_age.saturating_sub(self.ohttp.current_valid_from().await.elapsed());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in line with the other simplification of the delays, adding something like ROTATION_GRACE / 3 to this duration should be done here

Comment thread payjoin-mailroom/src/directory.rs Outdated
Comment on lines +524 to +529
let switch_delay = {
let valid_from = keyset.current_valid_from().await;
let switch_at = valid_from + interval - ROTATION_GRACE / 2;
switch_at.saturating_duration_since(Instant::now())
};
tokio::time::sleep(switch_delay).await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let switch_delay = {
let valid_from = keyset.current_valid_from().await;
let switch_at = valid_from + interval - ROTATION_GRACE / 2;
switch_at.saturating_duration_since(Instant::now())
};
tokio::time::sleep(switch_delay).await;
let switch_delay = {
let switch_at = keyset.current_valid_until().await;
switch_at.saturating_duration_since(Instant::now())
};
tokio::time::sleep(switch_delay).await;

i think all the math would be easier to follow if the field was defined like this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokio::time::sleep_until exists, no need to calculate the difference from now

Comment thread payjoin-mailroom/src/directory.rs Outdated
Comment on lines +536 to +541
let overwrite_delay = {
let valid_from = keyset.valid_from(old_key_id).await;
let overwrite_at = valid_from + interval + ROTATION_GRACE;
overwrite_at.saturating_duration_since(Instant::now())
};
tokio::time::sleep(overwrite_delay).await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let overwrite_delay = {
let valid_from = keyset.valid_from(old_key_id).await;
let overwrite_at = valid_from + interval + ROTATION_GRACE;
overwrite_at.saturating_duration_since(Instant::now())
};
tokio::time::sleep(overwrite_delay).await;
tokio::time::sleep(ROTATION_GRACE).await;

Comment thread payjoin-mailroom/src/lib.rs Outdated
let valid_from = now.checked_sub(age).unwrap_or(now);
let key_id = cfg.key_id();
slots[key_id as usize] =
Some(crate::directory::KeySlot { server: cfg.into_server(), valid_from });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid_from only makes sense as mtime for the initial key

secondly, if one or both of these keys are expired, then the key generation logic should trigger here during initialization

Comment thread payjoin-mailroom/src/directory.rs Outdated
};
tokio::time::sleep(switch_delay).await;

let old_key_id = keyset.current_key_id();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the mtime of the key we're about to switch to should be updated at this moment


// Grace period after a switch during which the old key is still
// accepted for decapsulation.
const ROTATION_GRACE: Duration = Duration::from_secs(30);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const ROTATION_GRACE: Duration = Duration::from_secs(30);
const ROTATION_GRACE: Duration = Duration::from_secs(15);

this delay should be reduced somewhat, it should only really account for end to end latency, and even then the expected delays not worst case so thinking about it more 30 seconds seems a bit excessive

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately this suggestion is completely wrong, we need the grace period to be roughly 1 week (!!!) long because of the way OHTTP configs and expiry is handled in BIP 77 URIs, see later review comments

Comment thread payjoin-mailroom/src/lib.rs Outdated
}
}

let configs = crate::key_config::read_server_configs_by_mtime(ohttp_keys_dir)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sorts by mtime, but then they are installed by keyid... isn't it simpler to just return them in keyid order. the older mtime would correspond to the valid_from of the earlier epoch, allowing the later epoch's valid_from to be calculated from that even if it hadn't yet started (server was shutdown before a key it was rotated into use but after it was created)

Comment thread payjoin-mailroom/src/directory.rs Outdated
Comment on lines +94 to +95
current.0 = 1 - current.0;
current.1 = Instant::now() + interval;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my suggestion i believe i wrote RwLock<(u8, Instant)> but i didn't mean that literally, just the tuple a struct would be equivalent to, IMO this should have named fields

Comment thread payjoin-mailroom/src/directory.rs Outdated
.await
.saturating_duration_since(Instant::now())
.min(max_age)
.saturating_sub(ROTATION_GRACE / 3);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should add a few seconds, not subtract, the purpose is to avoid clients expiring this before the key is actually rotated (a client whose clock is running fast) only to fetch the same key and have it expire basically immediately

Comment thread payjoin-mailroom/src/directory.rs Outdated
loop {
// Sleep until just before the current key expires.
let valid_until = keyset.valid_until().await;
let switch_at = valid_until.checked_sub(ROTATION_GRACE / 2).unwrap_or(valid_until);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the only meaning of the valid_until field is to time this, why not just set it to the time things should be rotated (no grace period related adjustment)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i made this adjustments

Comment thread payjoin-mailroom/src/directory.rs Outdated
let valid_until = keyset.valid_until().await;
let overwrite_at =
valid_until.checked_sub(interval).unwrap_or(valid_until) + ROTATION_GRACE;
tokio::time::sleep_until(overwrite_at.into()).await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... then this can just be tokio::time::sleep(ROTATION_GRACE)

Comment thread payjoin-mailroom/src/lib.rs Outdated
let slot1 = slots[1].take().expect("slot 1 missing after init");
let valid_until = match interval {
Some(ivl) => now + ivl.saturating_sub(current_age),
None => now + std::time::Duration::from_secs(365 * 24 * 3600),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this force rotation on a yearly schedule even if not configured?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does, but this part of the code is never reached as it's dependent on max-age being "some". i set it to 10 years now

Comment thread payjoin-mailroom/src/directory.rs Outdated
// Capture old key id before switching, then switch.
// `switch` stamps valid_until = Instant::now() + interval, anchored
// to the actual moment the new key goes live.
let old_key_id = keyset.current_key_id().await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new key file's mtime needs to be updated here before switching, otherwise it may be served in response a key config GET request before the server shuts down, and on the next startup it will be invalidated, causing the client to get a 422 response for their cached key

@nothingmuch
Copy link
Copy Markdown
Contributor

please consider all scenarios and make sure they are handled correctly:

  • server initializes with two expired keys
  • server initializes with one valid key and one expired key
  • server initializes with two valid keys

as far as i can tell only the last one is, and even that is buggy due to not updating mtime

there should be tests covering this

This pr addresses payjoin#445. It implements OHTTP-key rotation to payjoin-mailroom
Mailroom operators can now decide the time interval for keys to be rotated.
Also if a key has expired, a 422 error is returned to clients. Clients can
handle they key-rotation via the cach-control header returned by the directory.

fix spawn rotation
@zealsham
Copy link
Copy Markdown
Collaborator Author

please consider all scenarios and make sure they are handled correctly:

  • server initializes with two expired keys
  • server initializes with one valid key and one expired key
  • server initializes with two valid keys

as far as i can tell only the last one is, and even that is buggy due to not updating mtime

there should be tests covering this

implemented this
when the server init with two expired keys , new keys are generated and an active key assigned to the key with id=0
when the server init with one valid key, the remaining inteval is calculated from the mtime of the key , the key is then set as the active key an valid_untill set to the remaining time
if the server initizializes with two valid keys, the key with the older mtime is set to the active key and the remaining time calculated for rotation

Copy link
Copy Markdown
Contributor

@nothingmuch nothingmuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code is looking good. i have a few suggestions

but more importantly, i also realized there's a problem with the design i proposed...

a pj URI will contain the OHTTP config from the receiver to the sender. the receiver can set expiry to be before the OHTTP config expires, but that just shortens the time window for the URI to be useful. if a longer duration is required the receiver can only wait until the config expires and then generate a new one.

in addition to cache control, we should probably send a last modified header as well,

one potential workaround for this is to allow post dating an If-Modified-Since, i.e. have it contain a date after the expiry of the current cache control, specifically to request the next key. if the interval is long this should probably be capped at ~1 week or 24 hours, or whatever we decide on in #1457

however this can be be a strong fingerprint, it leaks information about the expiry that a receiver is setting up, and allows linking the sender to the receiver more strongly based on this... i think this means that the design of the URI including the OHTTP config is inherently at odds with privacy preserving key rotation

Comment on lines +59 to +60
slot0: KeySlot,
slot1: KeySlot,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeySlot wrapper contains no additional information, so i think it just imposes boilerplate on the calling code

there's a better use for it IMO see other comment

Suggested change
slot0: KeySlot,
slot1: KeySlot,
slot0: ohttp::Server,
slot1: ohttp::Server,

Comment on lines +41 to +43
pub(crate) struct KeySlot {
pub(crate) server: ohttp::Server,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this type can be removed as a pub type, it's a newtype wrapper for ohttp::Server but doesn't have an impl block...

it can be instead a private struct that wraps the boxed mutex and implements overwrite and decapsulate methods that take the lock internally.

Suggested change
pub(crate) struct KeySlot {
pub(crate) server: ohttp::Server,
}
struct KeySlot {
pub(crate) server: Box<RwLock<ohttp::Server>>,
}

Comment on lines +82 to +86
let key_id = ohttp_body.first().copied().ok_or(ohttp::Error::Truncated)?;
match self.keys.get(key_id as usize) {
Some(slot) => slot.read().await.server.decapsulate(ohttp_body),
None => Err(ohttp::Error::KeyId),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i find more ? heavy code easier to read because of early exit, and this would be more consistent. not a big difference a match block is still readable but maybe you also prefer this approach

Suggested change
let key_id = ohttp_body.first().copied().ok_or(ohttp::Error::Truncated)?;
match self.keys.get(key_id as usize) {
Some(slot) => slot.read().await.server.decapsulate(ohttp_body),
None => Err(ohttp::Error::KeyId),
}
let key_id = ohttp_body.first().copied().ok_or(ohttp::Error::Truncated)? as usize;
let slot = self.keys.get(key_id as usize).ok_or(ohttp::Error::KeyId)?;
slot.read().await.server.decapsulate(ohttp_body)

or even

Suggested change
let key_id = ohttp_body.first().copied().ok_or(ohttp::Error::Truncated)?;
match self.keys.get(key_id as usize) {
Some(slot) => slot.read().await.server.decapsulate(ohttp_body),
None => Err(ohttp::Error::KeyId),
}
let key_id = ohttp_body.first().copied().ok_or(ohttp::Error::Truncated)? as usize;
self.keys.get(key_id).ok_or(ohttp::Error::KeyId)?.decapsulate(ohttp_body)

if impl KeySlot gets a decapsulate

Comment on lines 470 to +496
let ohttp_keys = self
.ohttp
.config()
.encode()
.encode_current()
.await
.map_err(|e| HandlerError::InternalServerError(e.into()))?;
let mut res = Response::new(full(ohttp_keys));
res.headers_mut().insert(CONTENT_TYPE, HeaderValue::from_static("application/ohttp-keys"));
if let Some(max_age) = self.ohttp_keys_max_age {
// Subtract ROTATION_GRACE / 3 so clients refresh their cached key
// slightly before the rotation boundary, staying well within the
// grace window where the old key is still accepted.
let remaining = self
.ohttp
.valid_until()
.await
.saturating_duration_since(Instant::now())
.min(max_age)
.saturating_add(ROTATION_GRACE / 3);
res.headers_mut().insert(
CACHE_CONTROL,
HeaderValue::from_str(&format!(
"public, s-maxage={}, immutable",
remaining.as_secs()
))
.expect("valid header value"),
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a race condition here.

ohttp_keys is the serialized keys but then ohttp.valid_until() is only called after constructing the response, which may race with the rotation.

encode_current should probably just return a pair, the config and how long it's valid from that holds a lock on the keyslot while it's reading both values out of the active key struct, then encodes that keyslot's config before dropping the lock

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the duration instead of instant conversion logic can also be done inside encode_current, and i don't think and min(max_age) is required, the max age param affects the active key structure, the valid until data is set up by the active key rotation so this part of the code can just trust it instead of redundantly enforcing the config parameter

// Sleep until just before the current key expires.
let valid_until = keyset.valid_until().await;
tracing::info!("Sleeping until {:?}", valid_until);
//let switch_at = valid_until.checked_sub(ROTATION_GRACE / 2).unwrap_or(valid_until);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//let switch_at = valid_until.checked_sub(ROTATION_GRACE / 2).unwrap_or(valid_until);

Comment on lines +547 to +550
tracing::info!(
"---------------------------------------------------------------------------"
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tracing::info!(
"---------------------------------------------------------------------------"
);

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, mistakenly left a line i used for seperating outputs

"---------------------------------------------------------------------------"
);

// Touch the new active key file *after* overwriting the old slot so
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Touch the new active key file *after* overwriting the old slot so
// Touch the new active key file *before* overwriting the old slot so

// its mtime is newest on disk. On restart,
// and derives valid_until from its age.
let active_path = keys_dir.join(format!("{new_key_id}.ikm"));
let times = std::fs::FileTimes::new().set_modified(std::time::SystemTime::now());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be tokio::fs not std::fs

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True!, this is will lock the async runtime

match secs {
None => Ok(None),
Some(0) => Err(<D::Error as serde::de::Error>::custom(
"ohttp_keys_max_age must be greater than 0 seconds when set",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure where the right place to validate, though i guess this could be renamed to deserialize_key_epoch_duration_secs or something to make it clear that it's not for arbitrary durations...

but either way, the value should probably be at least 2*GRACE_PERIOD, and we might want to warn for unreasonably short values

@nothingmuch
Copy link
Copy Markdown
Contributor

oh.. hmm. turns out this isn't the first time i realized that, even came up with the same hack lol: #445 (comment)

anyway i think past me was less of an idiot than present me, and was correct in saying this isn't a big deal if the no-collusion assumption of the OHTTP privacy model holds, although this can link sessions' mailboxes there's not much a directory can do with that info, and temporal leaks likely reveal that anyway

@nothingmuch
Copy link
Copy Markdown
Contributor

hmm, doing weird tricks with asking for the next key seems overcomplicated, so again i think past self was less of an idiot, but unfortunately this means that the grace period needs to be at least as long as we expect a session to be (24H? a week? ...), instead of on the order of seconds =(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants