Add option to allocate large MLDSA objects from a global buffer#786
Add option to allocate large MLDSA objects from a global buffer#786bensze01 wants to merge 21 commits into
Conversation
d3e85ab to
8b57228
Compare
cfee876 to
b937dbf
Compare
39a6e29 to
e46c386
Compare
There was a problem hiding this comment.
Pull request overview
Adds an experimental option TF_PSA_CRYPTO_PQCP_BUFFER_ALLOC that makes mldsa-native allocate its large temporary objects from a single, global, statically-allocated bump buffer instead of the stack. When MBEDTLS_THREADING_C is enabled, access to the buffer is serialized by a new dedicated mutex.
Changes:
- New
pqcp_buffer_alloc.{h,c}defines the global buffer andMLD_CUSTOM_ALLOC/MLD_CUSTOM_FREEmacros plumbed into mldsa-native viapqcp-config.hand a per-call empty context stuffed by macros inwrap_mldsa_native.h. - A new global mutex
mbedtls_threading_pqcp_buffer_alloc_mutexis declared/initialised inplatform/threading.{c,h}to guard the buffer. - New configuration option (config metadata + scripts + two new CI components exercising the option with and without pthreads).
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| include/psa/crypto_config.h | Adds docblock + define for TF_PSA_CRYPTO_PQCP_BUFFER_ALLOC (currently enabled by default; module path also wrong). |
| scripts/data_files/config-options-current.txt | Registers the new option name. |
| scripts/config.py | Excludes the new option from blanket "full" configs. |
| drivers/pqcp/src/pqcp-config.h | Wires MLD_CONFIG_CONTEXT_PARAMETER, MLD_CONFIG_CUSTOM_ALLOC_FREE, and the alloc/free macros into mldsa-native. |
| drivers/pqcp/src/pqcp_buffer_alloc.h | Defines the global buffer, context struct and CUSTOM_ALLOC/CUSTOM_FREE bump-allocator macros. |
| drivers/pqcp/src/pqcp_buffer_alloc.c | Provides the storage for the global buffer. |
| drivers/pqcp/src/wrap_mldsa_native.h | Adds per-function macro wrappers that inject a zero-initialised context argument when context-stuffing is enabled. |
| drivers/pqcp/src/wrap_mldsa_native.c | Disables the new wrapper macros locally so the real function definitions are not affected, and tightens the guard with MBEDTLS_PSA_CRYPTO_C. |
| platform/threading.c, platform/threading_internal.h | Declares/initialises/destroys the new mutex when the option is enabled. |
| tests/scripts/components-configuration-crypto.sh | Adds two CI components covering the new option with and without pthread. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
gilles-peskine-arm
left a comment
There was a problem hiding this comment.
I've done a design review, not yet a full code review.
The design feels more complex and fragile than it needs to be. So please document the rationale. I especially want to understand why a context parameter is needed at all: can't we just make the offset a global variable, given that we end up having only one mldsa-native function active at time anyway?
| //#define TF_PSA_CRYPTO_PQCP_OWN_SHAKE | ||
|
|
||
| /** | ||
| * \def TF_PSA_CRYPTO_PQCP_BUFFER_ALLOC |
There was a problem hiding this comment.
Please add interesting entries to tests/suites/test_suite_config.crypto_combinations.data: at least
depends_on:MBEDTLS_PSA_CRYPTO_C:TF_PSA_CRYPTO_PQCP_MLDSA_ENABLED:!TF_PSA_CRYPTO_PQCP_BUFFER_ALLOC
or maybe with/without threads.
| #endif | ||
|
|
||
| // Sufficient for signing with TF_PSA_CRYPTO_PQCP_MLDSA_87_ENABLED | ||
| #define TF_PSA_CRYPTO_PQCP_ALLOC_BUFFER_SIZE 123200 |
There was a problem hiding this comment.
Where does this number come from?
I would like to have some validation somewhere. Presumably as a static assertion that compares our value with some value defined by mldsa-native.
| #define TF_PSA_CRYPTO_PQCP_ALLOC_LOCK() 0 | ||
| #endif | ||
|
|
||
| #define TF_PSA_CRYPTO_PQCP_CUSTOM_ALLOC(v, T, N, context) \ |
There was a problem hiding this comment.
Can we write this whole macro as T *v = tf_psa_crypto_pqcp_custom_alloc(sizeof(T), N, &context) where tf_psa_crypto_pqcp_custom_alloc is a static inline function? It would be more readable and easier to debug.
There was a problem hiding this comment.
I've kept the overflow-prevention check as part of the macro, just so we don't have to figure out always_inline for all the compilers we support. (GCC refused to inline the check otherwise)
There was a problem hiding this comment.
Looking around a bit, I see that we have always_inline support for GCC (, Clang?) and IAR in alignment.h, while mldsa-native has support for GCC, Clang and MSVC.
If IAR supports _Pragma(), we could define an ALWAYS_INLINE macro, and not have to mess around with #if/#else.
| This should be optimized away at compile-time */ \ | ||
| if ((N) > 0 && (N) <= TF_PSA_CRYPTO_PQCP_ALLOC_BUFFER_SIZE / sizeof(T)) { \ | ||
| if ((context).alloc_offset == 0) { \ | ||
| (context).alloc_offset = TF_PSA_CRYPTO_PQCP_ALLOC_LOCK(); \ |
There was a problem hiding this comment.
I don't like having alloc_offset double as an error code. It's error-prone. Here we could use a local variable. Where else is alloc_offset checked for negative values?
There was a problem hiding this comment.
I introduced the error-code usage to have a way to fail all further allocations, as that reduced code size considerably. Without that, the generated code has to check each allocation individually to see if it succeeded, then calls mbedtls_zeroize for allocations.
| if ((context).alloc_offset == 0) { \ | ||
| (context).alloc_offset = TF_PSA_CRYPTO_PQCP_ALLOC_LOCK(); \ |
There was a problem hiding this comment.
I'm trying to understand the concurrency behavior. If I understand correctly:
- When a thread calls one of the mldsa-native entry points, it passes a context with
alloc_offset=0. Thus the first allocation using that context object locks the buffer, which is then for exclusive use inside this thread. - When that first allocation is freed, we release the lock. Assuming that allocations are arranged in a FILO order (i.e. like a stack), the first allocation is the last one to be freed, so that means the thread no longer needs the global buffer, so it's correct to release the lock.
This works, but why not instead have our code call
mbedtls_mutex_lock(&mbedtls_threading_pqcp_buffer_alloc_mutex);
the_mldsa_native_function(...);
mbedtls_mutex_unlock(&mbedtls_threading_pqcp_buffer_alloc_mutex);
? That would be simpler and wouldn't rely on the FILO order of allocations. I believe that the FILO order is likely to remain true as mldsa-native evolves, but it isn't currently guaranteed by the mldsa-native documentation.
There was a problem hiding this comment.
Yeah, i guess now that I'm defining macros for each entry point, this approach makes more sense.
| if ((size_t) (context).alloc_offset <= \ | ||
| TF_PSA_CRYPTO_PQCP_ALLOC_BUFFER_SIZE - MLD_ALIGN_UP(sizeof(T) * (N))) { \ | ||
| (v) = (T *) (tf_psa_crypto_pqcp_alloc_buffer + (context).alloc_offset); \ | ||
| (context).alloc_offset += MLD_ALIGN_UP(sizeof(T) * (N)); \ |
There was a problem hiding this comment.
The value of context.alloc_offset is how the different allocations made from one mldsa-native call don't overlap, so we rely on the context not being copied.
So why not make it a global variable?
There was a problem hiding this comment.
For code size. GCC doesn't know that the value of the global variable alloc_offset is guaranteed to be the same after a call to any function - even one that allocates from the buffer - since the default implementation uses on-stack allocation.
There was a problem hiding this comment.
Making alloc_offset a pass-by-value parameter makes this clear for the compiler.
There was a problem hiding this comment.
I tried making the offset global and I see a size reduction: (positive = size gain for my version)
drivers/pqcp/src/pqcp_buffer_alloc.o: 0 -> 136 (diff: -136)
drivers/pqcp/src/psa_crypto_mldsa.o: 464 -> 488 (diff: -24)
drivers/pqcp/src/wrap_mldsa_native.o: 9612 -> 9188 (diff: 424)
https://github.com/gilles-peskine-arm/TF-PSA-Crypto/tree/mldsa-buffer-alloc-global
What downsides do you see to my version (in terms of general design, it would need some cleanup and documentation)?
| int alloc_offset; // Negative values indicate allocator errors | ||
| }; | ||
|
|
||
| #if defined(MBEDTLS_THREADING_C) |
There was a problem hiding this comment.
I wouldn't mind just saying that TF_PSA_CRYPTO_PQCP_BUFFER_ALLOC is incompatible with multithreading for now, since it's intended for constrained environments. But if we do claim that it's compatible, I would like it to be tested.
| /* mld_poly */ 3*(1024) \ | ||
| ) /* == 123200 */ | ||
|
|
||
| #if defined(MBEDTLS_THREADING_C) |
There was a problem hiding this comment.
Nothing includes wrap_mldsa_native.h if PSA_CRYPTO_C is disabled, so this should be fine.
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
MLD_FREE() will call mbedtls_platform_zeroize for us via mld_zeroize. Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
1ce1d20 to
4bbd5cd
Compare
This matches the normal behaviour of mldsa-native more closely. Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
6bbe397 to
abeb825
Compare
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
This only affects mldsa65, as all other parameter sets were already divisible by 32. Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
bjwtaylor
left a comment
There was a problem hiding this comment.
Still need to do a full review, however an initial comment/question?
| sig, siglen, m, mlen, pre, prelen, rnd, sk, externalmu); | ||
| int cleanup_ret = tf_psa_crypto_pqcp_alloc_done(); | ||
| if (ret != 0) { | ||
| return ret; |
There was a problem hiding this comment.
Currently the returns from this function are ignored. As drivers/pqcp/src/psa_crypto_mldsa.c:151 immediately overrides the return with 0, is this ok?
There was a problem hiding this comment.
No, that doesn't sound correct - this looks like I messed up during my last rebase.
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Resolves #645
PR checklist