Skip to content

Add option to allocate large MLDSA objects from a global buffer#786

Open
bensze01 wants to merge 21 commits into
Mbed-TLS:developmentfrom
bensze01:mldsa-buffer-alloc
Open

Add option to allocate large MLDSA objects from a global buffer#786
bensze01 wants to merge 21 commits into
Mbed-TLS:developmentfrom
bensze01:mldsa-buffer-alloc

Conversation

@bensze01
Copy link
Copy Markdown
Contributor

@bensze01 bensze01 commented Apr 30, 2026

Resolves #645

PR checklist

@bensze01 bensze01 changed the title Mldsa buffer alloc Add option to allocate large MLDSA objects from a global buffer Apr 30, 2026
@bensze01 bensze01 force-pushed the mldsa-buffer-alloc branch 8 times, most recently from d3e85ab to 8b57228 Compare May 6, 2026 16:06
@bensze01 bensze01 force-pushed the mldsa-buffer-alloc branch 5 times, most recently from cfee876 to b937dbf Compare May 13, 2026 16:17
@bensze01 bensze01 force-pushed the mldsa-buffer-alloc branch 3 times, most recently from 39a6e29 to e46c386 Compare May 14, 2026 18:47
@bensze01 bensze01 added enhancement New feature or request size-s Estimated task size: small (~2d) priority-high High priority - will be reviewed soon labels May 18, 2026
@bensze01 bensze01 marked this pull request as ready for review May 18, 2026 15:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an experimental option TF_PSA_CRYPTO_PQCP_BUFFER_ALLOC that makes mldsa-native allocate its large temporary objects from a single, global, statically-allocated bump buffer instead of the stack. When MBEDTLS_THREADING_C is enabled, access to the buffer is serialized by a new dedicated mutex.

Changes:

  • New pqcp_buffer_alloc.{h,c} defines the global buffer and MLD_CUSTOM_ALLOC/MLD_CUSTOM_FREE macros plumbed into mldsa-native via pqcp-config.h and a per-call empty context stuffed by macros in wrap_mldsa_native.h.
  • A new global mutex mbedtls_threading_pqcp_buffer_alloc_mutex is declared/initialised in platform/threading.{c,h} to guard the buffer.
  • New configuration option (config metadata + scripts + two new CI components exercising the option with and without pthreads).

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
include/psa/crypto_config.h Adds docblock + define for TF_PSA_CRYPTO_PQCP_BUFFER_ALLOC (currently enabled by default; module path also wrong).
scripts/data_files/config-options-current.txt Registers the new option name.
scripts/config.py Excludes the new option from blanket "full" configs.
drivers/pqcp/src/pqcp-config.h Wires MLD_CONFIG_CONTEXT_PARAMETER, MLD_CONFIG_CUSTOM_ALLOC_FREE, and the alloc/free macros into mldsa-native.
drivers/pqcp/src/pqcp_buffer_alloc.h Defines the global buffer, context struct and CUSTOM_ALLOC/CUSTOM_FREE bump-allocator macros.
drivers/pqcp/src/pqcp_buffer_alloc.c Provides the storage for the global buffer.
drivers/pqcp/src/wrap_mldsa_native.h Adds per-function macro wrappers that inject a zero-initialised context argument when context-stuffing is enabled.
drivers/pqcp/src/wrap_mldsa_native.c Disables the new wrapper macros locally so the real function definitions are not affected, and tightens the guard with MBEDTLS_PSA_CRYPTO_C.
platform/threading.c, platform/threading_internal.h Declares/initialises/destroys the new mutex when the option is enabled.
tests/scripts/components-configuration-crypto.sh Adds two CI components covering the new option with and without pthread.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread include/psa/crypto_config.h Outdated
Comment thread include/psa/crypto_config.h Outdated
Comment thread drivers/pqcp/src/pqcp_buffer_alloc.h
Comment thread drivers/pqcp/src/pqcp_buffer_alloc.h Outdated
@gilles-peskine-arm gilles-peskine-arm added the needs-review Every commit must be reviewed by at least two team members label May 18, 2026
Copy link
Copy Markdown
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a design review, not yet a full code review.

The design feels more complex and fragile than it needs to be. So please document the rationale. I especially want to understand why a context parameter is needed at all: can't we just make the offset a global variable, given that we end up having only one mldsa-native function active at time anyway?

//#define TF_PSA_CRYPTO_PQCP_OWN_SHAKE

/**
* \def TF_PSA_CRYPTO_PQCP_BUFFER_ALLOC
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add interesting entries to tests/suites/test_suite_config.crypto_combinations.data: at least

depends_on:MBEDTLS_PSA_CRYPTO_C:TF_PSA_CRYPTO_PQCP_MLDSA_ENABLED:!TF_PSA_CRYPTO_PQCP_BUFFER_ALLOC

or maybe with/without threads.

Comment thread drivers/pqcp/src/pqcp_buffer_alloc.h Outdated
#endif

// Sufficient for signing with TF_PSA_CRYPTO_PQCP_MLDSA_87_ENABLED
#define TF_PSA_CRYPTO_PQCP_ALLOC_BUFFER_SIZE 123200
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this number come from?

I would like to have some validation somewhere. Presumably as a static assertion that compares our value with some value defined by mldsa-native.

Comment thread drivers/pqcp/src/pqcp_buffer_alloc.h Outdated
#define TF_PSA_CRYPTO_PQCP_ALLOC_LOCK() 0
#endif

#define TF_PSA_CRYPTO_PQCP_CUSTOM_ALLOC(v, T, N, context) \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write this whole macro as T *v = tf_psa_crypto_pqcp_custom_alloc(sizeof(T), N, &context) where tf_psa_crypto_pqcp_custom_alloc is a static inline function? It would be more readable and easier to debug.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've kept the overflow-prevention check as part of the macro, just so we don't have to figure out always_inline for all the compilers we support. (GCC refused to inline the check otherwise)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking around a bit, I see that we have always_inline support for GCC (, Clang?) and IAR in alignment.h, while mldsa-native has support for GCC, Clang and MSVC.

If IAR supports _Pragma(), we could define an ALWAYS_INLINE macro, and not have to mess around with #if/#else.

Comment thread drivers/pqcp/src/pqcp_buffer_alloc.h Outdated
This should be optimized away at compile-time */ \
if ((N) > 0 && (N) <= TF_PSA_CRYPTO_PQCP_ALLOC_BUFFER_SIZE / sizeof(T)) { \
if ((context).alloc_offset == 0) { \
(context).alloc_offset = TF_PSA_CRYPTO_PQCP_ALLOC_LOCK(); \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like having alloc_offset double as an error code. It's error-prone. Here we could use a local variable. Where else is alloc_offset checked for negative values?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced the error-code usage to have a way to fail all further allocations, as that reduced code size considerably. Without that, the generated code has to check each allocation individually to see if it succeeded, then calls mbedtls_zeroize for allocations.

Comment thread drivers/pqcp/src/pqcp_buffer_alloc.h Outdated
Comment on lines +40 to +41
if ((context).alloc_offset == 0) { \
(context).alloc_offset = TF_PSA_CRYPTO_PQCP_ALLOC_LOCK(); \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand the concurrency behavior. If I understand correctly:

  • When a thread calls one of the mldsa-native entry points, it passes a context with alloc_offset=0. Thus the first allocation using that context object locks the buffer, which is then for exclusive use inside this thread.
  • When that first allocation is freed, we release the lock. Assuming that allocations are arranged in a FILO order (i.e. like a stack), the first allocation is the last one to be freed, so that means the thread no longer needs the global buffer, so it's correct to release the lock.

This works, but why not instead have our code call

mbedtls_mutex_lock(&mbedtls_threading_pqcp_buffer_alloc_mutex);
the_mldsa_native_function(...);
mbedtls_mutex_unlock(&mbedtls_threading_pqcp_buffer_alloc_mutex);

? That would be simpler and wouldn't rely on the FILO order of allocations. I believe that the FILO order is likely to remain true as mldsa-native evolves, but it isn't currently guaranteed by the mldsa-native documentation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i guess now that I'm defining macros for each entry point, this approach makes more sense.

Comment thread drivers/pqcp/src/pqcp_buffer_alloc.h Outdated
if ((size_t) (context).alloc_offset <= \
TF_PSA_CRYPTO_PQCP_ALLOC_BUFFER_SIZE - MLD_ALIGN_UP(sizeof(T) * (N))) { \
(v) = (T *) (tf_psa_crypto_pqcp_alloc_buffer + (context).alloc_offset); \
(context).alloc_offset += MLD_ALIGN_UP(sizeof(T) * (N)); \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of context.alloc_offset is how the different allocations made from one mldsa-native call don't overlap, so we rely on the context not being copied.

So why not make it a global variable?

Copy link
Copy Markdown
Contributor Author

@bensze01 bensze01 May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For code size. GCC doesn't know that the value of the global variable alloc_offset is guaranteed to be the same after a call to any function - even one that allocates from the buffer - since the default implementation uses on-stack allocation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making alloc_offset a pass-by-value parameter makes this clear for the compiler.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried making the offset global and I see a size reduction: (positive = size gain for my version)

drivers/pqcp/src/pqcp_buffer_alloc.o: 0 -> 136 (diff: -136)
drivers/pqcp/src/psa_crypto_mldsa.o: 464 -> 488 (diff: -24)
drivers/pqcp/src/wrap_mldsa_native.o: 9612 -> 9188 (diff: 424)

https://github.com/gilles-peskine-arm/TF-PSA-Crypto/tree/mldsa-buffer-alloc-global

What downsides do you see to my version (in terms of general design, it would need some cleanup and documentation)?

int alloc_offset; // Negative values indicate allocator errors
};

#if defined(MBEDTLS_THREADING_C)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mind just saying that TF_PSA_CRYPTO_PQCP_BUFFER_ALLOC is incompatible with multithreading for now, since it's intended for constrained environments. But if we do claim that it's compatible, I would like it to be tested.

@gilles-peskine-arm gilles-peskine-arm removed the needs-review Every commit must be reviewed by at least two team members label May 19, 2026
@bensze01 bensze01 requested review from bjwtaylor and Copilot June 3, 2026 09:11
@bensze01 bensze01 added the needs-review Every commit must be reviewed by at least two team members label Jun 3, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Comment thread drivers/pqcp/src/wrap_mldsa_native.c Outdated
Comment thread drivers/pqcp/src/pqcp_buffer_alloc.h Outdated
/* mld_poly */ 3*(1024) \
) /* == 123200 */

#if defined(MBEDTLS_THREADING_C)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing includes wrap_mldsa_native.h if PSA_CRYPTO_C is disabled, so this should be fine.

Comment thread drivers/pqcp/src/pqcp_buffer_alloc.c Outdated
Comment thread drivers/pqcp/src/pqcp_buffer_alloc.c Outdated
bensze01 and others added 9 commits June 3, 2026 14:36
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
MLD_FREE() will call mbedtls_platform_zeroize for us via mld_zeroize.

Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
@bensze01 bensze01 force-pushed the mldsa-buffer-alloc branch from 1ce1d20 to 4bbd5cd Compare June 3, 2026 12:37
bensze01 added 3 commits June 3, 2026 18:07
This matches the normal behaviour of mldsa-native more closely.

Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
@bensze01 bensze01 force-pushed the mldsa-buffer-alloc branch from 6bbe397 to abeb825 Compare June 3, 2026 16:09
bensze01 added 8 commits June 3, 2026 18:09
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
This only affects mldsa65, as all other parameter sets were already
divisible by 32.

Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Copy link
Copy Markdown
Contributor

@bjwtaylor bjwtaylor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to do a full review, however an initial comment/question?

sig, siglen, m, mlen, pre, prelen, rnd, sk, externalmu);
int cleanup_ret = tf_psa_crypto_pqcp_alloc_done();
if (ret != 0) {
return ret;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the returns from this function are ignored. As drivers/pqcp/src/psa_crypto_mldsa.c:151 immediately overrides the return with 0, is this ok?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that doesn't sound correct - this looks like I messed up during my last rebase.

Signed-off-by: Bence Szépkúti <bence.szepkuti@arm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request needs-review Every commit must be reviewed by at least two team members priority-high High priority - will be reviewed soon size-s Estimated task size: small (~2d)

Projects

Development

Successfully merging this pull request may close these issues.

Configuration option for using a global variable as the ML-DSA workspace

4 participants