Skip to content

[OTBN] Integrate compression option for imem and dmem#30163

Merged
nasahlpa merged 3 commits into
lowRISC:earlgrey_1.0.0from
siemen11:otbn_code_saving
Jun 19, 2026
Merged

[OTBN] Integrate compression option for imem and dmem#30163
nasahlpa merged 3 commits into
lowRISC:earlgrey_1.0.0from
siemen11:otbn_code_saving

Conversation

@siemen11

@siemen11 siemen11 commented May 20, 2026

Copy link
Copy Markdown
Contributor

Use LZ4 to add a compression option in otbn_binary (activated by setting
compress = True). If this is set, the imem and dmem of the otbn
application are compressed before they are stored.
By default, binaries are not compressed.

Change the otbn.c and otbn.h in the cryptolib to use LZ4 from lib/base
in order to decompress the loaded imem and dmem from an application.

The compression option is therefore mandatory in the cryptolib
(lib/crypto), however, the applications from silicon_creator (using
silicon_creator/lib/drivers/otbn.c/h) still use the uncompressed version
(and will not work with the compressed version).

Concerning the CRC: The CRC for the imem/dmem is calculated before
compression. For loading to OTBN, the OTBN is given the CRC of the
uncompressed payload before and checks itself the CRC of what it
receives. I.e., the CRC still provides end-2-end protection.

@siemen11 siemen11 requested a review from johannheyszl May 20, 2026 20:36
@johannheyszl

Copy link
Copy Markdown
Contributor

interesting idea.

kindly, what is the code size overhead of the addition. I'd assume significantly smaller than the saving.

@siemen11

siemen11 commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

interesting idea.

kindly, what is the code size overhead of the addition. I'd assume significantly smaller than the saving.

We require an additional ~400 bytes for the decompression algorithm but save ~5400 bytes

@siemen11 siemen11 force-pushed the otbn_code_saving branch from cd68be9 to 506e010 Compare May 21, 2026 09:37
@siemen11 siemen11 changed the title [test] OTBN flash size saving test [OTBN] Integrate compression option for imem and dmem May 21, 2026
@siemen11 siemen11 force-pushed the otbn_code_saving branch from 506e010 to 0f5b321 Compare May 21, 2026 09:54
@siemen11

siemen11 commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

Some quick overview for the choice of compression algorithms provided by Gemini:

Algorithm RAM (KB) Flash (KB) Speed Ratio %
LZ4 0.00 0.5 100 50%
Heatshrink 0.05 0.8 30 55%
FastLZ 0.20 1.0 90 52%
DEFLATE 4.00 3.0 40 65%
Zstandard 24.00 15.0 80 70%

Currently, LZ4 was chosen since it saves already quite a bit and it would be the least invasive in speed or RAM usage

@siemen11 siemen11 force-pushed the otbn_code_saving branch 8 times, most recently from 673988c to 7a55643 Compare May 21, 2026 14:20
@johannheyszl

Copy link
Copy Markdown
Contributor

Thanks @siemen11 - have to say that I don't see any immediate issues with this approach, and it clearly helps save code size.

Any quick thoughts on this @cfrantz @moidx ?

@siemen11 siemen11 marked this pull request as ready for review May 21, 2026 15:36
@siemen11 siemen11 requested review from a team and cfrantz as code owners May 21, 2026 15:36
@siemen11 siemen11 requested review from andrea-caforio, engdoreis and nasahlpa and removed request for a team, cfrantz and engdoreis May 21, 2026 15:36
Comment thread python-requirements.txt Outdated
attrs==25.3.0 \
--hash=sha256:427318ce031701fea540783410126f03899a97ffc6f61596ad581ac2e40e3bc3 \
--hash=sha256:75d7cefc7fb576747b2c81b4442d4d4a1ce0900973527c011d1030fd3bf4af1b
argcomplete==3.6.3 \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the update needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure... hence why I left it as a completely separate commit, but using the exact right version of python (3.9) and uv, I got differences when remaking python-requirements :s

*/
status_t read_otbn_load_checksum(uint32_t *checksum) {
TRY(dif_otbn_get_load_checksum(&otbn, checksum));
*checksum = abs_mmio_read32(TOP_EARLGREY_OTBN_BASE_ADDR +

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please provide more details why we shouldn't use DIF functions in the pentest framework?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you were completely correct, this was not needed... I removed the commit entirely. Should be cleaner now, thanks 🙏

uint32_t uncomp_len = (uint32_t)(src[2] | (src[3] << 8));
src += 4;

if (LZ4_decompress((const char *)src, (char *)local_chunk_buf,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much runtime overhead adds the decompression and do we actually care about the runtime penalty? I assume we anyways only load the cryptolib once?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the runtime penatly applies upon every load of an OTBN binary.
Would be good to understand time overhead - agree. Suppose significantly lower than asymm crypto exec.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So running ecdsa_p256_functest_fpga_cw340_sival_rom_ext for example with this commit gives:

Successfully finished test sign_kat in 1693294 cycles or 70553 us @ 24 MHz.

This is some fluff, a sha2, and a otcrypto_ecdsa_p256_sign_config_k

Before we had

Successfully finished test sign_kat in 1598898 cycles or 66620 us @ 24 MHz.

So looks like 1 ms of time, i.e., 94396 cycles or around 6% for p256.

For ed25519 for example I see

Successfully finished test sign_verify_test in 1600837 cycles or 66701 us @ 24 MHz.

Versus before

Successfully finished test sign_verify_test in 1419775 cycles or 59157 us @ 24 MHz.

which is a 181062 cycles or 1.8 ms or a 12.7% overhead, this incudes both a sign and a verify.

And then for p384

Successfully finished test sign_kat in 3558983 cycles or 148290 us @ 24 MHz.

versus before

Successfully finished test sign_kat in 3438653 cycles or 143277 us @ 24 MHz.

So that is 120330 cycles or 1.2ms or 3.5% overhead in latency.

We always clean and load the OTBN binary, we never re-use the binary between executions. Hence the addition of ~1ms is on every operation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: We have a rough 25% in flash savings and a 10% in latency overhead for LZ4

@siemen11 siemen11 May 31, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some more experiments and optimizations. We have around a 20-24% in flash savings of OTBN binaries.
The latency increase is 4-5% for p256/p384 (lower for RSA), but 14% for 25519. This is because 25519 is simply just fast...
Hence, we have 20% flash improvement and 4% latency increase

Comment thread sw/device/lib/testing/BUILD Outdated
"//sw/device/lib/base:status",
"//sw/device/lib/dif:otbn",
"//sw/otbn/crypto:run_rsa",
"//sw/otbn/crypto:test_run_rsa",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have two "types" of OTBN binaries: compressed and non-compressed.
The cryptolib will move to using compresssed binaries (run_rsa). But the otbn.c from dif, testutils, or silicon_creator will load applications as if uncompressed whereas otbn drivers loads them compressed (so decompresses first). Hence, we need to distinguish which one we are using

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's possible to let otbn_build emit both compressed and non-compressed blob into the same archive (either put into different sub-sections, or put multiple objects to a archive), and relies on linker gc to keep the one used.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a bit scary to do because if we mess something up you have both binaries in there, but no, works pretty well and removes the ugly Bazel targets, thank you!

Comment thread pyproject.toml
"libclang==16.0.0",
"libcst == 1.1.0", # 1.2.0+ needs Python 3.9+
"lxml ~= 5.0",
"lz4 == 4.4.5",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luismarques do we have a policy regarding including new python packages? Do you know about license?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siemen11 siemen11 force-pushed the otbn_code_saving branch 2 times, most recently from fbcf2b4 to 5953d59 Compare May 22, 2026 08:48
@anchivuk

Copy link
Copy Markdown

@siemen11 , you can also consider do this for the KATS (as discussed offline), provided the entire decompression algorithm is included inside the cryptolib

@siemen11

Copy link
Copy Markdown
Contributor Author

@siemen11 , you can also consider do this for the KATS (as discussed offline), provided the entire decompression algorithm is included inside the cryptolib

Indeed super nice idea! Though sadly trying it out since the vectors are full entropy, LZ4 seems to increase the size 😆 was worth the try!

@siemen11 siemen11 force-pushed the otbn_code_saving branch 4 times, most recently from a83f2ef to 9f0cfe5 Compare May 31, 2026 10:34
@siemen11 siemen11 added the CI:Rerun Rerun failed CI jobs label May 31, 2026
@github-actions github-actions Bot removed the CI:Rerun Rerun failed CI jobs label May 31, 2026
@siemen11 siemen11 force-pushed the otbn_code_saving branch from 9f0cfe5 to 923590d Compare May 31, 2026 11:14

@johannheyszl johannheyszl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favour due to the code size savings and not seeing downsides - thanks @siemen11 !

@johannheyszl johannheyszl requested review from cfrantz and moidx June 11, 2026 09:33
// Write to IMEM. Always starts at zero on the OTBN side.
uint32_t words = uncomp_len / (uint32_t)sizeof(uint32_t);
for (uint32_t i = 0; launder32(i) < words; i++) {
abs_mmio_write32(mmio_addr + i * sizeof(uint32_t), local_chunk_buf[i]);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing boundary check? (depends on whether the uncompressed size metadata is considered trusted or not)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added them, best to be on the safe side, thank you!

Comment thread sw/device/lib/crypto/drivers/otbn.c Outdated
Comment thread sw/device/lib/base/lz4.c Outdated
if (offset >= match_len) {
// Safe from overlap
if (match_len >= FAST_PATH_THRESH) {
memcpy(dst_ptr, match_ptr, match_len);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that memcpy (not memmove) with overlapped dst/src is UB in C. How much speedup does these fast paths provide?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.3% :)
So I simplified it back again, I got carried away optimizing it


while (src < app.imem_compressed_end) {
// Parse the 4-byte chunk header as unsigned 32-bit integers
uint32_t comp_len = (uint32_t)(src[0] | (src[1] << 8));

@sasdf sasdf Jun 17, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if src is considered trusted, we doesn't check if src has enough bytes for this 4 byte header.

Similar situation for reading comp_len bytes from src below, and also when decompressing dmem.

Maybe we can extract the decompress routine as a common helper to share between imem/dmem loader?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check on the header size, but in general I made the logic closer to what we had before and implemented your suggestion of a decompress routine

siemen11 added 2 commits June 17, 2026 23:59
Add lz4 to the python dependencies.

Signed-off-by: Siemen Dhooghe <sdhooghe@google.com>
Add the lz4 decompression algorithm. This algorithm costs around 400
bytes.

Signed-off-by: Siemen Dhooghe <sdhooghe@google.com>
Use LZ4 to create a compressed OTBN binary (both imem and dmem). The OTBN build script
creates both the uncompressed and compressed binary relying on garbage
collection to filter out the unused binary.

Change the otbn.c and otbn.h in the cryptolib to use LZ4 from lib/base
in order to decompress the loaded imem and dmem from an application.

The cryptolib (lib/crypto) uses the compressed binaries, however, the applications from silicon_creator (using
silicon_creator/lib/drivers/otbn.c/h) still use the uncompressed version.

This option saves the cryptolib ~5400bytes while increasing ~400bytes to
load LZ4.c while providing backwards compatability for the boot.

Concerning the CRC: The CRC for the imem/dmem is calculated before
compression. For loading to OTBN, the OTBN is given the CRC of the
uncompressed payload before and checks itself the CRC of what it
receives. I.e., the CRC still provides end-2-end protection.

Signed-off-by: Siemen Dhooghe <sdhooghe@google.com>

@nasahlpa nasahlpa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Siemen. I had a look into the code - LGMT!

@nasahlpa nasahlpa merged commit e735550 into lowRISC:earlgrey_1.0.0 Jun 19, 2026
34 of 35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants