[OTBN] Integrate compression option for imem and dmem#30163
Conversation
|
interesting idea. kindly, what is the code size overhead of the addition. I'd assume significantly smaller than the saving. |
We require an additional ~400 bytes for the decompression algorithm but save ~5400 bytes |
cd68be9 to
506e010
Compare
506e010 to
0f5b321
Compare
|
Some quick overview for the choice of compression algorithms provided by Gemini:
Currently, LZ4 was chosen since it saves already quite a bit and it would be the least invasive in speed or RAM usage |
673988c to
7a55643
Compare
| attrs==25.3.0 \ | ||
| --hash=sha256:427318ce031701fea540783410126f03899a97ffc6f61596ad581ac2e40e3bc3 \ | ||
| --hash=sha256:75d7cefc7fb576747b2c81b4442d4d4a1ce0900973527c011d1030fd3bf4af1b | ||
| argcomplete==3.6.3 \ |
There was a problem hiding this comment.
Why is the update needed?
There was a problem hiding this comment.
Unsure... hence why I left it as a completely separate commit, but using the exact right version of python (3.9) and uv, I got differences when remaking python-requirements :s
| */ | ||
| status_t read_otbn_load_checksum(uint32_t *checksum) { | ||
| TRY(dif_otbn_get_load_checksum(&otbn, checksum)); | ||
| *checksum = abs_mmio_read32(TOP_EARLGREY_OTBN_BASE_ADDR + |
There was a problem hiding this comment.
Can you please provide more details why we shouldn't use DIF functions in the pentest framework?
There was a problem hiding this comment.
Actually you were completely correct, this was not needed... I removed the commit entirely. Should be cleaner now, thanks 🙏
| uint32_t uncomp_len = (uint32_t)(src[2] | (src[3] << 8)); | ||
| src += 4; | ||
|
|
||
| if (LZ4_decompress((const char *)src, (char *)local_chunk_buf, |
There was a problem hiding this comment.
How much runtime overhead adds the decompression and do we actually care about the runtime penalty? I assume we anyways only load the cryptolib once?
There was a problem hiding this comment.
I think the runtime penatly applies upon every load of an OTBN binary.
Would be good to understand time overhead - agree. Suppose significantly lower than asymm crypto exec.
There was a problem hiding this comment.
So running ecdsa_p256_functest_fpga_cw340_sival_rom_ext for example with this commit gives:
Successfully finished test sign_kat in 1693294 cycles or 70553 us @ 24 MHz.
This is some fluff, a sha2, and a otcrypto_ecdsa_p256_sign_config_k
Before we had
Successfully finished test sign_kat in 1598898 cycles or 66620 us @ 24 MHz.
So looks like 1 ms of time, i.e., 94396 cycles or around 6% for p256.
For ed25519 for example I see
Successfully finished test sign_verify_test in 1600837 cycles or 66701 us @ 24 MHz.
Versus before
Successfully finished test sign_verify_test in 1419775 cycles or 59157 us @ 24 MHz.
which is a 181062 cycles or 1.8 ms or a 12.7% overhead, this incudes both a sign and a verify.
And then for p384
Successfully finished test sign_kat in 3558983 cycles or 148290 us @ 24 MHz.
versus before
Successfully finished test sign_kat in 3438653 cycles or 143277 us @ 24 MHz.
So that is 120330 cycles or 1.2ms or 3.5% overhead in latency.
We always clean and load the OTBN binary, we never re-use the binary between executions. Hence the addition of ~1ms is on every operation.
There was a problem hiding this comment.
TL;DR: We have a rough 25% in flash savings and a 10% in latency overhead for LZ4
There was a problem hiding this comment.
Did some more experiments and optimizations. We have around a 20-24% in flash savings of OTBN binaries.
The latency increase is 4-5% for p256/p384 (lower for RSA), but 14% for 25519. This is because 25519 is simply just fast...
Hence, we have 20% flash improvement and 4% latency increase
| "//sw/device/lib/base:status", | ||
| "//sw/device/lib/dif:otbn", | ||
| "//sw/otbn/crypto:run_rsa", | ||
| "//sw/otbn/crypto:test_run_rsa", |
There was a problem hiding this comment.
We now have two "types" of OTBN binaries: compressed and non-compressed.
The cryptolib will move to using compresssed binaries (run_rsa). But the otbn.c from dif, testutils, or silicon_creator will load applications as if uncompressed whereas otbn drivers loads them compressed (so decompresses first). Hence, we need to distinguish which one we are using
There was a problem hiding this comment.
I guess it's possible to let otbn_build emit both compressed and non-compressed blob into the same archive (either put into different sub-sections, or put multiple objects to a archive), and relies on linker gc to keep the one used.
There was a problem hiding this comment.
It was a bit scary to do because if we mess something up you have both binaries in there, but no, works pretty well and removes the ugly Bazel targets, thank you!
| "libclang==16.0.0", | ||
| "libcst == 1.1.0", # 1.2.0+ needs Python 3.9+ | ||
| "lxml ~= 5.0", | ||
| "lz4 == 4.4.5", |
There was a problem hiding this comment.
@luismarques do we have a policy regarding including new python packages? Do you know about license?
There was a problem hiding this comment.
fbcf2b4 to
5953d59
Compare
|
@siemen11 , you can also consider do this for the KATS (as discussed offline), provided the entire decompression algorithm is included inside the cryptolib |
Indeed super nice idea! Though sadly trying it out since the vectors are full entropy, LZ4 seems to increase the size 😆 was worth the try! |
a83f2ef to
9f0cfe5
Compare
9f0cfe5 to
923590d
Compare
johannheyszl
left a comment
There was a problem hiding this comment.
I am in favour due to the code size savings and not seeing downsides - thanks @siemen11 !
| // Write to IMEM. Always starts at zero on the OTBN side. | ||
| uint32_t words = uncomp_len / (uint32_t)sizeof(uint32_t); | ||
| for (uint32_t i = 0; launder32(i) < words; i++) { | ||
| abs_mmio_write32(mmio_addr + i * sizeof(uint32_t), local_chunk_buf[i]); |
There was a problem hiding this comment.
missing boundary check? (depends on whether the uncompressed size metadata is considered trusted or not)
There was a problem hiding this comment.
Added them, best to be on the safe side, thank you!
| if (offset >= match_len) { | ||
| // Safe from overlap | ||
| if (match_len >= FAST_PATH_THRESH) { | ||
| memcpy(dst_ptr, match_ptr, match_len); |
There was a problem hiding this comment.
note that memcpy (not memmove) with overlapped dst/src is UB in C. How much speedup does these fast paths provide?
There was a problem hiding this comment.
0.3% :)
So I simplified it back again, I got carried away optimizing it
923590d to
0a17be0
Compare
|
|
||
| while (src < app.imem_compressed_end) { | ||
| // Parse the 4-byte chunk header as unsigned 32-bit integers | ||
| uint32_t comp_len = (uint32_t)(src[0] | (src[1] << 8)); |
There was a problem hiding this comment.
Not sure if src is considered trusted, we doesn't check if src has enough bytes for this 4 byte header.
Similar situation for reading comp_len bytes from src below, and also when decompressing dmem.
Maybe we can extract the decompress routine as a common helper to share between imem/dmem loader?
There was a problem hiding this comment.
Added a check on the header size, but in general I made the logic closer to what we had before and implemented your suggestion of a decompress routine
0a17be0 to
a21369a
Compare
Add lz4 to the python dependencies. Signed-off-by: Siemen Dhooghe <sdhooghe@google.com>
Add the lz4 decompression algorithm. This algorithm costs around 400 bytes. Signed-off-by: Siemen Dhooghe <sdhooghe@google.com>
a21369a to
cea92c4
Compare
Use LZ4 to create a compressed OTBN binary (both imem and dmem). The OTBN build script creates both the uncompressed and compressed binary relying on garbage collection to filter out the unused binary. Change the otbn.c and otbn.h in the cryptolib to use LZ4 from lib/base in order to decompress the loaded imem and dmem from an application. The cryptolib (lib/crypto) uses the compressed binaries, however, the applications from silicon_creator (using silicon_creator/lib/drivers/otbn.c/h) still use the uncompressed version. This option saves the cryptolib ~5400bytes while increasing ~400bytes to load LZ4.c while providing backwards compatability for the boot. Concerning the CRC: The CRC for the imem/dmem is calculated before compression. For loading to OTBN, the OTBN is given the CRC of the uncompressed payload before and checks itself the CRC of what it receives. I.e., the CRC still provides end-2-end protection. Signed-off-by: Siemen Dhooghe <sdhooghe@google.com>
cea92c4 to
14210ed
Compare
nasahlpa
left a comment
There was a problem hiding this comment.
Thanks Siemen. I had a look into the code - LGMT!
Use LZ4 to add a compression option in otbn_binary (activated by setting
compress = True). If this is set, the imem and dmem of the otbn
application are compressed before they are stored.
By default, binaries are not compressed.
Change the otbn.c and otbn.h in the cryptolib to use LZ4 from lib/base
in order to decompress the loaded imem and dmem from an application.
The compression option is therefore mandatory in the cryptolib
(lib/crypto), however, the applications from silicon_creator (using
silicon_creator/lib/drivers/otbn.c/h) still use the uncompressed version
(and will not work with the compressed version).
Concerning the CRC: The CRC for the imem/dmem is calculated before
compression. For loading to OTBN, the OTBN is given the CRC of the
uncompressed payload before and checks itself the CRC of what it
receives. I.e., the CRC still provides end-2-end protection.