[盗火者计划] 任务5 - CALAS feat(sparse): CKKS sparse-packing bootstrap (CPU + GPU)#33
[盗火者计划] 任务5 - CALAS feat(sparse): CKKS sparse-packing bootstrap (CPU + GPU)#33D4rkCrypto wants to merge 1 commit into
Conversation
4130ccc to
0e611f4
Compare
|
|
||
| def __init__(self, type=DataType.Ciphertext, id='', degree=1, level=DEFAULT_LEVEL) -> None: | ||
| super().__init__(type, id, degree, level) | ||
| def __init__( |
There was a problem hiding this comment.
slots或log_slots是CkksParam的一个属性,当一个计算任务开始设置好同态参数,即g_param后,Ckks相关数据结构比如CkksCiphertextNode无需从构造函数再次获得该属性,也无需在各类同态算子中传播该属性
| context->set_poly_modulus_degree(n); | ||
|
|
||
| int slots = param_json["slots"].get<int>(); | ||
| // Accept either `slots` (new) or `log_slots` (legacy fixtures) in JSON. |
There was a problem hiding this comment.
这里应该在前端统一设定为"slots",因此无需添加此类判断
|
|
||
| for (int i = 0; i < p.size(); i++) { | ||
| P.push_back(Data64(p[i])); | ||
| std::vector<Data64> Q, P; |
There was a problem hiding this comment.
HEonGPU库支持CKKS Bootstrapping (包括sparse encoding的密文),请采用原有接口,保持代码的简洁性
|
1.前端逻辑精简:slots (或 log_slots) 应当作为 CkksParam 的固有属性。在计算图中,CkksCiphertextNode 等数据节点无需重复持有该属性,各同态算子之间也应避免不必要的参数传播,以保持前端的简洁性。 |
|
2.GPU Runtime 接口对齐:目前 GPU 端的底层实现已通过 HEonGPU 支持了包括 Sparse encoding 在内的 CKKS Bootstrapping。为了维持架构的一致性,建议无需修改现有接口,应直接复用即可。 |
|
3.后续重点:目前的缺口主要在于 Client 端的参数配置映射。建议重点应放在补齐相关接口配置上,确保能通过 Client 正确触发 HEonGPU 的 Sparse Bootstrapping 逻辑,并完成某AI模型的端到端调用。 |
0e611f4 to
bcaf517
Compare
|
Squashed and force-pushed 1. 前端逻辑精简 (frontend simplification —
User opts into sparse via 2. GPU Runtime 接口对齐 (
Carries submodules: HEonGPU 3. 后续重点 (Client-side AI model E2E) — out of scope for this PR; tracked separately. Verification:
|
bcaf517 to
d685826
Compare
|
Amended →
|
d685826 to
c180e91
Compare
|
Amended → |
c180e91 to
9c7b415
Compare
|
Amend ( The only branch that remains is the 4-arg vs 3-arg Net change in Issue 3 (Client-side AI E2E parameter mapping) is intentionally left for a follow-up PR. |
9c7b415 to
be4653c
Compare
Adds end-to-end sparse-packing CKKS bootstrap. The user opts into a sparse configuration with CkksBtpParam.create_sparse_param(log_slots) (or .create_toy_sparse_param for N=8192 dev runs), and the existing bootstrap() op in the DAG goes through: - CPU: lattigo's native sparse path (LogSlots < LogN-1) via the new SetCkksParameterLogSlots SDK export. The ABI bridge encodes/decodes at the param's LogSlots, so sparse plaintexts pack/unpack correctly. - GPU: HEonGPU's existing regular_bootstrapping_v2 once context.set_slot_count puts the encoder in gap_>1 mode. mega_ag_runners/gpu/gpu_wrapper.cu uses the static prime-chain helper ckks_sparse_bootstrap_chain_n16 (semi-public in HEonGPU; included explicitly here) to translate the frontend's dense CkksBtpParam into the sparse Q + level_starts that BootstrappingConfigV2's 4-arg ctor expects. This is a translator, not a duplicate code path -- once the frontend learns to emit sparse-correct values directly, the branch collapses into the dense path. Tests: test_sparse_bootstrap (CPU lattigo path) demonstrates the 1.32-1.59x speedup at log_slots in [8, 14] vs the dense LogSlots=LogN-1 baseline. test_gpu_ckks/cpu_ckks add CKKS smoke coverage. The example/ ckks_sparse_bootstrap_cpu/ shows the C++ and Python entry points for users. Carries submodule pointers: HEonGPU 6675bee (sparse-packing CKKS bootstrap), lattigo bb1b0bb (SetCkksParameterLogSlots + decoupled encode/decode).
be4653c to
0cff25f
Compare
|
|
||
| heongpu::BootstrappingConfigV2 boot_config(stc_config, eval_mod_config, cts_config); | ||
|
|
||
| heongpu::EncodingMatrixConfig cts_config( |
feat(sparse): CKKS sparse-packing bootstrap (CPU + GPU)
Branch:
feat/sparse-bootstrap→mainHEAD:
d239ba4Base:
a784b50(= upstreamcipherflow-fhe/lattisensemain)Summary
End-to-end CKKS sparse-packing bootstrap support, dispatchable from a single
frontend API and routed at the
mega_ag.jsonlevel to either the CPU runner(Lattigo) or the GPU runner (HEonGPU
sparse_bootstrapping_v2). Frontendadds an auto-slot-inference pass over the DAG so users who set
used_slotshints get the rightlog_slotsselected without manualparameter tuning.
The change is a single D4rkCrypto commit on top of upstream main, with
companion PRs in two submodules (HEonGPU, lattigo) referenced via
submodule-pin updates.
Components changed
Submodule pins
backends/HEonGPU→2d9a0c8(sparse_bootstrapping_v2 + reduced-dim design)fhe_ops_lib/lattigo→6a17cfa(sparse trace + CkksBootstrap)CPU runner (Lattigo bridge)
init_empty_contextbranches onparam_json.contains("log_slots")→dispatches
CkksBtpParameter::create_sparse_parameter(orcreate_toy_sparse_parameterfor N=2^13).examples/ckks_sparse_bootstrap_cpu/shows the end-to-end flow.GPU runner (HEonGPU bridge)
mega_ag_runners/gpu/gpu_wrapper.cu: in the sparse path, callscontext->set_slot_count(1 << log_slots)beforeset_coeff_modulus_valuesso the encoder produces sparse-NTT'dplaintexts and
regular_bootstrapping_v2takes the doubled-mode CtSfuse path (single EvalMod, ~14 ms saved per bootstrap).
mega_ag_executors_gpudispatches GPU sparse bootstrap viaHEArithmeticOperator::sparse_bootstrapping_v2.Frontend (auto-slot inference)
frontend/custom_task.py:ct_pt_mult_accumulate_slicenow propagatesused_slotsin thedot-product reduction (matches the
add_ct_slicevariant). Withoutthis, sparsity hints were lost in ct-pt dot products.
_infer_log_slots: graph scan picks the smallestlog_slotsthatcovers
max(used_slots)across all bootstrap outputs. User-setlog_slotsis never overridden.bootstrap()adds positive Trace rotations to the Galois set whenis_sparse()(HEonGPU's sparse Trace projection step needs2^irotations fori ∈ [log_slots, logN-2]).Build (CUDA 13 compatibility)
cmake/patches/{gpu-ntt-cstdint, gpu-ntt-uint128-shift, heongpu-findthrust}-cuda13.patchcarry forward fixes that GPU-NTT and HEonGPU upstreams have not
released yet. Applied at configure time by
cmake/ApplySubmodulePatches.cmakeso the submodule SHAs stay clean.Tests added
unittests/test_sparse_bootstrap.cpp:[sparse][bootstrap]— correctness + speedup vs full packing[.sparse][.bootstrap][.benchmark]— log_slots sweep (gated)[sparse][bootstrap][integration]— JSON-pipeline dispatchunittests/test_auto_slot_inference.py— covers the dot-product familyand
seal_advanced_rotate_colssaturation behavior.Environment Setup
Dependencies
System packages (Ubuntu names):
Python (frontend, in a venv):
Submodules are managed via the standard git workflow:
Execution Steps
CPU-only build with tests
cmake -B build -DLATTISENSE_BUILD_TESTS=ON cmake --build build -j$(nproc)GPU build with tests (Blackwell / sm_120)
cmake -B build \ -DLATTISENSE_BUILD_TESTS=ON \ -DLATTISENSE_ENABLE_GPU=ON \ -DLATTISENSE_CUDA_ARCH=120 cmake --build build -j$(nproc)For other GPUs, replace
120with yoursm_*value (e.g.90forH100,
89for L40S,86for A100).Run the CPU sparse bootstrap test (correctness + speedup vs dense)
build/unittests/test_sparse_bootstrap '[sparse]'Run the CPU sparse log_slots sweep (the headline benchmark)
build/unittests/test_sparse_bootstrap '[.benchmark]'This sweep is
[.]-gated so it only runs when explicitly requested. Outputis a single sweep table to stdout, captured here as
bench_cpu_2026-04-30.txt.Run the auto-slot-inference unit tests (Python frontend)
cd unittests python -m pytest test_auto_slot_inference.py -vRun the GPU sparse bootstrap example end-to-end
(Run from this repo's HEonGPU submodule build, or from a separately built
HEonGPU tree; the binary takes
log_slotsas argv[1].)Run the end-to-end CPU sparse bootstrap example (Python → JSON → C++ runner)
The Python step emits
mega_ag.jsonwithlog_slots = 8set; the C++ steploads it and runs the sparse path through Lattigo.
Results
Reference hardware
CPU sweep — Lattigo bridge, toy N=2^13
Reproducible via
test_sparse_bootstrap '[.benchmark]':log_slotsSaturation at
log_slots ≈ 5, peak ~1.59× speedup. Curve is monotonic — noGPU-style cliff because Lattigo runs everything on CPU.
GPU sweep — HEonGPU sparse_bootstrapping_v2, production N=2^16
Reproducible via the HEonGPU example 6 binary. Dense baseline from
5_ckks_regular_bootstrapping_v2:log_slotsHeadline:
log_slots=11gives the best speedup (1.59×) — smallest log_slotsthat still avoids the GPU-FFT-vs-CPU-FFT-fallback cliff at log_slots ∈ [8, 10].
Why log_slots ∈ [8, 10] is slower than [11, 14] on GPU
GPU
Special_FFTrejectsn_power ≤ 10. Forlog_slots ∈ [8, 10]theHEonGPU PR's
sparse_fft_util.cuhfalls back to CPU butterflies viacudaMemcpy D→H + compute + cudaMemcpy H→D. Per-call overhead ~15–20 ms,which exceeds the matrix-mul savings at these sizes. Trade-off accepted —
upstream throws here; we offer a slow-but-correct path. See HEonGPU PR.
Known limitation: GPU log_slots ∈ [4, 7]
HEonGPU's kernel grid formulas assume
slot_count ≥ 256. Tracked butdeferred — practical workloads on N=2^16 use
log_slots ≥ 8. Users whogenuinely need extreme sparsity get it via the CPU runner (Lattigo handles
log_slots ∈ [4, 14]end-to-end).Self-composability gap (open, gated)
A preventive test catches
bootstrap → drop_level(9) → bootstrapon asparse ciphertext giving ~−3 bits precision:
build/unittests/test_cpu_ckks '[.composability]'Real workloads always interleave arithmetic between bootstraps, so this is
preventive coverage rather than a blocker. Documented as a known
limitation; the gated test serves as a tracked reproducer for if/when a
fix lands.
Notes for reviewers
ApplySubmodulePatches.cmakeruns at configure time and isidempotent — re-running configure on an already-patched submodule is a
no-op. This keeps the submodule SHAs clean (no in-place commits in the
submodule worktrees) while still letting us carry CUDA 13 fixes that
GPU-NTT and HEonGPU haven't released upstream yet.
g_param.log_slots. Anearlier design did, which broke the case where a single param object was
reused across tasks with different sparsity hints. Tests in
test_auto_slot_inference.py::TestEdgeCaseHardeningpin this downexplicitly.
examples/ckks_sparse_bootstrap_cpu/is the recommended startingpoint for any user who wants to add sparse bootstrap to their workload —
it's the smallest end-to-end pipeline that exercises the JSON dispatch
path.