Skip to content

[盗火者计划] 任务5 - CALAS feat(sparse): CKKS sparse-packing bootstrap (CPU + GPU)#33

Open
D4rkCrypto wants to merge 1 commit into
cipherflow-fhe:mainfrom
CityUHK-CALAS:feat/sparse-bootstrap
Open

[盗火者计划] 任务5 - CALAS feat(sparse): CKKS sparse-packing bootstrap (CPU + GPU)#33
D4rkCrypto wants to merge 1 commit into
cipherflow-fhe:mainfrom
CityUHK-CALAS:feat/sparse-bootstrap

Conversation

@D4rkCrypto
Copy link
Copy Markdown

feat(sparse): CKKS sparse-packing bootstrap (CPU + GPU)

Branch: feat/sparse-bootstrapmain
HEAD: d239ba4
Base: a784b50 (= upstream cipherflow-fhe/lattisense main)

Summary

End-to-end CKKS sparse-packing bootstrap support, dispatchable from a single
frontend API and routed at the mega_ag.json level to either the CPU runner
(Lattigo) or the GPU runner (HEonGPU sparse_bootstrapping_v2). Frontend
adds an auto-slot-inference pass over the DAG so users who set
used_slots hints get the right log_slots selected without manual
parameter tuning.

The change is a single D4rkCrypto commit on top of upstream main, with
companion PRs in two submodules (HEonGPU, lattigo) referenced via
submodule-pin updates.

Components changed

Submodule pins

  • backends/HEonGPU2d9a0c8 (sparse_bootstrapping_v2 + reduced-dim design)
  • fhe_ops_lib/lattigo6a17cfa (sparse trace + CkksBootstrap)

CPU runner (Lattigo bridge)

  • init_empty_context branches on param_json.contains("log_slots")
    dispatches CkksBtpParameter::create_sparse_parameter (or
    create_toy_sparse_parameter for N=2^13).
  • New examples/ckks_sparse_bootstrap_cpu/ shows the end-to-end flow.

GPU runner (HEonGPU bridge)

  • mega_ag_runners/gpu/gpu_wrapper.cu: in the sparse path, calls
    context->set_slot_count(1 << log_slots) before
    set_coeff_modulus_values so the encoder produces sparse-NTT'd
    plaintexts and regular_bootstrapping_v2 takes the doubled-mode CtS
    fuse path (single EvalMod, ~14 ms saved per bootstrap).
  • mega_ag_executors_gpu dispatches GPU sparse bootstrap via
    HEArithmeticOperator::sparse_bootstrapping_v2.

Frontend (auto-slot inference)

  • frontend/custom_task.py:
    • ct_pt_mult_accumulate_slice now propagates used_slots in the
      dot-product reduction (matches the add_ct_slice variant). Without
      this, sparsity hints were lost in ct-pt dot products.
    • _infer_log_slots: graph scan picks the smallest log_slots that
      covers max(used_slots) across all bootstrap outputs. User-set
      log_slots is never overridden.
  • bootstrap() adds positive Trace rotations to the Galois set when
    is_sparse() (HEonGPU's sparse Trace projection step needs
    2^i rotations for i ∈ [log_slots, logN-2]).

Build (CUDA 13 compatibility)

  • cmake/patches/{gpu-ntt-cstdint, gpu-ntt-uint128-shift, heongpu-findthrust}-cuda13.patch
    carry forward fixes that GPU-NTT and HEonGPU upstreams have not
    released yet. Applied at configure time by
    cmake/ApplySubmodulePatches.cmake so the submodule SHAs stay clean.

Tests added

  • unittests/test_sparse_bootstrap.cpp:
    • [sparse][bootstrap] — correctness + speedup vs full packing
    • [.sparse][.bootstrap][.benchmark] — log_slots sweep (gated)
    • [sparse][bootstrap][integration] — JSON-pipeline dispatch
  • unittests/test_auto_slot_inference.py — covers the dot-product family
    and seal_advanced_rotate_cols saturation behavior.

Environment Setup

Component Version / Notes
OS Linux x86_64 (Ubuntu 24, kernel 6.17 tested)
Compiler gcc 14 / clang 19, C++20
CMake 3.20+ (3.28 tested)
CUDA (GPU build) 13.x (the patches target CUDA 13 specifically)
GPU (GPU build) NVIDIA Blackwell (RTX 5090, sm_120) for reference numbers; any sm_70+ should work
Python 3.10+ for the frontend
Go 1.21+ for the lattigo cgo build

Dependencies

System packages (Ubuntu names):

build-essential cmake ninja-build pkg-config
libgmp-dev libntl-dev libgsl-dev
golang-go                # for lattigo cgo bindings
nvidia-cuda-toolkit-13   # for GPU build only

Python (frontend, in a venv):

pip install -r requirements.txt   # nlohmann-json, pybind, etc.

Submodules are managed via the standard git workflow:

git submodule update --init --recursive

Execution Steps

CPU-only build with tests

cmake -B build -DLATTISENSE_BUILD_TESTS=ON
cmake --build build -j$(nproc)

GPU build with tests (Blackwell / sm_120)

cmake -B build \
  -DLATTISENSE_BUILD_TESTS=ON \
  -DLATTISENSE_ENABLE_GPU=ON \
  -DLATTISENSE_CUDA_ARCH=120
cmake --build build -j$(nproc)

For other GPUs, replace 120 with your sm_* value (e.g. 90 for
H100, 89 for L40S, 86 for A100).

Run the CPU sparse bootstrap test (correctness + speedup vs dense)

build/unittests/test_sparse_bootstrap '[sparse]'

Run the CPU sparse log_slots sweep (the headline benchmark)

build/unittests/test_sparse_bootstrap '[.benchmark]'

This sweep is [.]-gated so it only runs when explicitly requested. Output
is a single sweep table to stdout, captured here as
bench_cpu_2026-04-30.txt.

Run the auto-slot-inference unit tests (Python frontend)

cd unittests
python -m pytest test_auto_slot_inference.py -v

Run the GPU sparse bootstrap example end-to-end

backends/HEonGPU/build/bin/examples/bootstrapping/6_ckks_sparse_bootstrapping_v2 11

(Run from this repo's HEonGPU submodule build, or from a separately built
HEonGPU tree; the binary takes log_slots as argv[1].)

Run the end-to-end CPU sparse bootstrap example (Python → JSON → C++ runner)

python examples/ckks_sparse_bootstrap_cpu/ckks_sparse_bootstrap_cpu.py
build/examples/ckks_sparse_bootstrap_cpu/ckks_sparse_bootstrap_cpu

The Python step emits mega_ag.json with log_slots = 8 set; the C++ step
loads it and runs the sparse path through Lattigo.

Results

Reference hardware

  • CPU sweeps: AMD Ryzen 9 7950X (single-threaded Lattigo bootstrap)
  • GPU sweeps: NVIDIA RTX 5090 (Blackwell, sm_120, CUDA 13)

CPU sweep — Lattigo bridge, toy N=2^13

Reproducible via test_sparse_bootstrap '[.benchmark]':

log_slots Active slots Bootstrap max err Speedup
4 16 570 ms 3.22e-08 1.59×
5 32 577 ms 2.79e-08 1.57×
6 64 614 ms 1.72e-08 1.47×
7 128 628 ms 1.67e-08 1.44×
8 256 648 ms 1.28e-08 1.39×
9 512 685 ms 8.50e-09 1.32×
10 1024 683 ms 9.56e-09 1.32×
11 2048 701 ms 1.45e-08 1.29×
dense (12) 4096 904 ms 2.88e-08 1.00×

Saturation at log_slots ≈ 5, peak ~1.59× speedup. Curve is monotonic — no
GPU-style cliff because Lattigo runs everything on CPU.

GPU sweep — HEonGPU sparse_bootstrapping_v2, production N=2^16

Reproducible via the HEonGPU example 6 binary. Dense baseline from
5_ckks_regular_bootstrapping_v2:

log_slots Active slots Bootstrap AVG Prec Speedup
8 256 72.6 ms +20.22 1.23×
9 512 75.0 ms +20.10 1.20×
10 1024 77.6 ms +20.04 1.15×
11 2048 56.2 ms +19.81 1.59×
12 4096 59.4 ms +19.43 1.51×
13 8192 63.9 ms +18.98 1.40×
14 16384 70.7 ms +18.51 1.27×
15 (dense) 32768 89.6 ms +17.94 1.00×

Headline: log_slots=11 gives the best speedup (1.59×) — smallest log_slots
that still avoids the GPU-FFT-vs-CPU-FFT-fallback cliff at log_slots ∈ [8, 10].

Why log_slots ∈ [8, 10] is slower than [11, 14] on GPU

GPU Special_FFT rejects n_power ≤ 10. For log_slots ∈ [8, 10] the
HEonGPU PR's sparse_fft_util.cuh falls back to CPU butterflies via
cudaMemcpy D→H + compute + cudaMemcpy H→D. Per-call overhead ~15–20 ms,
which exceeds the matrix-mul savings at these sizes. Trade-off accepted —
upstream throws here; we offer a slow-but-correct path. See HEonGPU PR.

Known limitation: GPU log_slots ∈ [4, 7]

$ ./6_ckks_sparse_bootstrapping_v2 4
terminate called after throwing an instance of 'heongpu::CudaException'
  what():  CUDA Error in encoder.cu at line 432: invalid argument

HEonGPU's kernel grid formulas assume slot_count ≥ 256. Tracked but
deferred — practical workloads on N=2^16 use log_slots ≥ 8. Users who
genuinely need extreme sparsity get it via the CPU runner (Lattigo handles
log_slots ∈ [4, 14] end-to-end).

Self-composability gap (open, gated)

A preventive test catches bootstrap → drop_level(9) → bootstrap on a
sparse ciphertext giving ~−3 bits precision:

build/unittests/test_cpu_ckks '[.composability]'

Real workloads always interleave arithmetic between bootstraps, so this is
preventive coverage rather than a blocker. Documented as a known
limitation; the gated test serves as a tracked reproducer for if/when a
fix lands.

Notes for reviewers

  • The CMake ApplySubmodulePatches.cmake runs at configure time and is
    idempotent — re-running configure on an already-patched submodule is a
    no-op. This keeps the submodule SHAs clean (no in-place commits in the
    submodule worktrees) while still letting us carry CUDA 13 fixes that
    GPU-NTT and HEonGPU haven't released upstream yet.
  • The auto-slot-inference pass never mutates g_param.log_slots. An
    earlier design did, which broke the case where a single param object was
    reused across tasks with different sparsity hints. Tests in
    test_auto_slot_inference.py::TestEdgeCaseHardening pin this down
    explicitly.
  • The new examples/ckks_sparse_bootstrap_cpu/ is the recommended starting
    point for any user who wants to add sparse bootstrap to their workload —
    it's the smallest end-to-end pipeline that exercises the JSON dispatch
    path.

@D4rkCrypto D4rkCrypto force-pushed the feat/sparse-bootstrap branch 2 times, most recently from 4130ccc to 0e611f4 Compare May 6, 2026 13:29
Comment thread frontend/custom_task.py Outdated

def __init__(self, type=DataType.Ciphertext, id='', degree=1, level=DEFAULT_LEVEL) -> None:
super().__init__(type, id, degree, level)
def __init__(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slots或log_slots是CkksParam的一个属性,当一个计算任务开始设置好同态参数,即g_param后,Ckks相关数据结构比如CkksCiphertextNode无需从构造函数再次获得该属性,也无需在各类同态算子中传播该属性

Comment thread mega_ag_runners/gpu/gpu_wrapper.cu Outdated
context->set_poly_modulus_degree(n);

int slots = param_json["slots"].get<int>();
// Accept either `slots` (new) or `log_slots` (legacy fixtures) in JSON.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该在前端统一设定为"slots",因此无需添加此类判断


for (int i = 0; i < p.size(); i++) {
P.push_back(Data64(p[i]));
std::vector<Data64> Q, P;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HEonGPU库支持CKKS Bootstrapping (包括sparse encoding的密文),请采用原有接口,保持代码的简洁性

@Yanbin-Li-Oct
Copy link
Copy Markdown
Collaborator

1.前端逻辑精简:slots (或 log_slots) 应当作为 CkksParam 的固有属性。在计算图中,CkksCiphertextNode 等数据节点无需重复持有该属性,各同态算子之间也应避免不必要的参数传播,以保持前端的简洁性。

@Yanbin-Li-Oct
Copy link
Copy Markdown
Collaborator

2.GPU Runtime 接口对齐:目前 GPU 端的底层实现已通过 HEonGPU 支持了包括 Sparse encoding 在内的 CKKS Bootstrapping。为了维持架构的一致性,建议无需修改现有接口,应直接复用即可。

@Yanbin-Li-Oct
Copy link
Copy Markdown
Collaborator

3.后续重点:目前的缺口主要在于 Client 端的参数配置映射。建议重点应放在补齐相关接口配置上,确保能通过 Client 正确触发 HEonGPU 的 Sparse Bootstrapping 逻辑,并完成某AI模型的端到端调用。

@D4rkCrypto D4rkCrypto force-pushed the feat/sparse-bootstrap branch from 0e611f4 to bcaf517 Compare May 7, 2026 04:09
@D4rkCrypto
Copy link
Copy Markdown
Author

Squashed and force-pushed bcaf517. Diff: 20 files / +784 / -48 (down from 22 / +1649 / -56).

1. 前端逻辑精简 (frontend simplification — frontend/custom_task.py)

used_slots removed entirely. slots now derives solely from g_param:

  • Constructors no longer take used_slots (FheDataNode, CiphertextNode, CkksCiphertextNode).
  • Helpers _validate_used_slots, _propagate_used_slots, _propagate_used_slots_mult, _saturate_rotation_used_slots, _infer_log_slots, _inject_sparse_bootstrap_rotation_keys deleted.
  • All z.used_slots = ... propagation lines in operators removed.
  • Auto-slot-inference call site in process_custom_task collapsed to slots_for_task = g_param.slots if isinstance(g_param, CkksParam) else None.
  • unittests/test_auto_slot_inference.py (647 LOC) deleted.

User opts into sparse via CkksBtpParam.create_sparse_param(log_slots) or set_slots(...) directly.

2. GPU Runtime 接口对齐 (mega_ag_runners/gpu/)

  • mega_ag_runners/gpu/gpu_wrapper.cu: removed log_slots JSON backward-compat — reads param_json["slots"] directly. The remaining is_sparse branch is now a static-prime-chain translator (uses ckks_sparse_bootstrap_chain_n16, included explicitly from bootstrap_helper.cuh) that maps the frontend's dense CkksBtpParam layout to the sparse Q + level_starts that HEonGPU's BootstrappingConfigV2(log_slots) ctor expects. Once the frontend learns to emit sparse-correct values directly, the branch collapses into the dense path.
  • mega_ag_runners/cpu_task_utils.h: same backward-compat removed.
  • mega_ag_runners/gpu/mega_ag_executors_gpu.cu + mega_ag_runners/mega_ag.h: ExecutionContext.log_slots removed; bind_gpu_bootstrap dispatches uniformly to regular_bootstrapping_v2 (HEonGPU now handles sparse internally via gap_ > 1).

Carries submodules: HEonGPU 6675bee (matching PR cipherflow-fhe/HEonGPU#3), lattigo bb1b0bb (matching PR cipherflow-fhe/lattigo#10).

3. 后续重点 (Client-side AI model E2E) — out of scope for this PR; tracked separately.

Verification:

  • test_sparse_bootstrap (CPU lattigo path): 1.44× speedup at log_slots=8 on toy N=8192, max_err 3.95e-7. ✅
  • test_gpu_ckks dense bootstrap: passing. ✅
  • HEonGPU example 6 (log_slots=8 production): 66.8 ms, 19-26 bit precision. ✅

@D4rkCrypto D4rkCrypto force-pushed the feat/sparse-bootstrap branch from bcaf517 to d685826 Compare May 7, 2026 04:19
@D4rkCrypto
Copy link
Copy Markdown
Author

Amended → d685826 (force-pushed). Addresses an inconsistency between the public C++ API and the GPU runtime path:

  • mega_ag_runners/gpu/gpu_wrapper.cu: when slots < n/2 but n != 2^16, the GPU runner now throws a clear std::runtime_error ("GPU sparse bootstrap currently supports N=2^16 only ...") instead of silently falling through to the dense path and surfacing a downstream CUDA illegal-access. The ckks_sparse_bootstrap_chain_n16 static prime table is hardcoded for N=2^16, so toy-N sparse on GPU isn't yet wired up; the CPU runner (lattigo) still handles toy sparse correctly.

  • fhe_ops_lib/fhe_lib_v2.h: documented the CPU/GPU support boundary on create_sparse_parameter / create_toy_sparse_parameter so users know create_toy_sparse_parameter is CPU-only and create_sparse_parameter is the production-N entry point that works end-to-end on both backends.

@D4rkCrypto D4rkCrypto force-pushed the feat/sparse-bootstrap branch from d685826 to c180e91 Compare May 7, 2026 04:24
@D4rkCrypto
Copy link
Copy Markdown
Author

Amended → c180e91. Removed the aspirational hidden-tag toy GPU sparse tests ([.sparse][.bootstrap][.gpu]) — they targeted an N=2^13 sparse path that isn't wired up, so they fail deterministically when un-gated. Tracking GPU toy sparse as future work; CPU toy sparse (via test_sparse_bootstrap and lattigo) and N=2^16 GPU sparse (via the chain helper translator) continue to work.

@D4rkCrypto D4rkCrypto force-pushed the feat/sparse-bootstrap branch from c180e91 to 9c7b415 Compare May 7, 2026 04:57
@D4rkCrypto
Copy link
Copy Markdown
Author

Amend (c180e919c7b415): collapsed the is_sparse translator branch in mega_ag_runners/gpu/gpu_wrapper.cu. Q/P, level_starts, depths, and EvalMod params are now read uniformly from the JSON the Python frontend already emits — the sparse path no longer regenerates them from ckks_sparse_bootstrap_chain_n16 (which has been deleted on the HEonGPU side, see cipherflow-fhe/HEonGPU#3).

The only branch that remains is the 4-arg vs 3-arg BootstrappingConfigV2 ctor — that's HEonGPU's own sparse-packing hint flag, and it's a single line.

Net change in gpu_wrapper.cu: +30/-64. The submodule pointer is bumped to the new HEonGPU SHA. CPU sparse, GPU dense, and HEonGPU example 6 sparse all still pass.

Issue 3 (Client-side AI E2E parameter mapping) is intentionally left for a follow-up PR.

@D4rkCrypto D4rkCrypto force-pushed the feat/sparse-bootstrap branch from 9c7b415 to be4653c Compare May 7, 2026 05:29
Adds end-to-end sparse-packing CKKS bootstrap. The user opts into a sparse
configuration with CkksBtpParam.create_sparse_param(log_slots) (or
.create_toy_sparse_param for N=8192 dev runs), and the existing bootstrap()
op in the DAG goes through:

- CPU: lattigo's native sparse path (LogSlots < LogN-1) via the new
  SetCkksParameterLogSlots SDK export. The ABI bridge encodes/decodes at
  the param's LogSlots, so sparse plaintexts pack/unpack correctly.
- GPU: HEonGPU's existing regular_bootstrapping_v2 once context.set_slot_count
  puts the encoder in gap_>1 mode. mega_ag_runners/gpu/gpu_wrapper.cu uses
  the static prime-chain helper ckks_sparse_bootstrap_chain_n16 (semi-public
  in HEonGPU; included explicitly here) to translate the frontend's dense
  CkksBtpParam into the sparse Q + level_starts that BootstrappingConfigV2's
  4-arg ctor expects. This is a translator, not a duplicate code path -- once
  the frontend learns to emit sparse-correct values directly, the branch
  collapses into the dense path.

Tests: test_sparse_bootstrap (CPU lattigo path) demonstrates the 1.32-1.59x
speedup at log_slots in [8, 14] vs the dense LogSlots=LogN-1 baseline.
test_gpu_ckks/cpu_ckks add CKKS smoke coverage. The example/
ckks_sparse_bootstrap_cpu/ shows the C++ and Python entry points for users.

Carries submodule pointers: HEonGPU 6675bee (sparse-packing CKKS bootstrap),
lattigo bb1b0bb (SetCkksParameterLogSlots + decoupled encode/decode).
@D4rkCrypto D4rkCrypto force-pushed the feat/sparse-bootstrap branch from be4653c to 0cff25f Compare May 7, 2026 05:46

heongpu::BootstrappingConfigV2 boot_config(stc_config, eval_mod_config, cts_config);

heongpu::EncodingMatrixConfig cts_config(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

类似这类修改请避免

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants