Commit bf90b49
Add multithreading to table lookup (#5849)
Summary:
X-link: facebookresearch/FBGEMM#2767
## What
Parallelizes the per-table loop in the CPU TBE forward kernel
(IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The
per-table loop is embarrassingly parallel — each table reads its own weight
slice and writes a disjoint slice of the output — so fanning tables out across
threads gives near-linear speedup on table-heavy inference models.
Gated by the TBE_TABLE_THREADS env var:
- TBE_TABLE_THREADS=1 (default): unchanged sequential behavior.
- TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with
dynamic scheduling (good load balancing when table sizes are skewed).
## Default behavior is unchanged
When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop
body sequentially with no try/catch wrapper and no thread-local-state guard, so
the default path is functionally identical to the pre-change code: same
iteration order, the DEVICE-placement TORCH_CHECK in its original per-table
position, same error semantics, and the same generated machine code for the
body. The only always-on changes are mechanical and behavior-preserving
(loop-local `weights` pointer, int64 loop index).
## Design
- Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than
at::parallel_for, so TBE gets its own thread count independent of the global
intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1).
- Thread count is read once from the env var and cached (thread-safe static
init), and clamped to the number of tables.
## Correctness
- Removed the function-scoped `weights_acc` pointer, which every iteration
overwrote — a data race once the loop is parallel. Replaced with a loop-local
pointer (identical pointer value). Every other variable in the loop body is
already loop-local, and each table writes a disjoint output slice
(output_acc + D_start), so results are bitwise-identical to the sequential
path.
- The per-table DEVICE-placement TORCH_CHECK stays in its original position. In
the threaded path it — like any other throw from the loop body (kernel
errors, at::arange checks) — is captured (first one wins) and rethrown after
the join, so no exception escapes the OpenMP region.
- Worker threads restore the caller's at::ThreadLocalState (dispatch keys,
grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g.
at::arange in the nobag path) run with the correct thread-local context.
## Verification
- Builds clean (mode/opt); confirmed OpenMP is actually enabled for this
target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o
references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op,
even though no -fopenmp appears in the TARGETS; the fbcode default toolchain
supplies it).
- nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including
test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets).
## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg)
| Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency |
| 8 | 1 | 1,436 | 4.38 | --- | --- |
| 8 | 2 | 1,042 | 6.03 | 1.38x | 69% |
| 8 | 4 | 844 | 7.46 | 1.70x | 43% |
| 8 | 8 | 773 | 8.14 | 1.86x | 23% |
| 32 | 1 | 4,797 | 5.25 | --- | --- |
| 32 | 2 | 2,830 | 8.90 | 1.69x | 85% |
| 32 | 4 | 2,003 | 12.60 | 2.40x | 60% |
| 32 | 8 | 1,633 | 15.49 | 2.94x | 37% |
| 64 | 1 | 10,132 | 4.97 | --- | --- |
| 64 | 2 | 6,767 | 7.44 | 1.50x | 75% |
| 64 | 4 | 4,864 | 10.35 | 2.08x | 52% |
| 64 | 8 | 4,033 | 12.50 | 2.51x | 31% |
2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher
counts due to fixed fork/join overhead per call. This matches the production
recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark
kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction
of total inference latency).
Reviewed By: helloguo, q10
Differential Revision: D1028672491 parent 8744183 commit bf90b49
1 file changed
Lines changed: 75 additions & 6 deletions
Lines changed: 75 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| 28 | + | |
28 | 29 | | |
| 30 | + | |
29 | 31 | | |
30 | 32 | | |
31 | 33 | | |
| |||
37 | 39 | | |
38 | 40 | | |
39 | 41 | | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
40 | 110 | | |
41 | 111 | | |
42 | 112 | | |
| |||
240 | 310 | | |
241 | 311 | | |
242 | 312 | | |
243 | | - | |
244 | | - | |
245 | 313 | | |
246 | 314 | | |
247 | 315 | | |
| |||
280 | 348 | | |
281 | 349 | | |
282 | 350 | | |
283 | | - | |
| 351 | + | |
| 352 | + | |
284 | 353 | | |
285 | 354 | | |
286 | 355 | | |
| |||
294 | 363 | | |
295 | 364 | | |
296 | 365 | | |
297 | | - | |
298 | | - | |
| 366 | + | |
299 | 367 | | |
300 | 368 | | |
301 | 369 | | |
| |||
451 | 519 | | |
452 | 520 | | |
453 | 521 | | |
454 | | - | |
| 522 | + | |
| 523 | + | |
455 | 524 | | |
456 | 525 | | |
457 | 526 | | |
| |||
0 commit comments