Summary
Profiling on production traffic shows that TFPredictor.predict() (the autoregressive decode loop in tablemodel04_rs.py) leaves the GPU mostly idle even when the model is loaded on CUDA. Per-table wall time is on the order of seconds while the per-step GPU forward appears to be much shorter. The bottleneck looks like per-token Python overhead — primarily a .item() call that forces a CUDA→CPU sync every token plus Python-side structure-correction rules between tokens.
Related but distinct from #115 (kernel-level optimization via need_weights=False) and docling-project/docling#1521 (kv_proj elimination). Those reduce per-step cost; this issue is about per-step launch overhead dominating compute, which becomes the bottleneck once kernel-level work is small.
Evidence
Setup: 8× docling-server VMs, each with 1× NVIDIA L4, torch 2.11.0+cu130. Models confirmed on cuda:0 (verified next(model.parameters()).device == cuda:0).
While processing a ~70-page PDF with 64 tables (mean ~12s/table inside convert):
- Total wall time inside
doc_converter.convert(...): 780s
- Python process CPU usage: ~170% (≈1.7 of 32 cores active)
nvidia-smi dmon -s u -c 10 during this load: GPU sm% = 0,0,0,0,0,1,0,0,0,0 over 10 consecutive seconds
- 8.3 GB of weights loaded into GPU memory (model is on device)
Watchdog stack dump (caught the worker mid-decode):
torch/nn/modules/module.py:1790 _call_impl
docling_ibm_models/tableformer/models/table04_rs/transformer_rs.py:58 forward
docling_ibm_models/tableformer/models/table04_rs/tablemodel04_rs.py:181 predict
docling_ibm_models/tableformer/data_management/tf_predictor.py:746 predict
docling_ibm_models/tableformer/data_management/tf_predictor.py:490 multi_table_predict
docling/models/stages/table_structure/table_structure_model.py:258 predict_tables
docling/pipeline/standard_pdf_pipeline.py:296 _process_batch
All other Python threads were idle ThreadPoolExecutor workers.
Root cause
In tablemodel04_rs.py predict() (around line 180+):
while len(output_tags) < self._max_pred_len:
decoded_embedding = self._tag_transformer._embedding(decoded_tags)
decoded_embedding = self._tag_transformer._positional_encoding(decoded_embedding)
decoded, cache = self._tag_transformer._decoder(...)
logits = self._tag_transformer._fc(decoded[-1, :, :])
new_tag = logits.argmax(1).item() # <-- GPU→CPU sync every token
if line_num == 0 and new_tag == word_map["xcel"]:
new_tag = word_map["lcel"] # Python branch on a scalar
if prev_tag_ucel and new_tag == word_map["lcel"]:
new_tag = word_map["fcel"]
if new_tag == word_map["<end>"]:
...
break # Python control flow exits the loop
For each generated token: small forward on GPU → .item() blocks until GPU finishes and copies one int back → Python evaluates 2-3 conditionals → builds next input → next forward. On a small model the GPU kernel completes in microseconds, so the sync + Python overhead dominates total step time.
Proposed optimization
Make the decode loop GPU-resident end-to-end so it can be wrapped in CUDA graphs (torch.compile(mode="reduce-overhead") or torch.cuda.CUDAGraph):
- Drop
.item() from the hot loop. Keep new_tag as a 0-D GPU tensor.
- Vectorize the structure-correction rules as
torch.where ops on the tensor.
- Generate to a fixed
max_pred_len unconditionally, then trim at the first <end> token in a single post-pass. Trades some wasted compute for a Python-free loop.
- Pad or bucket the KV cache so step input/output shapes are stable enough for CUDA graph capture.
Haven't benchmarked the proposed change yet, so the magnitude of improvement is unverified. The expectation is that this targets a different regime than #115 and docling-project/docling#1521: those reduce per-step kernel cost, while the change proposed here targets per-step Python overhead — which, based on the profiling above, looks like it currently dominates. Happy to prototype and post measured numbers before any PR if that would help.
Trade-offs
- Wasted compute at the tail of short tables (generating to
max_pred_len instead of breaking at <end>). Net win for typical tables, but worth measuring.
- The structure-correction rules are easier to maintain as Python
ifs than vectorized ops; adding new rules later would require updating both forms (or keeping the Python path as a fallback flag).
- KV cache shape handling: padding adds memory; bucketing adds graph cache complexity. Most likely a single padded length covers the common case well.
- Compatibility with
TableFormerMode.FAST and ACCURATE — both have separate predict paths.
Asks
- Is this optimization in scope / wanted by the maintainers?
- Are there existing test PDFs with diverse table structures (simple grids, multi-span, nested) we should validate against to catch regressions during the refactor?
- Any prior work on this internally we should be aware of before opening a PR?
If welcome, we'd be glad to contribute a draft PR. Would start with the .item() removal + simple vectorization (no CUDA graphs) so the diff is reviewable, then layer CUDA graph capture as a follow-up.
Environment
- docling: 2.92.0
- docling-ibm-models: latest as of 2026-05
- PyTorch: 2.11.0+cu130
- GPU: NVIDIA L4
Summary
Profiling on production traffic shows that
TFPredictor.predict()(the autoregressive decode loop intablemodel04_rs.py) leaves the GPU mostly idle even when the model is loaded on CUDA. Per-table wall time is on the order of seconds while the per-step GPU forward appears to be much shorter. The bottleneck looks like per-token Python overhead — primarily a.item()call that forces a CUDA→CPU sync every token plus Python-side structure-correction rules between tokens.Related but distinct from #115 (kernel-level optimization via
need_weights=False) and docling-project/docling#1521 (kv_proj elimination). Those reduce per-step cost; this issue is about per-step launch overhead dominating compute, which becomes the bottleneck once kernel-level work is small.Evidence
Setup: 8× docling-server VMs, each with 1× NVIDIA L4, torch 2.11.0+cu130. Models confirmed on cuda:0 (verified
next(model.parameters()).device == cuda:0).While processing a ~70-page PDF with 64 tables (mean ~12s/table inside
convert):doc_converter.convert(...): 780snvidia-smi dmon -s u -c 10during this load: GPUsm%=0,0,0,0,0,1,0,0,0,0over 10 consecutive secondsWatchdog stack dump (caught the worker mid-decode):
All other Python threads were idle ThreadPoolExecutor workers.
Root cause
In
tablemodel04_rs.pypredict()(around line 180+):For each generated token: small forward on GPU →
.item()blocks until GPU finishes and copies one int back → Python evaluates 2-3 conditionals → builds next input → next forward. On a small model the GPU kernel completes in microseconds, so the sync + Python overhead dominates total step time.Proposed optimization
Make the decode loop GPU-resident end-to-end so it can be wrapped in CUDA graphs (
torch.compile(mode="reduce-overhead")ortorch.cuda.CUDAGraph):.item()from the hot loop. Keepnew_tagas a 0-D GPU tensor.torch.whereops on the tensor.max_pred_lenunconditionally, then trim at the first<end>token in a single post-pass. Trades some wasted compute for a Python-free loop.Haven't benchmarked the proposed change yet, so the magnitude of improvement is unverified. The expectation is that this targets a different regime than #115 and docling-project/docling#1521: those reduce per-step kernel cost, while the change proposed here targets per-step Python overhead — which, based on the profiling above, looks like it currently dominates. Happy to prototype and post measured numbers before any PR if that would help.
Trade-offs
max_pred_leninstead of breaking at<end>). Net win for typical tables, but worth measuring.ifs than vectorized ops; adding new rules later would require updating both forms (or keeping the Python path as a fallback flag).TableFormerMode.FASTandACCURATE— both have separate predict paths.Asks
If welcome, we'd be glad to contribute a draft PR. Would start with the
.item()removal + simple vectorization (no CUDA graphs) so the diff is reviewable, then layer CUDA graph capture as a follow-up.Environment