Skip to content

perf(tableformer): autoregressive decode bottlenecked by per-token Python overhead (GPU mostly idle) #161

@utkarsh-611

Description

@utkarsh-611

Summary

Profiling on production traffic shows that TFPredictor.predict() (the autoregressive decode loop in tablemodel04_rs.py) leaves the GPU mostly idle even when the model is loaded on CUDA. Per-table wall time is on the order of seconds while the per-step GPU forward appears to be much shorter. The bottleneck looks like per-token Python overhead — primarily a .item() call that forces a CUDA→CPU sync every token plus Python-side structure-correction rules between tokens.

Related but distinct from #115 (kernel-level optimization via need_weights=False) and docling-project/docling#1521 (kv_proj elimination). Those reduce per-step cost; this issue is about per-step launch overhead dominating compute, which becomes the bottleneck once kernel-level work is small.

Evidence

Setup: 8× docling-server VMs, each with 1× NVIDIA L4, torch 2.11.0+cu130. Models confirmed on cuda:0 (verified next(model.parameters()).device == cuda:0).

While processing a ~70-page PDF with 64 tables (mean ~12s/table inside convert):

  • Total wall time inside doc_converter.convert(...): 780s
  • Python process CPU usage: ~170% (≈1.7 of 32 cores active)
  • nvidia-smi dmon -s u -c 10 during this load: GPU sm% = 0,0,0,0,0,1,0,0,0,0 over 10 consecutive seconds
  • 8.3 GB of weights loaded into GPU memory (model is on device)

Watchdog stack dump (caught the worker mid-decode):

torch/nn/modules/module.py:1790 _call_impl
docling_ibm_models/tableformer/models/table04_rs/transformer_rs.py:58 forward
docling_ibm_models/tableformer/models/table04_rs/tablemodel04_rs.py:181 predict
docling_ibm_models/tableformer/data_management/tf_predictor.py:746 predict
docling_ibm_models/tableformer/data_management/tf_predictor.py:490 multi_table_predict
docling/models/stages/table_structure/table_structure_model.py:258 predict_tables
docling/pipeline/standard_pdf_pipeline.py:296 _process_batch

All other Python threads were idle ThreadPoolExecutor workers.

Root cause

In tablemodel04_rs.py predict() (around line 180+):

while len(output_tags) < self._max_pred_len:
    decoded_embedding = self._tag_transformer._embedding(decoded_tags)
    decoded_embedding = self._tag_transformer._positional_encoding(decoded_embedding)
    decoded, cache = self._tag_transformer._decoder(...)
    logits = self._tag_transformer._fc(decoded[-1, :, :])
    new_tag = logits.argmax(1).item()       # <-- GPU→CPU sync every token

    if line_num == 0 and new_tag == word_map["xcel"]:
        new_tag = word_map["lcel"]           # Python branch on a scalar
    if prev_tag_ucel and new_tag == word_map["lcel"]:
        new_tag = word_map["fcel"]
    if new_tag == word_map["<end>"]:
        ...
        break                                # Python control flow exits the loop

For each generated token: small forward on GPU → .item() blocks until GPU finishes and copies one int back → Python evaluates 2-3 conditionals → builds next input → next forward. On a small model the GPU kernel completes in microseconds, so the sync + Python overhead dominates total step time.

Proposed optimization

Make the decode loop GPU-resident end-to-end so it can be wrapped in CUDA graphs (torch.compile(mode="reduce-overhead") or torch.cuda.CUDAGraph):

  1. Drop .item() from the hot loop. Keep new_tag as a 0-D GPU tensor.
  2. Vectorize the structure-correction rules as torch.where ops on the tensor.
  3. Generate to a fixed max_pred_len unconditionally, then trim at the first <end> token in a single post-pass. Trades some wasted compute for a Python-free loop.
  4. Pad or bucket the KV cache so step input/output shapes are stable enough for CUDA graph capture.

Haven't benchmarked the proposed change yet, so the magnitude of improvement is unverified. The expectation is that this targets a different regime than #115 and docling-project/docling#1521: those reduce per-step kernel cost, while the change proposed here targets per-step Python overhead — which, based on the profiling above, looks like it currently dominates. Happy to prototype and post measured numbers before any PR if that would help.

Trade-offs

  • Wasted compute at the tail of short tables (generating to max_pred_len instead of breaking at <end>). Net win for typical tables, but worth measuring.
  • The structure-correction rules are easier to maintain as Python ifs than vectorized ops; adding new rules later would require updating both forms (or keeping the Python path as a fallback flag).
  • KV cache shape handling: padding adds memory; bucketing adds graph cache complexity. Most likely a single padded length covers the common case well.
  • Compatibility with TableFormerMode.FAST and ACCURATE — both have separate predict paths.

Asks

  1. Is this optimization in scope / wanted by the maintainers?
  2. Are there existing test PDFs with diverse table structures (simple grids, multi-span, nested) we should validate against to catch regressions during the refactor?
  3. Any prior work on this internally we should be aware of before opening a PR?

If welcome, we'd be glad to contribute a draft PR. Would start with the .item() removal + simple vectorization (no CUDA graphs) so the diff is reviewable, then layer CUDA graph capture as a follow-up.

Environment

  • docling: 2.92.0
  • docling-ibm-models: latest as of 2026-05
  • PyTorch: 2.11.0+cu130
  • GPU: NVIDIA L4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions