urchade
diff --git a/‎README.md‎
Lines changed: 13 additions & 1 deletion b/‎README.md‎
Lines changed: 13 additions & 1 deletion
diff --git a/‎benchmarks/BENCHMARK_batch_level_decoding.md‎
Lines changed: 164 additions & 0 deletions b/‎benchmarks/BENCHMARK_batch_level_decoding.md‎
Lines changed: 164 additions & 0 deletions
@@ -20,7 +20,7 @@
     <br>
 </div>
 
-GLiNER is a framework for training and deploying Named Entity Recognition (NER) models that can identify any entity type using bidirectional transformer encoders (BERT-like). Beyond standard NER, GLiNER supports multiple tasks including joint entity and relation extraction through specialized architectures. It provides a practical alternative to both traditional NER models, which are limited to predefined entity types, and Large Language Models (LLMs), which offer flexibility but require significant computational resources.
+GLiNER is a framework for training and deploying small Named Entity Recognition (NER) models with zero-shot capabilities. In addition to tradition NER, it also supports joint entity and relation extraction. GLiNER is fine-tunable, optimized to run on CPUs and consumer hardware, and has performance competitive with LLMs several times its size, like ChatGPT and UniNER.
 
 
 ## Example Notebooks
@@ -142,6 +142,18 @@ If you find GLiNER useful in your research, please consider citing our papers:
       url={https://arxiv.org/abs/2406.12925}, 
 }
 ```
+
+```bibtex
+@misc{stepanov2026millionlabelnerbreakingscale,
+      title={The Million-Label NER: Breaking Scale Barriers with GLiNER bi-encoder}, 
+      author={Ihor Stepanov and Mykhailo Shtopko and Dmytro Vodianytskyi and Oleksandr Lukashov},
+      year={2026},
+      eprint={2602.18487},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2602.18487}, 
+}
+```
 ## Support and funding
 
 This project has been supported and funded by **F.initiatives** and **Laboratoire Informatique de Paris Nord**.
 
@@ -0,0 +1,164 @@
+# Benchmark: Batch-Level Decoding
+
+## Summary
+
+Replaces the per-item `for i in range(B)` decoding loop with a single set of
+batch-wide tensor operations. This reduces CUDA kernel launches from `B * 8` to
+`~8` regardless of batch size, eliminating the dominant overhead in GPU decoding.
+
+**GPU (bs>=8):** 63-95% decoder speedup (median 85%), all statistically significant.
+**GPU (bs=1):** Neutral (fast path delegates to per-item decoder).
+**CPU (bs>=8):** Mixed. Improvements at very_long inputs (24-42%); regressions at
+short/medium inputs where 4D `torch.where` has higher fixed overhead than B
+separate 3D calls. Absolute CPU regressions are 3-5ms.
+
+## Environment
+
+| | |
+|---|---|
+| Python | 3.13.7 |
+| PyTorch | 2.8.0+cu128 |
+| OS | Linux 6.6.87.2 (WSL2) |
+| GPU | NVIDIA GeForce RTX 5090 |
+| CPU | AMD (via WSL2) |
+
+## Methodology
+
+- **Interleaved A/B**: Old and new paths alternate within each iteration to
+  avoid warm-cache bias. Execution order flips on odd/even iterations.
+- **Replications**: 10 warmup + 50 measured iterations per condition (n=50).
+- **Conditions**: 4 batch sizes (1, 8, 16, 32) x 4 input lengths (20, 80, 200,
+  500 tokens) x 2 devices (CPU, GPU) = 32 conditions.
+- **Statistical test**: Welch's t-test (two-sided, unequal variance), p<0.05.
+- **Input realism**: Logits biased to -3.0 so ~5-50 spans per item pass
+  threshold=0.5, matching real GLiNER inference. max_width=12, num_classes=8.
+- **Correctness**: Bit-identical output verified for every condition before
+  benchmarking.
+
+## GPU Results
+
+The batch-level approach eliminates per-item CUDA kernel launches. At bs=1, a
+fast path delegates to the original per-item decoder to avoid the overhead of
+4D `torch.where` without amortization.
+
+| Batch Size | Input Length | Spans | Old (ms) | New (ms) | Change | p-value |
+|-----------:|:------------|------:|---------:|---------:|-------:|--------:|
+| 1 | short (20) | 3 | 0.862 | 0.889 | -3.2% | 0.42 |
+| 8 | short (20) | 24 | 6.782 | 1.124 | **+83.4%** | <0.001 |
+| 16 | short (20) | 48 | 13.539 | 1.177 | **+91.3%** | <0.001 |
+| 32 | short (20) | 102 | 22.382 | 1.074 | **+95.2%** | <0.001 |
+| 1 | medium (80) | 9 | 0.829 | 0.790 | +4.8% | 0.24 |
+| 8 | medium (80) | 63 | 6.878 | 1.224 | **+82.2%** | <0.001 |
+| 16 | medium (80) | 121 | 13.711 | 1.240 | **+91.0%** | <0.001 |
+| 32 | medium (80) | 273 | 27.550 | 1.613 | **+94.1%** | <0.001 |
+| 1 | long (200) | 19 | 0.789 | 0.764 | +3.1% | 0.39 |
+| 8 | long (200) | 158 | 7.739 | 1.703 | **+78.0%** | <0.001 |
+| 16 | long (200) | 330 | 16.388 | 2.384 | **+85.5%** | <0.001 |
+| 32 | long (200) | 629 | 31.173 | 3.442 | **+89.0%** | <0.001 |
+| 1 | very_long (500) | 50 | 1.303 | 1.349 | -3.5% | 0.12 |
+| 8 | very_long (500) | 366 | 10.149 | 3.716 | **+63.4%** | <0.001 |
+| 16 | very_long (500) | 771 | 18.791 | 6.670 | **+64.5%** | <0.001 |
+| 32 | very_long (500) | 1591 | 38.142 | 12.907 | **+66.2%** | <0.001 |
+
+All values are median wall-clock time over 50 interleaved iterations. Bold
+entries are statistically significant (p<0.05).
+
+### GPU scaling characteristics
+
+The new decoder time is nearly constant across batch sizes for short/medium
+inputs (~1ms), confirming that the fixed overhead is paid once:
+
+```
+GPU short input:   bs=8 → 1.1ms,  bs=16 → 1.2ms,  bs=32 → 1.1ms
+GPU medium input:  bs=8 → 1.2ms,  bs=16 → 1.2ms,  bs=32 → 1.6ms
+```
+
+The old path scales linearly with batch size (B kernel launches each):
+
+```
+Old short input:   bs=8 → 6.8ms,  bs=16 → 13.5ms, bs=32 → 22.4ms
+Old medium input:  bs=8 → 6.9ms,  bs=16 → 13.7ms, bs=32 → 27.6ms
+```
+
+## CPU Results
+
+On CPU, `torch.where` on a 4D tensor has ~3-5ms fixed overhead that doesn't
+exist when calling it B times on 3D slices. This makes the batch path slower
+for short/medium inputs where the per-item cost is already low. At very_long
+inputs, the per-item cost is high enough that batching still wins.
+
+| Batch Size | Input Length | Spans | Old (ms) | New (ms) | Change | p-value |
+|-----------:|:------------|------:|---------:|---------:|-------:|--------:|
+| 1 | short (20) | 2 | 0.027 | 0.027 | -1.1% | 0.31 |
+| 8 | short (20) | 25 | 0.221 | 0.083 | **+62.4%** | <0.001 |
+| 16 | short (20) | 51 | 0.492 | 0.148 | **+69.8%** | <0.001 |
+| 32 | short (20) | 89 | 1.008 | 5.427 | **-438.5%** | <0.001 |
+| 1 | medium (80) | 5 | 0.040 | 0.040 | -0.9% | 0.95 |
+| 8 | medium (80) | 68 | 0.465 | 5.301 | **-1038.8%** | <0.001 |
+| 16 | medium (80) | 127 | 0.799 | 3.790 | **-374.2%** | <0.001 |
+| 32 | medium (80) | 259 | 1.607 | 4.311 | **-168.2%** | <0.001 |
+| 1 | long (200) | 20 | 0.129 | 0.129 | +0.4% | 0.85 |
+| 8 | long (200) | 154 | 0.998 | 6.177 | **-519.0%** | <0.001 |
+| 16 | long (200) | 323 | 2.065 | 1.714 | **+17.0%** | <0.001 |
+| 32 | long (200) | 638 | 4.193 | 3.760 | +10.3% | 0.08 |
+| 1 | very_long (500) | 56 | 0.455 | 0.447 | +1.6% | 0.88 |
+| 8 | very_long (500) | 384 | 5.389 | 3.115 | **+42.2%** | <0.001 |
+| 16 | very_long (500) | 767 | 8.085 | 6.133 | **+24.2%** | <0.001 |
+| 32 | very_long (500) | 1572 | 17.120 | 12.168 | **+28.9%** | <0.001 |
+
+### CPU regression analysis
+
+The CPU regressions share a pattern: the new path's absolute time clusters
+around 3-6ms regardless of batch size or input length, suggesting a fixed floor
+in PyTorch's 4D `torch.where` / `nonzero` implementation on CPU. The per-item
+path avoids this by calling 3D `torch.where` on small tensors (each <50K
+elements), which stays under 0.1ms per call.
+
+The regressions are limited to conditions where the old decoder was already
+fast (<2ms). In absolute terms the worst regression adds ~5ms. At very_long
+inputs where the decoder is the bottleneck (old path 5-17ms), batching
+delivers 24-42% improvement.
+
+Note: CPU benchmarks ran under WSL2, which adds scheduling variance. The
+high stdev on some CPU conditions (30-60% of mean) partly reflects this.
+
+## Why the improvement
+
+The old per-item loop calls `_decode_batch_item` B times, each paying:
+- 1 `torch.where` on (L, K, C)
+- 1 boolean mask + 3 indexing ops
+- 1 score extraction via advanced indexing
+- 5 `.tolist()` GPU→CPU transfers
+
+At bs=32, that's **256 CUDA kernel launches + 160 GPU→CPU transfers**.
+
+The batch-level approach does this once on the full (B, L, K, C) tensor:
+- 1 `torch.where`
+- 1 boolean mask + 3 indexing ops
+- 1 score extraction
+- 6 `.tolist()` transfers
+
+Total: **~8 CUDA ops** regardless of batch size.
+
+## Design decisions
+
+### bs=1 fast path
+
+At batch size 1, there's nothing to amortize — the 4D `torch.where` has
+strictly more overhead than the 3D version. `_decode_batch` detects bs=1 and
+delegates directly to `_decode_batch_item`, matching the old performance
+exactly. Benchmarks confirm bs=1 is neutral on both CPU and GPU (all p>0.1).
+
+### Batched `return_class_probs`
+
+When class probabilities are requested, the old path called `torch.topk` per
+span (N_total kernel launches). The batch path gathers all probability vectors
+in one advanced indexing op, then does one batched `topk` on the (N_total, C)
+matrix.
+
+### Correctness
+
+Output is verified bit-identical for all 32 benchmark conditions. The batch
+`torch.where` returns indices in row-major order (batch, start, width, class),
+so within each batch item the span ordering is identical to the per-item path.
+`greedy_search` sorts by score regardless, so final output is deterministic.