Commit a786902
committed
perf: remove sentence sorting from embedding pipeline
The _order_invariant_text() function previously sorted sentences
alphabetically before embedding to achieve order-invariant embeddings
for the REORDER scenario. This is no longer needed because:
1. Full-document segmented mean pooling (added in v0.2.0) is inherently
near-order-invariant (0.996 cosine for reordered documents)
2. Sorting broke on code, markdown, and non-English text (split on ". ")
3. Sorting slightly hurt cross-instruction similarity (-0.002)
4. Sorting destroyed instruction-document alignment in long contexts
Benchmark results (SGLang + SemBlend, Qwen2.5-7B-Instruct, A10G):
| Dataset | WITH sorting | WITHOUT sorting | Delta |
|---------------|-------------|-----------------|----------|
| TriviaQA | 3.5% | 22.6% | +19.1pp |
| SCBench | 4.0% | 13.6% | +9.6pp |
| WikiText103 | 10.0% | 15.7% | +5.7pp |
| LongEval | 9.7% | 15.2% | +5.5pp |
| NarrativeQA | 16.7% | 17.4% | +0.7pp |
Hit rates improved on every dataset. Average improvement: +8.1pp.
Signed-off-by: Zach Bennett <zach@worldflowai.com>1 parent 8a7901b commit a786902
7 files changed
Lines changed: 2488 additions & 13 deletions
File tree
- deploy/benchmarks/sglang_nosort
- semblend_core
- semblend/integration/dynamo
0 commit comments