Skip to content

Commit a786902

Browse files
committed
perf: remove sentence sorting from embedding pipeline
The _order_invariant_text() function previously sorted sentences alphabetically before embedding to achieve order-invariant embeddings for the REORDER scenario. This is no longer needed because: 1. Full-document segmented mean pooling (added in v0.2.0) is inherently near-order-invariant (0.996 cosine for reordered documents) 2. Sorting broke on code, markdown, and non-English text (split on ". ") 3. Sorting slightly hurt cross-instruction similarity (-0.002) 4. Sorting destroyed instruction-document alignment in long contexts Benchmark results (SGLang + SemBlend, Qwen2.5-7B-Instruct, A10G): | Dataset | WITH sorting | WITHOUT sorting | Delta | |---------------|-------------|-----------------|----------| | TriviaQA | 3.5% | 22.6% | +19.1pp | | SCBench | 4.0% | 13.6% | +9.6pp | | WikiText103 | 10.0% | 15.7% | +5.7pp | | LongEval | 9.7% | 15.2% | +5.5pp | | NarrativeQA | 16.7% | 17.4% | +0.7pp | Hit rates improved on every dataset. Average improvement: +8.1pp. Signed-off-by: Zach Bennett <zach@worldflowai.com>
1 parent 8a7901b commit a786902

7 files changed

Lines changed: 2488 additions & 13 deletions

0 commit comments

Comments
 (0)