Skip to content

Q4 — Embedding model bottleneck: GTE-Large FP16 → Qwen3-Embedding-0.6B #115

@longobucco

Description

@longobucco

Problem

GTE-Large FP16 (~670 MB, 512-token context, English-centric) is outperformed by Qwen3-Embedding-0.6B as of April 2026:

Model MTEB avg Context VRAM
GTE-Large FP16 63.1 512 tok ~670 MB
Qwen3-Embedding-0.6B 68.4 8192 tok ~360 MB

Qwen3-Embed-0.6B is smaller, better, and multilingual — relevant for Italian lecture notes + English whitepapers. The 8192-token context also eliminates the child-chunk truncation issue (R4) entirely.

Fix

When `@qvac/sdk ≥ 1.0` ships `QWEN3_EMBEDDING_0_6B`, update `QVAC_EMB_SRC` in `workers/qvac-service/src/models.js`. Re-ingest all documents after the switch.

Effort: Small (config change + re-ingest)
Impact: High — direct quality improvement, smaller memory footprint

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions