ViR2 (Vietnamese Retrieval and Re-ranking) is a few-shot example selection method for Text-to-SQL that combines semantic retrieval with syntactic matching and diversity optimization.
| Method | Limitations |
|---|---|
| Random | No semantic understanding |
| DICL | Relies solely on semantic similarity, ignores grammatical structure |
| ASTRES | AST matching not suitable for Vietnamese |
| Skill-KNN | Requires LLM for skill extraction → expensive |
ViR2 combines three components:
- Semantic Similarity - PhoBERT embeddings
- Syntactic Matching - POS tagging (Vietnamese-aware)
- Diversity - Avoids redundant examples
Input: Question q, Training Pool P, Parameters (M, B, k, λ)
Stage 1: Semantic Retrieval
│
├─ Encode q using PhoBERT → embedding e_q
├─ Compute cosine similarity with all examples in P
└─ Select top-M candidates → C
│
↓
│
Stage 2: Beam Search Re-ranking
│
├─ Initialize beams = [[]]
├─ For each position i ∈ [1, k]:
│ ├─ For each beam:
│ │ ├─ For each candidate c ∈ C:
│ │ │ ├─ Create new_beam = beam + [c]
│ │ │ └─ Compute score(new_beam, q)
│ │ └─ Collect all candidate beams
│ └─ Keep top-B beams by score
└─ Return best beam (k examples)
Filter top-M candidates with high semantic similarity to the new question.
Input:
- Question
$q$ (new question) - Meaning Pool
$P = {(q_1, s_1), (q_2, s_2), \ldots, (q_n, s_n)}$ - Pool size
$M$ (default: 50)
Process:
-
Encode question:
$$\mathbf{e}_q = \text{PhoBERT}(q)$$ -
Compute similarities: $$\text{sim}(q, q_i) = \frac{\mathbf{e}q \cdot \mathbf{e}{q_i}}{|\mathbf{e}q| |\mathbf{e}{q_i}|}$$
-
Select top-M:
$$C = \text{TopK}(P, M, \text{by}=\text{sim})$$
Output: Candidate set
PhoBERT Encoding:
- Model:
vinai/phobert-base-v2 - Embedding dimension: 768
- Pooling: Mean pooling over all token embeddings
Optimization:
- Pre-compute embeddings for training pool → Saved in
dicl_candidates.json - Only encode new question at runtime
- Complexity:
$O(M)$ similarity computations
From M candidates, select k optimal examples using beam search with a scoring function that combines POS matching and diversity.
Input:
- Candidates
$C = {c_1, \ldots, c_M}$ - Question
$q$ - Beam size
$B$ (default: 5) - Number of examples
$k$ (default: 3)
Process:
Initialize:
For
$\text{candidates} = \emptyset$ -
For each beam
$\in \text{beams}$ :-
For each
$c \in C$ :$\text{new_beam} = \text{beam} \cup {c}$ $\text{score} = f(\text{new_beam}, q)$ $\text{candidates} = \text{candidates} \cup {(\text{new_beam}, \text{score})}$
-
For each
$\text{beams} = \text{TopB}(\text{candidates}, B)$
Return: Best beam (highest score)
where:
-
$E = {e_1, e_2, \ldots, e_k}$ is the set of selected examples -
$\lambda$ is diversity weight (default: 0.3)
POS matching measures grammatical structure similarity between new question and example questions.
Input: Two questions
Step 1: POS Tagging
Using underthesea (Vietnamese-specific):
q_1 = "Có bao nhiêu học sinh trong lớp 10A?"
POS tags: [('Có', 'V'), ('bao nhiêu', 'M'), ('học sinh', 'N'), ('trong', 'E'), ('lớp', 'N'), ('10A', 'M')]
Step 2: Extract POS Distribution
Normalize:
Step 3: Compute Jensen-Shannon Divergence
where
Step 4: POS Match Score
Range:
For a set of examples
- Vietnamese has different word order from English
- Complex grammatical structure (isolating language)
- underthesea understands Vietnamese-specific patterns
- POS tags: N (noun), V (verb), M (number), E (preposition), A (adjective), etc.
Diversity ensures k examples are not too similar, covering multiple SQL patterns.
Input: Set of examples
Step 1: Compute Pairwise Similarities
Using embeddings from Stage 1:
Step 2: Average Pairwise Similarity
Step 3: Diversity Score
Range:
- High diversity = Semantically different examples
- Low diversity = Overly similar examples (redundant)
- Balanced with POS matching via
$\lambda$
| Parameter | Symbol | Default | Range | Description |
|---|---|---|---|---|
| Candidate pool size | 50 | [10, 200] | Stage 1 retrieval size | |
| Beam size | 5 | [1, 20] | Beam search width | |
| Diversity weight | 0.3 | [0, 1] | Balance POS vs diversity | |
| Examples | 3 | [1, 10] | Final selection count |
- Large → Better coverage but slower Stage 2
- Small → Faster but may miss good examples
- Recommended: 50-100 for typical datasets
- Large → Better optimization but slower
- Small → Faster but sub-optimal
- Recommended: 5-10
- High → Prefer diverse examples (multiple patterns)
- Low → Prefer similar structure (consistent style)
- Recommended: 0.2-0.5
- Many → Better guidance but longer prompts, more tokens
- Few → Shorter prompts but less guidance
- Recommended: 3-5
Time Complexity:
- Encoding question:
$O(L)$ where$L$ = question length - Similarity computation:
$O(|P|)$ where$|P|$ = pool size -
Total:
$O(L + |P|)$
Space Complexity:
- Store embeddings:
$O(|P| \times D)$ where$D$ = 768
Optimization:
- Pre-compute pool embeddings → Save
$O(|P| \times L)$
Time Complexity:
- For each position
$i \in [1, k]$ :- Expand
$B$ beams with$M$ candidates →$B \times M$ new beams - Score computation per beam:
- POS matching:
$O(k \times L_{\text{avg}})$ - Diversity:
$O(k^2)$
- POS matching:
- Keep top-$B$:
$O(B \times M \times \log B)$
- Expand
Total:
Practical: With
- Iterations:
$3 \times 5 \times 50 = 750$ beam evaluations - Fast on modern CPUs (~1-2 seconds)
To test contribution of each component:
Modification: Remove POS matching component
Usage:
python vipersql.py --strategy few-shot --example-selection-strategy vir2-no-posModification: Remove diversity optimization
Set
Usage:
python vipersql.py --strategy few-shot --example-selection-strategy vir2-no-diversityModification: Replace beam search with greedy top-$k$ selection
Algorithm:
- Score all
$M$ candidates individually - Select top-$k$ by score
Usage:
python vipersql.py --strategy few-shot --example-selection-strategy vir2-no-beam-searchViR2 advantages:
- Semantic relevance (Stage 1)
- Syntactic similarity (POS matching)
- Diversity optimization
Random: Completely random → no relevance
ViR2 advantages:
- POS matching (syntactic structure)
- Diversity (avoid redundancy)
DICL: Only semantic similarity
ViR2 advantages:
- Vietnamese-specific (PhoBERT + underthesea)
- POS matching suitable for Vietnamese
- No need for SQL parsing
ASTRES: AST matching designed for English, requires SQL parsing
ViR2 advantages:
- No LLM calls for preprocessing → faster, cheaper
- Direct POS matching (no intermediate skill extraction)
Skill-KNN: Requires LLM to extract skills → expensive
Question: "Có bao nhiêu học sinh trong lớp 10A?"
Meaning Pool: 1000 training examples
Parameters:
- Encode question with PhoBERT →
$\mathbf{e}_q \in \mathbb{R}^{768}$ - Compute similarity with 1000 examples
- Select top-50 candidates:
- "Có bao nhiêu giáo viên?" (sim=0.92)
- "Liệt kê học sinh lớp 10B" (sim=0.88)
- "Số lượng học sinh toàn trường" (sim=0.85)
- ...
Iteration 1 (select 1st example):
- Try all 50 candidates as 1st example
- Compute score (only POS since no diversity yet)
- Keep top-5 beams:
- Beam 1: ["Có bao nhiêu giáo viên?"] score=0.94
- Beam 2: ["Số lượng học sinh toàn trường"] score=0.89
- ...
Iteration 2 (select 2nd example):
- For each beam, try adding each of 50 candidates
- Total: 5 × 50 = 250 combinations
- Compute score (POS + diversity now)
- Keep top-5 beams:
- Beam 1: [ex1, ex7] score=0.91
- Beam 2: [ex1, ex12] score=0.88
- ...
Iteration 3 (select 3rd example):
- Similar process → 250 combinations
- Keep top-5 beams
- Final best beam: [ex1, ex7, ex23] score=0.87
Output: 3 selected examples
python vipersql.py \
--strategy few-shot \
--example-selection-strategy vir2 \
--vir2-candidate-pool-size 50 \
--vir2-beam-size 5 \
--vir2-diversity-weight 0.3 \
--few-shot-examples 3# ViR2 Parameters
VIR2_CANDIDATE_POOL_SIZE=50
VIR2_BEAM_SIZE=5
VIR2_DIVERSITY_WEIGHT=0.3
FEW_SHOT_EXAMPLES=3PhoBERT:
- Nguyen, D. Q., & Nguyen, A. T. (2020). PhoBERT: Pre-trained language models for Vietnamese.
underthesea:
- Vietnamese NLP toolkit: https://github.com/undertheseanlp/underthesea
Beam Search:
- Freitag, M., & Al-Onaizan, Y. (2017). Beam search strategies for neural machine translation.
Jensen-Shannon Divergence:
- Lin, J. (1991). Divergence measures based on the Shannon entropy.