Skip to content

Commit d1da221

Browse files
isPANNclaude
andcommitted
Add ClosestSubstring -> ILP reduction (#1035)
Position-character ILP encoding for ClosestSubstring: binary variables x_{r,a} for the center substring's character at each position, window choice indicators y_{i,p} (exactly one window per source string), and an integer radius R bounded by per-window Hamming-distance constraints plus a tight R <= ell upper-bound constraint critical for ILP solver performance. Closes #1035. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 2842578 commit d1da221

4 files changed

Lines changed: 416 additions & 0 deletions

File tree

docs/paper/reductions.typ

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13893,6 +13893,31 @@ The following reductions to Integer Linear Programming are straightforward formu
1389313893
_Solution extraction._ For each position $j$, read the unique symbol $a$ with $x_(j, a) = 1$; the resulting length-$m$ vector is the source center.
1389413894
]
1389513895

13896+
#reduction-rule("ClosestSubstring", "ILP")[
13897+
Integer variables select one alphabet symbol at each center position and one window start per input string. A conditional radius constraint is activated by the window-choice indicator and upper-bounds the Hamming distance between the center and the selected window of each string.
13898+
][
13899+
_Construction._ Given alphabet $Sigma$ of size $q$, $n$ input strings $s_1, dots, s_n$ over $Sigma$, and window length $ell$ with $W_i = |s_i| - ell + 1$:
13900+
13901+
_Variables:_ (1) $x_(r, a) in {0, 1}$ for $r in {0, dots, ell - 1}$ and $a in {0, dots, q - 1}$: $x_(r, a) = 1$ iff the center has symbol $a$ at position $r$. (2) $y_(i, p) in {0, 1}$ for input string $s_i$ and window start $p in {0, dots, W_i - 1}$: $y_(i, p) = 1$ iff window $p$ is selected from $s_i$. (3) Nonnegative integer $R$: an upper bound on the worst-case Hamming distance.
13902+
13903+
_Constraints:_ (1) Center assignment: $sum_(a = 0)^(q - 1) x_(r, a) = 1$ for every position $r$. (2) Window choice: $sum_(p = 0)^(W_i - 1) y_(i, p) = 1$ for every input string $s_i$. (3) Conditional radius: $R + sum_(r = 0)^(ell - 1) x_(r, s_i [p + r]) - ell dot y_(i, p) >= 0$ for every $(i, p)$. When $y_(i, p) = 1$, this is equivalent to $R >= ell - sum_r x_(r, s_i [p + r]) = d_H (c, s_i [p .. p + ell))$; when $y_(i, p) = 0$, the constraint reduces to $R + (text("nonneg match count")) >= 0$, which holds automatically.
13904+
13905+
_Objective:_ Minimize $R$.
13906+
13907+
The ILP is:
13908+
$
13909+
"minimize" quad & R \
13910+
"subject to" quad & sum_(a = 0)^(q - 1) x_(r, a) = 1 quad forall r in {0, dots, ell - 1} \
13911+
& sum_(p = 0)^(W_i - 1) y_(i, p) = 1 quad forall i in {1, dots, n} \
13912+
& R + sum_(r = 0)^(ell - 1) x_(r, s_i [p + r]) - ell dot y_(i, p) >= 0 quad forall i, p \
13913+
& x_(r, a), y_(i, p) in {0, 1}, quad R in ZZ_(>= 0).
13914+
$
13915+
13916+
_Correctness._ ($arrow.r.double$) Given an optimal center $c^* in Sigma^ell$ and optimal window starts $p_1^*, dots, p_n^*$, set $x_(r, c^*[r]) = 1$, $y_(i, p_i^*) = 1$, and $R = max_i d_H (c^*, s_i [p_i^* .. p_i^* + ell))$. The assignment and window-choice constraints hold by construction. For each pair $(i, p_i^*)$ the radius constraint becomes $R >= d_H (c^*, s_i [p_i^* .. p_i^* + ell))$, which holds with equality at the worst case; for every other $(i, p)$ with $y_(i, p) = 0$ the constraint is redundant. ($arrow.l.double$) The assignment and window-choice constraints force each block of $x$ and each block of $y$ to be one-hot, encoding a center $c$ and one window per input string. The conditional radius constraint is active exactly on the selected windows and forces $R >= d_H (c, s_i [p_i .. p_i + ell))$ for every $i$, so $R$ is at least the worst-case selected Hamming distance. Minimizing $R$ therefore minimizes the maximum Hamming distance over chosen windows.
13917+
13918+
_Solution extraction._ For each position $r$, read the unique symbol $a$ with $x_(r, a) = 1$ as the center symbol; for each input string $s_i$, read the unique $p$ with $y_(i, p) = 1$ as the selected window start.
13919+
]
13920+
1389613921
#reduction-rule("LongestCommonSubsequence", "ILP")[
1389713922
An optimization ILP formulation maximizes the length of a common subsequence. Binary variables choose a symbol (or padding) at each witness position. Match variables link active positions to source string indices, and the objective maximizes the number of non-padding positions.
1389813923
][

src/rules/closestsubstring_ilp.rs

Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
//! Reduction from ClosestSubstring to ILP (Integer Linear Programming).
2+
//!
3+
//! Given an alphabet of size `q`, `n` input strings `s_1, ..., s_n` (not
4+
//! necessarily of equal length), and a window length `ell`, the goal is to
5+
//! pick a center `c in Sigma^ell` and one length-`ell` window from each input
6+
//! string that together minimize the worst-case Hamming distance between the
7+
//! center and any chosen window. The ILP encoding combines the
8+
//! center-selection variables of ClosestString with one-hot window-choice
9+
//! indicators, plus a radius variable that is active only on each selected
10+
//! window.
11+
//!
12+
//! - Integer `x_{r, a}` for `r in {0, ..., ell - 1}` and
13+
//! `a in {0, ..., q - 1}`: `x_{r, a} = 1` iff the center has symbol `a` at
14+
//! position `r`. The non-negativity of ILP variables together with the
15+
//! assignment constraint forces every `x_{r, a} in {0, 1}`.
16+
//! - Integer `y_{i, p}` for input string `s_i` and window start
17+
//! `p in {0, ..., W_i - 1}` where `W_i = |s_i| - ell + 1`: `y_{i, p} = 1` iff
18+
//! window `p` is selected from string `s_i`.
19+
//! - Nonnegative integer radius variable `R`.
20+
//! - Assignment constraint: `sum_a x_{r, a} = 1` for every position `r`.
21+
//! - Window-choice constraint: `sum_p y_{i, p} = 1` for every input string.
22+
//! - Conditional radius constraint per `(i, p)`:
23+
//! `R + sum_{r} x_{r, s_i[p + r]} - ell * y_{i, p} >= 0`.
24+
//! When `y_{i, p} = 1`, this becomes `R >= d_H(c, s_i[p..p + ell))`; when
25+
//! `y_{i, p} = 0`, the constraint is automatically satisfied.
26+
//! - Objective: minimize `R`.
27+
//!
28+
//! Reference: Ming Li, Bin Ma, and Lusheng Wang, "On the closest string and
29+
//! substring problems," Journal of the ACM 49(2):157-171, 2002.
30+
//! <https://doi.org/10.1145/506147.506150>
31+
32+
use crate::models::algebraic::{LinearConstraint, ObjectiveSense, ILP};
33+
use crate::models::misc::ClosestSubstring;
34+
use crate::reduction;
35+
use crate::rules::traits::{ReduceTo, ReductionResult};
36+
37+
/// Result of reducing ClosestSubstring to ILP.
38+
///
39+
/// Variable layout (`ILP<i32>`, all non-negative):
40+
/// - `x_{r, a}` at index `r * alphabet_size + a` for `r in [0, ell)` and
41+
/// `a in [0, q)`, forced into `{0, 1}` by the assignment constraints.
42+
/// - `y_{i, p}` at index `q * ell + window_offsets[i] + p` for input string
43+
/// `s_i` and window start `p in [0, W_i)`, forced into `{0, 1}` by the
44+
/// window-choice constraints.
45+
/// - `R` (radius) at index `q * ell + total_num_windows`, a non-negative
46+
/// integer in `[0, ell]`.
47+
#[derive(Debug, Clone)]
48+
pub struct ReductionClosestSubstringToILP {
49+
target: ILP<i32>,
50+
alphabet_size: usize,
51+
substring_length: usize,
52+
/// Prefix sums of per-string window counts: `window_offsets[i]` is the
53+
/// number of `y_{j, p}` variables for `j < i`. Has length `num_strings`.
54+
window_offsets: Vec<usize>,
55+
/// `window_counts[i] = W_i = |s_i| - ell + 1`.
56+
window_counts: Vec<usize>,
57+
}
58+
59+
impl ReductionResult for ReductionClosestSubstringToILP {
60+
type Source = ClosestSubstring;
61+
type Target = ILP<i32>;
62+
63+
fn target_problem(&self) -> &ILP<i32> {
64+
&self.target
65+
}
66+
67+
/// Decode the integer ILP assignment into the source config layout.
68+
///
69+
/// `ClosestSubstring::evaluate` expects `config` of length `ell + n`: the
70+
/// first `ell` entries are the center symbols, the remaining `n` entries
71+
/// are per-string window starts. For each center position `r`, we pick the
72+
/// unique alphabet symbol `a` with `x_{r, a} = 1`; for each input string
73+
/// `s_i`, we pick the unique window start `p` with `y_{i, p} = 1`. When no
74+
/// indicator is set to 1 in some block (which only happens on partial /
75+
/// infeasible ILP solutions), we fall back to 0 so the returned vector
76+
/// still has the expected shape.
77+
fn extract_solution(&self, target_solution: &[usize]) -> Vec<usize> {
78+
let q = self.alphabet_size;
79+
let ell = self.substring_length;
80+
let y_base = q * ell;
81+
82+
let mut out = Vec::with_capacity(ell + self.window_counts.len());
83+
84+
// Center symbols.
85+
for r in 0..ell {
86+
let symbol = (0..q)
87+
.find(|&a| target_solution.get(r * q + a).copied().unwrap_or(0) == 1)
88+
.unwrap_or(0);
89+
out.push(symbol);
90+
}
91+
92+
// Window starts.
93+
for (i, &w_i) in self.window_counts.iter().enumerate() {
94+
let start = (0..w_i)
95+
.find(|&p| {
96+
target_solution
97+
.get(y_base + self.window_offsets[i] + p)
98+
.copied()
99+
.unwrap_or(0)
100+
== 1
101+
})
102+
.unwrap_or(0);
103+
out.push(start);
104+
}
105+
106+
out
107+
}
108+
}
109+
110+
#[reduction(
111+
overhead = {
112+
num_vars = "alphabet_size * substring_length + total_num_windows + 1",
113+
num_constraints = "substring_length + num_strings + total_num_windows + 1",
114+
}
115+
)]
116+
impl ReduceTo<ILP<i32>> for ClosestSubstring {
117+
type Result = ReductionClosestSubstringToILP;
118+
119+
fn reduce_to(&self) -> Self::Result {
120+
let q = self.alphabet_size();
121+
let ell = self.substring_length();
122+
let strings = self.strings();
123+
let n = strings.len();
124+
125+
let window_counts: Vec<usize> = strings.iter().map(|s| s.len() - ell + 1).collect();
126+
let mut window_offsets: Vec<usize> = Vec::with_capacity(n);
127+
{
128+
let mut acc = 0usize;
129+
for &w in &window_counts {
130+
window_offsets.push(acc);
131+
acc += w;
132+
}
133+
}
134+
let total_windows: usize = window_counts.iter().sum();
135+
136+
let x_idx = |r: usize, a: usize| -> usize { r * q + a };
137+
let y_base = q * ell;
138+
let y_idx = |i: usize, p: usize| -> usize { y_base + window_offsets[i] + p };
139+
let r_idx = y_base + total_windows;
140+
let num_vars = r_idx + 1;
141+
142+
let mut constraints: Vec<LinearConstraint> =
143+
Vec::with_capacity(ell + n + total_windows + 1);
144+
145+
// Assignment constraints: exactly one symbol per center position.
146+
// Together with the non-negativity built into ILP<i32>, this also
147+
// forces every x_{r, a} to lie in {0, 1}.
148+
for r in 0..ell {
149+
let terms: Vec<(usize, f64)> = (0..q).map(|a| (x_idx(r, a), 1.0)).collect();
150+
constraints.push(LinearConstraint::eq(terms, 1.0));
151+
}
152+
153+
// Tight upper bound on R: the worst-case Hamming distance over a
154+
// length-ell window is at most ell. Added as a single-term `<=`
155+
// constraint so the solver's bound-tightening pass (which scans for
156+
// exactly this pattern) picks it up. Without this, R defaults to the
157+
// full i32 domain, which severely degrades HiGHS performance even on
158+
// tiny instances.
159+
constraints.push(LinearConstraint::le(vec![(r_idx, 1.0)], ell as f64));
160+
161+
// Window-choice constraints: exactly one window per input string.
162+
// Combined with non-negativity, this forces every y_{i, p} in {0, 1}.
163+
for (i, &w_i) in window_counts.iter().enumerate() {
164+
let terms: Vec<(usize, f64)> = (0..w_i).map(|p| (y_idx(i, p), 1.0)).collect();
165+
constraints.push(LinearConstraint::eq(terms, 1.0));
166+
}
167+
168+
// Conditional radius constraints: for every (input string, window
169+
// start) pair, R + sum_r x_{r, s_i[p + r]} - ell * y_{i, p} >= 0.
170+
// - If y_{i, p} = 1: R >= ell - sum_r x_{r, s_i[p + r]} = d_H(c, window).
171+
// - If y_{i, p} = 0: the LHS is R + (nonneg match count) >= 0,
172+
// automatically satisfied because R >= 0.
173+
for (i, s) in strings.iter().enumerate() {
174+
for p in 0..window_counts[i] {
175+
let mut terms: Vec<(usize, f64)> = Vec::with_capacity(ell + 2);
176+
terms.push((r_idx, 1.0));
177+
for r in 0..ell {
178+
terms.push((x_idx(r, s[p + r]), 1.0));
179+
}
180+
terms.push((y_idx(i, p), -(ell as f64)));
181+
constraints.push(LinearConstraint::ge(terms, 0.0));
182+
}
183+
}
184+
185+
// Objective: minimize R.
186+
let objective = vec![(r_idx, 1.0)];
187+
188+
let target = ILP::new(num_vars, constraints, objective, ObjectiveSense::Minimize);
189+
190+
ReductionClosestSubstringToILP {
191+
target,
192+
alphabet_size: q,
193+
substring_length: ell,
194+
window_offsets,
195+
window_counts,
196+
}
197+
}
198+
}
199+
200+
#[cfg(feature = "example-db")]
201+
pub(crate) fn canonical_rule_example_specs() -> Vec<crate::example_db::specs::RuleExampleSpec> {
202+
vec![crate::example_db::specs::RuleExampleSpec {
203+
id: "closestsubstring_to_ilp",
204+
build: || {
205+
// Canonical issue #1033 instance: binary alphabet, length-3
206+
// windows on three length-5 strings. Optimum radius is 1; one
207+
// optimal center is 010 with windows (0, 1, 0) selecting 000,
208+
// 010, 110 from s_1, s_2, s_3 respectively.
209+
let source = ClosestSubstring::new(
210+
2,
211+
vec![
212+
vec![0, 0, 0, 1, 1],
213+
vec![1, 0, 1, 0, 0],
214+
vec![1, 1, 0, 0, 1],
215+
],
216+
3,
217+
);
218+
crate::example_db::specs::rule_example_via_ilp::<_, i32>(source)
219+
},
220+
}]
221+
}
222+
223+
#[cfg(test)]
224+
#[path = "../unit_tests/rules/closestsubstring_ilp.rs"]
225+
mod tests;

src/rules/mod.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,8 @@ pub(crate) mod circuit_ilp;
166166
#[cfg(feature = "ilp-solver")]
167167
pub(crate) mod closeststring_ilp;
168168
#[cfg(feature = "ilp-solver")]
169+
pub(crate) mod closestsubstring_ilp;
170+
#[cfg(feature = "ilp-solver")]
169171
pub(crate) mod clustering_ilp;
170172
#[cfg(feature = "ilp-solver")]
171173
pub(crate) mod coloring_ilp;
@@ -551,6 +553,7 @@ pub(crate) fn canonical_rule_example_specs() -> Vec<crate::example_db::specs::Ru
551553
specs.extend(capacityassignment_ilp::canonical_rule_example_specs());
552554
specs.extend(circuit_ilp::canonical_rule_example_specs());
553555
specs.extend(closeststring_ilp::canonical_rule_example_specs());
556+
specs.extend(closestsubstring_ilp::canonical_rule_example_specs());
554557
specs.extend(clustering_ilp::canonical_rule_example_specs());
555558
specs.extend(coloring_ilp::canonical_rule_example_specs());
556559
specs.extend(consecutiveblockminimization_ilp::canonical_rule_example_specs());

0 commit comments

Comments
 (0)