Skip to content

[Model] LongestCommonSubsequence #108

@zazabap

Description

@zazabap

Motivation

Foundational string problem underlying diff/version control, bioinformatics (DNA/protein alignment), and data compression. Poly-time for 2 strings via DP, but NP-hard for $k$ strings. Has known reductions to ILP and SAT. Garey & Johnson SR10. This problem is also the core of the git diff.

Definition

Name: LongestCommonSubsequence
Reference: Garey & Johnson, 1979 (SR10); Maier, 1978

Given a finite alphabet $\Sigma$ and a set of $k$ strings $R = {s_1, \ldots, s_k}$ over $\Sigma$, find a longest string $w$ that is a subsequence of every $s_i \in R$.

A string $w$ is a subsequence of $s$ if $w$ can be obtained by deleting zero or more characters from $s$ without changing the order of the remaining characters.

Variables

  • Count: $m$ (length of the shortest string in $R$; upper bound on solution length)
  • Per-variable domain: $\Sigma$ (alphabet symbols)
  • Meaning: $w_j$ is the $j$-th character of the common subsequence $w$

Schema (data type)

Type name: LongestCommonSubsequence

Field Type Description
strings Vec<Vec<u8>> The input strings $R = {s_1, \ldots, s_k}$

Problem Size

Metric Expression Description
num_strings $k$ Number of input strings
total_length $\sum_i \lvert s_i \rvert$ Sum of all string lengths

Complexity

  • Two strings: $O(mn)$ via dynamic programming (Wagner & Fischer, 1974)
  • $k$ strings: NP-hard (Maier, 1978)
  • Approximation: Not approximable within $n^{1/4-\varepsilon}$ for any $\varepsilon &gt; 0$; approximable within $O(m / \log m)$ where $m = \min_i |s_i|$

How to solve

  • It can be solved by (existing) bruteforce.
  • It can be solved by reducing to integer programming, through a new LCS → ILP rule issue (to be filed).

Bruteforce: enumerate all $2^m$ subsequences of the shortest string $s_1$, check each against all other strings, return the longest common one.

Example Instance

$k = 3$ strings over alphabet $\Sigma = {A, B, C, D}$:

String Value
$s_1$ ABCDAB
$s_2$ BDCABA
$s_3$ BCADBA

Optimal solution: $w$ = BCAB, length 4.

Verification ($w$ is a subsequence of each $s_i$):

String Matched positions Trace
ABCDAB 1, 2, 4, 5 ABCDAB
BDCABA 0, 2, 3, 4 BDCABA
BCADBA 0, 1, 2, 4 BCADBA

Note: pairwise LCS lengths are 4, 4, and 5 respectively — the 3-string LCS is harder than any pairwise LCS. For $k = 2$ this problem is solvable in $O(mn)$ time, but the $k$-string generalization is NP-hard.

Metadata

Metadata

Assignees

No one assigned

    Labels

    GoodAn issue passed all checks.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions