Skip to content

Commit 7952b3d

Browse files
authored
Fix #452: [Model] ConsistencyOfDatabaseFrequencyTables (#718)
* Add plan for #452: [Model] ConsistencyOfDatabaseFrequencyTables * Add ConsistencyOfDatabaseFrequencyTables model * Add CLI creation for frequency table consistency * Add ILP reduction for frequency table consistency * Add paper entry for frequency table consistency * chore: remove plan file after implementation * Fix formatting after merge with main * Add reduction-rule paper entry for ConsistencyOfDatabaseFrequencyTables -> ILP * Add reduction tests using full issue instance (6 objects, 3 attrs, 2 tables)
1 parent 2fec9f9 commit 7952b3d

12 files changed

Lines changed: 1286 additions & 23 deletions

File tree

docs/paper/reductions.typ

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,7 @@
108108
"BoundedComponentSpanningForest": [Bounded Component Spanning Forest],
109109
"BinPacking": [Bin Packing],
110110
"BoyceCoddNormalFormViolation": [Boyce-Codd Normal Form Violation],
111+
"ConsistencyOfDatabaseFrequencyTables": [Consistency of Database Frequency Tables],
111112
"ClosestVectorProblem": [Closest Vector Problem],
112113
"ConsecutiveSets": [Consecutive Sets],
113114
"MinimumMultiwayCut": [Minimum Multiway Cut],
@@ -3201,6 +3202,77 @@ A classical NP-complete problem from Garey and Johnson @garey1979[Ch.~3, p.~76],
32013202
A relation satisfies _Boyce-Codd Normal Form_ (BCNF) if every non-trivial functional dependency $X arrow.r Y$ has $X$ as a superkey --- that is, $X^+$ = $A'$. This classical NP-complete problem from database theory asks whether the given attribute subset $A'$ violates BCNF. The NP-completeness was established by Beeri and Bernstein (1979) via reduction from Hitting Set. It appears as problem SR29 in Garey and Johnson's compendium (category A4: Storage and Retrieval).
32023203
]
32033204

3205+
#{
3206+
let x = load-model-example("ConsistencyOfDatabaseFrequencyTables")
3207+
let num_objects = x.instance.num_objects
3208+
let num_attrs = x.instance.attribute_domains.len()
3209+
let domains = x.instance.attribute_domains
3210+
let table01 = x.instance.frequency_tables.at(0).counts
3211+
let table12 = x.instance.frequency_tables.at(1).counts
3212+
let config = x.optimal_config
3213+
let value = (object, attr) => config.at(object * num_attrs + attr)
3214+
[
3215+
#problem-def("ConsistencyOfDatabaseFrequencyTables")[
3216+
Given a finite set $V$ of objects, a finite set $A$ of attributes, a domain $D_a$ for each $a in A$, a collection of pairwise frequency tables $f_(a,b): D_a times D_b -> ZZ^(>=0)$ whose entries sum to $|V|$, and a set $K subset.eq V times A times union_(a in A) D_a$ of known triples $(v, a, x)$, determine whether there exist functions $g_a: V -> D_a$ such that $g_a(v) = x$ for every $(v, a, x) in K$ and, for every published table $f_(a,b)$, exactly $f_(a,b)(x, y)$ objects satisfy $(g_a(v), g_b(v)) = (x, y)$.
3217+
][
3218+
Consistency of Database Frequency Tables is Garey and Johnson's storage-and-retrieval problem SR35 @garey1979. It asks whether released pairwise marginals can come from some hidden microdata table while respecting already known individual attribute values, making it a natural decision problem in statistical disclosure control. The direct witness space implemented in this crate assigns one categorical variable to each object-attribute pair, so exhaustive search runs in $O^*((product_(a in A) |D_a|)^(|V|))$. #footnote[This is the exact search bound induced by the implementation's configuration space; no faster general exact worst-case algorithm is claimed here.]
3219+
3220+
*Example.* Let $|V| = #num_objects$ with attributes $a_0, a_1, a_2$ having domain sizes $#domains.at(0)$, $#domains.at(1)$, and $#domains.at(2)$ respectively. Publish the pairwise tables
3221+
3222+
#align(center, table(
3223+
columns: 4,
3224+
align: center,
3225+
table.header([$f_(a_0, a_1)$], [$0$], [$1$], [$2$]),
3226+
[$0$], [#table01.at(0).at(0)], [#table01.at(0).at(1)], [#table01.at(0).at(2)],
3227+
[$1$], [#table01.at(1).at(0)], [#table01.at(1).at(1)], [#table01.at(1).at(2)],
3228+
))
3229+
3230+
and
3231+
3232+
#align(center, table(
3233+
columns: 3,
3234+
align: center,
3235+
table.header([$f_(a_1, a_2)$], [$0$], [$1$]),
3236+
[$0$], [#table12.at(0).at(0)], [#table12.at(0).at(1)],
3237+
[$1$], [#table12.at(1).at(0)], [#table12.at(1).at(1)],
3238+
[$2$], [#table12.at(2).at(0)], [#table12.at(2).at(1)],
3239+
))
3240+
3241+
together with the known values $K = {(v_0, a_0, 0), (v_3, a_0, 1), (v_1, a_2, 1)}$. One consistent completion is:
3242+
3243+
#align(center, table(
3244+
columns: 4,
3245+
align: center,
3246+
table.header([object], [$a_0$], [$a_1$], [$a_2$]),
3247+
[$v_0$], [#value(0, 0)], [#value(0, 1)], [#value(0, 2)],
3248+
[$v_1$], [#value(1, 0)], [#value(1, 1)], [#value(1, 2)],
3249+
[$v_2$], [#value(2, 0)], [#value(2, 1)], [#value(2, 2)],
3250+
[$v_3$], [#value(3, 0)], [#value(3, 1)], [#value(3, 2)],
3251+
[$v_4$], [#value(4, 0)], [#value(4, 1)], [#value(4, 2)],
3252+
[$v_5$], [#value(5, 0)], [#value(5, 1)], [#value(5, 2)],
3253+
))
3254+
3255+
This witness satisfies every published count: in $f_(a_0, a_1)$ each of the six cells appears exactly once, while in $f_(a_1, a_2)$ the five occupied cells have multiplicities $1, 1, 2, 1, 1$ exactly as listed above. It also respects all three known triples, so the answer is YES.
3256+
]
3257+
]
3258+
}
3259+
3260+
#reduction-rule("ConsistencyOfDatabaseFrequencyTables", "ILP")[
3261+
Each object-attribute pair is encoded by a one-hot binary vector over its domain, and each pairwise frequency count becomes a linear equality over McCormick auxiliary variables that linearize the product of two one-hot indicators. Known values are fixed by pinning the corresponding indicator to 1. The resulting ILP is a pure feasibility problem (trivial objective).
3262+
][
3263+
_Construction._ Let $V$ be the set of objects, $A$ the set of attributes with domains $D_a$, $cal(T)$ the set of published frequency tables, and $K$ the set of known triples $(v, a, x)$.
3264+
3265+
_Variables:_ (1) Binary one-hot indicators $y_(v,a,x) in {0, 1}$ for each object $v in V$, attribute $a in A$, and value $x in D_a$: $y_(v,a,x) = 1$ iff object $v$ takes value $x$ for attribute $a$. (2) Binary auxiliary variables $z_(t,v,x,x') in {0, 1}$ for each table $t in cal(T)$ (with attribute pair $(a, b)$), object $v in V$, and cell $(x, x') in D_a times D_b$: $z_(t,v,x,x') = 1$ iff object $v$ realizes cell $(x, x')$ in table $t$.
3266+
3267+
_Constraints:_ (1) One-hot: $sum_(x in D_a) y_(v,a,x) = 1$ for all $v in V$, $a in A$. (2) Known values: $y_(v,a,x) = 1$ for each $(v, a, x) in K$. (3) McCormick linearization for $z_(t,v,x,x') = y_(v,a,x) dot y_(v,b,x')$: $z_(t,v,x,x') lt.eq y_(v,a,x)$, $z_(t,v,x,x') lt.eq y_(v,b,x')$, $z_(t,v,x,x') gt.eq y_(v,a,x) + y_(v,b,x') - 1$. (4) Frequency counts: $sum_(v in V) z_(t,v,x,x') = f_t (x, x')$ for each table $t$ and cell $(x, x')$.
3268+
3269+
_Objective:_ Minimize $0$ (feasibility problem).
3270+
3271+
_Correctness._ ($arrow.r.double$) A consistent assignment defines one-hot indicators and their products; all constraints hold by construction, and the frequency equalities match the published counts. ($arrow.l.double$) Any feasible binary solution assigns exactly one value per object-attribute (one-hot), respects known values, and the McCormick constraints force $z_(t,v,x,x') = y_(v,a,x) dot y_(v,b,x')$ for binary variables, so the frequency equalities certify consistency.
3272+
3273+
_Solution extraction._ For each object $v$ and attribute $a$, find $x$ with $y_(v,a,x) = 1$; assign value $x$ to $(v, a)$.
3274+
]
3275+
32043276
#problem-def("SumOfSquaresPartition")[
32053277
Given a finite set $A = {a_0, dots, a_(n-1)}$ with sizes $s(a_i) in ZZ^+$, a positive integer $K lt.eq |A|$ (number of groups), and a positive integer $J$ (bound), determine whether $A$ can be partitioned into $K$ disjoint sets $A_1, dots, A_K$ such that $sum_(i=1)^K (sum_(a in A_i) s(a))^2 lt.eq J$.
32063278
][

problemreductions-cli/src/cli.rs

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -265,6 +265,7 @@ Flags by problem type:
265265
RuralPostman (RPP) --graph, --edge-weights, --required-edges, --bound
266266
MultipleChoiceBranching --arcs [--weights] --partition --bound [--num-vertices]
267267
AdditionalKey --num-attributes, --dependencies, --relation-attrs [--known-keys]
268+
ConsistencyOfDatabaseFrequencyTables --num-objects, --attribute-domains, --frequency-tables [--known-values]
268269
SubgraphIsomorphism --graph (host), --pattern (pattern)
269270
LCS --strings, --bound [--alphabet-size]
270271
FAS --arcs [--weights] [--num-vertices]
@@ -312,6 +313,7 @@ Examples:
312313
pred create MIS/UnitDiskGraph --positions \"0,0;1,0;0.5,0.8\" --radius 1.5
313314
pred create MIS --random --num-vertices 10 --edge-prob 0.3
314315
pred create MultiprocessorScheduling --lengths 4,5,3,2,6 --num-processors 2 --deadline 10
316+
pred create ConsistencyOfDatabaseFrequencyTables --num-objects 6 --attribute-domains \"2,3,2\" --frequency-tables \"0,1:1,1,1|1,1,1;1,2:1,1|0,2|1,1\" --known-values \"0,0,0;3,0,1;1,2,1\"
315317
pred create BiconnectivityAugmentation --graph 0-1,1-2,2-3 --potential-edges 0-2:3,0-3:4,1-3:2 --budget 5
316318
pred create FVS --arcs \"0>1,1>2,2>0\" --weights 1,1,1
317319
pred create UndirectedTwoCommodityIntegralFlow --graph 0-2,1-2,2-3 --capacities 1,1,2 --source-1 0 --sink-1 3 --source-2 1 --sink-2 3 --requirement-1 1 --requirement-2 1
@@ -608,6 +610,18 @@ pub struct CreateArgs {
608610
/// Known candidate keys for AdditionalKey (e.g., "0,1;2,3")
609611
#[arg(long)]
610612
pub known_keys: Option<String>,
613+
/// Number of objects for ConsistencyOfDatabaseFrequencyTables
614+
#[arg(long)]
615+
pub num_objects: Option<usize>,
616+
/// Attribute-domain sizes for ConsistencyOfDatabaseFrequencyTables (comma-separated, e.g., "2,3,2")
617+
#[arg(long)]
618+
pub attribute_domains: Option<String>,
619+
/// Pairwise frequency tables for ConsistencyOfDatabaseFrequencyTables (e.g., "0,1:1,1|0,1;1,2:1,0|0,1")
620+
#[arg(long)]
621+
pub frequency_tables: Option<String>,
622+
/// Known value triples for ConsistencyOfDatabaseFrequencyTables (e.g., "0,0,0;3,1,2")
623+
#[arg(long)]
624+
pub known_values: Option<String>,
611625
/// Domain size for ConjunctiveBooleanQuery
612626
#[arg(long)]
613627
pub domain_size: Option<usize>,

0 commit comments

Comments
 (0)