Skip to content

Replace Java DUKE with a pure-Python rapidfuzz matching engine#301

Draft
FabianHofmann wants to merge 4 commits into
masterfrom
feat/rapidfuzz-matching-engine
Draft

Replace Java DUKE with a pure-Python rapidfuzz matching engine#301
FabianHofmann wants to merge 4 commits into
masterfrom
feat/rapidfuzz-matching-engine

Conversation

@FabianHofmann

@FabianHofmann FabianHofmann commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

I am doing some experiments on the side on replacing the old and slow DUKE binary as a backend for deduplication and matching

Note

AI generated draft
Fully experimental

Summary

Replaces the Java/JVM DUKE record-linkage engine with a pure-Python, vectorised backend (powerplantmatching.linkage, built on rapidfuzz + numpy) and makes it the sole matching engine. Matching no longer requires a Java installation or the bundled DUKE .jar binaries.

The new engine encodes the former DUKE field weights and thresholds verbatim (from Comparison.xml / Deleteduplicates.xml) and uses the same Fellegi-Sunter belief-update scoring, so matching behaviour is preserved — but it is substantially faster (~17× linkage, ~10× dedup) at equal-or-better quality on an objective ground truth recovered from powerplants.csv. Rationale and benchmark numbers are in analysis/linkage_findings.md.

Changes

New engine

  • powerplantmatching/linkage.py with match() — list of two frames → record linkage, single frame → deduplication.

Removed

  • duke.py (Java subprocess, get_matcher resolver, add_geoposition_for_duke) and duke_recordlinkage.py
  • package_data/duke_binaries/*.jar (520K), Comparison.xml, Deleteduplicates.xml
  • LICENSES/Apache-2.0.txt (only the DUKE jars used it) and its REUSE / MANIFEST / docs references
  • test/test_duke.py and the DUKE-vs-engine benchmark script

Wiring & config

  • matching.py / cleaning.py call match() directly (no resolver indirection); **dukeargs**kwargs across matching.py, collection.py, accessor.py
  • Drop the matching_backend config option; rename parallel_duke_processesparallel_processes
  • Docs: API reference, license notes (README + docs/license.md), release notes

Compatibility

⚠️ Breaking:

  • Java is no longer required (and no longer used).
  • The matching_backend config key is removed; parallel_duke_processes is renamed to parallel_processes — existing user configs setting the old key must update it.
  • Direct imports of powerplantmatching.duke no longer resolve.

Testing

  • test/test_linkage.py — 6 passed (format parity, identical-row matching, singlematch uniqueness, empty input, symmetric dedup pairs, geo falloff)
  • test/test_cleaning.py — 3 passed (real-data aggregate_unitsmatch() dedup path)
  • ruff check clean; reuse lint compliant (used licenses: MIT, CC0-1.0, CC-BY-4.0)

Draft — rapidfuzz is now a hard dependency (already declared in pyproject.toml). Opening for review of the API rename (linkage.match) and the config-key change before finalising.

FabianHofmann and others added 4 commits June 22, 2026 08:51
Add a pure-Python recordlinkage backend mirroring Comparison.xml
(jarowinkler/qgram/numeric/geo + Fellegi-Sunter scoring) and a
GEO x GPD benchmark vs DUKE. ~70% recall / 84% precision at threshold
0.965; findings in analysis/recordlinkage_findings.md.
…ernative

Vectorised pure-Python matcher (rapidfuzz + numpy) mirroring both DUKE
configs (Comparison.xml linkage, Deleteduplicates.xml dedup) with
Fellegi-Sunter scoring. Selectable via config['matching_backend']
(default 'duke', unchanged); wired into compare_two_datasets and
aggregate_units through duke.get_matcher.

Validated against production-derived ground truth from powerplants.csv:
- linkage (GEO x GPD): F1 0.394 vs DUKE 0.348, ~17x faster
- dedup: 95% of merges principled, correctly collapses multi-unit
  plants DUKE misses, ~10x faster

Adds rapidfuzz dependency, tests, and analysis/benchmark + findings.
Make the vectorised rapidfuzz + numpy record-linkage backend the sole
matching engine and remove DUKE entirely. Matching no longer requires a
Java installation or the bundled DUKE binaries, and is substantially
faster (~17x linkage, ~10x dedup) at equal-or-better quality on an
objective ground truth (see analysis/linkage_findings.md).

- New powerplantmatching/linkage.py with match() (was duke_recordlinkage.duke)
- Remove duke.py, get_matcher resolver, add_geoposition_for_duke
- Remove duke_binaries/*.jar, Comparison.xml, Deleteduplicates.xml,
  LICENSES/Apache-2.0.txt and its REUSE/MANIFEST/doc references
- matching.py/cleaning.py call match() directly; **dukeargs -> **kwargs
- config: drop matching_backend; rename parallel_duke_processes -> parallel_processes
- Rename tests/findings; drop benchmark that compared against DUKE
query = " and ".join(filter(None, [agg_query, block_query, country_query]))
duplicates = pd.concat(
[duke(df.query(query), threads=threads) for c in countries]
[match(df.query(query), threads=threads) for c in countries]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant