Replace Java DUKE with a pure-Python rapidfuzz matching engine by FabianHofmann · Pull Request #301 · PyPSA/powerplantmatching

FabianHofmann · 2026-06-22T09:18:02Z

I am doing some experiments on the side on replacing the old and slow DUKE binary as a backend for deduplication and matching

Note

AI generated draft
Fully experimental

Summary

Replaces the Java/JVM DUKE record-linkage engine with a pure-Python, vectorised backend (powerplantmatching.linkage, built on rapidfuzz + numpy) and makes it the sole matching engine. Matching no longer requires a Java installation or the bundled DUKE .jar binaries.

The new engine encodes the former DUKE field weights and thresholds verbatim (from Comparison.xml / Deleteduplicates.xml) and uses the same Fellegi-Sunter belief-update scoring, so matching behaviour is preserved — but it is substantially faster (~17× linkage, ~10× dedup) at equal-or-better quality on an objective ground truth recovered from powerplants.csv. Rationale and benchmark numbers are in analysis/linkage_findings.md.

Changes

New engine

powerplantmatching/linkage.py with match() — list of two frames → record linkage, single frame → deduplication.

Removed

duke.py (Java subprocess, get_matcher resolver, add_geoposition_for_duke) and duke_recordlinkage.py
package_data/duke_binaries/*.jar (520K), Comparison.xml, Deleteduplicates.xml
LICENSES/Apache-2.0.txt (only the DUKE jars used it) and its REUSE / MANIFEST / docs references
test/test_duke.py and the DUKE-vs-engine benchmark script

Wiring & config

matching.py / cleaning.py call match() directly (no resolver indirection); **dukeargs → **kwargs across matching.py, collection.py, accessor.py
Drop the matching_backend config option; rename parallel_duke_processes → parallel_processes
Docs: API reference, license notes (README + docs/license.md), release notes

Compatibility

⚠️ Breaking:

Java is no longer required (and no longer used).
The matching_backend config key is removed; parallel_duke_processes is renamed to parallel_processes — existing user configs setting the old key must update it.
Direct imports of powerplantmatching.duke no longer resolve.

Testing

test/test_linkage.py — 6 passed (format parity, identical-row matching, singlematch uniqueness, empty input, symmetric dedup pairs, geo falloff)
test/test_cleaning.py — 3 passed (real-data aggregate_units → match() dedup path)
ruff check clean; reuse lint compliant (used licenses: MIT, CC0-1.0, CC-BY-4.0)

Draft — rapidfuzz is now a hard dependency (already declared in pyproject.toml). Opening for review of the API rename (linkage.match) and the config-key change before finalising.

Add a pure-Python recordlinkage backend mirroring Comparison.xml (jarowinkler/qgram/numeric/geo + Fellegi-Sunter scoring) and a GEO x GPD benchmark vs DUKE. ~70% recall / 84% precision at threshold 0.965; findings in analysis/recordlinkage_findings.md.

…ernative Vectorised pure-Python matcher (rapidfuzz + numpy) mirroring both DUKE configs (Comparison.xml linkage, Deleteduplicates.xml dedup) with Fellegi-Sunter scoring. Selectable via config['matching_backend'] (default 'duke', unchanged); wired into compare_two_datasets and aggregate_units through duke.get_matcher. Validated against production-derived ground truth from powerplants.csv: - linkage (GEO x GPD): F1 0.394 vs DUKE 0.348, ~17x faster - dedup: 95% of merges principled, correctly collapses multi-unit plants DUKE misses, ~10x faster Adds rapidfuzz dependency, tests, and analysis/benchmark + findings.

Make the vectorised rapidfuzz + numpy record-linkage backend the sole matching engine and remove DUKE entirely. Matching no longer requires a Java installation or the bundled DUKE binaries, and is substantially faster (~17x linkage, ~10x dedup) at equal-or-better quality on an objective ground truth (see analysis/linkage_findings.md). - New powerplantmatching/linkage.py with match() (was duke_recordlinkage.duke) - Remove duke.py, get_matcher resolver, add_geoposition_for_duke - Remove duke_binaries/*.jar, Comparison.xml, Deleteduplicates.xml, LICENSES/Apache-2.0.txt and its REUSE/MANIFEST/doc references - matching.py/cleaning.py call match() directly; **dukeargs -> **kwargs - config: drop matching_backend; rename parallel_duke_processes -> parallel_processes - Rename tests/findings; drop benchmark that compared against DUKE

for more information, see https://pre-commit.ci

        query = " and ".join(filter(None, [agg_query, block_query, country_query]))
        duplicates = pd.concat(
-            [duke(df.query(query), threads=threads) for c in countries]
+            [match(df.query(query), threads=threads) for c in countries]


FabianHofmann and others added 4 commits June 22, 2026 08:51

[pre-commit.ci] auto fixes from pre-commit.com hooks

f469538

for more information, see https://pre-commit.ci

github-code-quality Bot found potential problems Jun 22, 2026

View reviewed changes

Comment thread powerplantmatching/cleaning.py

query = " and ".join(filter(None, [agg_query, block_query, country_query]))

duplicates = pd.concat(

[duke(df.query(query), threads=threads) for c in countries]

[match(df.query(query), threads=threads) for c in countries]

FabianHofmann mentioned this pull request Jun 22, 2026

Replace Java DUKE with a Splink matching engine #302

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace Java DUKE with a pure-Python rapidfuzz matching engine#301

Replace Java DUKE with a pure-Python rapidfuzz matching engine#301
FabianHofmann wants to merge 4 commits into
masterfrom
feat/rapidfuzz-matching-engine

FabianHofmann commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FabianHofmann commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Compatibility

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FabianHofmann commented Jun 22, 2026 •

edited

Loading