Replace Java DUKE with a pure-Python rapidfuzz matching engine#301
Draft
FabianHofmann wants to merge 4 commits into
Draft
Replace Java DUKE with a pure-Python rapidfuzz matching engine#301FabianHofmann wants to merge 4 commits into
FabianHofmann wants to merge 4 commits into
Conversation
Add a pure-Python recordlinkage backend mirroring Comparison.xml (jarowinkler/qgram/numeric/geo + Fellegi-Sunter scoring) and a GEO x GPD benchmark vs DUKE. ~70% recall / 84% precision at threshold 0.965; findings in analysis/recordlinkage_findings.md.
…ernative Vectorised pure-Python matcher (rapidfuzz + numpy) mirroring both DUKE configs (Comparison.xml linkage, Deleteduplicates.xml dedup) with Fellegi-Sunter scoring. Selectable via config['matching_backend'] (default 'duke', unchanged); wired into compare_two_datasets and aggregate_units through duke.get_matcher. Validated against production-derived ground truth from powerplants.csv: - linkage (GEO x GPD): F1 0.394 vs DUKE 0.348, ~17x faster - dedup: 95% of merges principled, correctly collapses multi-unit plants DUKE misses, ~10x faster Adds rapidfuzz dependency, tests, and analysis/benchmark + findings.
Make the vectorised rapidfuzz + numpy record-linkage backend the sole matching engine and remove DUKE entirely. Matching no longer requires a Java installation or the bundled DUKE binaries, and is substantially faster (~17x linkage, ~10x dedup) at equal-or-better quality on an objective ground truth (see analysis/linkage_findings.md). - New powerplantmatching/linkage.py with match() (was duke_recordlinkage.duke) - Remove duke.py, get_matcher resolver, add_geoposition_for_duke - Remove duke_binaries/*.jar, Comparison.xml, Deleteduplicates.xml, LICENSES/Apache-2.0.txt and its REUSE/MANIFEST/doc references - matching.py/cleaning.py call match() directly; **dukeargs -> **kwargs - config: drop matching_backend; rename parallel_duke_processes -> parallel_processes - Rename tests/findings; drop benchmark that compared against DUKE
for more information, see https://pre-commit.ci
| query = " and ".join(filter(None, [agg_query, block_query, country_query])) | ||
| duplicates = pd.concat( | ||
| [duke(df.query(query), threads=threads) for c in countries] | ||
| [match(df.query(query), threads=threads) for c in countries] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I am doing some experiments on the side on replacing the old and slow DUKE binary as a backend for deduplication and matching
Note
AI generated draft
Fully experimental
Summary
Replaces the Java/JVM DUKE record-linkage engine with a pure-Python, vectorised backend (
powerplantmatching.linkage, built onrapidfuzz+numpy) and makes it the sole matching engine. Matching no longer requires a Java installation or the bundled DUKE.jarbinaries.The new engine encodes the former DUKE field weights and thresholds verbatim (from
Comparison.xml/Deleteduplicates.xml) and uses the same Fellegi-Sunter belief-update scoring, so matching behaviour is preserved — but it is substantially faster (~17× linkage, ~10× dedup) at equal-or-better quality on an objective ground truth recovered frompowerplants.csv. Rationale and benchmark numbers are inanalysis/linkage_findings.md.Changes
New engine
powerplantmatching/linkage.pywithmatch()— list of two frames → record linkage, single frame → deduplication.Removed
duke.py(Java subprocess,get_matcherresolver,add_geoposition_for_duke) andduke_recordlinkage.pypackage_data/duke_binaries/*.jar(520K),Comparison.xml,Deleteduplicates.xmlLICENSES/Apache-2.0.txt(only the DUKE jars used it) and its REUSE / MANIFEST / docs referencestest/test_duke.pyand the DUKE-vs-engine benchmark scriptWiring & config
matching.py/cleaning.pycallmatch()directly (no resolver indirection);**dukeargs→**kwargsacrossmatching.py,collection.py,accessor.pymatching_backendconfig option; renameparallel_duke_processes→parallel_processesdocs/license.md), release notesCompatibility
matching_backendconfig key is removed;parallel_duke_processesis renamed toparallel_processes— existing user configs setting the old key must update it.powerplantmatching.dukeno longer resolve.Testing
test/test_linkage.py— 6 passed (format parity, identical-row matching, singlematch uniqueness, empty input, symmetric dedup pairs, geo falloff)test/test_cleaning.py— 3 passed (real-dataaggregate_units→match()dedup path)ruff checkclean;reuse lintcompliant (used licenses: MIT, CC0-1.0, CC-BY-4.0)