Add EIC-first deterministic matching before Duke fuzzy matching by MaykThewessen · Pull Request #289 · PyPSA/powerplantmatching

MaykThewessen · 2026-03-25T13:33:09Z

Summary

Implements the architecture proposed in #287: deterministic matching on EIC (Energy Identification Code) runs before Duke fuzzy matching in compare_two_datasets(). Plants that are linked one-to-one by a shared EIC code are matched with certainty and removed from the Duke input, so the fuzzy matcher only handles the residual.

Before:
  All sources ──→ Duke fuzzy match ──→ reduce

After:
  All sources ──→ Deterministic EIC match (1-to-1 only) ──→ matched set 1 (high confidence)
                  │
                  └→ residual ──→ Duke fuzzy match ──→ matched set 2

                  matched set 1 + matched set 2 ──→ reduce

Motivation

EIC codes are unique European plant identifiers already loaded from ENTSOE, OPSD, and GEM, but previously unused in the matching decision. Using them as a first-pass exact match provides a robustness guarantee against co-located plant confusion.

Example: Eemshaven harbour, Netherlands

Source	Plant	Fuel	Capacity	EIC
OPSD	Eemscentrale Ec	Natural Gas	1929 MW	6 shared codes
ENTSOE	Eems	Natural Gas	1931 MW	same 6 codes
ENTSOE	Eemshaven	Hard Coal	1580 MW	`49W000000000066Q`
ENTSOE	Eemshaven	Natural Gas (Magnum)	1410 MW	`49W000000000119V`

EIC matching deterministically pairs the gas plants via their shared codes, preventing Duke from merging them with the nearby coal plant based on name/geo similarity (~0.86 JaroWinkler, ~400 m apart).

Example: Borssele, Netherlands

Borssele nuclear (OPSD: 492 MW, ENTSOE: 485 MW) is locked to its correct cross-source pair via EIC 49W000000000054X, independent of fuzzy name matching against nearby Borssele wind/coal/solar entries.

Is "shared EIC code ⇒ same plant" safe? (review follow-up)

@FabianHofmann raised the key question: can two entries be assumed identical as soon as they share one EIC code? Assessed on the real OPSD/ENTSOE slice by modelling the EIC relationship as a bipartite graph (rows = nodes, shared code = edge) and inspecting the connected components:

	clusters	breakdown
EIC-linked clusters	410
clean 1-to-1	371 (90.5%)	358 identical-set, 13 subset/superset, 0 partial-overlap
ambiguous	39 (9.5%)	one row on one side, several on the other

So a single shared code does not prove identity in ~9.5% of cases. These are almost all Alpine hydro schemes: ENTSOE reports the aggregated scheme under one scheme-level EIC (e.g. "Oberhasli Ag Kwo", 1307 MW, 12W-0000000031-O) while OPSD splits it into eight stations that all inherit that same code.

The fix is therefore to accept a pair only when its shared codes link the two rows to no other row on either side (degree 1 on both ends of the bipartite graph, which is exactly a size-2 connected component, verified identical on the slice: 371/371). Ambiguous clusters fall through to Duke.

Note on the exact-set-equality alternative discussed in review: it would not fix this (the eight split stations each carry exactly the same single code as the aggregate, so they remain set-equal and still 1-to-many) and it would additionally drop the 13 genuine subset matches (e.g. one unit code in OPSD vs two in ENTSOE). 1-to-1 uniqueness is the discriminator that actually works.

Implementation

_match_by_eic() in matching.py: explode() the per-row EIC collections, merge on the code to find shared-code row pairs, then keep only pairs that are degree-1 on both sides via two groupby(...).transform("size").eq(1) masks. Pure pandas, no Python loops or set operations.
DirectMatcher class: holds an ordered list of deterministic matchers (default [_match_by_eic]); run() applies them, peels matched rows from the residual, and returns (matches, [residual0, residual1]). This keeps the deterministic phase cleanly separated from Duke and extensible (a name+country or project-id matcher can join the list without touching the fuzzy path).
compare_two_datasets(): runs DirectMatcher().run(...) first, then Duke on the residual, then concatenates both match sets.

Results (OPSD ↔ ENTSOE)

371 deterministic 1-to-1 EIC matches resolved without fuzzy matching and removed from the Duke input.
Rows in ambiguous clusters are no longer force-matched; they are deferred to fuzzy matching.
Gracefully falls back to Duke-only when EIC data is unavailable (GEM, GEO, GPD have no EIC codes).

Test plan

pytest test/test_matching.py — 13 tests pass (basic matching, missing EIC column, empty/NaN/None sets, raw-string EIC, multi-code 1-to-1, subset/superset 1-to-1, and regression cases for the 1-to-many and Alpine-hydro-scheme ambiguities, plus DirectMatcher residual + pluggability).
Assessment on real OPSD + ENTSOE data — 410 EIC-linked clusters, 371 deterministic 1-to-1 matches, _match_by_eic output verified identical to the size-2 connected components.
Full pipeline run with powerplants(update=True) to verify end-to-end output.

Closes #287

🤖 Generated with Claude Code

MaykThewessen · 2026-06-12T21:57:46Z

Two developments that raise the value of this PR:

Since feat(OSM): upgrade source from frozen Europe snapshot to live global dataset #292, OSM is a live monthly-refreshed source. The Dutch thermal fleet now carries ref:EU:EIC tags in OSM (Amer, Hemweg, Maasstroom, Rijnmond, Schoonebeek, PerGen, ...), and Extract ref:EU:EIC tags into an EIC column for deterministic dataset matching open-energy-transition/osm-powerplants#13 proposes passing those through to an EIC column in osm_global.csv. With that in place, EIC-first matching would cover the OSM source as well - an end-to-end deterministic path from an OSM tag to the matched output.
The NL slice provides concrete test cases where Duke fuzzy matching currently merges distinct plants that EIC-first matching would keep separate: Maasvlakte Uniper (2306 MW = MPP3 1070 + the retired MV1/2) and the duplicate Eem/Eemscentrale group (six ENTSOE EICs).

Happy to rebase/extend if there's interest in moving this forward.

MaykThewessen · 2026-06-18T15:51:22Z

Gentle nudge: still mergeable, implements #287, and ships with tests in test/test_matching.py. Could a maintainer review when there's time? Happy to address any feedback.

FabianHofmann

thanks @MaykThewessen, I think general idea is good and it makes sense to introduce a direct matching phase that circumvent the fuzzy duke matching. some feedback below

FabianHofmann · 2026-06-22T06:30:53Z

@MaykThewessen I forgot to mention:

this approach fully relies on the assumption that powerplant entries are the same in df0 and df1 as soon as the share one common EIC code. can you please make an assessment whether this assumption is fair? if not we should make the direct matching stricter and match on exact same values of EIC entries which compares either the (sorted) list of strings or the strings. only in case of a exact match, the values would be mapped.

Addresses review feedback on PyPSA#289. - Rewrite _match_by_eic in pandas (explode + merge). Accept a pair only when its shared EIC codes link the two rows to no other row on either side (degree-1 on both ends of the shared-code bipartite graph), verified identical to scipy connected-components on the OPSD/ENTSOE slice (371/371). Ambiguous clusters, where one source aggregates a scheme under a single scheme-level EIC and the other splits it into stations carrying that same code (e.g. Alpine hydro), now fall through to Duke instead of being force-matched arbitrarily. - Make label0/label1 explicit args; return a single matches frame. - Add DirectMatcher class with run() to separate the deterministic step and host future direct matchers. - Expand test/test_matching.py to 13 tests (1-to-many, hydro scheme, subset/superset, raw-string, DirectMatcher.run).

MaykThewessen · 2026-06-22T20:19:54Z

Thanks @FabianHofmann, that was the right thing to check. I ran the assessment on the real OPSD/ENTSOE slice (cached bulk data, frames as compare_two_datasets sees them after aggregate_units), modelling the EIC relationship as a bipartite graph (rows = nodes, shared code = edge) and looking at the connected components:

410 EIC-linked clusters
371 (90.5%) clean 1-to-1: 358 with identical sets, 13 subset/superset, 0 partial-overlap
39 (9.5%) ambiguous (one row on one side, several on the other)

So the "one shared code implies same entry" assumption is unsafe in about 9.5% of cases. These are almost all Alpine hydro schemes: ENTSOE reports the aggregated scheme under one scheme-level EIC (Oberhasli Ag Kwo, 1307 MW, 12W-0000000031-O) while OPSD splits it into eight stations (Gental, Grimsel, Handeck, ...) that all inherit that same code. The old greedy code paired the 1307 MW aggregate with whichever station it hit first.

One subtlety on the exact-equality proposal: it would not fully fix this. In those clusters all eight OPSD stations carry exactly {12W-0000000031-O} and ENTSOE carries exactly {12W-0000000031-O}, so they are set-equal and you would still get eight equal-set candidates for one ENTSOE row. Exact equality would also drop the 13 genuine subset matches (e.g. Ballylumford has one unit code in OPSD and two in ENTSOE). What actually discriminates is 1-to-1 uniqueness of the linkage.

So I reworked _match_by_eic to accept a pair only when its shared codes link the two rows to no other row on either side (degree 1 on both ends of the bipartite graph). I verified this is identical to taking the size-2 connected components: 371/371, zero difference. Ambiguous clusters fall through to Duke.

Pushed in 475ac2c. On the inline comments:

explode: done. The index is built with df["EIC"].explode() and everything stays in pandas Series/DataFrame.
pandas-style 1-to-1: the greedy set/loop is gone. The uniqueness test is two groupby(...).transform("size").eq(1) masks, no Python loops.
explicit label0/label1: done, and the return is now a single matches frame (the caller derives the matched indices from the columns).
DirectMatcher: added. It holds the list of direct matchers and exposes run(df0, df1, label0, label1) returning (matches, [residual0, residual1]), so compare_two_datasets runs the deterministic step then hands the residual to Duke. _match_by_eic stays a standalone function and is the default matcher, so a future name+country or project-id matcher just joins the list.

Net effect on the integration numbers: deterministic matches are now 371 high-confidence 1-to-1 pairs for OPSD/ENTSOE; rows in ambiguous clusters are no longer force-matched and go to fuzzy matching instead. test/test_matching.py is up to 13 tests, including regression cases for the 1-to-many and hydro-scheme situations.

MaykThewessen · 2026-06-30T15:11:09Z

@FabianHofmann all the review points are now addressed and I've resolved the three inline threads:

"shared EIC ⇒ same plant" assumption: assessed in the comment above and since independently re-verified on the real OPSD/ENTSOE slice. Of 410 EIC-linked clusters, 371 (90.5%) are clean 1-to-1 and 39 (9.5%) are ambiguous (almost all Alpine hydro schemes, e.g. Oberhasli Ag Kwo aggregated under 12W-0000000031-O vs 8 OPSD stations). _match_by_eic now accepts only degree-1 (1-to-1) links, which I confirmed is exactly the set of size-2 connected components (371/371, identical index pairs). Ambiguous clusters fall through to Duke.
DirectMatcher class, explode() + pandas-style, explicit label0/label1: done in 475ac2c.
I also refreshed the PR description, which had drifted to the old greedy write-up ("431 matches / greedy 1-to-1 / 7 tests"); it now matches the shipped code.

test/test_matching.py is at 13 tests, including regressions for the 1-to-many and hydro-scheme cases. Would appreciate a re-review when you have a moment.

Implements the architecture proposed in PyPSA#287: deterministic matching on EIC (Energy Identification Code) runs before Duke fuzzy matching in compare_two_datasets(). Plants sharing an EIC code are matched with certainty, then removed from the Duke input so the fuzzy matcher only handles the residual. This provides a robustness guarantee against co-located plant confusion. For example, in the Eemshaven harbour area (Netherlands): OPSD "Eemscentrale Ec" (Natural Gas, 1929 MW, 6 EIC codes) ENTSOE "Eems" (Natural Gas, 1931 MW, same 6 EIC codes) ENTSOE "Eemshaven" (Hard Coal, 1580 MW, different EIC) ENTSOE "Eemshaven" (Natural Gas, 1410 MW, different EIC) EIC matching deterministically pairs the gas plants via their shared codes, preventing any possibility of Duke merging them with the nearby coal plant based on name/geo similarity. Similarly, Borssele nuclear (OPSD: 492 MW, ENTSOE: 485 MW) is locked to its correct cross-source pair via EIC 49W000000000054X, independent of fuzzy name matching against nearby Borssele wind/coal entries. Integration test: 431 deterministic EIC matches between OPSD and ENTSOE (28% of ENTSOE), reducing Duke's workload for those pairs to zero. Closes PyPSA#287 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

for more information, see https://pre-commit.ci

Addresses review feedback on PyPSA#289. - Rewrite _match_by_eic in pandas (explode + merge). Accept a pair only when its shared EIC codes link the two rows to no other row on either side (degree-1 on both ends of the shared-code bipartite graph), verified identical to scipy connected-components on the OPSD/ENTSOE slice (371/371). Ambiguous clusters, where one source aggregates a scheme under a single scheme-level EIC and the other splits it into stations carrying that same code (e.g. Alpine hydro), now fall through to Duke instead of being force-matched arbitrarily. - Make label0/label1 explicit args; return a single matches frame. - Add DirectMatcher class with run() to separate the deterministic step and host future direct matchers. - Expand test/test_matching.py to 13 tests (1-to-many, hydro scheme, subset/superset, raw-string, DirectMatcher.run).

for more information, see https://pre-commit.ci

…tkirchen)

MaykThewessen · 2026-07-02T20:29:13Z

Two updates.

1. Branch rebased onto current master (picks up the pandas-3 fixes from #294; the branch previously crashed under pandas 3 in best_matches/duke.py, which predated that merge). Linear history, test/test_matching.py 13 passed on pandas 3.0.3.

2. Quantified what the deterministic EIC pass buys, by scoring pure Duke fuzzy matching (current static Comparison.xml, no EIC pre-step, i.e. this repo before this PR) against the 371 unambiguous 1-to-1 EIC pairs from the assessment above, on the same OPSD/ENTSOE frames (3325/1521 rows post-aggregate_units):

Metric	Pure Duke (fuzzy only)
Recall on the 371 EIC-confirmed pairs	85.2% (316 recovered; 55 missed = 34.6 GW)
Precision on pairs where both rows carry EIC codes	89.1% (41 of 376 EIC-inconsistent)

Structure of the failures:

Misses concentrate in the UK (30 of 55): ENTSO-E registers UK units under identifier-style names that defeat string similarity: West Burton = "Wbups", Connahs Quay = "Cnqps", Peterhead = "Pehe", Heysham 2 = "Heym2". These are multi-GW plants silently dropped from the matched set today.
Confirmed wrong matches are the co-located-plant failure mode this PR targets: Duke matches Fiddler's Ferry (1961 MW) to "Ferr" while its EIC says "Fidl", and South Humber Bank (1365 MW) to "Humr", which the EIC registry identifies as the adjacent VPI Immingham plant. The Eemshaven example from the PR description generalizes.
Duke's scores cannot flag any of this: correct matches (median score 0.9987) and EIC-contradicted matches (median 0.9971) overlap almost completely and saturate at the same ceiling, so no threshold separates them. A wrong 695 MW match (Gud / Gud Ludwigshafen Mitte) scores 0.998, above hundreds of correct pairs.

So the deterministic pass is not just a tie-breaker for lookalike neighbours: it recovers ~15% of confirmed-matchable capacity that fuzzy matching misses outright, and removes multi-GW mismatches that are invisible to score-based filtering. Happy to share the evaluation script if useful.

MaykThewessen mentioned this pull request Jun 12, 2026

Extract ref:EU:EIC tags into an EIC column for deterministic dataset matching open-energy-transition/osm-powerplants#13

Open

MaykThewessen force-pushed the feature/eic-deterministic-matching branch from ec4364a to 25d6a4e Compare June 12, 2026 22:08

MaykThewessen mentioned this pull request Jun 14, 2026

Extract ref:EU:EIC tags into an EIC column open-energy-transition/osm-powerplants#14

Open

FabianHofmann requested changes Jun 22, 2026

View reviewed changes

Comment thread powerplantmatching/matching.py Outdated

Comment thread powerplantmatching/matching.py Outdated

Comment thread powerplantmatching/matching.py Outdated

MaykThewessen and others added 5 commits July 2, 2026 22:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

7ff5e62

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

3712094

for more information, see https://pre-commit.ci

test: rename station to avoid codespell false positive (Gental->Inner…

65fda1c

…tkirchen)

MaykThewessen force-pushed the feature/eic-deterministic-matching branch from c611e46 to 65fda1c Compare July 2, 2026 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add EIC-first deterministic matching before Duke fuzzy matching#289

Add EIC-first deterministic matching before Duke fuzzy matching#289
MaykThewessen wants to merge 5 commits into
PyPSA:masterfrom
MaykThewessen:feature/eic-deterministic-matching

MaykThewessen commented Mar 25, 2026 •

edited

Loading

Uh oh!

MaykThewessen commented Jun 12, 2026

Uh oh!

MaykThewessen commented Jun 18, 2026

Uh oh!

FabianHofmann left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FabianHofmann commented Jun 22, 2026

Uh oh!

MaykThewessen commented Jun 22, 2026

Uh oh!

MaykThewessen commented Jun 30, 2026

Uh oh!

MaykThewessen commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

MaykThewessen commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Example: Eemshaven harbour, Netherlands

Example: Borssele, Netherlands

Is "shared EIC code ⇒ same plant" safe? (review follow-up)

Implementation

Results (OPSD ↔ ENTSOE)

Test plan

Uh oh!

MaykThewessen commented Jun 12, 2026

Uh oh!

MaykThewessen commented Jun 18, 2026

Uh oh!

FabianHofmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FabianHofmann commented Jun 22, 2026

Uh oh!

MaykThewessen commented Jun 22, 2026

Uh oh!

MaykThewessen commented Jun 30, 2026

Uh oh!

MaykThewessen commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MaykThewessen commented Mar 25, 2026 •

edited

Loading