Skip to content

Add EIC-first deterministic matching before Duke fuzzy matching#289

Open
MaykThewessen wants to merge 5 commits into
PyPSA:masterfrom
MaykThewessen:feature/eic-deterministic-matching
Open

Add EIC-first deterministic matching before Duke fuzzy matching#289
MaykThewessen wants to merge 5 commits into
PyPSA:masterfrom
MaykThewessen:feature/eic-deterministic-matching

Conversation

@MaykThewessen

@MaykThewessen MaykThewessen commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements the architecture proposed in #287: deterministic matching on EIC (Energy Identification Code) runs before Duke fuzzy matching in compare_two_datasets(). Plants that are linked one-to-one by a shared EIC code are matched with certainty and removed from the Duke input, so the fuzzy matcher only handles the residual.

Before:
  All sources ──→ Duke fuzzy match ──→ reduce

After:
  All sources ──→ Deterministic EIC match (1-to-1 only) ──→ matched set 1 (high confidence)
                  │
                  └→ residual ──→ Duke fuzzy match ──→ matched set 2

                  matched set 1 + matched set 2 ──→ reduce

Motivation

EIC codes are unique European plant identifiers already loaded from ENTSOE, OPSD, and GEM, but previously unused in the matching decision. Using them as a first-pass exact match provides a robustness guarantee against co-located plant confusion.

Example: Eemshaven harbour, Netherlands

Source Plant Fuel Capacity EIC
OPSD Eemscentrale Ec Natural Gas 1929 MW 6 shared codes
ENTSOE Eems Natural Gas 1931 MW same 6 codes
ENTSOE Eemshaven Hard Coal 1580 MW 49W000000000066Q
ENTSOE Eemshaven Natural Gas (Magnum) 1410 MW 49W000000000119V

EIC matching deterministically pairs the gas plants via their shared codes, preventing Duke from merging them with the nearby coal plant based on name/geo similarity (~0.86 JaroWinkler, ~400 m apart).

Example: Borssele, Netherlands

Borssele nuclear (OPSD: 492 MW, ENTSOE: 485 MW) is locked to its correct cross-source pair via EIC 49W000000000054X, independent of fuzzy name matching against nearby Borssele wind/coal/solar entries.

Is "shared EIC code ⇒ same plant" safe? (review follow-up)

@FabianHofmann raised the key question: can two entries be assumed identical as soon as they share one EIC code? Assessed on the real OPSD/ENTSOE slice by modelling the EIC relationship as a bipartite graph (rows = nodes, shared code = edge) and inspecting the connected components:

clusters breakdown
EIC-linked clusters 410
clean 1-to-1 371 (90.5%) 358 identical-set, 13 subset/superset, 0 partial-overlap
ambiguous 39 (9.5%) one row on one side, several on the other

So a single shared code does not prove identity in ~9.5% of cases. These are almost all Alpine hydro schemes: ENTSOE reports the aggregated scheme under one scheme-level EIC (e.g. "Oberhasli Ag Kwo", 1307 MW, 12W-0000000031-O) while OPSD splits it into eight stations that all inherit that same code.

The fix is therefore to accept a pair only when its shared codes link the two rows to no other row on either side (degree 1 on both ends of the bipartite graph, which is exactly a size-2 connected component, verified identical on the slice: 371/371). Ambiguous clusters fall through to Duke.

Note on the exact-set-equality alternative discussed in review: it would not fix this (the eight split stations each carry exactly the same single code as the aggregate, so they remain set-equal and still 1-to-many) and it would additionally drop the 13 genuine subset matches (e.g. one unit code in OPSD vs two in ENTSOE). 1-to-1 uniqueness is the discriminator that actually works.

Implementation

  • _match_by_eic() in matching.py: explode() the per-row EIC collections, merge on the code to find shared-code row pairs, then keep only pairs that are degree-1 on both sides via two groupby(...).transform("size").eq(1) masks. Pure pandas, no Python loops or set operations.
  • DirectMatcher class: holds an ordered list of deterministic matchers (default [_match_by_eic]); run() applies them, peels matched rows from the residual, and returns (matches, [residual0, residual1]). This keeps the deterministic phase cleanly separated from Duke and extensible (a name+country or project-id matcher can join the list without touching the fuzzy path).
  • compare_two_datasets(): runs DirectMatcher().run(...) first, then Duke on the residual, then concatenates both match sets.

Results (OPSD ↔ ENTSOE)

  • 371 deterministic 1-to-1 EIC matches resolved without fuzzy matching and removed from the Duke input.
  • Rows in ambiguous clusters are no longer force-matched; they are deferred to fuzzy matching.
  • Gracefully falls back to Duke-only when EIC data is unavailable (GEM, GEO, GPD have no EIC codes).

Test plan

  • pytest test/test_matching.py13 tests pass (basic matching, missing EIC column, empty/NaN/None sets, raw-string EIC, multi-code 1-to-1, subset/superset 1-to-1, and regression cases for the 1-to-many and Alpine-hydro-scheme ambiguities, plus DirectMatcher residual + pluggability).
  • Assessment on real OPSD + ENTSOE data — 410 EIC-linked clusters, 371 deterministic 1-to-1 matches, _match_by_eic output verified identical to the size-2 connected components.
  • Full pipeline run with powerplants(update=True) to verify end-to-end output.

Closes #287

🤖 Generated with Claude Code

@MaykThewessen

Copy link
Copy Markdown
Contributor Author

Two developments that raise the value of this PR:

  1. Since feat(OSM): upgrade source from frozen Europe snapshot to live global dataset #292, OSM is a live monthly-refreshed source. The Dutch thermal fleet now carries ref:EU:EIC tags in OSM (Amer, Hemweg, Maasstroom, Rijnmond, Schoonebeek, PerGen, ...), and Extract ref:EU:EIC tags into an EIC column for deterministic dataset matching open-energy-transition/osm-powerplants#13 proposes passing those through to an EIC column in osm_global.csv. With that in place, EIC-first matching would cover the OSM source as well - an end-to-end deterministic path from an OSM tag to the matched output.

  2. The NL slice provides concrete test cases where Duke fuzzy matching currently merges distinct plants that EIC-first matching would keep separate: Maasvlakte Uniper (2306 MW = MPP3 1070 + the retired MV1/2) and the duplicate Eem/Eemscentrale group (six ENTSOE EICs).

Happy to rebase/extend if there's interest in moving this forward.

@MaykThewessen

Copy link
Copy Markdown
Contributor Author

Gentle nudge: still mergeable, implements #287, and ships with tests in test/test_matching.py. Could a maintainer review when there's time? Happy to address any feedback.

@FabianHofmann FabianHofmann left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @MaykThewessen, I think general idea is good and it makes sense to introduce a direct matching phase that circumvent the fuzzy duke matching. some feedback below

Comment thread powerplantmatching/matching.py Outdated
Comment thread powerplantmatching/matching.py Outdated
Comment thread powerplantmatching/matching.py Outdated
@FabianHofmann

Copy link
Copy Markdown
Contributor

@MaykThewessen I forgot to mention:

this approach fully relies on the assumption that powerplant entries are the same in df0 and df1 as soon as the share one common EIC code. can you please make an assessment whether this assumption is fair? if not we should make the direct matching stricter and match on exact same values of EIC entries which compares either the (sorted) list of strings or the strings. only in case of a exact match, the values would be mapped.

MaykThewessen added a commit to MaykThewessen/powerplantmatching that referenced this pull request Jun 22, 2026
Addresses review feedback on PyPSA#289.

- Rewrite _match_by_eic in pandas (explode + merge). Accept a pair only
  when its shared EIC codes link the two rows to no other row on either
  side (degree-1 on both ends of the shared-code bipartite graph),
  verified identical to scipy connected-components on the OPSD/ENTSOE
  slice (371/371). Ambiguous clusters, where one source aggregates a
  scheme under a single scheme-level EIC and the other splits it into
  stations carrying that same code (e.g. Alpine hydro), now fall through
  to Duke instead of being force-matched arbitrarily.
- Make label0/label1 explicit args; return a single matches frame.
- Add DirectMatcher class with run() to separate the deterministic step
  and host future direct matchers.
- Expand test/test_matching.py to 13 tests (1-to-many, hydro scheme,
  subset/superset, raw-string, DirectMatcher.run).
@MaykThewessen

Copy link
Copy Markdown
Contributor Author

Thanks @FabianHofmann, that was the right thing to check. I ran the assessment on the real OPSD/ENTSOE slice (cached bulk data, frames as compare_two_datasets sees them after aggregate_units), modelling the EIC relationship as a bipartite graph (rows = nodes, shared code = edge) and looking at the connected components:

  • 410 EIC-linked clusters
  • 371 (90.5%) clean 1-to-1: 358 with identical sets, 13 subset/superset, 0 partial-overlap
  • 39 (9.5%) ambiguous (one row on one side, several on the other)

So the "one shared code implies same entry" assumption is unsafe in about 9.5% of cases. These are almost all Alpine hydro schemes: ENTSOE reports the aggregated scheme under one scheme-level EIC (Oberhasli Ag Kwo, 1307 MW, 12W-0000000031-O) while OPSD splits it into eight stations (Gental, Grimsel, Handeck, ...) that all inherit that same code. The old greedy code paired the 1307 MW aggregate with whichever station it hit first.

One subtlety on the exact-equality proposal: it would not fully fix this. In those clusters all eight OPSD stations carry exactly {12W-0000000031-O} and ENTSOE carries exactly {12W-0000000031-O}, so they are set-equal and you would still get eight equal-set candidates for one ENTSOE row. Exact equality would also drop the 13 genuine subset matches (e.g. Ballylumford has one unit code in OPSD and two in ENTSOE). What actually discriminates is 1-to-1 uniqueness of the linkage.

So I reworked _match_by_eic to accept a pair only when its shared codes link the two rows to no other row on either side (degree 1 on both ends of the bipartite graph). I verified this is identical to taking the size-2 connected components: 371/371, zero difference. Ambiguous clusters fall through to Duke.

Pushed in 475ac2c. On the inline comments:

  • explode: done. The index is built with df["EIC"].explode() and everything stays in pandas Series/DataFrame.
  • pandas-style 1-to-1: the greedy set/loop is gone. The uniqueness test is two groupby(...).transform("size").eq(1) masks, no Python loops.
  • explicit label0/label1: done, and the return is now a single matches frame (the caller derives the matched indices from the columns).
  • DirectMatcher: added. It holds the list of direct matchers and exposes run(df0, df1, label0, label1) returning (matches, [residual0, residual1]), so compare_two_datasets runs the deterministic step then hands the residual to Duke. _match_by_eic stays a standalone function and is the default matcher, so a future name+country or project-id matcher just joins the list.

Net effect on the integration numbers: deterministic matches are now 371 high-confidence 1-to-1 pairs for OPSD/ENTSOE; rows in ambiguous clusters are no longer force-matched and go to fuzzy matching instead. test/test_matching.py is up to 13 tests, including regression cases for the 1-to-many and hydro-scheme situations.

@MaykThewessen

Copy link
Copy Markdown
Contributor Author

@FabianHofmann all the review points are now addressed and I've resolved the three inline threads:

  • "shared EIC ⇒ same plant" assumption: assessed in the comment above and since independently re-verified on the real OPSD/ENTSOE slice. Of 410 EIC-linked clusters, 371 (90.5%) are clean 1-to-1 and 39 (9.5%) are ambiguous (almost all Alpine hydro schemes, e.g. Oberhasli Ag Kwo aggregated under 12W-0000000031-O vs 8 OPSD stations). _match_by_eic now accepts only degree-1 (1-to-1) links, which I confirmed is exactly the set of size-2 connected components (371/371, identical index pairs). Ambiguous clusters fall through to Duke.
  • DirectMatcher class, explode() + pandas-style, explicit label0/label1: done in 475ac2c.
  • I also refreshed the PR description, which had drifted to the old greedy write-up ("431 matches / greedy 1-to-1 / 7 tests"); it now matches the shipped code.

test/test_matching.py is at 13 tests, including regressions for the 1-to-many and hydro-scheme cases. Would appreciate a re-review when you have a moment.

MaykThewessen and others added 5 commits July 2, 2026 22:27
Implements the architecture proposed in PyPSA#287: deterministic matching on
EIC (Energy Identification Code) runs before Duke fuzzy matching in
compare_two_datasets(). Plants sharing an EIC code are matched with
certainty, then removed from the Duke input so the fuzzy matcher only
handles the residual.

This provides a robustness guarantee against co-located plant confusion.
For example, in the Eemshaven harbour area (Netherlands):

  OPSD "Eemscentrale Ec"  (Natural Gas, 1929 MW, 6 EIC codes)
  ENTSOE "Eems"           (Natural Gas, 1931 MW, same 6 EIC codes)
  ENTSOE "Eemshaven"      (Hard Coal,   1580 MW, different EIC)
  ENTSOE "Eemshaven"      (Natural Gas, 1410 MW, different EIC)

EIC matching deterministically pairs the gas plants via their shared
codes, preventing any possibility of Duke merging them with the nearby
coal plant based on name/geo similarity.

Similarly, Borssele nuclear (OPSD: 492 MW, ENTSOE: 485 MW) is locked
to its correct cross-source pair via EIC 49W000000000054X, independent
of fuzzy name matching against nearby Borssele wind/coal entries.

Integration test: 431 deterministic EIC matches between OPSD and ENTSOE
(28% of ENTSOE), reducing Duke's workload for those pairs to zero.

Closes PyPSA#287

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses review feedback on PyPSA#289.

- Rewrite _match_by_eic in pandas (explode + merge). Accept a pair only
  when its shared EIC codes link the two rows to no other row on either
  side (degree-1 on both ends of the shared-code bipartite graph),
  verified identical to scipy connected-components on the OPSD/ENTSOE
  slice (371/371). Ambiguous clusters, where one source aggregates a
  scheme under a single scheme-level EIC and the other splits it into
  stations carrying that same code (e.g. Alpine hydro), now fall through
  to Duke instead of being force-matched arbitrarily.
- Make label0/label1 explicit args; return a single matches frame.
- Add DirectMatcher class with run() to separate the deterministic step
  and host future direct matchers.
- Expand test/test_matching.py to 13 tests (1-to-many, hydro scheme,
  subset/superset, raw-string, DirectMatcher.run).
@MaykThewessen MaykThewessen force-pushed the feature/eic-deterministic-matching branch from c611e46 to 65fda1c Compare July 2, 2026 20:28
@MaykThewessen

Copy link
Copy Markdown
Contributor Author

Two updates.

1. Branch rebased onto current master (picks up the pandas-3 fixes from #294; the branch previously crashed under pandas 3 in best_matches/duke.py, which predated that merge). Linear history, test/test_matching.py 13 passed on pandas 3.0.3.

2. Quantified what the deterministic EIC pass buys, by scoring pure Duke fuzzy matching (current static Comparison.xml, no EIC pre-step, i.e. this repo before this PR) against the 371 unambiguous 1-to-1 EIC pairs from the assessment above, on the same OPSD/ENTSOE frames (3325/1521 rows post-aggregate_units):

Metric Pure Duke (fuzzy only)
Recall on the 371 EIC-confirmed pairs 85.2% (316 recovered; 55 missed = 34.6 GW)
Precision on pairs where both rows carry EIC codes 89.1% (41 of 376 EIC-inconsistent)

Structure of the failures:

  • Misses concentrate in the UK (30 of 55): ENTSO-E registers UK units under identifier-style names that defeat string similarity: West Burton = "Wbups", Connahs Quay = "Cnqps", Peterhead = "Pehe", Heysham 2 = "Heym2". These are multi-GW plants silently dropped from the matched set today.
  • Confirmed wrong matches are the co-located-plant failure mode this PR targets: Duke matches Fiddler's Ferry (1961 MW) to "Ferr" while its EIC says "Fidl", and South Humber Bank (1365 MW) to "Humr", which the EIC registry identifies as the adjacent VPI Immingham plant. The Eemshaven example from the PR description generalizes.
  • Duke's scores cannot flag any of this: correct matches (median score 0.9987) and EIC-contradicted matches (median 0.9971) overlap almost completely and saturate at the same ceiling, so no threshold separates them. A wrong 695 MW match (Gud / Gud Ludwigshafen Mitte) scores 0.998, above hundreds of correct pairs.

So the deterministic pass is not just a tie-breaker for lookalike neighbours: it recovers ~15% of confirmed-matchable capacity that fuzzy matching misses outright, and removes multi-GW mismatches that are invisible to score-based filtering. Happy to share the evaluation script if useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use EIC codes as deterministic matching key before Duke fuzzy matching

2 participants