Add EIC-first deterministic matching before Duke fuzzy matching#289
Add EIC-first deterministic matching before Duke fuzzy matching#289MaykThewessen wants to merge 5 commits into
Conversation
|
Two developments that raise the value of this PR:
Happy to rebase/extend if there's interest in moving this forward. |
ec4364a to
25d6a4e
Compare
|
Gentle nudge: still mergeable, implements #287, and ships with tests in |
FabianHofmann
left a comment
There was a problem hiding this comment.
thanks @MaykThewessen, I think general idea is good and it makes sense to introduce a direct matching phase that circumvent the fuzzy duke matching. some feedback below
|
@MaykThewessen I forgot to mention: this approach fully relies on the assumption that powerplant entries are the same in df0 and df1 as soon as the share one common EIC code. can you please make an assessment whether this assumption is fair? if not we should make the direct matching stricter and match on exact same values of EIC entries which compares either the (sorted) list of strings or the strings. only in case of a exact match, the values would be mapped. |
Addresses review feedback on PyPSA#289. - Rewrite _match_by_eic in pandas (explode + merge). Accept a pair only when its shared EIC codes link the two rows to no other row on either side (degree-1 on both ends of the shared-code bipartite graph), verified identical to scipy connected-components on the OPSD/ENTSOE slice (371/371). Ambiguous clusters, where one source aggregates a scheme under a single scheme-level EIC and the other splits it into stations carrying that same code (e.g. Alpine hydro), now fall through to Duke instead of being force-matched arbitrarily. - Make label0/label1 explicit args; return a single matches frame. - Add DirectMatcher class with run() to separate the deterministic step and host future direct matchers. - Expand test/test_matching.py to 13 tests (1-to-many, hydro scheme, subset/superset, raw-string, DirectMatcher.run).
|
Thanks @FabianHofmann, that was the right thing to check. I ran the assessment on the real OPSD/ENTSOE slice (cached bulk data, frames as
So the "one shared code implies same entry" assumption is unsafe in about 9.5% of cases. These are almost all Alpine hydro schemes: ENTSOE reports the aggregated scheme under one scheme-level EIC ( One subtlety on the exact-equality proposal: it would not fully fix this. In those clusters all eight OPSD stations carry exactly So I reworked Pushed in 475ac2c. On the inline comments:
Net effect on the integration numbers: deterministic matches are now 371 high-confidence 1-to-1 pairs for OPSD/ENTSOE; rows in ambiguous clusters are no longer force-matched and go to fuzzy matching instead. |
|
@FabianHofmann all the review points are now addressed and I've resolved the three inline threads:
|
Implements the architecture proposed in PyPSA#287: deterministic matching on EIC (Energy Identification Code) runs before Duke fuzzy matching in compare_two_datasets(). Plants sharing an EIC code are matched with certainty, then removed from the Duke input so the fuzzy matcher only handles the residual. This provides a robustness guarantee against co-located plant confusion. For example, in the Eemshaven harbour area (Netherlands): OPSD "Eemscentrale Ec" (Natural Gas, 1929 MW, 6 EIC codes) ENTSOE "Eems" (Natural Gas, 1931 MW, same 6 EIC codes) ENTSOE "Eemshaven" (Hard Coal, 1580 MW, different EIC) ENTSOE "Eemshaven" (Natural Gas, 1410 MW, different EIC) EIC matching deterministically pairs the gas plants via their shared codes, preventing any possibility of Duke merging them with the nearby coal plant based on name/geo similarity. Similarly, Borssele nuclear (OPSD: 492 MW, ENTSOE: 485 MW) is locked to its correct cross-source pair via EIC 49W000000000054X, independent of fuzzy name matching against nearby Borssele wind/coal entries. Integration test: 431 deterministic EIC matches between OPSD and ENTSOE (28% of ENTSOE), reducing Duke's workload for those pairs to zero. Closes PyPSA#287 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
for more information, see https://pre-commit.ci
Addresses review feedback on PyPSA#289. - Rewrite _match_by_eic in pandas (explode + merge). Accept a pair only when its shared EIC codes link the two rows to no other row on either side (degree-1 on both ends of the shared-code bipartite graph), verified identical to scipy connected-components on the OPSD/ENTSOE slice (371/371). Ambiguous clusters, where one source aggregates a scheme under a single scheme-level EIC and the other splits it into stations carrying that same code (e.g. Alpine hydro), now fall through to Duke instead of being force-matched arbitrarily. - Make label0/label1 explicit args; return a single matches frame. - Add DirectMatcher class with run() to separate the deterministic step and host future direct matchers. - Expand test/test_matching.py to 13 tests (1-to-many, hydro scheme, subset/superset, raw-string, DirectMatcher.run).
for more information, see https://pre-commit.ci
c611e46 to
65fda1c
Compare
|
Two updates. 1. Branch rebased onto current 2. Quantified what the deterministic EIC pass buys, by scoring pure Duke fuzzy matching (current static
Structure of the failures:
So the deterministic pass is not just a tie-breaker for lookalike neighbours: it recovers ~15% of confirmed-matchable capacity that fuzzy matching misses outright, and removes multi-GW mismatches that are invisible to score-based filtering. Happy to share the evaluation script if useful. |
Summary
Implements the architecture proposed in #287: deterministic matching on EIC (Energy Identification Code) runs before Duke fuzzy matching in
compare_two_datasets(). Plants that are linked one-to-one by a shared EIC code are matched with certainty and removed from the Duke input, so the fuzzy matcher only handles the residual.Motivation
EIC codes are unique European plant identifiers already loaded from ENTSOE, OPSD, and GEM, but previously unused in the matching decision. Using them as a first-pass exact match provides a robustness guarantee against co-located plant confusion.
Example: Eemshaven harbour, Netherlands
49W000000000066Q49W000000000119VEIC matching deterministically pairs the gas plants via their shared codes, preventing Duke from merging them with the nearby coal plant based on name/geo similarity (~0.86 JaroWinkler, ~400 m apart).
Example: Borssele, Netherlands
Borssele nuclear (OPSD: 492 MW, ENTSOE: 485 MW) is locked to its correct cross-source pair via EIC
49W000000000054X, independent of fuzzy name matching against nearby Borssele wind/coal/solar entries.Is "shared EIC code ⇒ same plant" safe? (review follow-up)
@FabianHofmann raised the key question: can two entries be assumed identical as soon as they share one EIC code? Assessed on the real OPSD/ENTSOE slice by modelling the EIC relationship as a bipartite graph (rows = nodes, shared code = edge) and inspecting the connected components:
So a single shared code does not prove identity in ~9.5% of cases. These are almost all Alpine hydro schemes: ENTSOE reports the aggregated scheme under one scheme-level EIC (e.g. "Oberhasli Ag Kwo", 1307 MW,
12W-0000000031-O) while OPSD splits it into eight stations that all inherit that same code.The fix is therefore to accept a pair only when its shared codes link the two rows to no other row on either side (degree 1 on both ends of the bipartite graph, which is exactly a size-2 connected component, verified identical on the slice: 371/371). Ambiguous clusters fall through to Duke.
Note on the exact-set-equality alternative discussed in review: it would not fix this (the eight split stations each carry exactly the same single code as the aggregate, so they remain set-equal and still 1-to-many) and it would additionally drop the 13 genuine subset matches (e.g. one unit code in OPSD vs two in ENTSOE). 1-to-1 uniqueness is the discriminator that actually works.
Implementation
_match_by_eic()inmatching.py:explode()the per-row EIC collections,mergeon the code to find shared-code row pairs, then keep only pairs that are degree-1 on both sides via twogroupby(...).transform("size").eq(1)masks. Pure pandas, no Python loops or set operations.DirectMatcherclass: holds an ordered list of deterministic matchers (default[_match_by_eic]);run()applies them, peels matched rows from the residual, and returns(matches, [residual0, residual1]). This keeps the deterministic phase cleanly separated from Duke and extensible (a name+country or project-id matcher can join the list without touching the fuzzy path).compare_two_datasets(): runsDirectMatcher().run(...)first, then Duke on the residual, then concatenates both match sets.Results (OPSD ↔ ENTSOE)
Test plan
pytest test/test_matching.py— 13 tests pass (basic matching, missing EIC column, empty/NaN/None sets, raw-string EIC, multi-code 1-to-1, subset/superset 1-to-1, and regression cases for the 1-to-many and Alpine-hydro-scheme ambiguities, plusDirectMatcherresidual + pluggability)._match_by_eicoutput verified identical to the size-2 connected components.powerplants(update=True)to verify end-to-end output.Closes #287
🤖 Generated with Claude Code