You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
In soccerdata==1.8.8, WhoScored.read_events(output_fmt="events") returns a dataframe that includes related_event_id, but it does not expose the event's own identifier in the formatted output.
This makes related_event_id difficult to use, because users cannot easily link a related event back to the original event row within the same dataframe.
The expected behavior is that the formatted events output should preserve the event identifier, either as an index level or as a regular column.
Python version: 3.12.13
Affected scrapers
This affects the following scrapers:
ClubElo
ESPN
FBref
FiveThirtyEight
Match History
SoFIFA
Understat
WhoScored
Code example
A minimal code example that reproduces the problem in soccerdata==1.8.8:
No exception is raised. The issue is that the formatted dataframe exposes related_event_id, but does not expose the event's own identifier, which makes related_event_id difficult or impossible to use for linking related events within the same formatted output.
Additional context
I want to be explicit about the version behavior:
In soccerdata==1.8.8, WhoScored.read_events() works, but the formatted events output is missing the event identifier.
In soccerdata==1.9.0, I can no longer validate this in the same workflow because WhoScored currently fails earlier due to a separate regression reported in [WhoScored] read_schedule() fails with JSONDecodeError in 1.9.0 #940, where read_schedule() fails with JSONDecodeError.
So this issue is specifically about the formatted read_events(output_fmt="events") output schema.
I am attaching the Jupyter notebook that I used to reproduce the 1.8.8 behavior. In that notebook, read_events(match_id=1485184) returns a dataframe indexed by league, season, and game, and the rendered dataframe shows game_id and related_event_id, but not the event's own identifier.
a mismatch between the implementation and the documentation.
Local workaround
I found a local workaround and am sharing it here for reference.
The raw WhoScored event data appears to contain both eventId and id. After standardize_colnames, these become event_id and id.
There is a design question here: should the formatted dataframe expose event_id, id, or both? There is also a second question: should the chosen identifier be part of the index, or should it remain a regular column?
In the local patch I tested, I used event_id as an additional index level and kept id as a regular column.
The reasoning was:
event_id comes from WhoScored's eventId;
related_event_id appears to refer to this event-level identifier;
using event_id in the index makes it possible to link related_event_id back to an event row within the same match;
keeping id as a column avoids dropping the other identifier from the formatted output.
I understand that the maintainers may prefer a different schema, for example keeping the current index and exposing event_id as a regular column instead. The main point is that the formatted read_events(output_fmt="events") output should preserve an event identifier so that related_event_id can be used reliably.
I am attaching the modified whoscored.py file for reference.
Describe the bug
In
soccerdata==1.8.8,WhoScored.read_events(output_fmt="events")returns a dataframe that includesrelated_event_id, but it does not expose the event's own identifier in the formatted output.This makes
related_event_iddifficult to use, because users cannot easily link a related event back to the original event row within the same dataframe.The expected behavior is that the formatted events output should preserve the event identifier, either as an index level or as a regular column.
Python version:
3.12.13Affected scrapers
This affects the following scrapers:
Code example
A minimal code example that reproduces the problem in
soccerdata==1.8.8:Output
No exception is raised. The issue is that the formatted dataframe exposes
related_event_id, but does not expose the event's own identifier, which makesrelated_event_iddifficult or impossible to use for linking related events within the same formatted output.Additional context
I want to be explicit about the version behavior:
soccerdata==1.8.8,WhoScored.read_events()works, but the formatted events output is missing the event identifier.soccerdata==1.9.0, I can no longer validate this in the same workflow becauseWhoScoredcurrently fails earlier due to a separate regression reported in [WhoScored] read_schedule() fails with JSONDecodeError in 1.9.0 #940, whereread_schedule()fails withJSONDecodeError.So this issue is specifically about the formatted
read_events(output_fmt="events")output schema.I am attaching the Jupyter notebook that I used to reproduce the
1.8.8behavior. In that notebook,read_events(match_id=1485184)returns a dataframe indexed byleague,season, andgame, and the rendered dataframe showsgame_idandrelated_event_id, but not the event's own identifier.Guía SoccerData (1.8.8).ipynb
This looks like either:
Local workaround
I found a local workaround and am sharing it here for reference.
The raw WhoScored event data appears to contain both
eventIdandid. Afterstandardize_colnames, these becomeevent_idandid.There is a design question here: should the formatted dataframe expose
event_id,id, or both? There is also a second question: should the chosen identifier be part of the index, or should it remain a regular column?In the local patch I tested, I used
event_idas an additional index level and keptidas a regular column.The reasoning was:
event_idcomes from WhoScored'seventId;related_event_idappears to refer to this event-level identifier;event_idin the index makes it possible to linkrelated_event_idback to an event row within the same match;idas a column avoids dropping the other identifier from the formatted output.I understand that the maintainers may prefer a different schema, for example keeping the current index and exposing
event_idas a regular column instead. The main point is that the formattedread_events(output_fmt="events")output should preserve an event identifier so thatrelated_event_idcan be used reliably.I am attaching the modified
whoscored.pyfile for reference.whoscored_issue_941_local_patch.py
Contributor Action Plan