Skip to content

Commit b5aff34

Browse files
authored
Update analyze-baseball-stats-with-pandas-and-matplotlib.mdx
1 parent aa66d1b commit b5aff34

1 file changed

Lines changed: 89 additions & 53 deletions

File tree

projects/analyze-baseball-stats-with-pandas-and-matplotlib/analyze-baseball-stats-with-pandas-and-matplotlib.mdx

Lines changed: 89 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@ active_players = batting[batting['AB'] > 0]
104104
active_players.describe()
105105
```
106106

107+
![At least one at bat](https://raw.githubusercontent.com/codedex-io/projects/refs/heads/main/projects/analyze-baseball-stats-with-pandas-and-matplotlib/at-least-one-at-bat.png)
107108

108109
As expected, the average number of runs scored by a player shot up to `20.88` after filtering out all of the players who were always on the bench.
109110

@@ -117,38 +118,70 @@ If we use `.groupby()` we can find statistics for a particular player, year, or
117118
batting.groupby('playerID')['HR'].sum().sort_values(ascending=False)
118119
```
119120

121+
The output:
122+
123+
```text
124+
playerID
125+
bondsba01 762
126+
aaronha01 755
127+
ruthba01 714
128+
pujola101 703
129+
rodria101 696
130+
...
131+
abadijo01 0
132+
abadfe01 0
133+
zmiched01 0
134+
abbotky01 0
135+
zettlege01 0
136+
Name: HR, Length: 24011, dtype: int64
137+
```
120138

121-
You might recognize some familiar names here! These are the all time home run hitters: Barry Bonds, Hank Arron, Babe Ruth, and so on.
139+
You might recognize some familiar names here! These are the all-time home run leaders: Barry Bonds, Hank Arron, Babe Ruth, and so on.
122140

123141
This line of code chains four operations together, so let’s break it down step by step:
124142

125143
- `.groupby('playerID')` groups all rows that belong to the same player. Since each row represents one season, this grouping effectively collects every season of a player’s career together. If you print this object by itself, Pandas will show a `DataFrameGroupBy object`, because it doesn’t yet know how you want to summarize the data.
126-
['HR'] selects only the home runs column from each group.
144+
- `['HR']` selects only the home runs column from each group.
127145
- `.sum()` is an aggregate function. It adds up the home runs within each player’s group, giving us each player’s career total. We could use other aggregate functions if we wanted other statistics. For example, if we wanted the average number of home runs per year, we could use `.mean()`.
128146
- `.sort_values(ascending=False)` sorts the results so the players with the most home runs appear first.
129147

130148
If we wanted to filter our dataset by some condition, we could do that before filtering. For example, if we wanted to see how dominant Babe Ruth was compared to his peers, we could filter for only players that played the same years as him. To do this, we'll need to find what years he started and ended his career:
131149

132150
```py
133151
# Filtering for only Babe Ruth's rows
134-
babe_ruth = batting[batting["playerID"] == "ruthba01"]
152+
babe_ruth = batting[batting['playerID'] == 'ruthba01']
135153

136154
# Finding the starting and ending years
137-
start_year = babe_ruth["yearID"].min()
138-
end_year = babe_ruth["yearID"].max()
155+
start_year = babe_ruth['yearID'].min()
156+
end_year = babe_ruth['yearID'].max()
139157
```
140158
Now we can use these years a filter for our entire dataset and find the total number of home runs per player for just those years:
141159

142160
```py
143161
# Filtering for players that played the same years as Babe
144-
ruth_years = batting[(batting["yearID"] >= start_year) & (batting["yearID"] <= end_year)]
162+
ruth_years = batting[(batting['yearID'] >= start_year) & (batting['yearID'] <= end_year)]
145163

146164
# Finding the home run leaders during that time
147165
ruth_years.groupby('playerID')['HR'].sum().sort_values(ascending=False)
148166
```
149167

168+
The output:
169+
170+
```text
171+
playerID
172+
ruthba01 714
173+
gehrilo01 378
174+
foxxji01 302
175+
...
176+
zapusjo01 0
177+
zahnipa01 0
178+
zabelzi01 0
179+
youngch02 0
180+
youngch01 0
181+
Name: HR, Length: 4689, dtype: int64
182+
```
150183

151-
Wow, Babe Ruth had almost twice the number of home runs as the next best player!
184+
Wow, Babe Ruth had almost twice the number of home runs as the next best player! 😮
152185

153186
## Graphing Data By Year
154187

@@ -169,75 +202,76 @@ plt.figure(figsize=(10,6))
169202
plt.plot(avg_hr_by_year.index, avg_hr_by_year.values)
170203

171204
# Adding labels
172-
plt.title('Average Home Runs per Player by Year')
205+
plt.title('Average Home Runs per Player Over Time')
173206
plt.xlabel('Year')
174207
plt.ylabel('Avg Home Runs')
208+
175209
plt.show()
176210
```
177211

178212
It's interesting that you can see the abbreviated 2020 season in this graph! They played about half as many games that year due to COVID.
179213

180-
## Your Favorite Team Vs. The League
214+
## Your Favorite Team vs. The League
181215

182-
Another fun visualization that you can create is a comparison of your favorite team's stats to the rest of the league. I grew up in Denver, so my team is the Colorado Rockies, who have been laughably bad for the majority of my life 😅.
216+
Another fun visualization that you can create is a comparison of your favorite team's stats to the rest of the league. I grew up in Denver, so my team is the Colorado Rockies, who have been laughably bad for the majority of my life. 😅
183217

184218
That being said, there is an interesting phenomenon with the Rockies: it is very easy to hit home runs in Denver because of the altitude. As a result, even though the Rockies are generally a fairly weak team, they end up hitting more home runs than their competitors. Let's graph the Rockies' home runs every year against the league average!
185219

186220
First, we can group the data by year and team to find each team's home runs for a given year:
187221

188-
```
222+
```py
189223
team_hr_per_year = (
190-
batting
191-
.groupby(["yearID", "teamID"])["HR"]
192-
.sum()
193-
.reset_index()
224+
batting
225+
.groupby(['yearID', 'teamID'])['HR']
226+
.sum()
227+
.reset_index()
194228
)
195229
```
196230

197231
Then, to find the Rockies specifically, we can filter by `teamID`:
198232

199-
```
233+
```py
200234
rockies_hr = (
201-
team_hr_per_year
202-
[team_hr_per_year["teamID"] == "COL"]
203-
.set_index("yearID")["HR"]
235+
team_hr_per_year
236+
[team_hr_per_year['teamID"] == 'COL']
237+
.set_index('yearID')['HR']
204238
)
205239
```
206240

207241
We can also find the league average for each year. We'll also filter out anything before 1993, since that is the year the Rockies were created:
208242

209-
```
243+
```py
210244
league_avg_hr = (
211-
team_hr_per_year
212-
.groupby("yearID")["HR"]
213-
.mean()
245+
team_hr_per_year
246+
.groupby('yearID')['HR']
247+
.mean()
214248
)
215-
league_avg_hr = league_avg_hr[league_avg_hr.index >= 1993]
216249

250+
league_avg_hr = league_avg_hr[league_avg_hr.index >= 1993]
217251
```
218252

219253
Finally, we can put this all together by graphing these two lines:
220254

221-
```
255+
```py
222256
plt.figure(figsize=(10, 6))
223257

224-
plt.plot(
225-
league_avg_hr.index,
226-
league_avg_hr.values,
227-
label="League Average",
228-
linestyle="--"
258+
plt.plot(league_avg_hr.index,
259+
league_avg_hr.values,
260+
label='League Average',
261+
linestyle='--'
229262
)
230263

231264
plt.plot(
232-
rockies_hr.index,
233-
rockies_hr.values,
234-
label="Colorado Rockies"
265+
rockies_hr.index,
266+
rockies_hr.values,
267+
label='Colorado Rockies'
235268
)
236269

237-
plt.title("Colorado Rockies Home Runs vs League Average")
238-
plt.xlabel("Year")
239-
plt.ylabel("Home Runs")
270+
plt.title('Colorado Rockies Home Runs vs League Average')
271+
plt.xlabel('Year')
272+
plt.ylabel('Home Runs')
240273
plt.legend()
274+
241275
plt.show()
242276
``
243277

@@ -248,9 +282,10 @@ As expected, the altitude in Denver has caused some pretty high home run numbers
248282
Finally, let's take on a challenge of replicating the work Billy Bean and Peter Brand did for the Oakland A's in _Moneyball_. While they almost certainly considered many statistics, they are most famous for finding players with a high **on-base percentage** (OBP) relative to their cost.
249283

250284
Let's try to identify some of those players! Here's how we'll tackle this problem:
251-
OBP is not currently listed in the Batting table, but we can calculate that value ourselves based on other information and add a new OBP column to the table.
252-
We can find a player's salary for a given year in the Salaries table. We'll need to use a join to combine the Batting and Salaries tables.
253-
Once OBP and salary are in the same table, we can find the ratio of those two values. If we sort by that ratio, we can find the players who have the best OBP for their cost!
285+
286+
- OBP is not currently listed in the Batting table, but we can calculate that value ourselves based on other information and add a new OBP column to the table.
287+
- We can find a player's salary for a given year in the Salaries table. We'll need to use a join to combine the Batting and Salaries tables.
288+
- Once OBP and salary are in the same table, we can find the ratio of those two values. If we sort by that ratio, we can find the players who have the best OBP for their cost!
254289

255290
### Calculating OBP
256291

@@ -262,21 +297,21 @@ Let's add a new column to our Batting table:
262297

263298
```py
264299
# Fill missing values to avoid NaN issues
265-
batting[["BB", "HBP", "SF"]] = batting[["BB", "HBP", "SF"]].fillna(0)
300+
batting[['BB', 'HBP', 'SF']] = batting[['BB', 'HBP', 'SF']].fillna(0)
266301

267302
# Create OBP column
268-
batting["OBP"] = (
269-
(batting["H"] + batting["BB"] + batting["HBP"]) /
270-
(batting["AB"] + batting["BB"] + batting["HBP"] + batting["SF"])
303+
batting['OBP'] = (
304+
(batting['H'] + batting['BB'] + batting['HBP']) /
305+
(batting['AB'] + batting['BB'] + batting['HBP'] + batting['SF'])
271306
)
272307

273308
# If we got NaN due to no plate appearances, then fill with 0.
274-
batting["OBP"] = batting["OBP"].fillna(0)
309+
batting['OBP'] = batting['OBP'].fillna(0)
275310
```
276311
A decent OBP is around `.3` to `.4`, and if we do a quick scan of the data, we can see values around that range:
277312

278313

279-
#### Joining with the Salaries Table
314+
### Joining with the Salaries Table
280315

281316
Every row in the Batting table contains information about a player (`playerID`) in a given year (`yearID`) on a given team (`teamID`). We'll want to match all three of those pieces of information with the rows in the Salaries table.
282317

@@ -285,28 +320,29 @@ We'll also want to consider what to do in the case that we can't find a salary f
285320
We'll use a left join so that we get all of the rows from the first table (in this case, Batting) regardless of whether we find a matching player in the Salaries table.
286321

287322
```py
288-
289323
# Load data
290-
salaries = pd.read_csv("Salaries.csv")
324+
salaries = pd.read_csv('Salaries.csv')
291325

292326
# Merge salary data into batting data
293327
batting_with_salary = batting.merge(
294-
salaries,
295-
on=["playerID", "yearID", "teamID"],
296-
how="left"
328+
salaries,
329+
on=['playerID', 'yearID', 'teamID'],
330+
how='left'
297331
)
298332

299333
batting_with_salary.head()
300334
```
335+
301336
As expected, after performing this join, we're missing salary data about some of the more old-school players.
302337

303338

304339

305-
#### Calculating Value
340+
### Calculating Value
306341

307342
We can now find the "value" of a player by calculating their OBP divided by their salary. There are a few edge cases to consider before finding our final values:
308-
We'll want to remove anyone with a salary or OBP of `0`
309-
We'll want to remove anyone with under 200 at bats. This will help us filter out any players who are outliers due to lack of playtime.
343+
344+
- We'll want to remove anyone with a salary or OBP of `0`
345+
- We'll want to remove anyone with under 200 at bats. This will help us filter out any players who are outliers due to lack of playtime.
310346

311347
```py
312348
value_df = batting_with_salary[

0 commit comments

Comments
 (0)