You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: projects/analyze-baseball-stats-with-pandas-and-matplotlib/analyze-baseball-stats-with-pandas-and-matplotlib.mdx

107
108
108
109
As expected, the average number of runs scored by a player shot up to `20.88` after filtering out all of the players who were always on the bench.
109
110
@@ -117,38 +118,70 @@ If we use `.groupby()` we can find statistics for a particular player, year, or
You might recognize some familiar names here! These are the alltime home run hitters: Barry Bonds, Hank Arron, Babe Ruth, and so on.
139
+
You might recognize some familiar names here! These are the all-time home run leaders: Barry Bonds, Hank Arron, Babe Ruth, and so on.
122
140
123
141
This line of code chains four operations together, so let’s break it down step by step:
124
142
125
143
-`.groupby('playerID')` groups all rows that belong to the same player. Since each row represents one season, this grouping effectively collects every season of a player’s career together. If you print this object by itself, Pandas will show a `DataFrameGroupBy object`, because it doesn’t yet know how you want to summarize the data.
126
-
['HR'] selects only the home runs column from each group.
144
+
-`['HR']` selects only the home runs column from each group.
127
145
-`.sum()` is an aggregate function. It adds up the home runs within each player’s group, giving us each player’s career total. We could use other aggregate functions if we wanted other statistics. For example, if we wanted the average number of home runs per year, we could use `.mean()`.
128
146
-`.sort_values(ascending=False)` sorts the results so the players with the most home runs appear first.
129
147
130
148
If we wanted to filter our dataset by some condition, we could do that before filtering. For example, if we wanted to see how dominant Babe Ruth was compared to his peers, we could filter for only players that played the same years as him. To do this, we'll need to find what years he started and ended his career:
plt.title('Average Home Runs per Player Over Time')
173
206
plt.xlabel('Year')
174
207
plt.ylabel('Avg Home Runs')
208
+
175
209
plt.show()
176
210
```
177
211
178
212
It's interesting that you can see the abbreviated 2020 season in this graph! They played about half as many games that year due to COVID.
179
213
180
-
## Your Favorite Team Vs. The League
214
+
## Your Favorite Team vs. The League
181
215
182
-
Another fun visualization that you can create is a comparison of your favorite team's stats to the rest of the league. I grew up in Denver, so my team is the Colorado Rockies, who have been laughably bad for the majority of my life 😅.
216
+
Another fun visualization that you can create is a comparison of your favorite team's stats to the rest of the league. I grew up in Denver, so my team is the Colorado Rockies, who have been laughably bad for the majority of my life. 😅
183
217
184
218
That being said, there is an interesting phenomenon with the Rockies: it is very easy to hit home runs in Denver because of the altitude. As a result, even though the Rockies are generally a fairly weak team, they end up hitting more home runs than their competitors. Let's graph the Rockies' home runs every year against the league average!
185
219
186
220
First, we can group the data by year and team to find each team's home runs for a given year:
187
221
188
-
```
222
+
```py
189
223
team_hr_per_year = (
190
-
batting
191
-
.groupby(["yearID", "teamID"])["HR"]
192
-
.sum()
193
-
.reset_index()
224
+
batting
225
+
.groupby(['yearID', 'teamID'])['HR']
226
+
.sum()
227
+
.reset_index()
194
228
)
195
229
```
196
230
197
231
Then, to find the Rockies specifically, we can filter by `teamID`:
198
232
199
-
```
233
+
```py
200
234
rockies_hr = (
201
-
team_hr_per_year
202
-
[team_hr_per_year["teamID"] == "COL"]
203
-
.set_index("yearID")["HR"]
235
+
team_hr_per_year
236
+
[team_hr_per_year['teamID"] == 'COL']
237
+
.set_index('yearID')['HR']
204
238
)
205
239
```
206
240
207
241
We can also find the league average for each year. We'll also filter out anything before 1993, since that is the year the Rockies were created:
Finally, we can put this all together by graphing these two lines:
220
254
221
-
```
255
+
```py
222
256
plt.figure(figsize=(10, 6))
223
257
224
-
plt.plot(
225
-
league_avg_hr.index,
226
-
league_avg_hr.values,
227
-
label="League Average",
228
-
linestyle="--"
258
+
plt.plot(league_avg_hr.index,
259
+
league_avg_hr.values,
260
+
label='League Average',
261
+
linestyle='--'
229
262
)
230
263
231
264
plt.plot(
232
-
rockies_hr.index,
233
-
rockies_hr.values,
234
-
label="Colorado Rockies"
265
+
rockies_hr.index,
266
+
rockies_hr.values,
267
+
label='Colorado Rockies'
235
268
)
236
269
237
-
plt.title("Colorado Rockies Home Runs vs League Average")
238
-
plt.xlabel("Year")
239
-
plt.ylabel("Home Runs")
270
+
plt.title('Colorado Rockies Home Runs vs League Average')
271
+
plt.xlabel('Year')
272
+
plt.ylabel('Home Runs')
240
273
plt.legend()
274
+
241
275
plt.show()
242
276
``
243
277
@@ -248,9 +282,10 @@ As expected, the altitude in Denver has caused some pretty high home run numbers
248
282
Finally, let's take on a challenge of replicating the work Billy Bean and Peter Brand did for the Oakland A's in _Moneyball_. While they almost certainly considered many statistics, they are most famous for finding players with a high **on-base percentage** (OBP) relative to their cost.
249
283
250
284
Let's try to identify some of those players! Here's how we'll tackle this problem:
251
-
OBP is not currently listed in the Batting table, but we can calculate that value ourselves based on other information and add a new OBP column to the table.
252
-
We can find a player's salary for a given year in the Salaries table. We'll need to use a join to combine the Batting and Salaries tables.
253
-
Once OBP and salary are in the same table, we can find the ratio of those two values. If we sort by that ratio, we can find the players who have the best OBP for their cost!
285
+
286
+
-OBPisnot currently listed in the Batting table, but we can calculate that value ourselves based on other information and add a new OBP column to the table.
287
+
- We can find a player's salary for a given year in the Salaries table. We'll need to use a join to combine the Batting and Salaries tables.
288
+
- Once OBPand salary are in the same table, we can find the ratio of those two values. If we sort by that ratio, we can find the players who have the best OBPfor their cost!
254
289
255
290
### Calculating OBP
256
291
@@ -262,21 +297,21 @@ Let's add a new column to our Batting table:
# If we got NaN due to no plate appearances, then fill with 0.
274
-
batting["OBP"] = batting["OBP"].fillna(0)
309
+
batting['OBP'] = batting['OBP'].fillna(0)
275
310
```
276
311
A decent OBPis around `.3` to `.4`, andif we do a quick scan of the data, we can see values around that range:
277
312
278
313
279
-
####Joining with the Salaries Table
314
+
### Joining with the Salaries Table
280
315
281
316
Every row in the Batting table contains information about a player (`playerID`) in a given year (`yearID`) on a given team (`teamID`). We'll want to match all three of those pieces of information with the rows in the Salaries table.
282
317
@@ -285,28 +320,29 @@ We'll also want to consider what to do in the case that we can't find a salary f
285
320
We'll use a left join so that we get all of the rows from the first table (in this case, Batting) regardless of whether we find a matching player in the Salaries table.
286
321
287
322
```py
288
-
289
323
# Load data
290
-
salaries = pd.read_csv("Salaries.csv")
324
+
salaries = pd.read_csv('Salaries.csv')
291
325
292
326
# Merge salary data into batting data
293
327
batting_with_salary = batting.merge(
294
-
salaries,
295
-
on=["playerID", "yearID", "teamID"],
296
-
how="left"
328
+
salaries,
329
+
on=['playerID', 'yearID', 'teamID'],
330
+
how='left'
297
331
)
298
332
299
333
batting_with_salary.head()
300
334
```
335
+
301
336
As expected, after performing this join, we're missing salary data about some of the more old-school players.
302
337
303
338
304
339
305
-
####Calculating Value
340
+
### Calculating Value
306
341
307
342
We can now find the "value" of a player by calculating their OBP divided by their salary. There are a few edge cases to consider before finding our final values:
308
-
We'll want to remove anyone with a salary or OBP of `0`
309
-
We'll want to remove anyone with under 200 at bats. This will help us filter out any players who are outliers due to lack of playtime.
343
+
344
+
- We'll want to remove anyone with a salary or OBP of `0`
345
+
- We'll want to remove anyone with under 200 at bats. This will help us filter out any players who are outliers due to lack of playtime.
0 commit comments