Skip to content

Commit 7f904fc

Browse files
Merge pull request #166 from LaunchCodeEducation/audit-data-manipulation
Audit for data manipulation
2 parents 92a6a80 + 81bf43e commit 7f904fc

File tree

4 files changed

+77
-11
lines changed

4 files changed

+77
-11
lines changed

content/data-manipulation/_index.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ hidden = false
99

1010
## Learning Objectives
1111
After completing all of the content in this chapter, you should be able to do the following:
12-
1. Aggregate data accross multiple columns (mean, median, mode)
12+
1. Aggregate data across multiple columns (mean, median, mode)
1313
1. Append data: stack or concatenate multiple datasets with the `.concat` function
1414
1. Recode and map values within a column to new values by providing conditional formatting
1515
1. Group data together with the `.groupby` function
@@ -28,6 +28,7 @@ After completing all of the content in this chapter, you should be able to do th
2828
### Reshaping Tables
2929
1. `.melt()`
3030
1. `.concat()`
31+
1. `.merge()`
3132
1. `.sort_values()`
3233
1. wide format
3334
1. long format

content/data-manipulation/reading/aggregation/_index.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ This reading, and following readings, will provide examples from the `titanic.cs
1111

1212
## Groupby
1313

14-
The `.groupby()` function groups data together from one or more columns. As we group the data together, it forms a new **GroupBy** object. The offical [pandas documenation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) states that a "group by" accomplshes the following:
14+
The `.groupby()` function groups data together from one or more columns. As we group the data together, it forms a new **GroupBy** object. The official [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) states that a "group by" accomplishes the following:
1515
1. Splitting: Split the data based on the criteria provided.
1616
1. Applying: Provide an applicable function to the groups that were split.
1717
1. Combining: Combine the results from the function into a new data structure.
@@ -43,7 +43,7 @@ grouping_variable = your_data.groupby(["column_one", "column_two", "etc.."])
4343
```
4444

4545
{{% notice blue Example "rocket" %}}
46-
Applying an aggregate function to multipled grouped columns can also be accomplished with method chaining. The following image uses columns from the titanic dataset as an example.
46+
Applying an aggregate function to multiple grouped columns can also be accomplished with method chaining. The following image uses columns from the titanic dataset as an example.
4747

4848
![Creating a new groupby object from the columns "embark_town" and "alone" and applying the sum aggregate function](pictures/grouped-titanic.png?classes=border)
4949

@@ -65,6 +65,10 @@ data.agg(['mean', 'median', 'mode'])
6565
```
6666
{{% /notice %}}
6767

68+
{{% notice orange Warning "rocket" %}}
69+
Note that the `mode()` function can return multiple values per column if there are multiple modes (values that appear with equal frequency). This may result in a DataFrame with more rows than expected. If you need only one mode value, you may want to use `mode()[0]` or apply mode to specific columns individually.
70+
{{% /notice %}}
71+
6872
### Aggregation Using a Dictionary
6973

7074
pandas also allows the ability to provide a dictionary with columns as a key and aggregate functions as an associated value.
@@ -79,7 +83,7 @@ aggregate_dictionary_example = {
7983
dictionary_aggregate = data.agg(aggregate_dictionary_example)
8084
```
8185

82-
This dictionary object has now become a tempate for the aggregations we want to preform. However, on it's own, it does nothing. Once passed to the agg() method, it will pick out the specific location of data we want to examine. Making a subset table.
86+
This dictionary object has now become a template for the aggregations we want to perform. However, on it's own, it does nothing. Once passed to the agg() method, it will pick out the specific location of data we want to examine. Making a subset table.
8387
{{% /notice %}}
8488

8589
## Groupby and Multiple Aggregations

content/data-manipulation/reading/recoding-data/_index.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -46,14 +46,14 @@ data["survived"] = data["survived"].replace(to_replace={0: False, 1: True})
4646
Creating a function to aggregate data or create new columns is another common practice used when analyzing data. Pandas utilizes the `.apply()` method to execute a function on a pandas Series or DataFrame.
4747

4848
{{% notice blue Example "rocket" %}}
49-
Suppose you wanted to know how many survivors under the age of 20 are still alive from the titanic dataset:
49+
Suppose you wanted to know how many survivors age 20 and under are still alive from the titanic dataset:
5050

5151
```python
5252
import pandas as pd
5353

5454
data = pd.read_csv("titanic.csv")
5555

56-
def under_age_21_survivors(data):
56+
def age_20_and_under_survivors(data):
5757
age = data['age']
5858
alive = data['alive']
5959

@@ -62,8 +62,8 @@ def under_age_21_survivors(data):
6262
else:
6363
return False
6464

65-
data["under_21_survivors"] = data.apply(under_age_21_survivors, axis=1)
66-
print(data["under_21_survivors"].value_counts())
65+
data["age_20_and_under_survivors"] = data.apply(age_20_and_under_survivors, axis=1)
66+
print(data["age_20_and_under_survivors"].value_counts())
6767
```
6868

6969
**Output**
@@ -75,5 +75,5 @@ print(data["under_21_survivors"].value_counts())
7575

7676
When recoding your data there are some things you should think about:
7777
1. Does the original data need to remain intact?
78-
1. What data tyes should be replaced with new values, and what type of data should the new value be?
78+
1. What data types should be replaced with new values, and what type of data should the new value be?
7979
1. Would a function be useful for repetitive tasks and manipulation?

content/data-manipulation/reading/reshaping-tables/_index.md

Lines changed: 63 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ movie_genre_dataframe = pd.concat([movie_dataframe, genre_rating_dataframe])
4848
Note in the output image above the inclusion of the `axis` parameter when printing the dataframe a second time. The axis parameter specifies that the two DataFrames should be joined along the columns instead of rows, providing a cleaner dataset.
4949
{{% /notice %}}
5050

51-
In the lesson on exploring data with python we covered how to create a DataFrame using the `.concat()` method by providing two Series as parameters. The `.concat` function can alse be used to add a Series within an existing DataFrame!
51+
In the lesson on exploring data with python we covered how to create a DataFrame using the `.concat()` method by providing two Series as parameters. The `.concat` function can also be used to add a Series within an existing DataFrame!
5252

5353
{{% notice blue Example "rocket" %}}
5454
```python
@@ -64,6 +64,67 @@ concat_series_dataframe = pd.concat([example_dataframe, example_series], axis=1)
6464
```
6565
{{% /notice %}}
6666

67+
## Merging DataFrames
68+
69+
The `.merge()` function is used to combine two DataFrames based on common columns or indices, similar to SQL joins. Unlike `.concat()` which simply stacks DataFrames, `.merge()` intelligently combines rows based on matching values in specified columns.
70+
71+
### Common Merge Types
72+
73+
There are four main types of merges:
74+
1. `inner`: Returns only rows with matching values in both DataFrames (default)
75+
1. `left`: Returns all rows from the left DataFrame and matching rows from the right
76+
1. `right`: Returns all rows from the right DataFrame and matching rows from the left
77+
1. `outer`: Returns all rows from both DataFrames, filling in missing values with NaN
78+
79+
### Syntax
80+
81+
```python
82+
merged_dataframe = pd.merge(left_dataframe, right_dataframe, on="column_name", how="inner")
83+
```
84+
85+
{{% notice blue Example "rocket" %}}
86+
```python
87+
import pandas as pd
88+
89+
# Create two DataFrames with a common column
90+
passengers = pd.DataFrame({
91+
'passenger_id': [1, 2, 3, 4],
92+
'name': ['John', 'Jane', 'Bob', 'Alice'],
93+
'age': [25, 30, 35, 28]
94+
})
95+
96+
tickets = pd.DataFrame({
97+
'passenger_id': [1, 2, 3, 5],
98+
'ticket_class': ['First', 'Second', 'Third', 'First'],
99+
'fare': [100, 50, 25, 100]
100+
})
101+
102+
# Inner merge - only passengers with tickets
103+
inner_merged = pd.merge(passengers, tickets, on='passenger_id', how='inner')
104+
105+
# Left merge - all passengers, with ticket info if available
106+
left_merged = pd.merge(passengers, tickets, on='passenger_id', how='left')
107+
```
108+
109+
The `inner` merge will return only the 3 passengers (IDs 1, 2, 3) that exist in both DataFrames. The `left` merge will return all 4 passengers, with NaN values for the ticket information of passenger 4 who doesn't have a ticket.
110+
{{% /notice %}}
111+
112+
### Merging on Multiple Columns
113+
114+
You can merge on multiple columns by passing a list to the `on` parameter:
115+
116+
```python
117+
merged_dataframe = pd.merge(df1, df2, on=['column1', 'column2'], how='inner')
118+
```
119+
120+
### Merging with Different Column Names
121+
122+
If the columns to merge on have different names in each DataFrame, use `left_on` and `right_on`:
123+
124+
```python
125+
merged_dataframe = pd.merge(df1, df2, left_on='id', right_on='passenger_id', how='inner')
126+
```
127+
67128
## Sorting Values
68129

69130
The `.sort_values()` function allows you to reshape data to your specific use case. The below parameters are some of the more common:
@@ -78,7 +139,7 @@ The above parameters are not the only ones available when using the `.sort_value
78139
### Syntax
79140

80141
```python
81-
data.sort_values(by="column_name", axis=1, ascending=True)
142+
data.sort_values(by="column_name", ascending=True)
82143
```
83144

84145
{{% notice blue Example %}}

0 commit comments

Comments
 (0)