Merge pull request #166 from LaunchCodeEducation/audit-data-manipulation

jwoolbright23 · web-flow · commit 7f904fc9cf59 · 2025-11-20T12:01:59.000-06:00
Audit for data manipulation
diff --git a/content/data-manipulation/_index.md b/content/data-manipulation/_index.md
@@ -9,7 +9,7 @@ hidden = false
 
 ## Learning Objectives
 After completing all of the content in this chapter, you should be able to do the following:
-1. Aggregate data accross multiple columns (mean, median, mode)
+1. Aggregate data across multiple columns (mean, median, mode)
 1. Append data: stack or concatenate multiple datasets with the `.concat` function
 1. Recode and map values within a column to new values by providing conditional formatting
 1. Group data together with the `.groupby` function
@@ -28,6 +28,7 @@ After completing all of the content in this chapter, you should be able to do th
 ### Reshaping Tables
 1. `.melt()`
 1. `.concat()`
+1. `.merge()`
 1. `.sort_values()`
 1. wide format
 1. long format
diff --git a/content/data-manipulation/reading/aggregation/_index.md b/content/data-manipulation/reading/aggregation/_index.md
@@ -11,7 +11,7 @@ This reading, and following readings, will provide examples from the `titanic.cs
 
 ## Groupby
 
-The `.groupby()` function groups data together from one or more columns. As we group the data together, it forms a new **GroupBy** object. The offical [pandas documenation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) states that a "group by" accomplshes the following:
+The `.groupby()` function groups data together from one or more columns. As we group the data together, it forms a new **GroupBy** object. The official [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) states that a "group by" accomplishes the following:
 1. Splitting: Split the data based on the criteria provided.
 1. Applying: Provide an applicable function to the groups that were split.
 1. Combining: Combine the results from the function into a new data structure.
@@ -43,7 +43,7 @@ grouping_variable = your_data.groupby(["column_one", "column_two", "etc.."])
 ```
 
 {{% notice blue Example "rocket" %}}
-Applying an aggregate function to multipled grouped columns can also be accomplished with method chaining. The following image uses columns from the titanic dataset as an example.
+Applying an aggregate function to multiple grouped columns can also be accomplished with method chaining. The following image uses columns from the titanic dataset as an example.
 
 ![Creating a new groupby object from the columns "embark_town" and "alone" and applying the sum aggregate function](pictures/grouped-titanic.png?classes=border)
 
@@ -65,6 +65,10 @@ data.agg(['mean', 'median', 'mode'])
 ```
 {{% /notice %}}
 
+{{% notice orange Warning "rocket" %}}
+Note that the `mode()` function can return multiple values per column if there are multiple modes (values that appear with equal frequency). This may result in a DataFrame with more rows than expected. If you need only one mode value, you may want to use `mode()[0]` or apply mode to specific columns individually.
+{{% /notice %}}
+
 ### Aggregation Using a Dictionary
 
 pandas also allows the ability to provide a dictionary with columns as a key and aggregate functions as an associated value.
@@ -79,7 +83,7 @@ aggregate_dictionary_example = {
 dictionary_aggregate = data.agg(aggregate_dictionary_example)
 ```
 
-This dictionary object has now become a tempate for the aggregations we want to preform. However, on it's own, it does nothing. Once passed to the agg() method, it will pick out the specific location of data we want to examine. Making a subset table. 
+This dictionary object has now become a template for the aggregations we want to perform. However, on it's own, it does nothing. Once passed to the agg() method, it will pick out the specific location of data we want to examine. Making a subset table. 
 {{% /notice %}}
 
 ## Groupby and Multiple Aggregations
diff --git a/content/data-manipulation/reading/recoding-data/_index.md b/content/data-manipulation/reading/recoding-data/_index.md
@@ -46,14 +46,14 @@ data["survived"] = data["survived"].replace(to_replace={0: False, 1: True})
 Creating a function to aggregate data or create new columns is another common practice used when analyzing data. Pandas utilizes the `.apply()` method to execute a function on a pandas Series or DataFrame.
 
 {{% notice blue Example "rocket" %}}
-Suppose you wanted to know how many survivors under the age of 20 are still alive from the titanic dataset:
+Suppose you wanted to know how many survivors age 20 and under are still alive from the titanic dataset:
 
 ```python
 import pandas as pd
 
 data = pd.read_csv("titanic.csv")
 
-def under_age_21_survivors(data):
+def age_20_and_under_survivors(data):
     age = data['age']
     alive = data['alive']
 
@@ -62,8 +62,8 @@ def under_age_21_survivors(data):
     else:
         return False
 
-data["under_21_survivors"] = data.apply(under_age_21_survivors, axis=1)
-print(data["under_21_survivors"].value_counts())
+data["age_20_and_under_survivors"] = data.apply(age_20_and_under_survivors, axis=1)
+print(data["age_20_and_under_survivors"].value_counts())
 ```
 
 **Output**
@@ -75,5 +75,5 @@ print(data["under_21_survivors"].value_counts())
 
 When recoding your data there are some things you should think about:
 1. Does the original data need to remain intact?
-1. What data tyes should be replaced with new values, and what type of data should the new value be?
+1. What data types should be replaced with new values, and what type of data should the new value be?
 1. Would a function be useful for repetitive tasks and manipulation?
diff --git a/content/data-manipulation/reading/reshaping-tables/_index.md b/content/data-manipulation/reading/reshaping-tables/_index.md
@@ -48,7 +48,7 @@ movie_genre_dataframe = pd.concat([movie_dataframe, genre_rating_dataframe])
 Note in the output image above the inclusion of the `axis` parameter when printing the dataframe a second time. The axis parameter specifies that the two DataFrames should be joined along the columns instead of rows, providing a cleaner dataset.
 {{% /notice %}}
 
-In the lesson on exploring data with python we covered how to create a DataFrame using the `.concat()` method by providing two Series as parameters. The `.concat` function can alse be used to add a Series within an existing DataFrame!
+In the lesson on exploring data with python we covered how to create a DataFrame using the `.concat()` method by providing two Series as parameters. The `.concat` function can also be used to add a Series within an existing DataFrame!
 
 {{% notice blue Example "rocket" %}}
 ```python
@@ -64,6 +64,67 @@ concat_series_dataframe = pd.concat([example_dataframe, example_series], axis=1)
 ```
 {{% /notice %}}
 
+## Merging DataFrames
+
+The `.merge()` function is used to combine two DataFrames based on common columns or indices, similar to SQL joins. Unlike `.concat()` which simply stacks DataFrames, `.merge()` intelligently combines rows based on matching values in specified columns.
+
+### Common Merge Types
+
+There are four main types of merges:
+1. `inner`: Returns only rows with matching values in both DataFrames (default)
+1. `left`: Returns all rows from the left DataFrame and matching rows from the right
+1. `right`: Returns all rows from the right DataFrame and matching rows from the left
+1. `outer`: Returns all rows from both DataFrames, filling in missing values with NaN
+
+### Syntax
+
+```python
+merged_dataframe = pd.merge(left_dataframe, right_dataframe, on="column_name", how="inner")
+```
+
+{{% notice blue Example "rocket" %}}
+```python
+import pandas as pd
+
+# Create two DataFrames with a common column
+passengers = pd.DataFrame({
+    'passenger_id': [1, 2, 3, 4],
+    'name': ['John', 'Jane', 'Bob', 'Alice'],
+    'age': [25, 30, 35, 28]
+})
+
+tickets = pd.DataFrame({
+    'passenger_id': [1, 2, 3, 5],
+    'ticket_class': ['First', 'Second', 'Third', 'First'],
+    'fare': [100, 50, 25, 100]
+})
+
+# Inner merge - only passengers with tickets
+inner_merged = pd.merge(passengers, tickets, on='passenger_id', how='inner')
+
+# Left merge - all passengers, with ticket info if available
+left_merged = pd.merge(passengers, tickets, on='passenger_id', how='left')
+```
+
+The `inner` merge will return only the 3 passengers (IDs 1, 2, 3) that exist in both DataFrames. The `left` merge will return all 4 passengers, with NaN values for the ticket information of passenger 4 who doesn't have a ticket.
+{{% /notice %}}
+
+### Merging on Multiple Columns
+
+You can merge on multiple columns by passing a list to the `on` parameter:
+
+```python
+merged_dataframe = pd.merge(df1, df2, on=['column1', 'column2'], how='inner')
+```
+
+### Merging with Different Column Names
+
+If the columns to merge on have different names in each DataFrame, use `left_on` and `right_on`:
+
+```python
+merged_dataframe = pd.merge(df1, df2, left_on='id', right_on='passenger_id', how='inner')
+```
+
 ## Sorting Values
 
 The `.sort_values()` function allows you to reshape data to your specific use case. The below parameters are some of the more common:
@@ -78,7 +139,7 @@ The above parameters are not the only ones available when using the `.sort_value
 ### Syntax
 
 ```python
-data.sort_values(by="column_name", axis=1, ascending=True)
+data.sort_values(by="column_name", ascending=True)
 ```
 
 {{% notice blue Example %}}