tutorial/ai-ml/machine-learning/programming-fundamentals/essential-libraries/pandas.mdx at b098e2c66e44a01782a3b783cf790e36f4a5f30e · codeharborhub/tutorial

title

Pandas: Data Manipulation

sidebar_label

Pandas

description

Mastering DataFrames, Series, and data cleaning techniques: the essential toolkit for exploratory data analysis (EDA).

1. Core Data Structures

Pandas is built on top of NumPy, but it adds labels (indices and column names) to the data.

graph TD
    Data[Pandas Data Structures] --> Series["Series (1D)"]
    Data --> DF["DataFrame (2D)"]
    
    Series --> S_Desc["A single column of data with an index"]
    DF --> DF_Desc["A table with rows and columns (The 'Excel' of Python)"]

The DataFrame

A DataFrame is essentially a dictionary of Series objects. It is the primary object you will use to store your features () and targets ().

import pandas as pd

# Creating a DataFrame from a dictionary
df = pd.DataFrame({
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 70000]
})

2. Loading and Inspecting Data

Pandas can read almost any format. Once loaded, we use specific methods to "peek" into the data.

pd.read_csv('data.csv'): The most common way to load data.
df.head(): View the first 5 rows.
df.info(): Check data types and memory usage.
df.describe(): Get statistical summaries (mean, std, min, max).

3. Selecting and Filtering Data

In ML, we often need to separate our target variable from our features. We use .loc (label-based) and .iloc (integer-based) indexing.

# Select all rows, but only the 'Salary' column
target = df['Salary']

# Select rows where Age is greater than 30
seniors = df[df['Age'] > 30]

4. Data Cleaning: The "ML Pre-processing" Step

Before a model can learn, the data must be "clean." Pandas provides high-level functions for the most common cleaning tasks:

A. Handling Missing Values

Most ML algorithms cannot handle NaN (Not a Number) values.

df.isnull().sum(): Count missing values.
df.dropna(): Remove rows with missing values.
df.fillna(df.mean()): Fill missing values with the average (Imputation).

B. Handling Categorical Data

ML models require numbers. We use Pandas to convert text to categories.

pd.get_dummies(df['City']): One-Hot Encoding (turns "City" into multiple 0/1 columns).

5. Grouping and Aggregation

Commonly used in Exploratory Data Analysis (EDA) to find patterns.

# Calculate the average salary per city
avg_sal = df.groupby('City')['Salary'].mean()

flowchart LR
    A[Raw DataFrame] --> B["Split (by Category)"]
    B --> C["Apply (Mean/Sum)"]
    C --> D["Combine (New Table)"]
    style B fill:#e1f5fe,stroke:#01579b,color:#333

6. Vectorized String Operations

Pandas allows you to perform operations on entire text columns without writing loops—essential for Natural Language Processing (NLP).

# Lowercase all text in a 'Reviews' column
df['Reviews'] = df['Reviews'].str.lower()

References for More Details

Pandas Official "10 Minutes to Pandas":
Link
Best for: A quick syntax cheat sheet.
Kaggle - Data Cleaning Course:
Link
Best for: Practical, hands-on experience with messy real-world data.

Pandas helps us clean the data, but "seeing is believing." To truly understand our dataset, we need to visualize the relationships between variables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1. Core Data Structures

The DataFrame

2. Loading and Inspecting Data

3. Selecting and Filtering Data

4. Data Cleaning: The "ML Pre-processing" Step

A. Handling Missing Values

B. Handling Categorical Data

5. Grouping and Aggregation

6. Vectorized String Operations

References for More Details

Uh oh!

FilesExpand file tree

pandas.mdx

Latest commit

History

pandas.mdx

File metadata and controls

1. Core Data Structures

The DataFrame

2. Loading and Inspecting Data

3. Selecting and Filtering Data

4. Data Cleaning: The "ML Pre-processing" Step

A. Handling Missing Values

B. Handling Categorical Data

5. Grouping and Aggregation

6. Vectorized String Operations

References for More Details