| id | pandas-introduction | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| title | Introduction to Pandas | |||||||||
| sidebar_label | Pandas | |||||||||
| description | Learn the basics of the Pandas Python library, including Series, DataFrame, data input/output, and basic data analysis, to kickstart your ML/DS workflow. | |||||||||
| sidebar_position | 20 | |||||||||
| tags |
|
|||||||||
| slug | /python/pandas-introduction |
Pandas is one of the most essential libraries in the Python data ecosystem.
It provides rich, high-level data structures and tools designed for fast and flexible data manipulation, analysis, and visualization.
If you're working in data science, machine learning, or analytics, Pandas is your foundation for cleaning, transforming, and understanding data.
It sits beautifully on top of NumPy, integrating seamlessly with other libraries like Matplotlib, Seaborn, and Scikit-learn.
Working with raw data in Python used to mean juggling lists, dictionaries, and loops.
Pandas simplifies all that by introducing two powerful data structures — the Series and the DataFrame — that behave much like spreadsheet tables or SQL tables.
Some reasons Pandas is so popular:
- Handles large datasets efficiently.
- Provides built-in methods for aggregation, cleaning, and reshaping.
- Easily reads and writes data from multiple sources like CSV, Excel, JSON, and SQL.
- Integrates tightly with visualization and machine learning libraries.
If Pandas isn’t already installed, you can add it via pip:
pip install pandasYou can also install it with Anaconda (which includes Pandas by default):
conda install pandasA Series is a one-dimensional labeled array. You can think of it as a single column in a spreadsheet.
import pandas as pd
# Create a simple Series
s = pd.Series([100, 200, 300, 400])
print(s)Output:
0 100
1 200
2 300
3 400
dtype: int64
Each element has an index (on the left) and a value (on the right).
You can assign your own custom index too:
s = pd.Series([10, 20, 30], index=['A', 'B', 'C'])
print(s['B']) # Accessing by label → 20A DataFrame is a two-dimensional labeled data structure — essentially a table with rows and columns.
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['Delhi', 'Mumbai', 'Chennai', 'Kolkata']
}
df = pd.DataFrame(data)
print(df)Output:
Name Age City
0 Alice 25 Delhi
1 Bob 30 Mumbai
2 Charlie 35 Chennai
3 David 40 Kolkata
Each column in a DataFrame is actually a Series.
You can access them individually:
df['Name'] # Access a column
df[['Name', 'Age']] # Access multiple columns
df.loc[2] # Access a row by label
df.iloc[0] # Access a row by positionOne of Pandas’ greatest strengths is its ability to easily load data from many file formats.
Here are some commonly used functions:
| Format | Read | Write |
|---|---|---|
| CSV | pd.read_csv() |
DataFrame.to_csv() |
| Excel | pd.read_excel() |
DataFrame.to_excel() |
| JSON | pd.read_json() |
DataFrame.to_json() |
| SQL | pd.read_sql() |
DataFrame.to_sql() |
# Reading from a CSV file
df = pd.read_csv('employees.csv')
# Writing to a CSV file
df.to_csv('employees_cleaned.csv', index=False)By default, Pandas assumes that the first row of your CSV file contains column names.
You can customize this behavior with parameters like header=None or names=[...].
Once your data is loaded into a DataFrame, Pandas provides a variety of methods for quick exploration.
df.head() # Displays the first 5 rows
df.tail() # Displays the last 5 rows
df.shape # Returns (rows, columns)
df.columns # Lists all column names
df.dtypes # Shows data types for each column
df.info() # Summary: column names, types, nulls, memory usage
df.describe() # Statistical summary of numeric columnsExample:
print(df.describe())Output:
Age
count 4.000000
mean 32.500000
std 6.454972
min 25.000000
25% 28.750000
50% 32.500000
75% 36.250000
max 40.000000
Pandas allows flexible data filtering using both labels and conditions.
# Select a single column
df['Age']
# Select multiple columns
df[['Name', 'City']]
# Conditional filtering
df[df['Age'] > 30]
# Combining multiple conditions
df[(df['Age'] > 25) & (df['City'] == 'Delhi')]You can also use .loc[] for label-based selection or .iloc[] for position-based selection:
df.loc[1:3, ['Name', 'City']]
df.iloc[0:2, 0:2]Real-world data is messy. Pandas makes cleaning painless.
df.isnull() # Check for missing values
df.dropna() # Drop rows with missing values
df.fillna(0) # Fill missing values with a placeholderdf.rename(columns={'Name': 'Employee_Name'}, inplace=True)df['Age'] = df['Age'].astype(float)Sorting your data:
df.sort_values(by='Age', ascending=False)Grouping (e.g., aggregating data by a category):
grouped = df.groupby('City')['Age'].mean()
print(grouped)Output:
City
Chennai 35.0
Delhi 25.0
Kolkata 40.0
Mumbai 30.0
Name: Age, dtype: float64
Let’s see some quick examples of what you can do once your data is cleaned:
# Mean age
df['Age'].mean()
# Count how many from each city
df['City'].value_counts()
# Filter and sort together
df[df['Age'] > 30].sort_values(by='Age', ascending=False)Pandas integrates with Matplotlib, allowing quick visualization directly from your DataFrame.
import matplotlib.pyplot as plt
df['Age'].plot(kind='bar', title='Age Distribution')
plt.xlabel('Index')
plt.ylabel('Age')
plt.show()For more advanced visualizations, you can use libraries like Seaborn or Plotly with your Pandas data.
Pandas provides a clean, efficient interface for everything from data cleaning to basic analysis.
It’s one of the first libraries every data professional should master because it forms the backbone of nearly every ML and data science workflow in Python.
Next Steps:
- Explore advanced Pandas operations (merging, reshaping, pivoting)
- Learn how Pandas integrates with NumPy and visualization libraries
- Try using Pandas in a small data project — like analyzing a CSV dataset from Kaggle
References: