Skip to content

Commit 1b448e6

Browse files
authored
Merge pull request #372 from shweta-016/add-data-cleaning-guide
Added Data Cleaning Best Practices and Python preprocessing example
2 parents 8a58967 + 9a99e60 commit 1b448e6

1 file changed

Lines changed: 24 additions & 0 deletions

File tree

data_cleaning.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
## Data Cleaning Best Practices
2+
# Data Cleaning Best Practices
3+
4+
- Remove duplicate rows to avoid data leakage.
5+
- Standardize column names (lowercase, underscores).
6+
- Handle missing values using median/mean or domain logic.
7+
- Convert date columns to proper datetime format.
8+
- Validate data types before modeling.
9+
10+
## Python Example
11+
12+
import pandas as pd
13+
14+
df = pd.read_csv("data.csv")
15+
16+
df = df.drop_duplicates()
17+
df.columns = [c.lower().replace(" ", "_") for c in df.columns]
18+
19+
num_cols = df.select_dtypes(include="number").columns
20+
df[num_cols] = df[num_cols].fillna(df[num_cols].median())
21+
22+
if "date" in df.columns:
23+
df["date"] = pd.to_datetime(df["date"])
24+

0 commit comments

Comments
 (0)