Skip to content

Latest commit

 

History

History
38 lines (26 loc) · 2.43 KB

File metadata and controls

38 lines (26 loc) · 2.43 KB

Data_Cleaning_Football_Data_SQL

📌 Project Overview

End-to-End Data Cleaning and Preparation on a raw FIFA players dataset using PostgreSQL and pgAdmin 4.

🛠️ Tools & Technologies Used

Database:PostgreSQL

Query Language: SQL

Database Management: pgAdmin 4

Data Manipulation & String Matching: PostgreSQL built-in string functions, Regular Expressions (Regex), fuzzystrmatch extension

🧹 Methodology: Data Cleaning & Processing

To ensure data integrity before analysis, the following steps were performed:

Database Setup:

Imported the raw CSV file into a base table (fifa_ratings) and immediately created a working copy (fifa_rt) to safely perform all transformations.

Duplicate Removal & Integrity:

Verified the uniqueness of the player id and checked for duplicate player entries (based on name, age, and nationality) using Window Functions like ROW_NUMBER() OVER()

Text Normalization & Fuzzy Matching:

Trimmed whitespaces from player names and nationalities. Standardized text by converting nationalities to uppercase and used REGEXP_REPLACE to strip hidden characters (\n, \r, \t) from club names. Employed the fuzzystrmatch extension, using levenshtein distance and similarity scores, to detect and verify spelling inconsistencies in nationalities and club names.

Data Parsing & Feature Engineering:

Split the complex text column contract into four distinct, functional columns: contract_start, contract_end, contract_type (e.g., 'Regular', 'On Loan', 'Free'), and loan_end_date. Extracted the specific best_position from a comma-separated list of positions using split_part.

Data Type Conversion:

Converted date strings (like joined and loan_end_date) into proper DATE format using to_date for accurate temporal analysis.

Unit Standardization:

Converted mixed height measurements (feet/inches) into a uniform centimeter (cm) scale. Converted imperial weight measurements (lbs) into metric kilograms (kg) and cast them to integers.

Financial Data Cleaning:

Standardized monetary columns (value, wage, release_clause) by stripping the Euro symbol ('€') and mathematically converting text abbreviations ('M' for millions, 'K' for thousands) into uniform numeric types

Range Validations:

Performed logical checks on numerical columns (age, ova, pot, bov) to ensure all values fell within realistic limits (e.g., confirming max age was 53 and minimum overall rating was 47) and contained no unexpected nulls.