tutorial/ai-ml/machine-learning/data-engineering-basics/data-formats/excel.mdx at b098e2c66e44a01782a3b783cf790e36f4a5f30e · codeharborhub/tutorial

title

Excel: The Business Data Standard

sidebar_label

Excel

description

Handling .xlsx and .xls files in ML pipelines: managing multi-sheet workbooks, data types, and conversion pitfalls.

1. Excel vs. CSV: More Than Just Tables

Unlike a CSV, which is a single flat text file, an Excel file is actually a compressed collection of XML files (for .xlsx).

Feature	CSV	Excel (.xlsx)
Capacity	Unlimited rows (limited by disk)	1,048,576 rows per sheet
Structure	Single table	Multiple sheets (Workbooks)
Metadata	None	Includes formatting, formulas, and charts
Data Types	Everything is a string	Explicit types (Date, Currency, Number)

2. The Multi-Sheet Challenge

A single Excel workbook often contains multiple datasets spread across different tabs. In an ML pipeline, you must specify which sheet contains your features.

graph TD
    Workbook[Workbook: Sales_Data.xlsx] --> S1[Sheet: Q1_Sales]
    Workbook --> S2[Sheet: Q2_Sales]
    Workbook --> S3[Sheet: Reference_Codes]
    
    S1 --> Process[ML Ingestion Pipeline]
    style S1 fill:#e1f5fe,stroke:#01579b,color:#333

3. Reading Excel with Python

To process Excel files, Python requires an engine like openpyxl or xlrd. Pandas wraps these to provide a simple interface.

import pandas as pd

# Load a specific sheet
df = pd.read_excel('company_data.xlsx', sheet_name='Training_Data')

# Loading multiple sheets into a dictionary of DataFrames
all_sheets = pd.read_excel('company_data.xlsx', sheet_name=None)
# Access the 'Metadata' sheet
metadata = all_sheets['Metadata']

4. Common Pitfalls in Excel Ingestion

Excel allows for "human-friendly" formatting that is "machine-hostile." Watch out for these:

A. Merged Cells

Merged cells create NaN (null) values for all but the first cell in the merge.

Fix: Use df.fillna(method='ffill') to propagate values downward or across.

B. Header Offsets

Business reports often have titles or empty rows at the top before the data starts.

Fix: Use the skiprows parameter: pd.read_csv(..., skiprows=3).

C. Formulas vs. Values

If a cell contains =SUM(A1:A10), a basic parser might read the formula string instead of the calculated result ().

Note: Standard Pandas readers fetch the calculated value, but if the file hasn't been saved recently, the values might be stale.

5. Performance Note

Excel files are significantly slower to read than CSV or Parquet.

Benchmarking: Reading a 100MB Excel file can take 10–20x longer than reading the same data from a CSV because the engine must decompress the XML and parse formatting.
Best Practice: If you are running an iterative ML experiment, convert the Excel file to Parquet or Pickle once, then use that faster file for the rest of your work.

6. Summary: When to Use Excel

Use Excel if: You are receiving data from non-technical departments, need to preserve multi-sheet relationships, or require the data types (like Dates) to be pre-defined.
Avoid Excel if: You have more than 1 million rows, or you need to read data in a high-speed production environment.

References for More Details

Pandas read_excel Guide: Advanced parameters like usecols and converters.
Openpyxl Documentation: Modifying Excel files (writing formulas, adding colors) programmatically.

Excel is great for humans, but for web-based data and highly nested structures, we use a format that looks more like a Python dictionary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1. Excel vs. CSV: More Than Just Tables

2. The Multi-Sheet Challenge

3. Reading Excel with Python

4. Common Pitfalls in Excel Ingestion

A. Merged Cells

B. Header Offsets

C. Formulas vs. Values

5. Performance Note

6. Summary: When to Use Excel

References for More Details

Uh oh!

FilesExpand file tree

excel.mdx

Latest commit

History

excel.mdx

File metadata and controls

1. Excel vs. CSV: More Than Just Tables

2. The Multi-Sheet Challenge

3. Reading Excel with Python

4. Common Pitfalls in Excel Ingestion

A. Merged Cells

B. Header Offsets

C. Formulas vs. Values

5. Performance Note

6. Summary: When to Use Excel

References for More Details