Home: README
- Learners ready to produce business-value automation from spreadsheets.
- Teams that need repeatable reporting with validation and auditability.
A full Excel ingestion pipeline that:
- scans
input/for.xlsxfiles, - normalizes headers,
- validates rows,
- writes
Master_Report.xlsx,Master_Report.csv, andrejects.csv, - writes a run log with summary counts.
- Foundations phase complete.
- Quality tooling baseline from 09_QUALITY_TOOLING.md.
- Installed packages:
openpyxl, optionalpandas.
Create structure:
excel_merger/
input/
output/
logs/
src/
excel_merger/
__init__.py
main.py
schema.py
normalize.py
validate.py
io_excel.py
io_csv.py
tests/
Required columns:
CustomerSiteStatusOpenedTicketID
Define allowed status values and date rules.
Implement normalize_header(text) rules:
- trim whitespace,
- lowercase,
- remove
_,-, and extra spaces, - map known aliases (for example
ticket id,ticket_id,ticketid).
- Detect header row.
- Build row dictionaries by normalized headers.
- Capture source filename and row number.
Implement validate_row(row) checks:
- missing required fields,
- invalid status,
- invalid date format,
- duplicate TicketID in same file.
Write failures to rejects.csv with reason codes.
- Combine valid rows from all files.
- Add metadata fields:
source_fileingested_at_utc
output/Master_Report.csvoutput/Master_Report.xlsx- worksheet
AllRows - worksheet
CriticalRows - highlight critical rows.
- worksheet
- one log file per run:
logs/run_YYYYMMDD_HHMMSS.log - include:
- files discovered,
- rows accepted/rejected,
- final output paths,
- fatal errors with traceback.
After openpyxl baseline works:
- load dataframes,
- apply vectorized transforms,
- compare results to baseline outputs.
Implement command:
python -m excel_merger.main --input ./input --output ./output --log-dir ./logs- A rerunnable tool that handles malformed files safely.
- Clear output artifacts and rejects report.
- Deterministic behavior when rerun on same input.
- Remove
TicketIDcolumn in one file and confirm it lands in rejects. - Introduce mixed header styles and confirm normalization works.
- Add a corrupted workbook and confirm pipeline continues with logging.
- Missing package errors:
- activate
.venvand reinstall dependencies.
- activate
- Date parsing errors:
- normalize date formats before validation.
- Excel formatting issues:
- verify workbook writes happen after data transforms, not before.
You are ready for SQL integration when you can:
- process 20 files with mixed quality,
- produce clean master outputs,
- explain every reject reason,
- rerun without duplicate or conflicting outputs.
- Play: modify header aliases and test edge cases.
- Build: follow steps exactly and track completion.
- Dissect: inspect one malformed file and explain why it failed.
- Teach-back: present schema and validation rules to a teammate.
Pass criteria:
- functional: all required outputs generated.
- reliability: malformed data does not crash full run.
- traceability: each rejected row has a reason.
- maintainability: tests exist for normalization and validation.
Scoring rubric (0-2 each):
- correctness,
- resilience,
- logging quality,
- test coverage,
- usability of CLI.