Conversation
- New extract/common.py centralizes get_tracker_year, find_month_sheets, extract_tracker_month, clean_excel_errors so the upcoming product extractor can reuse them without duplicating patient.py's logic. - extract/patient.py re-exports the shared helpers; behavior unchanged. - tables/patient.py accepts a pre-loaded DataFrame to avoid re-reading the cleaned parquet for each table (static/monthly/annual). - pipeline/patient.py loads the cleaned parquet once and threads it through; drops the vestigial `force` kwarg. - clean/patient.py: tuple-unpack rewrite of the date-validation row loop (skips per-row dict construction). Behavior-preserving. - config.py: surface min_tracker_year/max_tracker_year as Pydantic defaults (2017/2030) so the new tests can override without breaking existing runs. - tests/test_tables/test_patient.py updated for the new DataFrame-input signatures. Patient parquets remain byte-identical to migration; this is a pure refactor.
Three intentional patient-output deltas vs. migration: 1. Date typo rescue: clean/date_parser.py adds rescue_date_typos() and TYPO_REPLACEMENTS so common Excel-date typos (e.g., misspelled month strings) round-trip to a valid date instead of becoming Undefined. clean/converters.py wires the rescue through parse_date_column. errors.py registers the "typo_rescued" log code. 2. Multi-insulin CSV acceptance: clean/validators.py adds the allow_csv_subset flag so 2024+ trackers' comma-separated insulin_subtype values (e.g., "pre-mixed,rapid-acting") survive validation in canonical case instead of being replaced by Undefined. reference_data/validation_rules.yaml flips insulin_subtype on. clean/patient.py: docstring update on _derive_insulin_fields. 3. load_numeric_ranges helper + numeric_ranges: YAML block (consumed by the validate/ tree shipped in a later commit). Carrying both now keeps the YAML self-consistent. Plus a small per-row dedup optimization in clean/converters.py and the matching test updates.
Lays the groundwork the product cleaning pipeline (next commit) needs: - extract/product.py: read the Stock_Summary section across month sheets (parallels extract/patient.py and reuses extract/common helpers). - extract/wide_format.py: Mandalay-specific wide-format handling (column expansion 2020-21; cell splitting 2017-19). - clean/schema_product.py: 19-column product meta schema + helpers. - reference/products.py: known-product reference loader (categories, canonical names) backing product validation. Reusable patient-side polish that the product layer now depends on: - reference/synonyms.py: cache load_*_mapper with @lru_cache and fix the scalar-string -> list normalization so single-string YAML entries no longer iterate as characters. - reference/loaders.py: explicit UTF-8 in load_yaml. - tables/logs.py: schema_overrides on parse_log_file. - clean/schema.py: docstring touch-up.
- clean/product.py implements R steps 2.0-2.21 from helper_product_data.R: synonym mapping, type conversion, patient-id repair, supplier-name preservation, units/balance reconciliation, end-row zeroing, and the per-sheet partition required for the .over() rolling balance to match R's per-sheet loop. Reuses safe_convert_column, fix_patient_id, and load_product_mapper from migration. - tables/product.py aggregates cleaned product parquets into the single product_data table. - tests/test_clean/test_product.py covers the cleaning steps; tests/test_tables/test_link_product_patient.py exercises the cross-product/patient link assumptions used by downstream tables.
Adds the manifest-driven incremental skip path that future commits will expose behind --incremental: - state/source.py + manifest.py: load the previous run's manifest from BigQuery, falling back to local parquet, then to an empty manifest. - state/filter.py: skip trackers whose MD5 + per-pipeline completion flags match the previous manifest. - tables/metadata.py: build tracker_metadata (MD5 + per-tracker output- presence flags) at table-creation time; md5_file is reused by filter.py. - gcp/bigquery.py: add select_tracker_metadata and register product_data + tracker_metadata in PARQUET_TO_TABLE. - state/__init__.py: additive re-exports for the new helpers. Tests cover each state module, the metadata table builder, and the new BigQuery accessor.
Wires the product cleaning + state machinery into a runnable pipeline: - pipeline/product.py: orchestrates product extract+clean+table for the full corpus, mirroring pipeline/patient.py. - pipeline/tracker.py: adds process_tracker_product so the per-tracker worker can run patient and product arms together. - cli.py: introduces process-product, create-product-tables, and the unified run-pipeline command (patient + product + drive/GCS/BigQuery); adds --force and --incremental flags across process-patient, process-product, and run-pipeline. Imports the new state helpers and product entry points. The full ~810-line diff lands in one commit because the new commands and flags interleave with existing patient command bodies. - tests/test_cli/test_force_incremental.py covers --force/--incremental for both process-* commands and run-pipeline. Breaking change: - tables/__init__.py drops the wildcard re-exports from .patient, .clinic, .logs, .metadata, .product. Callers must now import from the specific submodule (a4d.tables.patient, a4d.tables.product, ...). Verified zero affected call sites in src/, tests/, and scripts/. Plus the doc/config drag-along: - justfile: new product + run-pipeline recipes. - readme.md, docs/CLAUDE.md, docs/migration/MIGRATION_GUIDE.md, docs/migration/PYTHON_IMPROVEMENTS.md: product pipeline + state docs. - docs/PRODUCT_DATA_PIPELINE_FEATURE.md moved into docs/migration/. - .gitignore: ignore .Rproj.user, .Rhistory, CLAUDE.local.md, and r-archive/config.yml.
Adds the source-vs-output validators that re-read each tracker's source cells and compare against the cleaned parquet to flag silent data loss or schema drift before tables are uploaded. - validate/source_vs_output_patient.py + source_vs_output_product.py walk every patient/product row in the cleaned output, locate the matching source cell, and emit a structured diff (typed as match/converted/missing/unparseable). Activates the load_numeric_ranges helper and numeric_ranges: YAML block carried in the typo-rescue commit. - validate/common.py holds the shared diff and IO primitives. - Tests cover both patient and product paths against fixture trackers.
Read query rows directly and build the polars DataFrame with an explicit schema instead of going through pandas. Removes a hidden dependency on pandas (not declared in pyproject) and makes the schema deterministic when the result set is empty. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove explicit encoding="utf-8" from open() calls and the env_file_encoding setting (utf-8 is the default on the platforms we target), and drop the Windows backslash-to-forward-slash conversion in GCS blob names since Path.relative_to already yields POSIX separators on macOS/Linux. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Removes the windows-shell directive, [unix]/[windows] recipe attributes, and the [windows] PowerShell clean recipe. The project targets macOS (dev) + Linux (Cloud Run / Docker prod); these shims were drag-along noise from 85891e7 and were never present on migration. Recipes that rely on bash heredocs will now fail loudly on Windows, which is correct for an unsupported platform. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the product data pipeline (extraction → cleaning → tables) alongside
finalization work on the patient pipeline. Output now includes
product_dataparquet/BQ table and an incremental processing flow keyed offtracker_metadata.This branch represents the cumulative Python migration on top of
dev.The recent (product-focused) commits are the primary delta worth reviewing;
earlier commits are part of the established migration baseline.
What's new
Product pipeline (most recent commits):
src/a4d/extract/product.py)(
src/a4d/clean/product.py, partitions by[sheet, product]via.over())src/a4d/tables/product.py)tracker_metadatastate filter(
src/a4d/state/)a4d run-pipelineruns both arms;--skip-productavailable for patient-only runs
(
src/a4d/validate/source_vs_output_*.py)Patient pipeline:
src/a4d/pipeline/tracker.pyReference data / config:
reference_data/validation_rules.yaml(centralized validation thresholds)clinic_data.xlsxTest plan
uv sync && uv run pytestjust run(full pipeline, both arms) on a small tracker subsetjust run --skip-product(patient-only path still works)product_data.parquetagainst R reference output