|
| 1 | +# Plan: Echodata Module Parity Audit & Remediation |
| 2 | + |
| 3 | +**TL;DR**: Deep comparison of `oceanstream/echodata/` against the `saildrone-data/saildrone/` reference reveals **6 confirmed bugs**, **~8 functional gaps**, and several areas where oceanstream already exceeds the reference. Priority: fix bugs → close feature gaps → verify parity. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Phase 1: Bug Fixes (Critical — blocks correctness) |
| 8 | + |
| 9 | +**Bug 1 — Missing `_save_intermediate` method in `processor.py`** |
| 10 | +Called at ~3 locations but never defined. `save_intermediate=True` will raise `AttributeError`. Fix: implement the method or remove the calls. |
| 11 | + |
| 12 | +**Bug 2 — Missing `_select_channel` in `seabed/detection.py`** |
| 13 | +`detect_seabed_blackwell` calls `_select_channel(ds, channel)` — function doesn't exist. The `"blackwell"` method is also missing from the `Literal` type hint. The entire blackwell path is broken. |
| 14 | + |
| 15 | +**Bug 3 — Signature mismatch: `environment/geoparquet.py` → `environment/blended.py`** |
| 16 | +Caller passes `insitu_temp=, insitu_sal=, channel_ids=, target_depth_m=` but the callee expects `insitu_df=, channels=, target_depth=, frequency_hz=`. This path raises `TypeError`. |
| 17 | + |
| 18 | +**Bug 4 — Unreachable return in `concat.py`** |
| 19 | +`return dt` after `return merged` in `merge_location_data` — dead code. Remove it. |
| 20 | + |
| 21 | +**Bug 5 — Calibration key type inconsistency in `calibrate/calibration.py`** |
| 22 | +`validate_calibration_params` expects numeric frequency keys but loaders produce string keys like `"38kHz"` or `"38k_short"`. Validation will reject valid calibration data. |
| 23 | + |
| 24 | +**Bug 6 — `detect_sonar_model` is a stub in `convert.py`** |
| 25 | +Always returns `"EK80"`. Non-critical but incorrect for non-EK80 data. Implement basic header detection. |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## Phase 2: Feature Gaps (needed for saildrone-data parity) |
| 30 | + |
| 31 | +| # | Gap | Reference (saildrone-data) | Status in oceanstream | Fix | |
| 32 | +|---|-----|---------------------------|----------------------|-----| |
| 33 | +| 1 | `depth_offset` not wired in main `compute_sv` | `sv_dataset.py` `compute_sv(depth_offset=0)` | Helper exists in `enrich_sv_dataset` but primary `compute_sv()` skips it | Add param and wire through | |
| 34 | +| 2 | No Pydantic denoise parameter models | `pydantic_models/denoise.py` — per-freq, per-pulse, 38kHz inheritance | Has `DenoiseConfig` dataclass + `FREQUENCY_PRESETS` but not Pydantic | Create Pydantic models with same per-freq/pulse-length schema | |
| 35 | +| 3 | No pulse-category splitting for concat/export | `export.py` / concat flow separates `short_pulse`/`long_pulse` | `concat.py` groups by day only | Add pulse-mode detection + category grouping | |
| 36 | +| 4 | No batch time-window grouping for concatenation | Prefect flow `_batch_key` day-window logic | Basic concat only | Add time-window batch utility | |
| 37 | +| 5 | No standalone `compute_and_save_nasc`/`mvbs` wrappers | `workflow.py` zarr-open→compute→save helpers | Has compute functions but no zarr-to-zarr wrappers | Add thin wrappers or ensure processor covers this | |
| 38 | +| 6 | No bathymetry gating for seabed step | `workflow.py` skips seabed detection if depth > instrument range | Processor always runs seabed detection | Add bathymetry check before seabed step | |
| 39 | +| 7 | No NASC DB persistence | `NASCPointService` PostgreSQL | Has NASC GeoParquet export instead | **Parity via different approach** — exclude | |
| 40 | +| 8 | No Azure IoT integration | `azure_iot/` module | Not present | **Out of scope** — edge deployment feature | |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +## Phase 3: Consistency & Quality Improvements |
| 45 | + |
| 46 | +1. **Two incompatible ECS parsers** in `calibrate/calibration.py`: `_load_ecs_calibration` uses INI-style parsing; `parse_ecs_file` uses XML. Consolidate to one correct parser. |
| 47 | +2. **Calibration write inconsistency**: Generic `apply_calibration` and Saildrone-specific `calibrate_saildrone` write to echodata groups differently. Ensure consistency. |
| 48 | +3. **Test coverage**: Verify unit tests exist for all denoise algorithms, all seabed detection methods (including blackwell), calibration paths, and compute paths. |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | +## What Oceanstream Already Has Beyond Saildrone-Data |
| 53 | + |
| 54 | +Not gaps — these are advancements: |
| 55 | +- **Environment module** — Copernicus CDS integration, blended profiles, absorption/sound-speed equations (nothing equivalent in saildrone-data) |
| 56 | +- **STAC metadata** — Collection/item emission for echodata products |
| 57 | +- **NASC GeoParquet export** — Spatial Hive-partitioned output |
| 58 | +- **TOML config with frequency presets** — Centralized configuration |
| 59 | +- **Provider architecture** — Pluggable data source adapters |
| 60 | +- **Geotrack module** — Replaces and exceeds `process_gps.py` |
| 61 | +- **Composite/deltaSv seabed detection** — Additional algorithms beyond saildrone-data's main path |
| 62 | + |
| 63 | +--- |
| 64 | + |
| 65 | +## Verification |
| 66 | + |
| 67 | +1. After bugs: `make test-unit` + `ruff check . && mypy oceanstream` pass clean |
| 68 | +2. After gaps: Integration test running full pipeline (convert → calibrate → Sv → denoise → seabed → MVBS/NASC → echogram) |
| 69 | +3. Specific: `detect_seabed(method="blackwell")` runs without error; `.ecs` and `.xlsx` calibration pass validation; geoparquet→blended path completes; processor with `save_intermediate=True` writes Zarr checkpoints |
| 70 | +4. Cross-validation: Same denoise config on same test file in both projects, compare output masks |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +## Scope Decisions |
| 75 | + |
| 76 | +- **In scope**: All 6 bugs, gaps 1-6, quality improvements |
| 77 | +- **Out of scope**: Azure IoT (edge feature), PostgreSQL NASC persistence (GeoParquet approach is better), Prefect orchestration (oceanstream uses class-based processor) |
| 78 | +- **Assumption**: Core denoise *algorithms* are at parity — gaps are in plumbing/config/integration |
| 79 | +- **Assumption**: Geotrack module adequately replaces `process_gps.py` — not re-audited separately |
| 80 | + |
| 81 | +--- |
| 82 | + |
| 83 | +## Further Considerations |
| 84 | + |
| 85 | +1. **Pydantic models location**: Should they live in `oceanstream/echodata/config.py` alongside existing dataclasses, or in a new `oceanstream/echodata/models.py`? Recommend new `models.py` to keep concerns separated. |
| 86 | +2. **Batch concatenation complexity**: The saildrone-data time-window grouping is tightly coupled to Prefect. Oceanstream could implement this as a pure library utility in `concat.py` that the processor class calls, making it usable outside any orchestration framework. |
| 87 | +3. **ECS parser consolidation**: ECS files from Simrad are INI-style (`.ini` variant). The XML parser path may have been added for a different calibration file format — need to verify before removing it. |
0 commit comments