Duckdb migration#2
Open
maxmalynowsky wants to merge 51 commits into
Open
Conversation
5d24fc3 to
746bf10
Compare
The _05 cell-to-fid join now asks "where does this cell live?" against both _01 (primary) and _04 (fallback) using the cell's own interior point, instead of the prior "does any _01 interior point fall inside this cell?" with a QUALIFY tiebreak. This fixes routing for cells that sit inside _01 but don't happen to contain _01's interior point — concave shapes, multipart polygons, sub-cells from polygonization. On COL ADM3 the missing-area check goes from 17 bad fids to 1 (residual case is a polygonization fusion, symmetric in both join logics). Precomputed bbox columns on _05_tmp1 (per-part _01) and _04 let downstream joins use plain numeric comparisons instead of recomputing ST_XMin/etc per candidate pair. Direct measurement on COL ADM3: _05 query 146.7s → 10.9s (13×) on the same machine and settings. Tables renamed to reflect creation order in merge.py: _05_tmp4 (parts) → _05_tmp1 _05_tmp1 (extension lines) → _05_tmp2 _05_tmp2 (snapped lines) → _05_tmp3 _05_tmp3 (polygonized cells) → _05_tmp4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace _check_missing_area with _check_input_preserved: ST_Difference on each input vs same-fid output catches the case where outward extension masks a feature losing area to a neighbour at an internal boundary.
Re-polygonize fids whose interior boundary was reassigned during merge,
injecting their _02a exterior rings into the line union. Surgical pass
first (flagged fids only), global if any remain. Tighten reassignment
threshold from 0.1% to 0.01%; defer drops of _02a/_02b/_04/_05_tmp1/
_05_tmp3/_05_tmp4 to outputs.py so repair has the tables it needs.
Also tee per-run logs to tmp/{name}.log via log_file() context manager.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
merge.py was getting long; the polygonize/repair half is a self- contained unit that lives better in its own file. While here, normalize the rest of the stage modules to public-first ordering for consistency with merge.py and polygonize.py.
attempt.py stays unnumbered since it's an orchestrator wrapping points + voronoi with retry, not a stage of its own. clean is _01a because it's an optional sub-step that only runs on coverage violations.
Use uv sync with pyproject.toml/uv.lock instead of pinned pip install, drop OCI labels, .dockerignore, and redundant env vars. Add py3-psutil and edge/community repo to the Alpine variant.
Drops the multi-stage GEOS-from-source Docker build and the ctypes binding in _01a_clean.py now that DuckDB spatial 1.5.3 exposes ST_CoverageClean natively. Updates docs and .gitignore accordingly.
…Clean Byte-exact preservation of original polygon vertices is no longer a requirement (inputs are cleaned upstream by the caller), which removes the reason the ST_Node/ST_Polygonize planar-partition rebuild and its surgical-to-global repair escalation existed in the first place — that machinery was a workaround for DuckDB spatial having no native ST_CoverageClean when it was built, not a deliberate correctness choice. - _05_merge.py: per-fid ST_Difference against a bbox-prefiltered neighbor union, then a single whole-table ST_CoverageClean pass; _05b_polygonize.py deleted. Mirrors the original PostGIS implementation's union+difference+ coverage-clean pattern (recovered via `git show f6a3b67:app_postgis/merge.py`), now viable natively since DuckDB spatial 1.5.3. - _01_inputs.py: folds in the former _01a_clean.py stage (one fewer module, one fewer pipeline step). Coverage-clean now runs unconditionally over the whole table when violations are found, with no sliver-vs-real-hole detection — narrow gaps flow through to the Voronoi-extension stage like any other hole, which is the pipeline's actual job. - _02_lines.py: drops _02b (interior/shared edges), only ever needed for ST_Node. - _06_outputs.py: drops the byte-exactness check. - utils.py: extracts the coverage_clean helper shared by inputs and merge. Verified end-to-end on Burundi and Chile (the canonical stress test); fixed an OOM along the way from an early draft that used a single global ST_Union_Agg(_01) as a per-row ST_Difference operand instead of a bbox-prefiltered neighbor union.
Looping over input_dir inside one long-lived process let RSS grow unbounded across files (observed ~20GB) even though every file peaks under 3GB alone — GEOS's native heap isn't returned to the OS between files despite each DuckDB connection being closed. main() now shells out to a fresh `uv run -m app --input-file=X` subprocess per file.
…emory budget _03_points.py now decomposes boundary lines into real vertex-to-vertex segments and caps interpolation on segments longer than DISTANCE*MAX_POINTS_PER_SEGMENT, bounding the largest exactly-collinear point cluster fed to ST_VoronoiDiagram independent of that segment's raw length. Fixes Chad/Algeria's multi-minute-to-hours pathological slowdowns (confirmed: Chad's full pipeline 1115s -> ~20s) without altering any geometry — no simplification, every generated point stays on the true digitized boundary. attempt.py now derives a starting DISTANCE per file instead of always using the flat config default, combining the file's own natural vertex resolution with a point-volume budget so files never generate more points than --memory-gb can safely hold. That budget itself is fitted from probes run inside a real --memory=4g --memory-swap=4g (no swap) Docker container rather than RSS inference on an unconstrained dev machine, since testing there showed the previous flat calibration (5.5M points, based on Chile's dev-machine behavior) actually OOMs for real under the stated 2-4GB deployment target. Confirmed via docker inspect .State.OOMKilled that this class of failure is a hard kernel SIGKILL, not a catchable DuckDBError, so the retry-doubling loop can't rescue it after the fact — the budget has to avoid the ceiling going in. A raw-vertex-count floor (segment decompose+remerge cost, measured independent of DISTANCE) is checked up front so files whose vertex count alone exceeds the budget fail fast with a clear message instead of wasting 10 retries on a problem DISTANCE can't fix. Also found along the way: an O(n^2) ST_PointN-based decomposition, a per-segment (vs per-fid) differencing blowup, and a per-real-segment point floor that broke Philippines' 13M-vertex boundary — all fixed as part of the same change. Two more OOM bottlenecks (inputs.py's coverage-clean pass, lines.py's neighbor-union self-join) were found during 4GB-container verification but are out of scope here; documented in docs/voronoi-memory.md for follow-up.
… ceilings attempt.py's remerge-floor check now warns and falls back to the plain default distance instead of raising, since a real deployment may have swap headroom beyond the stated --memory-gb. A stricter fail-fast gate was also tried for _01_inputs.py's ST_CoverageClean and _02_lines.py's neighbor-union self-join but reverted: with no DISTANCE-style parameter to fall back to there, blocking execution outright wasn't worth two more probe-fitted constants, especially given the multi-file batch path already contains an OOM's blast radius to one subprocess. Also corrects CLAUDE.md's inaccurate claim that _02_lines.py already explodes multipolygon fids into per-part bboxes like _05_merge.py — it doesn't, and doing so was tested and found to badly regress Chile (one fid with 3,796 tightly-clustered parts) while only helping Indonesia's opposite shape (many fids with a few widely-scattered parts). Philippines (~5.9GB) and Indonesia (~5.4GB) genuinely don't fit a 4GB deployment for these two operations — documented in docs/voronoi-memory.md as a known, real gap rather than papered over with a threshold.
…nvention The DISTANCE-budget constants' derivation (measured data points, fitted formulas) was duplicated almost verbatim between config.py and docs/voronoi-memory.md, and the config.py copy had drifted stale after the prior commit changed attempt.py's remerge-floor check from fail-fast to warn-and-fallback. Trimmed config.py to terse pointers and moved the unique MAX_POINTS_PER_SEGMENT timing table into docs/voronoi-memory.md, which is the actual authoritative source. Also drops the leading underscore from REMERGE_BYTES_PER_RAW_SEGMENT, BASELINE_OVERHEAD_MB, BYTES_PER_POINT, and SAFETY_MARGIN — they're imported into attempt.py, so the underscore (reserved in this file for genuinely config.py-internal names like _BOOL_VALS/_PARQUET_EXPORT) was misleading.
candidate = min(distance, natural_res) means a user-supplied --distance can only ever floor resolution for naturally-coarse boundaries (natural_res always wins when finer) — it can never coarsen an already-detailed file. README.md documented "use a larger value for the entire world" as a working use case, but that never actually applied once real coastline detail exists anywhere in the file. With all meaningful per-file adaptivity now coming from --memory-gb + natural_res automatically, the flag added configurability with no corresponding real effect. 0.0002 now lives only as DEFAULT_DISTANCE, a fixed internal constant in config.py with no CLI flag or env var. Updates README/CLAUDE.md/performance docs accordingly.
…ds/--debug Pre-existing staleness, unrelated to the CLI's actual argparse definitions in config.py. --num-threads was never a real flag (the actual name is --threads); --quiet doesn't exist at all and --debug is the closest real equivalent, with opposite semantics (adds verbosity, doesn't suppress it).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.