Duckdb migration by maxmalynowsky · Pull Request #2 · fieldmaps/topo-tools-py

maxmalynowsky · 2026-04-15T07:18:01Z

No description provided.

The _05 cell-to-fid join now asks "where does this cell live?" against both _01 (primary) and _04 (fallback) using the cell's own interior point, instead of the prior "does any _01 interior point fall inside this cell?" with a QUALIFY tiebreak. This fixes routing for cells that sit inside _01 but don't happen to contain _01's interior point — concave shapes, multipart polygons, sub-cells from polygonization. On COL ADM3 the missing-area check goes from 17 bad fids to 1 (residual case is a polygonization fusion, symmetric in both join logics). Precomputed bbox columns on _05_tmp1 (per-part _01) and _04 let downstream joins use plain numeric comparisons instead of recomputing ST_XMin/etc per candidate pair. Direct measurement on COL ADM3: _05 query 146.7s → 10.9s (13×) on the same machine and settings. Tables renamed to reflect creation order in merge.py: _05_tmp4 (parts) → _05_tmp1 _05_tmp1 (extension lines) → _05_tmp2 _05_tmp2 (snapped lines) → _05_tmp3 _05_tmp3 (polygonized cells) → _05_tmp4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace _check_missing_area with _check_input_preserved: ST_Difference on each input vs same-fid output catches the case where outward extension masks a feature losing area to a neighbour at an internal boundary.

Re-polygonize fids whose interior boundary was reassigned during merge, injecting their _02a exterior rings into the line union. Surgical pass first (flagged fids only), global if any remain. Tighten reassignment threshold from 0.1% to 0.01%; defer drops of _02a/_02b/_04/_05_tmp1/ _05_tmp3/_05_tmp4 to outputs.py so repair has the tables it needs. Also tee per-run logs to tmp/{name}.log via log_file() context manager. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

merge.py was getting long; the polygonize/repair half is a self- contained unit that lives better in its own file. While here, normalize the rest of the stage modules to public-first ordering for consistency with merge.py and polygonize.py.

attempt.py stays unnumbered since it's an orchestrator wrapping points + voronoi with retry, not a stage of its own. clean is _01a because it's an optional sub-step that only runs on coverage violations.

Use uv sync with pyproject.toml/uv.lock instead of pinned pip install, drop OCI labels, .dockerignore, and redundant env vars. Add py3-psutil and edge/community repo to the Alpine variant.

Drops the multi-stage GEOS-from-source Docker build and the ctypes binding in _01a_clean.py now that DuckDB spatial 1.5.3 exposes ST_CoverageClean natively. Updates docs and .gitignore accordingly.

…Clean Byte-exact preservation of original polygon vertices is no longer a requirement (inputs are cleaned upstream by the caller), which removes the reason the ST_Node/ST_Polygonize planar-partition rebuild and its surgical-to-global repair escalation existed in the first place — that machinery was a workaround for DuckDB spatial having no native ST_CoverageClean when it was built, not a deliberate correctness choice. - _05_merge.py: per-fid ST_Difference against a bbox-prefiltered neighbor union, then a single whole-table ST_CoverageClean pass; _05b_polygonize.py deleted. Mirrors the original PostGIS implementation's union+difference+ coverage-clean pattern (recovered via `git show f6a3b67:app_postgis/merge.py`), now viable natively since DuckDB spatial 1.5.3. - _01_inputs.py: folds in the former _01a_clean.py stage (one fewer module, one fewer pipeline step). Coverage-clean now runs unconditionally over the whole table when violations are found, with no sliver-vs-real-hole detection — narrow gaps flow through to the Voronoi-extension stage like any other hole, which is the pipeline's actual job. - _02_lines.py: drops _02b (interior/shared edges), only ever needed for ST_Node. - _06_outputs.py: drops the byte-exactness check. - utils.py: extracts the coverage_clean helper shared by inputs and merge. Verified end-to-end on Burundi and Chile (the canonical stress test); fixed an OOM along the way from an early draft that used a single global ST_Union_Agg(_01) as a per-row ST_Difference operand instead of a bbox-prefiltered neighbor union.

Looping over input_dir inside one long-lived process let RSS grow unbounded across files (observed ~20GB) even though every file peaks under 3GB alone — GEOS's native heap isn't returned to the OS between files despite each DuckDB connection being closed. main() now shells out to a fresh `uv run -m app --input-file=X` subprocess per file.

…emory budget _03_points.py now decomposes boundary lines into real vertex-to-vertex segments and caps interpolation on segments longer than DISTANCE*MAX_POINTS_PER_SEGMENT, bounding the largest exactly-collinear point cluster fed to ST_VoronoiDiagram independent of that segment's raw length. Fixes Chad/Algeria's multi-minute-to-hours pathological slowdowns (confirmed: Chad's full pipeline 1115s -> ~20s) without altering any geometry — no simplification, every generated point stays on the true digitized boundary. attempt.py now derives a starting DISTANCE per file instead of always using the flat config default, combining the file's own natural vertex resolution with a point-volume budget so files never generate more points than --memory-gb can safely hold. That budget itself is fitted from probes run inside a real --memory=4g --memory-swap=4g (no swap) Docker container rather than RSS inference on an unconstrained dev machine, since testing there showed the previous flat calibration (5.5M points, based on Chile's dev-machine behavior) actually OOMs for real under the stated 2-4GB deployment target. Confirmed via docker inspect .State.OOMKilled that this class of failure is a hard kernel SIGKILL, not a catchable DuckDBError, so the retry-doubling loop can't rescue it after the fact — the budget has to avoid the ceiling going in. A raw-vertex-count floor (segment decompose+remerge cost, measured independent of DISTANCE) is checked up front so files whose vertex count alone exceeds the budget fail fast with a clear message instead of wasting 10 retries on a problem DISTANCE can't fix. Also found along the way: an O(n^2) ST_PointN-based decomposition, a per-segment (vs per-fid) differencing blowup, and a per-real-segment point floor that broke Philippines' 13M-vertex boundary — all fixed as part of the same change. Two more OOM bottlenecks (inputs.py's coverage-clean pass, lines.py's neighbor-union self-join) were found during 4GB-container verification but are out of scope here; documented in docs/voronoi-memory.md for follow-up.

… ceilings attempt.py's remerge-floor check now warns and falls back to the plain default distance instead of raising, since a real deployment may have swap headroom beyond the stated --memory-gb. A stricter fail-fast gate was also tried for _01_inputs.py's ST_CoverageClean and _02_lines.py's neighbor-union self-join but reverted: with no DISTANCE-style parameter to fall back to there, blocking execution outright wasn't worth two more probe-fitted constants, especially given the multi-file batch path already contains an OOM's blast radius to one subprocess. Also corrects CLAUDE.md's inaccurate claim that _02_lines.py already explodes multipolygon fids into per-part bboxes like _05_merge.py — it doesn't, and doing so was tested and found to badly regress Chile (one fid with 3,796 tightly-clustered parts) while only helping Indonesia's opposite shape (many fids with a few widely-scattered parts). Philippines (~5.9GB) and Indonesia (~5.4GB) genuinely don't fit a 4GB deployment for these two operations — documented in docs/voronoi-memory.md as a known, real gap rather than papered over with a threshold.

…nvention The DISTANCE-budget constants' derivation (measured data points, fitted formulas) was duplicated almost verbatim between config.py and docs/voronoi-memory.md, and the config.py copy had drifted stale after the prior commit changed attempt.py's remerge-floor check from fail-fast to warn-and-fallback. Trimmed config.py to terse pointers and moved the unique MAX_POINTS_PER_SEGMENT timing table into docs/voronoi-memory.md, which is the actual authoritative source. Also drops the leading underscore from REMERGE_BYTES_PER_RAW_SEGMENT, BASELINE_OVERHEAD_MB, BYTES_PER_POINT, and SAFETY_MARGIN — they're imported into attempt.py, so the underscore (reserved in this file for genuinely config.py-internal names like _BOOL_VALS/_PARQUET_EXPORT) was misleading.

candidate = min(distance, natural_res) means a user-supplied --distance can only ever floor resolution for naturally-coarse boundaries (natural_res always wins when finer) — it can never coarsen an already-detailed file. README.md documented "use a larger value for the entire world" as a working use case, but that never actually applied once real coastline detail exists anywhere in the file. With all meaningful per-file adaptivity now coming from --memory-gb + natural_res automatically, the flag added configurability with no corresponding real effect. 0.0002 now lives only as DEFAULT_DISTANCE, a fixed internal constant in config.py with no CLI flag or env var. Updates README/CLAUDE.md/performance docs accordingly.

…ds/--debug Pre-existing staleness, unrelated to the CLI's actual argparse definitions in config.py. --num-threads was never a real flag (the actual name is --threads); --quiet doesn't exist at all and --debug is the closest real equivalent, with opposite semantics (adds verbosity, doesn't suppress it).

maxmalynowsky added 30 commits April 14, 2026 09:14

migration from postgis to DuckDB

f6a3b67

replace PostGIS backend with DuckDB spatial, remove app_postgis module

153deb0

optimize alpine build

cfbdc1b

migrate to using a duckdb db and fix OOM

6314ebe

fix all existing runtime issues

d1a9406

fully migrate pipeline to DuckDB

f59ddf9

successfully fix merge memory issues

df79843

optimize runtime performance

80f1c31

add performance documentation

6fd1806

update dockerfile

08b7451

harden merge approach

4b28e9b

optimize merge step

6ef08b8

refactor code for efficiency

d02514e

fix multipolygon issue

9e3716d

complete merge re-write for memory efficiency

486070d

add profiler

8043afd

add peak memory profiler

c06a516

add more cli args to test

cb048eb

add additional testing params

37626cb

optimize order of operations

c01e8dc

fix external voronoi snapping to internal lines

d1fbfa5

reduce spatial join memory

9c9c11d

psutil performance profiling

b9a0ad5

optimize table dropping and thread usage

22cb8e5

add rtree performance docs

05c49b5

fix output generation with correct point in polygon implementation

24ef6a1

add ST_MemUnion_agg profile test

99b1eaf

fix noding for polygonization

5ae0e52

improve input line generation

56b1f5e

improve output file prformance

746bf10

maxmalynowsky force-pushed the duckdb-migration branch from 5d24fc3 to 746bf10 Compare April 30, 2026 06:44

maxmalynowsky and others added 21 commits April 30, 2026 22:19

add topology cleaning on input

1ed6f2b

implement more robust cleaning detection

69e9d8c

fix segmentation fault errors

fbc153a

rename stage to step

cc4cc59

fix internal holes and better detect bad output polygons

a810ef2

optimize table drop

1d722f6

detect interior boundary reassignment in output check

93a54f3

Replace _check_missing_area with _check_input_preserved: ST_Difference on each input vs same-fid output catches the case where outward extension masks a feature losing area to a neighbour at an internal boundary.

remove redundant skip log in clean stage

1e85361

prefix pipeline files with stage numbers matching table names

4eb0de1

attempt.py stays unnumbered since it's an orchestrator wrapping points + voronoi with retry, not a stage of its own. clean is _01a because it's an optional sub-step that only runs on coverage violations.

switch Dockerfile to uv-based build, simplify both images

f9e5ce9

Use uv sync with pyproject.toml/uv.lock instead of pinned pip install, drop OCI labels, .dockerignore, and redundant env vars. Add py3-psutil and edge/community repo to the Alpine variant.

replace ctypes libgeos binding with native ST_CoverageClean

4aa327d

Drops the multi-stage GEOS-from-source Docker build and the ctypes binding in _01a_clean.py now that DuckDB spatial 1.5.3 exposes ST_CoverageClean natively. Updates docs and .gitignore accordingly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Duckdb migration#2

Duckdb migration#2
maxmalynowsky wants to merge 51 commits into
mainfrom
duckdb-migration

maxmalynowsky commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

maxmalynowsky commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant