Skip to content

Duckdb migration#2

Open
maxmalynowsky wants to merge 51 commits into
mainfrom
duckdb-migration
Open

Duckdb migration#2
maxmalynowsky wants to merge 51 commits into
mainfrom
duckdb-migration

Conversation

@maxmalynowsky

Copy link
Copy Markdown
Member

No description provided.

maxmalynowsky and others added 21 commits April 30, 2026 22:19
The _05 cell-to-fid join now asks "where does this cell live?" against
both _01 (primary) and _04 (fallback) using the cell's own interior
point, instead of the prior "does any _01 interior point fall inside this
cell?" with a QUALIFY tiebreak. This fixes routing for cells that sit
inside _01 but don't happen to contain _01's interior point — concave
shapes, multipart polygons, sub-cells from polygonization. On COL ADM3
the missing-area check goes from 17 bad fids to 1 (residual case is a
polygonization fusion, symmetric in both join logics).

Precomputed bbox columns on _05_tmp1 (per-part _01) and _04 let
downstream joins use plain numeric comparisons instead of recomputing
ST_XMin/etc per candidate pair. Direct measurement on COL ADM3: _05
query 146.7s → 10.9s (13×) on the same machine and settings.

Tables renamed to reflect creation order in merge.py:
  _05_tmp4 (parts) → _05_tmp1
  _05_tmp1 (extension lines) → _05_tmp2
  _05_tmp2 (snapped lines) → _05_tmp3
  _05_tmp3 (polygonized cells) → _05_tmp4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace _check_missing_area with _check_input_preserved: ST_Difference
on each input vs same-fid output catches the case where outward extension
masks a feature losing area to a neighbour at an internal boundary.
Re-polygonize fids whose interior boundary was reassigned during merge,
injecting their _02a exterior rings into the line union. Surgical pass
first (flagged fids only), global if any remain. Tighten reassignment
threshold from 0.1% to 0.01%; defer drops of _02a/_02b/_04/_05_tmp1/
_05_tmp3/_05_tmp4 to outputs.py so repair has the tables it needs.

Also tee per-run logs to tmp/{name}.log via log_file() context manager.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
merge.py was getting long; the polygonize/repair half is a self-
contained unit that lives better in its own file. While here, normalize
the rest of the stage modules to public-first ordering for consistency
with merge.py and polygonize.py.
attempt.py stays unnumbered since it's an orchestrator wrapping
points + voronoi with retry, not a stage of its own. clean is _01a
because it's an optional sub-step that only runs on coverage violations.
Use uv sync with pyproject.toml/uv.lock instead of pinned pip install,
drop OCI labels, .dockerignore, and redundant env vars. Add py3-psutil
and edge/community repo to the Alpine variant.
Drops the multi-stage GEOS-from-source Docker build and the ctypes
binding in _01a_clean.py now that DuckDB spatial 1.5.3 exposes
ST_CoverageClean natively. Updates docs and .gitignore accordingly.
…Clean

Byte-exact preservation of original polygon vertices is no longer a
requirement (inputs are cleaned upstream by the caller), which removes the
reason the ST_Node/ST_Polygonize planar-partition rebuild and its
surgical-to-global repair escalation existed in the first place — that
machinery was a workaround for DuckDB spatial having no native
ST_CoverageClean when it was built, not a deliberate correctness choice.

- _05_merge.py: per-fid ST_Difference against a bbox-prefiltered neighbor
  union, then a single whole-table ST_CoverageClean pass; _05b_polygonize.py
  deleted. Mirrors the original PostGIS implementation's union+difference+
  coverage-clean pattern (recovered via `git show f6a3b67:app_postgis/merge.py`),
  now viable natively since DuckDB spatial 1.5.3.
- _01_inputs.py: folds in the former _01a_clean.py stage (one fewer module,
  one fewer pipeline step). Coverage-clean now runs unconditionally over the
  whole table when violations are found, with no sliver-vs-real-hole
  detection — narrow gaps flow through to the Voronoi-extension stage like
  any other hole, which is the pipeline's actual job.
- _02_lines.py: drops _02b (interior/shared edges), only ever needed for
  ST_Node.
- _06_outputs.py: drops the byte-exactness check.
- utils.py: extracts the coverage_clean helper shared by inputs and merge.

Verified end-to-end on Burundi and Chile (the canonical stress test); fixed
an OOM along the way from an early draft that used a single global
ST_Union_Agg(_01) as a per-row ST_Difference operand instead of a
bbox-prefiltered neighbor union.
Looping over input_dir inside one long-lived process let RSS grow
unbounded across files (observed ~20GB) even though every file peaks
under 3GB alone — GEOS's native heap isn't returned to the OS between
files despite each DuckDB connection being closed. main() now shells
out to a fresh `uv run -m app --input-file=X` subprocess per file.
…emory budget

_03_points.py now decomposes boundary lines into real vertex-to-vertex
segments and caps interpolation on segments longer than
DISTANCE*MAX_POINTS_PER_SEGMENT, bounding the largest exactly-collinear
point cluster fed to ST_VoronoiDiagram independent of that segment's raw
length. Fixes Chad/Algeria's multi-minute-to-hours pathological slowdowns
(confirmed: Chad's full pipeline 1115s -> ~20s) without altering any
geometry — no simplification, every generated point stays on the true
digitized boundary.

attempt.py now derives a starting DISTANCE per file instead of always using
the flat config default, combining the file's own natural vertex resolution
with a point-volume budget so files never generate more points than
--memory-gb can safely hold. That budget itself is fitted from probes run
inside a real --memory=4g --memory-swap=4g (no swap) Docker container rather
than RSS inference on an unconstrained dev machine, since testing there
showed the previous flat calibration (5.5M points, based on Chile's
dev-machine behavior) actually OOMs for real under the stated 2-4GB
deployment target. Confirmed via docker inspect .State.OOMKilled that this
class of failure is a hard kernel SIGKILL, not a catchable DuckDBError, so
the retry-doubling loop can't rescue it after the fact — the budget has to
avoid the ceiling going in. A raw-vertex-count floor (segment
decompose+remerge cost, measured independent of DISTANCE) is checked
up front so files whose vertex count alone exceeds the budget fail fast with
a clear message instead of wasting 10 retries on a problem DISTANCE can't
fix.

Also found along the way: an O(n^2) ST_PointN-based decomposition, a
per-segment (vs per-fid) differencing blowup, and a per-real-segment point
floor that broke Philippines' 13M-vertex boundary — all fixed as part of the
same change. Two more OOM bottlenecks (inputs.py's coverage-clean pass,
lines.py's neighbor-union self-join) were found during 4GB-container
verification but are out of scope here; documented in docs/voronoi-memory.md
for follow-up.
… ceilings

attempt.py's remerge-floor check now warns and falls back to the plain
default distance instead of raising, since a real deployment may have swap
headroom beyond the stated --memory-gb. A stricter fail-fast gate was also
tried for _01_inputs.py's ST_CoverageClean and _02_lines.py's neighbor-union
self-join but reverted: with no DISTANCE-style parameter to fall back to
there, blocking execution outright wasn't worth two more probe-fitted
constants, especially given the multi-file batch path already contains an
OOM's blast radius to one subprocess.

Also corrects CLAUDE.md's inaccurate claim that _02_lines.py already
explodes multipolygon fids into per-part bboxes like _05_merge.py — it
doesn't, and doing so was tested and found to badly regress Chile (one fid
with 3,796 tightly-clustered parts) while only helping Indonesia's opposite
shape (many fids with a few widely-scattered parts). Philippines (~5.9GB)
and Indonesia (~5.4GB) genuinely don't fit a 4GB deployment for these two
operations — documented in docs/voronoi-memory.md as a known, real gap
rather than papered over with a threshold.
…nvention

The DISTANCE-budget constants' derivation (measured data points, fitted
formulas) was duplicated almost verbatim between config.py and
docs/voronoi-memory.md, and the config.py copy had drifted stale after the
prior commit changed attempt.py's remerge-floor check from fail-fast to
warn-and-fallback. Trimmed config.py to terse pointers and moved the unique
MAX_POINTS_PER_SEGMENT timing table into docs/voronoi-memory.md, which is
the actual authoritative source.

Also drops the leading underscore from REMERGE_BYTES_PER_RAW_SEGMENT,
BASELINE_OVERHEAD_MB, BYTES_PER_POINT, and SAFETY_MARGIN — they're imported
into attempt.py, so the underscore (reserved in this file for genuinely
config.py-internal names like _BOOL_VALS/_PARQUET_EXPORT) was misleading.
candidate = min(distance, natural_res) means a user-supplied --distance can
only ever floor resolution for naturally-coarse boundaries (natural_res
always wins when finer) — it can never coarsen an already-detailed file.
README.md documented "use a larger value for the entire world" as a working
use case, but that never actually applied once real coastline detail exists
anywhere in the file. With all meaningful per-file adaptivity now coming
from --memory-gb + natural_res automatically, the flag added configurability
with no corresponding real effect.

0.0002 now lives only as DEFAULT_DISTANCE, a fixed internal constant in
config.py with no CLI flag or env var. Updates README/CLAUDE.md/performance
docs accordingly.
…ds/--debug

Pre-existing staleness, unrelated to the CLI's actual argparse definitions
in config.py. --num-threads was never a real flag (the actual name is
--threads); --quiet doesn't exist at all and --debug is the closest real
equivalent, with opposite semantics (adds verbosity, doesn't suppress it).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant