Skip to content

Add spatial/geo support to Arc (configurable DuckDB extensions + GeoParquet ingest) #440

@xe-nvdk

Description

@xe-nvdk

Background

A community user (David Sooter, Discord) asked whether Arc can read/write spatial/geo data, observing that DuckDB has the spatial extension but unsure whether Arrow/Parquet support carries through. He also proposed making DuckDB extensions configurable on a per-operator basis — that's the right architectural direction.

This issue tracks the investigation results and a staged implementation plan.

Investigation summary

I did a targeted code review of Arc's existing DuckDB extension loader (built originally for the proprietary arcx extension), Arc's Arrow/Parquet ingest writer, and the arrow-go upstream library's Parquet metadata API. Three architectural layers, three different difficulty curves:

Tier 1 — Configurable DuckDB extensions (small, ships first)

Today Arc loads the arcx extension via a per-connection LOAD callback wired through duckdb.NewConnector (internal/database/duckdb.go:185-243). The mechanism is general; it's hardcoded to a single extension path. Generalizing it to an operator-configurable list of DuckDB community extensions is the right next step and unblocks ~80% of geo use cases without any storage-format changes.

What this unlocks:

  • spatial: `ST_Distance`, `ST_Point`, `ST_Contains`, `ST_Within`, bounding-box predicates over plain `DOUBLE` lat/lon columns
  • Other community extensions on the same plumbing: `vss` (vector similarity), `inet`, `fuzzy_string_matching`, `json`, `icu`, etc.

Config shape (proposed):

```yaml
database:
extensions:
- spatial
- vss
- inet
```

Each entry triggers `INSTALL ; LOAD ` on every pooled connection. Default empty list (no extensions loaded). Operators on air-gapped deployments can override via env: `ARC_DATABASE_EXTENSIONS=spatial,vss`.

Trust model: loading DuckDB extensions executes native code in Arc's process. Today the `allow_unsigned_extensions=true` DSN flag is set only when arcx is configured (internal/database/duckdb.go:170-178); we'd flip it on when any extension list is non-empty. Community DuckDB extensions are signed by the DuckDB project, so unsigned-mode is only required for custom/proprietary builds; happy to scope a stricter signing policy if needed.

Scope: ~half a day. Generalize `openDuckDB` to accept a slice of extensions; thread the config through. Single small PR. Tests: existing arcx test pattern carries over.

Tier 2 — Native GEOMETRY persistence as GeoParquet (larger; separate PR)

GeoParquet is the spec for storing geometries in Parquet — geometries encoded as WKB inside binary Parquet columns + a sidecar `geo` key-value metadata blob describing geometry columns, encoding, CRS, and bounding box.

What I confirmed in arrow-go:

`pqarrow.FileWriter` exposes `AppendKeyValueMetadata(key, value string)` (source). This is the GeoParquet metadata hook. We can write GeoParquet from Arc today; arrow-go is not a blocker.

What needs to change in Arc:

  1. Type inference recognizes `[]byte` → currently `internal/ingest/arrow_writer.go:inferArrowType` falls through to `unsupported type` for `[]byte` (line 355-366). Add a case mapping to `arrow.BinaryTypes.Binary`. Unlocks msgpack ingest of WKB bytes today, with or without GeoParquet metadata.
  2. Geometry-column convention — how does a write request flag "this binary column is WKB"? Options:
    • Column-name suffix (e.g. `location_geom` auto-marked as geometry) — minimal contract change but feels brittle
    • Explicit schema hint in the request (`{schema_hints: {location: "geometry"}}`) — clean, requires client-side support
    • Per-database/per-measurement schema registry that operators declare upfront — most rigorous
  3. GeoParquet metadata emission — at the `pqarrow.NewFileWriter` site (arrow_writer.go:649-657), call `writer.AppendKeyValueMetadata("geo", geoJSON)` where `geoJSON` is the GeoParquet 1.1 schema describing each geometry column. Minimal impl: WKB encoding, EPSG:4326 CRS, geometry type "Point" or "Unknown".

Scope: 1-2 days of implementation + tests. Per-measurement schema-hint shape is the most flexible; happy to write a design doc for that as the kickoff for the Tier 2 PR.

What stays blocked: PostGIS-style GEOMETRY indexing on the read side (DuckDB's spatial extension does its own R-tree at query time, but indexing during compaction is a separate question — see Tier 3).

Tier 3 — Spatial indexing during compaction (later, optional)

R-tree or H3 indexing on top of stored geometries so `WHERE ST_Contains(geom, ST_Point(...))` doesn't full-scan. Material perf win on large geo datasets but requires real design work in the compaction layer and operator-side bucket configuration. Not blocking anything; tracking only.

Recommended staging

  1. PR 1 — Tier 1: Configurable DuckDB extensions. ~half day. Ships spatial functions over numeric columns + every other community DuckDB extension on the same plumbing. ([scoped above])
  2. PR 2 — Tier 2 (Phase A): Recognize `[]byte` in type inference; document WKB ingest via msgpack (without GeoParquet metadata yet). Tests for round-trip. Unblocks community users wanting binary geom storage today, even if not labeled "GeoParquet."
  3. PR 3 — Tier 2 (Phase B): GeoParquet `geo` key-value metadata emission. Per-measurement schema-hint config. Lossless round-trip with GeoPandas / QGIS / PostGIS-via-DuckDB.
  4. Tier 3: Separate scoping issue when there's a real customer workload to design against.

Critical files

Out of scope

  • Replacing DuckDB's spatial implementation with anything custom
  • Reading PostGIS dump files directly
  • WMS/WMTS tile-serving on top of stored geometries
  • Coordinate-system transformations on read (operators feed data in their chosen CRS; CRS is stored, not transformed)

References

  • Discord thread that triggered this — David Sooter, 2026-05-18

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions