Background
A community user (David Sooter, Discord) asked whether Arc can read/write spatial/geo data, observing that DuckDB has the spatial extension but unsure whether Arrow/Parquet support carries through. He also proposed making DuckDB extensions configurable on a per-operator basis — that's the right architectural direction.
This issue tracks the investigation results and a staged implementation plan.
Investigation summary
I did a targeted code review of Arc's existing DuckDB extension loader (built originally for the proprietary arcx extension), Arc's Arrow/Parquet ingest writer, and the arrow-go upstream library's Parquet metadata API. Three architectural layers, three different difficulty curves:
Tier 1 — Configurable DuckDB extensions (small, ships first)
Today Arc loads the arcx extension via a per-connection LOAD callback wired through duckdb.NewConnector (internal/database/duckdb.go:185-243). The mechanism is general; it's hardcoded to a single extension path. Generalizing it to an operator-configurable list of DuckDB community extensions is the right next step and unblocks ~80% of geo use cases without any storage-format changes.
What this unlocks:
spatial: `ST_Distance`, `ST_Point`, `ST_Contains`, `ST_Within`, bounding-box predicates over plain `DOUBLE` lat/lon columns
- Other community extensions on the same plumbing: `vss` (vector similarity), `inet`, `fuzzy_string_matching`, `json`, `icu`, etc.
Config shape (proposed):
```yaml
database:
extensions:
- spatial
- vss
- inet
```
Each entry triggers `INSTALL ; LOAD ` on every pooled connection. Default empty list (no extensions loaded). Operators on air-gapped deployments can override via env: `ARC_DATABASE_EXTENSIONS=spatial,vss`.
Trust model: loading DuckDB extensions executes native code in Arc's process. Today the `allow_unsigned_extensions=true` DSN flag is set only when arcx is configured (internal/database/duckdb.go:170-178); we'd flip it on when any extension list is non-empty. Community DuckDB extensions are signed by the DuckDB project, so unsigned-mode is only required for custom/proprietary builds; happy to scope a stricter signing policy if needed.
Scope: ~half a day. Generalize `openDuckDB` to accept a slice of extensions; thread the config through. Single small PR. Tests: existing arcx test pattern carries over.
Tier 2 — Native GEOMETRY persistence as GeoParquet (larger; separate PR)
GeoParquet is the spec for storing geometries in Parquet — geometries encoded as WKB inside binary Parquet columns + a sidecar `geo` key-value metadata blob describing geometry columns, encoding, CRS, and bounding box.
What I confirmed in arrow-go:
`pqarrow.FileWriter` exposes `AppendKeyValueMetadata(key, value string)` (source). This is the GeoParquet metadata hook. We can write GeoParquet from Arc today; arrow-go is not a blocker.
What needs to change in Arc:
- Type inference recognizes `[]byte` → currently `internal/ingest/arrow_writer.go:inferArrowType` falls through to `unsupported type` for `[]byte` (line 355-366). Add a case mapping to `arrow.BinaryTypes.Binary`. Unlocks msgpack ingest of WKB bytes today, with or without GeoParquet metadata.
- Geometry-column convention — how does a write request flag "this binary column is WKB"? Options:
- Column-name suffix (e.g. `location_geom` auto-marked as geometry) — minimal contract change but feels brittle
- Explicit schema hint in the request (`{schema_hints: {location: "geometry"}}`) — clean, requires client-side support
- Per-database/per-measurement schema registry that operators declare upfront — most rigorous
- GeoParquet metadata emission — at the `pqarrow.NewFileWriter` site (arrow_writer.go:649-657), call `writer.AppendKeyValueMetadata("geo", geoJSON)` where `geoJSON` is the GeoParquet 1.1 schema describing each geometry column. Minimal impl: WKB encoding, EPSG:4326 CRS, geometry type "Point" or "Unknown".
Scope: 1-2 days of implementation + tests. Per-measurement schema-hint shape is the most flexible; happy to write a design doc for that as the kickoff for the Tier 2 PR.
What stays blocked: PostGIS-style GEOMETRY indexing on the read side (DuckDB's spatial extension does its own R-tree at query time, but indexing during compaction is a separate question — see Tier 3).
Tier 3 — Spatial indexing during compaction (later, optional)
R-tree or H3 indexing on top of stored geometries so `WHERE ST_Contains(geom, ST_Point(...))` doesn't full-scan. Material perf win on large geo datasets but requires real design work in the compaction layer and operator-side bucket configuration. Not blocking anything; tracking only.
Recommended staging
- PR 1 — Tier 1: Configurable DuckDB extensions. ~half day. Ships spatial functions over numeric columns + every other community DuckDB extension on the same plumbing. ([scoped above])
- PR 2 — Tier 2 (Phase A): Recognize `[]byte` in type inference; document WKB ingest via msgpack (without GeoParquet metadata yet). Tests for round-trip. Unblocks community users wanting binary geom storage today, even if not labeled "GeoParquet."
- PR 3 — Tier 2 (Phase B): GeoParquet `geo` key-value metadata emission. Per-measurement schema-hint config. Lossless round-trip with GeoPandas / QGIS / PostGIS-via-DuckDB.
- Tier 3: Separate scoping issue when there's a real customer workload to design against.
Critical files
Out of scope
- Replacing DuckDB's spatial implementation with anything custom
- Reading PostGIS dump files directly
- WMS/WMTS tile-serving on top of stored geometries
- Coordinate-system transformations on read (operators feed data in their chosen CRS; CRS is stored, not transformed)
References
- Discord thread that triggered this — David Sooter, 2026-05-18
Background
A community user (David Sooter, Discord) asked whether Arc can read/write spatial/geo data, observing that DuckDB has the
spatialextension but unsure whether Arrow/Parquet support carries through. He also proposed making DuckDB extensions configurable on a per-operator basis — that's the right architectural direction.This issue tracks the investigation results and a staged implementation plan.
Investigation summary
I did a targeted code review of Arc's existing DuckDB extension loader (built originally for the proprietary
arcxextension), Arc's Arrow/Parquet ingest writer, and the arrow-go upstream library's Parquet metadata API. Three architectural layers, three different difficulty curves:Tier 1 — Configurable DuckDB extensions (small, ships first)
Today Arc loads the
arcxextension via a per-connectionLOADcallback wired throughduckdb.NewConnector(internal/database/duckdb.go:185-243). The mechanism is general; it's hardcoded to a single extension path. Generalizing it to an operator-configurable list of DuckDB community extensions is the right next step and unblocks ~80% of geo use cases without any storage-format changes.What this unlocks:
spatial: `ST_Distance`, `ST_Point`, `ST_Contains`, `ST_Within`, bounding-box predicates over plain `DOUBLE` lat/lon columnsConfig shape (proposed):
```yaml
database:
extensions:
- spatial
- vss
- inet
```
Each entry triggers `INSTALL ; LOAD ` on every pooled connection. Default empty list (no extensions loaded). Operators on air-gapped deployments can override via env: `ARC_DATABASE_EXTENSIONS=spatial,vss`.
Trust model: loading DuckDB extensions executes native code in Arc's process. Today the `allow_unsigned_extensions=true` DSN flag is set only when arcx is configured (internal/database/duckdb.go:170-178); we'd flip it on when any extension list is non-empty. Community DuckDB extensions are signed by the DuckDB project, so unsigned-mode is only required for custom/proprietary builds; happy to scope a stricter signing policy if needed.
Scope: ~half a day. Generalize `openDuckDB` to accept a slice of extensions; thread the config through. Single small PR. Tests: existing arcx test pattern carries over.
Tier 2 — Native GEOMETRY persistence as GeoParquet (larger; separate PR)
GeoParquet is the spec for storing geometries in Parquet — geometries encoded as WKB inside binary Parquet columns + a sidecar `geo` key-value metadata blob describing geometry columns, encoding, CRS, and bounding box.
What I confirmed in arrow-go:
`pqarrow.FileWriter` exposes `AppendKeyValueMetadata(key, value string)` (source). This is the GeoParquet metadata hook. We can write GeoParquet from Arc today; arrow-go is not a blocker.
What needs to change in Arc:
Scope: 1-2 days of implementation + tests. Per-measurement schema-hint shape is the most flexible; happy to write a design doc for that as the kickoff for the Tier 2 PR.
What stays blocked: PostGIS-style GEOMETRY indexing on the read side (DuckDB's spatial extension does its own R-tree at query time, but indexing during compaction is a separate question — see Tier 3).
Tier 3 — Spatial indexing during compaction (later, optional)
R-tree or H3 indexing on top of stored geometries so `WHERE ST_Contains(geom, ST_Point(...))` doesn't full-scan. Material perf win on large geo datasets but requires real design work in the compaction layer and operator-side bucket configuration. Not blocking anything; tracking only.
Recommended staging
Critical files
openDuckDB+connInitFn(the loader to generalize)allow_unsigned_extensionsDSN flagDatabaseConfigstruct; where to addExtensions []stringinferArrowType(the type-inference site for Tier 2)pqarrow.NewFileWritercall site (theAppendKeyValueMetadatainjection point)Out of scope
References