Cosmos: native driver distribution design + ADRs#4651
Conversation
Add a design doc and a numbered ADR set describing how the native Cosmos driver (azurecosmosdriver cdylib/staticlib + C header) is distributed to language SDKs. Core model: separate provenance from distribution. One Rust build produces all signed platform binaries as an internal-only hand-off artifact (ADR 0001/0008); distribution is per-language native packages on each language's existing internal + external feed (ADR 0009) -- NuGet NativeAssets + meta for .NET (0002), cgo prebuilt header+lib via the Go feed (0003), JAR for Java (future). ABI handshake (0004), bytes-in/bytes-out copy-out (0005), opt-in transport (0006), and platform matrix (0007) round out the set. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Since these ADRs are unpublished (no PR, no accepted record), the immutability/supersession discipline does not yet apply, so the pre-publication editing history was scrubbed for a clean first cut: - Drop ghost "supersedes the earlier canonical-bundle" scaffolding from 0001 and 0002 and the design doc; no reviewer ever saw that draft, so referencing it is noise. - Reorder per-language-feed distribution from 0009 to 0002 so the keystone distribution decision sits next to the build-once provenance decision (0001); renumber the former 0002-0008 to 0003-0009. Provenance (0001) and distribution (0002) stay as two decoupled records so either can be superseded independently later. - Re-point all in-text and link cross-references to the new numbers; reorder the index and design-doc decision tables to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The generic "distribution" name was ambiguous; "native-distribution" makes clear these docs cover distributing the native (FFI) driver artifact to the language SDKs. Pure directory rename — all internal links are relative and unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This documentation-only PR adds a design document plus nine Architecture Decision Records (ADRs) describing how the future Rust-built native Cosmos driver (azure_data_cosmos_driver_native → azurecosmosdriver.{dll,so,dylib} + cbindgen C header) will be distributed to the .NET, Go, and (later) Java SDKs. The core idea is to separate provenance (one signed build → internal hand-off artifact) from distribution (per-language packages on each language's existing feed). No code, build, or pipeline behavior changes; everything is marked "proposed for review."
Changes:
- Adds
distribution-design.md— the discussion doc (purpose, goals, model, per-consumer link model, platform matrix, rollout, open questions). - Adds ADRs
0000–0009capturing each decision (build-once hand-off, per-language feeds, .NET NuGet NativeAssets, Go cgo, ABI handshake, marshalling ownership, opt-in native, platform matrix, build/sign pipeline). - References an existing/planned
NATIVE_WRAPPER_SPEC.mdas the owner of the C-ABI surface.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
docs/native-distribution/distribution-design.md |
Main design discussion; links to ADRs and to a non-existent NATIVE_WRAPPER_SPEC.md; introduces many domain terms not in the cSpell dictionaries. |
docs/native-distribution/adr/0000-index.md |
ADR index/template overview; links resolve correctly. |
docs/native-distribution/adr/0001-build-once-internal-handoff.md |
Decision: single build → internal-only hand-off artifact. |
docs/native-distribution/adr/0002-per-language-feed-distribution.md |
Decision: distribute via each language's existing feeds. |
docs/native-distribution/adr/0003-dotnet-nuget-nativeassets.md |
Decision: .NET per-RID NuGet NativeAssets + meta-package. |
docs/native-distribution/adr/0004-go-cgo-prebuilt.md |
Decision: Go consumes prebuilt header+lib via cgo. |
docs/native-distribution/adr/0005-abi-version-handshake.md |
Decision: lib exports cosmos_abi_version(); hosts check before use. |
docs/native-distribution/adr/0006-binding-owns-marshalling.md |
Decision: bindings own marshalling/copy-out; links to non-existent NATIVE_WRAPPER_SPEC.md. |
docs/native-distribution/adr/0007-native-is-opt-in.md |
Decision: native is opt-in until GA, then default-with-fallback. |
docs/native-distribution/adr/0008-platform-matrix.md |
Decision: bounded platform matrix; clear error on unsupported targets. |
docs/native-distribution/adr/0009-build-and-signing-pipeline.md |
Decision: one build, sign once, fan-out; jobs never rebuild. |
The two notable issues are CI-blocking: a broken relative link to NATIVE_WRAPPER_SPEC.md (referenced from the design doc and ADR 0006, but the file doesn't exist) which verify-links will fail, and numerous new domain terms missing from the cSpell dictionaries (with no per-file <!-- cspell:ignore --> comment that sibling specs use) which the ContinueOnError:false spell check will fail.
Resolve the 9 review findings on PR #4651: - #1 CI: drop broken NATIVE_WRAPPER_SPEC.md links (lives in #4461); add cSpell terms (cbindgen, Authenticode, ESRP, manylinux, etc.) - #2 ADR 0001/0009: require RID-keyed hand-off + both .so/.a + C-only header - #3 ADR 0009: enforce build-once via checksum+signature verification - #4 ADR 0005: u32 is a monotonic ABI revision; add MaxSupported (reject too-new) - #5 promote Q5 to new ADR 0010 (single version fan-out) - #6 ADR 0000 template: Status under the title - #7 reciprocal Pairs-with note in ADR 0001 - #8 ADR 0008: name glibc floor 2.17 (manylinux2014) - #9 ADR 0007: silent-fallback for benign causes vs fail-loud for integrity/ABI Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
|
||
| ## Alternatives considered | ||
| - Single "fat" NativeAssets package — kept only as an **interim** (Phase 1–2) for speed, not GA. | ||
| - Embed natives directly in `Microsoft.Azure.Cosmos` — rejected: bloats the flagship for pure-managed users. |
There was a problem hiding this comment.
There won't be any pure managed users - not pushing back angainst teh decision - but against this false argument. .Net vNext will require a native package.
|
|
||
| ## Decision | ||
| - The native library exports **`cosmos_abi_version() -> u32`** — a **monotonically increasing integer ABI revision, not a packed SemVer** — and ships the same value in the hand-off's `ABI_VERSION`. | ||
| - Every language host reads it **at load, before the first real call**, and accepts only the closed range **`[MinSupported, MaxSupported]`** it was built against: it fails fast with a versioned message when the native revision is **below `MinSupported` (too old)** *or* **above `MaxSupported` (too new / unknown)**, so a newer native is never silently accepted. |
There was a problem hiding this comment.
I assume MinSupported==MAxSupported always - everything else results in unreasonable test matrix. Why not just close the range and make the contract clear - that there is exactly one version targeted?
| **Status:** Accepted (proposed for review) | ||
|
|
||
| ## Context | ||
| `Microsoft.Azure.Cosmos` (pure managed) and the Go SDK run today on any platform with no native dependency. Introducing a platform-specific native library must not silently break that portability or force a migration on a version bump. |
There was a problem hiding this comment.
SDKs will ALWAYS have to use a new major version when switching to Rust driver - at this point native vs. managed transport does not make sense to be hept in parallel - new maor version --> Rust native transport only - so, this is not opt-in for sure.
There was a problem hiding this comment.
Yes, I also disagree with this ADR. It's true that we might face pushback in Go, and lose some scenarios in JS by relying on the driver. At this point, I think those are worth it. If we need a pure go/pure js package at some point, I think it would be clearer/easier that they be completely different packages advertised to have limited functionality, instead of trying to offer fallback in the main SDK.
|
|
||
| ## Decision | ||
| - A given native release (one build + signing, ADR 0001) is **fanned out to all language feeds simultaneously** from that single hand-off — .NET, Go, and (later) Java publish the *same* native version together. | ||
| - Each language SDK **pins a compatible native range** and may cut its *own* managed/SDK releases independently between native releases; what it must not do is silently float onto a *different* native build. |
There was a problem hiding this comment.
no range - should be single native version - otherwise testing becomes unreaosnable.
|
|
||
| > **Status:** Draft for review · **Author:** Cosmos Rust SDK team · **Type:** Design discussion (decisions are recorded separately as ADRs — see [`adr/`](adr/0000-index.md)) | ||
| > | ||
| > This is the *discussion* doc: context, alternatives, and the "why". The **decisions** live as short, immutable records in [`adr/`](adr/0000-index.md). If the two disagree, the ADR wins. |
There was a problem hiding this comment.
i think this doc is duplicate? Either keep the ADRs (i like them) or this doc - not both
There was a problem hiding this comment.
I think the idea is that this could keep more of the thinking that would make the ADRs too verbose, but in this case I agree that this doesn't seem to add much value.
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
Rust/Native as opt-in is a no-go in my mind.
NaluTripician
left a comment
There was a problem hiding this comment.
Design review — native driver distribution
Read the whole set (design doc + ADRs 0000–0010). Overall this is a strong, well-structured design and the ADR hygiene is clean. I spot-checked the external references and they hold up: the SkiaSharp / Microsoft.Data.SqlClient.SNI per-RID runtimes/<rid>/native/ + runtime.json meta-package pattern, manylinux2014 ⇔ glibc 2.17, the cgo CFLAGS -I / LDFLAGS -L -l mechanics, cbindgen C-only header, and crate-type = ["cdylib","staticlib"] are all described correctly.
A few things I'd want resolved before these move from "proposed" to "accepted". Inline comments cover the independent ones; the three below overlap with Fabian's open CHANGES_REQUESTED, so I'm reinforcing rather than restating.
Blocking
- Opt-in / "pure-managed users" premise (ADR 0007 + G4, ADR 0003). If the next .NET major requires the native package (as Fabian notes), then "never force-migrated" / "opt-in" mis-frames the rollout, and ADR 0003's rejected-alternative argument ("bloats the flagship for pure-managed users") rests on a user class that won't exist. Independently: a NuGet
<PackageReference>is not "opt-in" — it always restores transitively. Please state the actual mechanism (flagship has no dep + consumer adds the meta-package, OR hard dep that's inert behind a client flag) and reconcile 0003/0007 with "mandatory at next major." - Linux has no signature for the integrity gate (ADR 0009 / 0007). See inline.
- ABI range vs single-version lockstep (ADR 0005
[Min,Max]vs ADR 0010 "never forked"). These two read as contradictory; if fan-out is always one coordinated version, the acceptance range only grows the test matrix. Collapse to a single targeted revision (Min==Max) or explain how a range coexists with 0010.
Important — see inline on ADR 0004 (public-Go delivery + static link flags), ADR 0010 (atomic fan-out on immutable feeds), and distribution-design.md §8 (now stale vs ADR 0009).
Also: design-doc §10 ("fallback on load failure") contradicts ADR 0007's fail-loud-on-integrity rule — §10 should mirror the benign-vs-integrity split. And the heavy §5–§11 restatement of the ADRs is what's producing these drifts (the §8 one is a concrete example); consider trimming the design body to discussion/alternatives and linking out for decisions.
Nits: status convention split (0010 "Proposed" while dependents are "Accepted"); the "Pairs with" note is on 0001/0002 only — either add it to the template or to the other coupled pairs; dangling NATIVE_WRAPPER_SPEC.md refs (add "forthcoming, #4461") and a pointer to the cosmos_free-style dealloc export behind ADR 0006's "frees the native buffer"; "download only your RID" is unconditional-sounding but differs for framework-dependent builds.
Nothing here is a showstopper for the idea — it's a genuinely good design. These are about closing the seams before downstream teams build against it.
| The binary is built in `azure-sdk-for-rust`, but the language packages live in other repos and publish to other feeds. The bytes that define the ABI must be signed at their source, and per-language packaging must not be able to alter, re-sign, or independently rebuild them — otherwise the languages drift onto different driver builds (the failure ADR 0001 exists to prevent). | ||
|
|
||
| ## Decision | ||
| - A single pipeline next to the Rust build produces the per-platform binaries (ADR 0001), **signs each binary once** (Authenticode on Windows; codesign + notarization on macOS), checksums them, and publishes the internal hand-off artifact. |
There was a problem hiding this comment.
The signing step names Authenticode (Windows) and codesign + notarization (macOS), but nothing for the three Linux RIDs (linux-x64, linux-musl-x64, linux-arm64). That's a gap, because both this ADR and ADR 0007 make "verify signatures / fail-loud on signature failure" load-bearing — and Linux .so/.a have no platform code-signing equivalent. A checksum proves integrity but not authenticity/provenance (it's computed over whatever bytes you have). Could we name an explicit Linux authenticity mechanism — detached signatures (cosign/Sigstore/GPG), or signing + attesting the Universal Package hand-off itself as the trust root — and soften "verifies signatures" to "verifies signatures (where applicable) + provenance attestation" so the gate is actually implementable on Linux?
| ## Decision | ||
| - Go consumes the prebuilt **`include/` header and `lib/` library via cgo** (`#cgo CFLAGS -I…` to parse the header into `C.*` symbols; `#cgo LDFLAGS -L… -lazurecosmosdriver` to link), **not NuGet**. | ||
| - Prefer the **static `.a`** for a self-contained Go binary; dynamic linking is supported as an option. | ||
| - The header + lib are delivered through the **azure-sdk-for-go feed** — an Azure Artifacts Universal Package fetched at build, or a vendored "binaries" Go module with per-OS build tags (delivery shape is open Q3). Either way it derives from the ADR 0001 hand-off artifact. |
There was a problem hiding this comment.
Both delivery options here are problematic for public Go consumers. Azure Artifacts Universal Packages are internal-only, so an external go get user can't fetch them — that option only really covers the internal/dogfood phases. And a vendored "binaries" Go module means committing ~7 platforms of multi-MB native libs into a module the proxy pulls wholesale on every go get; that's exactly why the canonical cgo libraries (e.g. go-sqlite3) ship C source rather than prebuilt binaries. Related: Go has no package "feed" — modules resolve from VCS via the module proxy, so describing azure-sdk-for-go as a "feed" (also ADR 0002 / §6.2) is misleading. Can we add a concrete public-Go answer (GitHub Releases + a small fetch shim, or build-tag selection of a per-RID .a)? This feels bigger than Q3's internal-only "Universal Package vs vendored module" framing.
|
|
||
| ## Consequences | ||
| - Go reuses the exact same signed binaries as .NET — no Go-specific build of the driver. | ||
| - cgo + static lib means `CGO_ENABLED=1` and a C toolchain on the Go build host; cross-compilation needs a cross C toolchain. |
There was a problem hiding this comment.
One -lazurecosmosdriver won't be enough to statically link a Rust staticlib through cgo. A Rust staticlib isn't self-contained at link time — it pulls in the Rust std/panic/unwind runtime plus system libs (typically -lpthread -ldl -lm, often -lgcc_s/-lunwind), and static-cgo on musl is notoriously fiddly. Since linux-musl-x64 is in the GA matrix and this ADR prefers the static .a by default, worth noting the per-OS LDFLAGS must include the transitive system libs and flagging musl-static-cgo as a known-hard case (it bears on whether "static by default" is realistic on every matrix row — Q4).
|
|
||
| ## Consequences | ||
| - No cross-language version skew: a fix in the native driver lands everywhere in one coordinated fan-out, not at N independent times. | ||
| - The fan-out pipeline (ADR 0009) publishes a release as an all-or-nothing set; a partial fan-out (some feeds updated, some not) is an explicit failure state to guard. |
There was a problem hiding this comment.
"All-or-nothing" isn't physically achievable on nuget.org or Maven Central — neither supports unpublish/retraction. Once a NativeAssets package is live on nuget.org, a later Go-publish failure can't be rolled back to restore atomicity; the only remedy is roll-forward (publish the next patch everywhere). Suggest replacing the implied transactional rollback with the achievable model: publish to internal/staging feeds atomically, gate the public push on all-green, and on a partial public failure roll forward, never retract — and say so, so implementers don't design for an impossible rollback.
|
|
||
| ## 8. Build, signing, and CI (ADR 0009) | ||
|
|
||
| One pipeline next to the Rust build produces the per-platform binaries, **signs each binary once** (Authenticode / codesign + notarize), checksums them, and publishes the **internal hand-off artifact**. Per-language publish jobs consume that artifact and emit NuGet / Go-consumable / JAR packages, signing only their **package wrapper** in that language's existing ESRP flow — they **never rebuild or re-sign the native binary**. Build-once is enforced by discipline: all language jobs consume one hand-off from one Rust build. Supply chain: SBOM / component governance for the Rust crate graph is new surface for the consuming orgs — owner TBD (Q7). |
There was a problem hiding this comment.
This says build-once is "enforced by discipline," but ADR 0009 (the authoritative record) upgraded this to "enforced by verification, not just discipline" with a checksum/signature gate that fails the publish on mismatch. By this doc's own "if the two disagree, the ADR wins" rule, §8 is now stale and understates the actual mechanism. Worth syncing — and a good example of why the §5–§11 restatements risk drifting from the ADRs.
| The native driver must reach .NET, Go, and later Java, each with a different package format and its own feed. Two concerns are easily conflated: **provenance** (there must be exactly one build + binary-signing per release, or the languages drift onto different driver builds) and **distribution** (how a consumer pulls the bytes). A neutral consumer-facing bundle would force a Go user to download a package containing DLLs/JARs they cannot use, and would require standing up new consumer feed infrastructure. | ||
|
|
||
| ## Decision | ||
| - For each release, **one** Rust build produces all platform binaries (cdylib + staticlib), the cbindgen C header, an `ABI_VERSION`, and checksums, and **signs the binaries**. |
There was a problem hiding this comment.
Maybe semantics, but this might actually be multiple "builds/pipelines" because (for example), I don't think we can't cross-compile macOS binaries from other platforms. That said, I do think we should have a "join" task that does signing and collects all the artifacts for a given release. It just might be the case that it has to collect earlier stage artifacts from other "builds".
| - Ship per-RID **`Microsoft.Azure.Cosmos.NativeAssets.<rid>`** packages, each carrying one platform's dynamic lib under `runtimes/<rid>/native/`. | ||
| - Front them with a thin **meta-package** whose `runtime.json` resolves the consumer's RID to the right per-RID package. |
There was a problem hiding this comment.
We should run this by the .NET/NuGet teams and see if there are any unexpected consequences to this choice.
| ## Decision | ||
| - Ship per-RID **`Microsoft.Azure.Cosmos.NativeAssets.<rid>`** packages, each carrying one platform's dynamic lib under `runtimes/<rid>/native/`. | ||
| - Front them with a thin **meta-package** whose `runtime.json` resolves the consumer's RID to the right per-RID package. | ||
| - `Microsoft.Azure.Cosmos` takes an **opt-in** dependency on the meta-package (ADR 0007). |
There was a problem hiding this comment.
As @FabianMeiswinkel says, we shouldn't have opt-in.
|
|
||
| ## Alternatives considered | ||
| - Single "fat" NativeAssets package — kept only as an **interim** (Phase 1–2) for speed, not GA. | ||
| - Embed natives directly in `Microsoft.Azure.Cosmos` — rejected: bloats the flagship for pure-managed users. |
|
|
||
| ## Decision | ||
| - The ABI stays **bytes-in / bytes-out**; the wrapper does no JSON parsing. | ||
| - Each language binding **owns its own marshalling** (string encoding, structs) and **copies response buffers out of native memory** into host memory, then frees the native buffer. |
There was a problem hiding this comment.
I think the open question here is if we decide to do string->binary encoding in the driver someday, it would require a documented string encoding, presumably UTF-8. Practically, I think that's what we want anyways. We don't really want .NET to be passing in 16-bit char values and having rust put those on the wire anyway, so while I agree with bytes in/bytes out, I think it should maybe be explicitly UTF-8 bytes in/out.
| **Status:** Accepted (proposed for review) | ||
|
|
||
| ## Context | ||
| `Microsoft.Azure.Cosmos` (pure managed) and the Go SDK run today on any platform with no native dependency. Introducing a platform-specific native library must not silently break that portability or force a migration on a version bump. |
There was a problem hiding this comment.
Yes, I also disagree with this ADR. It's true that we might face pushback in Go, and lose some scenarios in JS by relying on the driver. At this point, I think those are worth it. If we need a pure go/pure js package at some point, I think it would be clearer/easier that they be completely different packages advertised to have limited functionality, instead of trying to offer fallback in the main SDK.
| A native library must be built per platform (OS + architecture + libc). The support surface must be bounded and explicit so build, signing, and testing are tractable, and so consumers get a clear answer on an unsupported platform. | ||
|
|
||
| ## Decision | ||
| - The GA matrix is: `win-x64`, `win-arm64`, `linux-x64` (glibc, **floor 2.17 — the manylinux2014 baseline**), `linux-musl-x64`, `linux-arm64`, `osx-x64`, `osx-arm64`. |
There was a problem hiding this comment.
I believe we need win-x86 since Azure App Service defaults to x86 for windows AppService instances.
|
|
||
| ## Alternatives considered | ||
| - Build only the most common platforms and let others fail at link/load — rejected: poor experience, no clear message. | ||
| - Include `win-x86` / mobile now — deferred (open Q): no demand yet. |
There was a problem hiding this comment.
See above - there is a very clear scenario where we will need this.
|
|
||
| ## Decision | ||
| - A given native release (one build + signing, ADR 0001) is **fanned out to all language feeds simultaneously** from that single hand-off — .NET, Go, and (later) Java publish the *same* native version together. | ||
| - Each language SDK **pins a compatible native range** and may cut its *own* managed/SDK releases independently between native releases; what it must not do is silently float onto a *different* native build. |
|
|
||
| > **Status:** Draft for review · **Author:** Cosmos Rust SDK team · **Type:** Design discussion (decisions are recorded separately as ADRs — see [`adr/`](adr/0000-index.md)) | ||
| > | ||
| > This is the *discussion* doc: context, alternatives, and the "why". The **decisions** live as short, immutable records in [`adr/`](adr/0000-index.md). If the two disagree, the ADR wins. |
There was a problem hiding this comment.
I think the idea is that this could keep more of the thinking that would make the ADRs too verbose, but in this case I agree that this doesn't seem to add much value.
heaths
left a comment
There was a problem hiding this comment.
Keep changes to sdk/cosmos and you won't need my sign-off.
There was a problem hiding this comment.
Put all these in sdk/cosmos/cspell.json and you won't need core sign-off. Granted, some of these are generic, but we can handle that separately.
Resolve the 9 review findings on PR #4651: - #1 CI: drop broken NATIVE_WRAPPER_SPEC.md links (lives in #4461); add cSpell terms (cbindgen, Authenticode, ESRP, manylinux, etc.) - #2 ADR 0001/0009: require RID-keyed hand-off + both .so/.a + C-only header - #3 ADR 0009: enforce build-once via checksum+signature verification - #4 ADR 0005: u32 is a monotonic ABI revision; add MaxSupported (reject too-new) - #5 promote Q5 to new ADR 0010 (single version fan-out) - #6 ADR 0000 template: Status under the title - #7 reciprocal Pairs-with note in ADR 0001 - #8 ADR 0008: name glibc floor 2.17 (manylinux2014) - #9 ADR 0007: silent-fallback for benign causes vs fail-loud for integrity/ABI Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
a6ee645 to
7e72641
Compare
What this is
A design doc + ADRs (documentation only — no code or pipeline changes) proposing how the Rust-built native Cosmos driver (
azure_data_cosmos_driver_native→azurecosmosdriver.{dll,so,dylib}plus the cbindgen C header) is distributed to the language SDKs (.NET first, Go near-term, Java later).The design doc carries the discussion — context, alternatives, and the "why". The ADRs record each decision in short, numbered, immutable form (Context · Decision · Consequences · Alternatives · Status). If the two ever disagree, the ADR wins.
These are proposed for design review — nothing here changes build or release behavior yet.
TL;DR of the design
NativeAssetspackages + a thin meta-package (the SkiaSharp / SqlClient.SNI model) — download only your RID.cosmos_abi_version(); every host checks it before first use and fails fast on mismatch instead of corrupting memory.Contents
docs/native-distribution/distribution-design.md— the discussion doc: purpose, goals/constraints, the model, per-consumer link model, platform matrix, rollout phases, and open questions.docs/native-distribution/adr/0000-index.mdand0001–0009— one decision per record.Status & open questions
Status is provisional ("proposed for review"). The main open questions the doc captures for reviewers: the internal hand-off shape (Universal Package vs pipeline artifact), packaging-pipeline ownership and its security boundary, the Go delivery model, version-mapping across the three feeds, and SBOM / component-governance ownership for the Rust crate graph.