Skip to content

Commit 89bdc4d

Browse files
nokomerowanc1
andauthored
🦋 AT Protocol Lexicon (#41)
Co-authored-by: Rowan Cockett <rowanc1@gmail.com>
1 parent 24961db commit 89bdc4d

18 files changed

Lines changed: 2845 additions & 9 deletions

File tree

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
"@oxa/core": patch
3+
---
4+
5+
Add atproto conversion scripts

‎README.md‎

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,24 @@ The schema can be downloaded from oxa.dev.
196196

197197
https://oxa.dev/schemas/0.0.1.json
198198

199+
## ATProto conversion
200+
201+
The `@oxa/core` CLI can also convert OXA documents to the ATProto JSON shape used by
202+
the OXA lexicon.
203+
204+
```bash
205+
oxa convert --to atproto --created-at 2026-03-22T00:00:00.000Z examples/document.yaml
206+
cat examples/document.yaml | oxa convert --to atproto --yaml --created-at 2026-03-22T00:00:00.000Z -
207+
```
208+
209+
The conversion API is also exported from the package entrypoint:
210+
211+
```ts
212+
import { flattenInlines, mapBlock, oxaToAtproto } from "@oxa/core";
213+
```
214+
215+
The generated ATProto records use the `pub.oxa.document.*` lexicon namespace.
216+
199217
## Licensing and Attribution
200218

201219
Each document can include licensing and rights metadata:

‎docs/atproto-lexicon.md‎

Lines changed: 268 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
---
2+
title: AT Protocol Lexicon
3+
---
4+
5+
OXA defines an [AT Protocol (atproto)](https://atproto.com) [Lexicon](https://atproto.com/guides/lexicon) for publishing scientific documents to the [Atmosphere](https://atproto.com/guides/understanding-atproto).
6+
7+
The lexicon lives under the `pub.oxa.*` namespace and enables OXA documents to be stored as records in any AT Protocol [Personal Data Server (PDS)](https://atproto.com/guides/the-at-stack), making scientific content natively available alongside social interactions, feeds, and moderation infrastructure.
8+
9+
## Why an AT Protocol lexicon?
10+
11+
Scientific publishing today relies on centralized platforms. Researchers upload papers to a service, and that service controls access, discovery, and permanence. AT Protocol offers a different model:
12+
13+
- **User-owned data.** Documents live in the author's signed data repository and can be migrated between hosts.
14+
- **Decentralized discovery.** Any indexer can crawl the network's [firehose](https://atproto.com/specs/sync) to discover and aggregate scientific content.
15+
- **Interoperability by default.** The Lexicon type system gives every consumer the same schema, so tools can read, validate, and render documents without out-of-band agreements.
16+
- **Built-in identity.** Authors are identified by [DIDs](https://atproto.com/specs/did) and [handles](https://atproto.com/specs/handle), providing a ready-made, portable identity layer.
17+
18+
By defining a lexicon, OXA documents become first-class objects on the AT Protocol network, subject to the same sync, auth, and moderation primitives as any other record type.
19+
20+
## Lexicon structure
21+
22+
The OXA lexicon is organized into two namespaces:
23+
24+
| File | NSID | Purpose |
25+
| -------------------------------- | --------------------------- | --------------------------------------------------------------------------------------- |
26+
| `lexicon/document/document.json` | `pub.oxa.document.document` | The `Document` record type — the root object stored in a PDS |
27+
| `lexicon/document/defs.json` | `pub.oxa.document.defs` | Block-level type definitions (`paragraph`, `heading`, `richText`) and the `block` union |
28+
| `lexicon/richtext/facet.json` | `pub.oxa.richtext.facet` | Facet annotations for inline formatting (`emphasis`, `strong`, `byteSlice`) |
29+
30+
A `Document` record contains an array of `children` (blocks). Each block carries a `text` string and an optional `facets` array that annotates ranges of that text with formatting features.
31+
32+
## Following the AT Protocol style guide
33+
34+
The lexicon follows the conventions described in the [Lexicon Style Guide](https://atproto.com/guides/lexicon-style-guide#design-patterns):
35+
36+
- **Rich text via facets.** Instead of embedding markup in strings, inline formatting is represented as byte-range annotations — the same pattern established by [`app.bsky.richtext.facet`](https://docs.bsky.app/docs/advanced-guides/post-richtext). This keeps text plain and makes it safe to render even if a consumer doesn't understand a particular facet type. See [why `pub.oxa.richtext.facet`?](#why-puboxa-not-appbsky) below for why OXA defines its own facet lexicon rather than reusing Bluesky's.
37+
- **Open unions.** The `block` union and the facet `features` union are both declared with `"closed": false`, allowing future extension without breaking existing consumers.
38+
- **Minimal required fields.** Only fields that are truly necessary for functionality are marked `required` (e.g. `children` and `createdAt` on a document, `level` on a heading). This keeps the schema flexible for evolution.
39+
- **Singular nouns for record schemas.** The record type is named `document` (not `documents`), following the convention for record schemas.
40+
- **Reusable definitions.** Shared types like `richText` and `byteSlice` are defined once and referenced across lexicons, following the pattern of `defs` files.
41+
- **`createdAt` timestamp.** The document record includes a `createdAt` field (datetime string), which is standard practice for ATProto records.
42+
- **`$type` discriminators.** Every block and facet feature carries a `$type` string so consumers can identify types in the open union without ambiguity.
43+
44+
(why-puboxa-not-appbsky)=
45+
46+
## Why `pub.oxa.richtext.facet`?
47+
48+
OXA defines its own facet lexicon (`pub.oxa.richtext.facet`) rather than reusing Bluesky's `app.bsky.richtext.facet`. This is a deliberate choice driven by the different domains the two lexicons serve.
49+
50+
Bluesky's facet features are designed for social microblogging — its feature union contains `mention` (account references), `link` (URLs), and `tag` (hashtags). These are not the annotations scientific documents need. OXA documents require typographic and semantic formatting: `emphasis`, `strong`, and in the future inline types like `subscript`, `superscript`, `inlineMath`, `inlineCode`, `link`, `cite`, and others defined in the OXA schema.
51+
52+
There are also structural differences. The `app.bsky.richtext.facet` features union is closed, meaning third parties cannot extend it without modifying the original lexicon. The `pub.oxa.richtext.facet` features union is declared with `"closed": false`, following the style guide's recommendation for extensibility. This allows the OXA facet feature set to grow as the schema grows. Because the lexicon is generated from the OXA schema, new inline types become facet features automatically.
53+
54+
Finally, the `app.bsky` namespace is owned by Bluesky PBC. Extending it with document-formatting features would conflate social and scientific concerns in a namespace OXA does not control.
55+
56+
### Compatible features from other namespaces
57+
58+
Where an OXA facet feature is semantically equivalent to a feature in another AT Protocol namespace, the converter emits both features in the same facet's `features` array. This gives consumers that understand the other namespace free interoperability without OXA depending on that namespace for its core schema.
59+
60+
For example, when `Link` is added to the OXA schema, a link facet will carry both the OXA feature and Bluesky's `app.bsky.richtext.facet#link`:
61+
62+
```json
63+
{
64+
"index": { "byteStart": 10, "byteEnd": 20 },
65+
"features": [
66+
{ "$type": "pub.oxa.richtext.facet#link", "uri": "https://example.com" },
67+
{ "$type": "app.bsky.richtext.facet#link", "uri": "https://example.com" }
68+
]
69+
}
70+
```
71+
72+
This works because AT Protocol facets support multiple features per byte range, and consumers ignore feature types they don't recognise. A Bluesky client rendering an OXA document record will make links clickable even though it doesn't understand `pub.oxa.richtext.facet#emphasis`.
73+
74+
The mapping is maintained in the `compatibleFeatures` export from `@oxa/core`. It is a record keyed by OXA facet feature `$type`, where each value is an array of functions that produce a compatible feature object (or `null` to skip). This design is not Bluesky-specific — any AT Protocol namespace can be added to the map.
75+
76+
## Flattening inlines into facets
77+
78+
The most significant transformation between an OXA document and its lexicon representation is how inline content is handled.
79+
80+
### OXA's tree model
81+
82+
In the OXA schema, inline content is a recursive tree. A `Paragraph` has `children` that can be `Text`, `Emphasis`, `Strong`, or other inline types, and formatting nodes themselves contain `children`:
83+
84+
```yaml
85+
type: Paragraph
86+
children:
87+
- type: Text
88+
value: "This is "
89+
- type: Strong
90+
children:
91+
- type: Text
92+
value: "bold and "
93+
- type: Emphasis
94+
children:
95+
- type: Text
96+
value: "italic"
97+
- type: Text
98+
value: " text."
99+
```
100+
101+
### AT Protocol's flat model
102+
103+
AT Protocol [uses facets instead of a tree](https://www.pfrazee.com/blog/why-facets). The text is stored as a single plain string, and formatting is described by byte-range annotations:
104+
105+
```json
106+
{
107+
"$type": "pub.oxa.document.defs#paragraph",
108+
"text": "This is bold and italic text.",
109+
"facets": [
110+
{
111+
"index": { "byteStart": 8, "byteEnd": 23 },
112+
"features": [{ "$type": "pub.oxa.richtext.facet#strong" }]
113+
},
114+
{
115+
"index": { "byteStart": 17, "byteEnd": 23 },
116+
"features": [{ "$type": "pub.oxa.richtext.facet#emphasis" }]
117+
}
118+
]
119+
}
120+
```
121+
122+
The conversion walks the inline tree depth-first, concatenating all `Text` node values into a single string. Each formatting node records the byte offset before and after its children are processed, producing a `byteStart`/`byteEnd` pair. Indices count bytes of the UTF-8 encoded text (not characters), matching the AT Protocol convention.
123+
124+
This design has several advantages in a decentralized setting:
125+
126+
- **Safe rendering.** A consumer that doesn't recognize a facet type can still display the plain text.
127+
- **Simple validation.** Facets are flat — there is no recursive nesting to validate.
128+
- **Extensibility.** New facet feature types (links, mentions, math, etc.) can be added to the open union without changing the text representation.
129+
130+
## Converting documents with the CLI
131+
132+
The `oxa` CLI (provided by the `oxa` npm package) can convert an OXA document to its AT Protocol lexicon representation.
133+
134+
The `--to` option is required and specifies the target format. Currently the only supported format is `atproto`.
135+
136+
### From a JSON file
137+
138+
```bash
139+
oxa convert --to atproto examples/document.json
140+
```
141+
142+
### From a YAML file via stdin
143+
144+
```bash
145+
cat examples/document.yaml | oxa convert --to atproto --yaml -
146+
```
147+
148+
### Setting `createdAt`
149+
150+
By default `createdAt` is set to the current time. You can provide a fixed value:
151+
152+
```bash
153+
oxa convert --to atproto --created-at 2026-03-22T00:00:00.000Z examples/document.json
154+
```
155+
156+
### Example
157+
158+
Given this OXA document (`doc.yaml`):
159+
160+
```yaml
161+
type: Document
162+
children:
163+
- type: Heading
164+
level: 1
165+
children:
166+
- type: Text
167+
value: Hello
168+
- type: Paragraph
169+
children:
170+
- type: Text
171+
value: "Some "
172+
- type: Emphasis
173+
children:
174+
- type: Text
175+
value: emphasized
176+
- type: Text
177+
value: " text."
178+
```
179+
180+
Running:
181+
182+
```bash
183+
oxa convert --to atproto --yaml --created-at 2026-01-01T00:00:00.000Z doc.yaml
184+
```
185+
186+
Produces:
187+
188+
```json
189+
{
190+
"$type": "pub.oxa.document.document",
191+
"children": [
192+
{
193+
"$type": "pub.oxa.document.defs#heading",
194+
"level": 1,
195+
"text": "Hello",
196+
"facets": []
197+
},
198+
{
199+
"$type": "pub.oxa.document.defs#paragraph",
200+
"text": "Some emphasized text.",
201+
"facets": [
202+
{
203+
"index": { "byteStart": 5, "byteEnd": 15 },
204+
"features": [{ "$type": "pub.oxa.richtext.facet#emphasis" }]
205+
}
206+
]
207+
}
208+
],
209+
"createdAt": "2026-01-01T00:00:00.000Z"
210+
}
211+
```
212+
213+
### Programmatic API
214+
215+
The conversion functions are also exported from the `@oxa/core` package:
216+
217+
```typescript
218+
import {
219+
flattenInlines,
220+
mapBlock,
221+
oxaToAtproto,
222+
compatibleFeatures,
223+
} from "@oxa/core";
224+
225+
const atprotoRecord = oxaToAtproto(oxaDocument, {
226+
createdAt: "2026-01-01T00:00:00.000Z",
227+
});
228+
```
229+
230+
The `compatibleFeatures` export is a mutable record that controls which additional facet features from other AT Protocol namespaces are emitted alongside OXA features (see [compatible features](#compatible-features-from-other-namespaces) above). You can add or remove entries to customize interoperability:
231+
232+
```typescript
233+
import { compatibleFeatures } from "@oxa/core";
234+
235+
// Emit a hypothetical shared-namespace feature alongside OXA links
236+
compatibleFeatures["pub.oxa.richtext.facet#link"] = [
237+
(node) => ({
238+
$type: "org.example.richtext.facet#link",
239+
uri: node.uri as string,
240+
}),
241+
];
242+
```
243+
244+
## Generated from the OXA schema
245+
246+
The lexicon files are generated from the OXA YAML schema definitions by the codegen script (`scripts/lib/generate-lexicon.ts`). The generator:
247+
248+
1. Loads the merged OXA JSON Schema.
249+
2. Classifies each type as inline or block based on the `Inline` and `Block` union definitions.
250+
3. Maps inline types to facet features in `pub.oxa.richtext.facet` (excluding `Text`, which becomes the plain text string).
251+
4. Maps block types to object definitions in `pub.oxa.document.defs`, replacing their inline `children` arrays with `text` + `facets` pairs.
252+
5. Emits the `Document` record type in `pub.oxa.document.document`.
253+
254+
To regenerate the lexicon after changing the schema:
255+
256+
```bash
257+
pnpm --filter scripts codegen lexicon
258+
```
259+
260+
This means that when a new type is added to the OXA schema — for example a new inline type like `Subscript` or a new block type like `CodeBlock` — it is **immediately available** in the lexicon after running codegen. New inline types automatically appear as facet features in the open union, and new block types appear in the block union. No manual lexicon authoring is required.
261+
262+
## Further reading
263+
264+
- [AT Protocol documentation](https://atproto.com/docs)
265+
- [Lexicon specification](https://atproto.com/specs/lexicon)
266+
- [Lexicon Style Guide](https://atproto.com/guides/lexicon-style-guide)
267+
- ["Why RichText facets in Bluesky"](https://www.pfrazee.com/blog/why-facets) — the design rationale behind facets
268+
- [OXA Schema overview](./schema-overview.md)

‎docs/myst.yml‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ project:
1717
- file: index.md
1818
- file: install.md
1919
- file: schema-overview.md
20+
- file: atproto-lexicon.md
2021
- title: Schema Reference
2122
children:
2223
- pattern: schema/*.md

0 commit comments

Comments
 (0)