Skip to content

Commit 8367220

Browse files
authored
docs: comprehensive documentation improvements and fixes (#498)
- Add missing extractors (csv, elastic, bigtable, bigquery) to feature matrix - Write enrich processor and Kafka sink READMEs (were empty) - Document MaxCompute support in tableau and optimus extractors - Fix broken date-kafka example recipe referencing non-existent extractor - Remove outdated "upcoming sinks" section, add Compass/Frontier/Kafka to sink docs - Expand deployment guide with Docker, Kubernetes, cron, and systemd options - Expand configuration reference with env vars, logging, and config patterns - Add troubleshooting guide covering common issues - Add example recipes for Postgres, BigQuery multi-sink, and Tableau - Rewrite processor docs with chaining, execution order, and error handling - Add context graph concept doc for AI use cases - Update sidebar navigation with new pages
1 parent 0cb8c31 commit 8367220

File tree

21 files changed

+966
-106
lines changed

21 files changed

+966
-106
lines changed
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Context Graph for AI
2+
3+
AI systems — large language models, copilots, autonomous agents — are powerful reasoners but poor guessers. Without grounding in real metadata, they hallucinate table names, fabricate schemas, and miss relationships between systems. The gap between what AI can do and what it actually knows about your data landscape is the **context gap**.
4+
5+
Meteor closes this gap. By continuously extracting metadata — schemas, lineage, ownership, descriptions — from across your data infrastructure, Meteor builds the structured knowledge layer that AI needs to reason accurately. This layer is the **context graph**.
6+
7+
## What is a Context Graph?
8+
9+
A context graph is a connected, queryable representation of your data ecosystem. It captures not just what assets exist, but how they relate to each other:
10+
11+
- **Nodes** represent assets: tables, dashboards, jobs, topics, models, buckets, users, groups.
12+
- **Edges** represent relationships: lineage (which table feeds which dashboard), ownership (who is responsible), and dependency (which job produces which dataset).
13+
14+
Unlike a flat catalog or a search index, a context graph preserves **structure**. It knows that a revenue dashboard depends on a sales table, which is produced by an ETL job, which reads from a Kafka topic. This structure is what makes AI useful over enterprise data.
15+
16+
## How Meteor Builds the Context Graph
17+
18+
Meteor operates as a metadata supply chain with three stages:
19+
20+
### Extract
21+
22+
Meteor's extractors connect to 30+ data sources — databases (BigQuery, Postgres, Snowflake), BI tools (Tableau, Metabase, Superset), streaming platforms (Kafka), cloud storage (GCS), orchestrators (Optimus), and more. Each extractor produces standardized **Asset** records containing:
23+
24+
- **Schema metadata** — column names, types, descriptions, constraints
25+
- **Lineage** — upstream and downstream asset references
26+
- **Ownership** — who owns and maintains the asset
27+
- **Service context** — source system, URLs, timestamps
28+
29+
### Process
30+
31+
Processors enrich and transform assets in-flight before they reach a sink. Use them to:
32+
33+
- Append **labels** for classification (environment, domain, sensitivity, PII)
34+
- **Enrich** assets with custom fields from external systems
35+
- Run **scripts** (Tengo) for arbitrary transformation logic, including HTTP calls to external APIs
36+
37+
### Deliver
38+
39+
Sinks push the enriched metadata to wherever your AI systems can consume it — a metadata catalog (Compass), a search index, cloud storage (GCS), a streaming platform (Kafka), or a generic HTTP endpoint.
40+
41+
```
42+
Sources (30+) Processors Sinks
43+
┌──────────────┐ ┌──────────┐ ┌────────────────┐
44+
│ BigQuery │ │ Labels │ │ Compass │
45+
│ Postgres │─────▶│ Enrich │─────▶│ Kafka │
46+
│ Tableau │ │ Script │ │ GCS / HTTP │
47+
│ Kafka ... │ └──────────┘ └────────────────┘
48+
└──────────────┘ │
49+
50+
Context Graph
51+
52+
53+
AI Systems & Agents
54+
```
55+
56+
## Why AI Needs a Context Graph
57+
58+
### Grounding and Retrieval (RAG)
59+
60+
Retrieval-Augmented Generation depends on having a rich, accurate corpus to search against. Meteor-extracted metadata — table descriptions, column names, business labels, ownership — becomes the retrieval layer that grounds LLM responses in reality.
61+
62+
When a user asks *"find me all tables related to revenue"*, the AI searches over Meteor-extracted descriptions and labels instead of guessing. When it generates SQL, it uses real column names and types from the schema metadata.
63+
64+
### Lineage as Causal Reasoning
65+
66+
Most metadata systems tell AI **what exists**. Lineage tells it **how things connect**. This is the difference between a lookup tool and a reasoning partner.
67+
68+
With Meteor's lineage graph, AI can:
69+
70+
- **Trace impact** — "If I change this table's schema, which dashboards break?"
71+
- **Root-cause analysis** — "This metric dropped. What changed upstream?"
72+
- **Dependency awareness** — "Before deprecating this dataset, show me everything downstream."
73+
74+
### Asset Discovery for AI Agents
75+
76+
Function-calling AI agents need to know what tools and data are available. The context graph serves as the agent's **world model**:
77+
78+
- What tables, APIs, dashboards, and services exist
79+
- What columns and types are available for SQL generation
80+
- How services connect (topic → consumer → table → dashboard)
81+
- Who to escalate to when something goes wrong
82+
83+
### Trust and Data Quality Signals
84+
85+
AI should not treat all data equally. Metadata enriched through Meteor's processors can carry trust signals — freshness, ownership, sensitivity labels, environment tags — that help AI systems prioritize reliable, appropriate data sources over stale or restricted ones.
86+
87+
## Extending Meteor for AI Workloads
88+
89+
Meteor's plugin architecture makes it straightforward to extend the context graph for AI-specific use cases:
90+
91+
| Capability | Approach |
92+
|---|---|
93+
| **Semantic search** | Use a script processor to generate vector embeddings from asset descriptions, enabling similarity-based retrieval |
94+
| **Business glossary** | Extract metric definitions and business terms as first-class assets, linking them to underlying tables |
95+
| **Usage signals** | Build extractors that capture query frequency and dashboard views, helping AI rank assets by relevance |
96+
| **Data quality** | Enrich assets with freshness, completeness, and anomaly scores so AI can assess data trustworthiness |
97+
| **LLM-optimized exports** | Create sinks that format metadata as structured context windows sized for LLM consumption |
98+
99+
## The Flywheel
100+
101+
The context graph is not a one-time build. It is a continuously improving loop:
102+
103+
1. **Meteor extracts** metadata from across the data ecosystem
104+
2. **The context graph** grows richer with each extraction cycle
105+
3. **AI systems** use the graph for grounding, reasoning, and discovery
106+
4. **AI interactions** reveal gaps — missing descriptions, unknown lineage, unlabeled assets
107+
5. **Teams fill gaps**, improving metadata quality
108+
6. **Meteor captures** the improvements, and the cycle continues
109+
110+
Each iteration makes the AI more capable and the metadata more complete. Meteor is the engine that keeps this flywheel turning.

docs/docs/concepts/overview.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,4 @@ A bit confused about various terms mentioned, and their usage?? Navigate through
66
* [Source](source.md)
77
* [Sink](sink.md)
88
* [Processor](processor.md)
9+
* [Context Graph for AI](context_graph.md)

docs/docs/concepts/processor.md

Lines changed: 70 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,88 @@
11
# Processor
22

3-
A recipe can have none or many processors registered, depending upon the way the user wants metadata to be processed. A processor is basically a function that:
3+
A recipe can have none or many processors registered, depending upon how the user wants metadata to be processed. A processor is a function that takes each asset record, transforms it, and passes it to the next stage.
44

5-
* expects a list of data
6-
* processes the list
7-
* returns a list
5+
## How Processors Work
86

9-
The result from a processor will be passed on to the next processor until there is no more processor, hence the flow is sequential.
7+
Processors execute **sequentially** in the order they are defined in the recipe. The output of one processor becomes the input of the next, forming a transformation pipeline:
8+
9+
```
10+
Extractor → Processor 1 → Processor 2 → Processor 3 → Sink
11+
```
12+
13+
If no processors are defined, records flow directly from the extractor to the sink unchanged.
14+
15+
## Error Handling
16+
17+
If a processor encounters an error during execution, the entire recipe run fails. There is no skip-on-error behavior — you must fix the processor configuration to resolve the issue.
1018

1119
## Built-in Processors
1220

13-
### metadata
21+
### Enrich
22+
23+
Append custom key-value attributes to each asset's data. Useful for adding metadata that is not present in the source system.
24+
25+
```yaml
26+
processors:
27+
- name: enrich
28+
config:
29+
attributes:
30+
team: data-platform
31+
environment: production
32+
```
1433
15-
This processor will set and overwrite metadata with given fields in the config.
34+
### Labels
35+
36+
Append key-value labels to each asset. Labels are useful for categorization and filtering in downstream catalog services.
1637
1738
```yaml
1839
processors:
19-
- name: metadata
40+
- name: labels
2041
config:
21-
fieldA: valueA
22-
fieldB: valueB
42+
labels:
43+
source: meteor
44+
classification: internal
2345
```
2446
47+
### Script
48+
49+
Transform assets using a [Tengo](https://github.com/d5/tengo) script. The script processor gives you full control — including the ability to make HTTP calls to external services for enrichment.
50+
51+
```yaml
52+
processors:
53+
- name: script
54+
config:
55+
engine: tengo
56+
script: |
57+
asset.labels["processed"] = "true"
58+
```
59+
60+
## Writing a Recipe with Processors
61+
2562
| key | Description | requirement |
2663
| :--- | :--- | :--- |
27-
| `name` | contains the name of processor | required |
28-
| `config` | different processors will require different config | required |
64+
| `name` | Name of the processor to use | required |
65+
| `config` | Processor-specific configuration | required |
2966

30-
More info about available processors can be found [here](../reference/processors.md).
67+
## Example: Chaining Multiple Processors
3168

69+
```yaml
70+
processors:
71+
- name: enrich
72+
config:
73+
attributes:
74+
domain: payments
75+
- name: labels
76+
config:
77+
labels:
78+
source: meteor
79+
- name: script
80+
config:
81+
engine: tengo
82+
script: |
83+
asset.name = asset.name + " [" + asset.service + "]"
84+
```
85+
86+
In this example, each asset first gets enriched with a `domain` attribute, then gets labeled with `source: meteor`, and finally has its name modified by the script processor.
87+
88+
More info about available processors can be found [here](../reference/processors.md).

docs/docs/concepts/sink.md

Lines changed: 66 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -27,85 +27,109 @@ sinks: # required - at least 1 sink defined
2727
* **Console**
2828

2929
```yaml
30-
name: sample-recipe
3130
sinks:
3231
- name: console
3332
```
3433

3534
Print metadata to stdout.
3635

36+
* **Compass**
37+
38+
```yaml
39+
sinks:
40+
- name: compass
41+
config:
42+
host: https://compass.example.com
43+
type: sample-compass-type
44+
mapping:
45+
new_fieldname: "json_field_name"
46+
id: "resource.urn"
47+
displayName: "resource.name"
48+
```
49+
50+
Upload metadata to [Compass](https://github.com/raystack/compass), Raystack's metadata catalog service. Supports lineage and ownership.
51+
3752
* **File**
3853

3954
```yaml
4055
sinks:
41-
name: file
56+
- name: file
4257
config:
43-
path: "./dir/sample.yaml"
44-
format: "yaml"
58+
path: "./dir/sample.yaml"
59+
format: "yaml"
4560
```
4661

4762
Sinks metadata to a file in `json/yaml` format as per the config defined.
4863

64+
* **Frontier**
65+
66+
```yaml
67+
sinks:
68+
- name: frontier
69+
config:
70+
host: frontier.example.com
71+
headers:
72+
X-Frontier-Email: meteor@raystack.io
73+
```
74+
75+
Upsert users to a Frontier service. Request will be sent via GRPC.
76+
4977
* **Google Cloud Storage**
5078

5179
```yaml
5280
sinks:
5381
- name: gcs
5482
config:
55-
project_id: google-project-id
56-
url: gcs://bucket_name/target_folder
57-
object_prefix : github-users
58-
service_account_base64: <base64 encoded service account key>
59-
service_account_json:
60-
{
61-
"type": "service_account",
62-
"private_key_id": "xxxxxxx",
63-
"private_key": "xxxxxxx",
64-
"client_email": "xxxxxxx",
65-
"client_id": "xxxxxxx",
66-
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
67-
"token_uri": "https://oauth2.googleapis.com/token",
68-
"auth_provider_x509_cert_url": "xxxxxxx",
69-
"client_x509_cert_url": "xxxxxxx",
70-
}
83+
project_id: google-project-id
84+
url: gcs://bucket_name/target_folder
85+
object_prefix: github-users
86+
service_account_base64: <base64 encoded service account key>
7187
```
7288

73-
Sinks json data to a file as ndjson format in Google Cloud Storage bucket
89+
Sinks JSON data as ndjson format in a Google Cloud Storage bucket.
7490

75-
* **http**
91+
* **HTTP**
7692

7793
```yaml
7894
sinks:
79-
name: http
80-
config:
81-
method: POST
82-
success_code: 200
83-
url: https://compass.com/v1beta1/asset
84-
headers:
85-
Header-1: value11,value12
95+
- name: http
96+
config:
97+
method: POST
98+
success_code: 200
99+
url: https://example.com/v1/metadata
100+
headers:
101+
Header-1: value11,value12
86102
```
87103

88-
Sinks metadata to a http destination as per the config defined.
104+
Sinks metadata to an HTTP destination as per the config defined.
89105

90-
* **Stencil**
106+
* **Kafka**
91107

92108
```yaml
93109
sinks:
94-
name: stencil
95-
config:
96-
host: https://stencil.com
97-
namespace_id: myNamespace
98-
schema_id: mySchema
99-
format: json
100-
send_format_header: false
110+
- name: kafka
111+
config:
112+
brokers: "localhost:9092"
113+
topic: metadata-topic
114+
key_path: ".resource.urn"
101115
```
102116

103-
Upload metadata of a given schema `format` in the existing `namespace_id` present in Stencil. Request will be sent via HTTP to a given host.
117+
Publish metadata as JSON messages to a Kafka topic. Supports message keying for partition control.
104118

105-
## Upcoming sinks
119+
* **Stencil**
106120

107-
* HTTP
108-
* Kafka
121+
```yaml
122+
sinks:
123+
- name: stencil
124+
config:
125+
host: https://stencil.com
126+
namespace_id: myNamespace
127+
schema_id: mySchema
128+
format: json
129+
send_format_header: false
130+
```
131+
132+
Upload metadata of a given schema `format` in the existing `namespace_id` present in Stencil. Request will be sent via HTTP to a given host.
109133

110134
## Serializer
111135

0 commit comments

Comments
 (0)