raystack
diff --git a/‎docs/docs/concepts/context_graph.md‎
Lines changed: 110 additions & 0 deletions b/‎docs/docs/concepts/context_graph.md‎
Lines changed: 110 additions & 0 deletions
diff --git a/‎docs/docs/concepts/overview.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/docs/concepts/overview.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/docs/concepts/processor.md‎
Lines changed: 70 additions & 13 deletions b/‎docs/docs/concepts/processor.md‎
Lines changed: 70 additions & 13 deletions
diff --git a/‎docs/docs/concepts/sink.md‎
Lines changed: 66 additions & 42 deletions b/‎docs/docs/concepts/sink.md‎
Lines changed: 66 additions & 42 deletions
@@ -0,0 +1,110 @@
+# Context Graph for AI
+
+AI systems — large language models, copilots, autonomous agents — are powerful reasoners but poor guessers. Without grounding in real metadata, they hallucinate table names, fabricate schemas, and miss relationships between systems. The gap between what AI can do and what it actually knows about your data landscape is the **context gap**.
+
+Meteor closes this gap. By continuously extracting metadata — schemas, lineage, ownership, descriptions — from across your data infrastructure, Meteor builds the structured knowledge layer that AI needs to reason accurately. This layer is the **context graph**.
+
+## What is a Context Graph?
+
+A context graph is a connected, queryable representation of your data ecosystem. It captures not just what assets exist, but how they relate to each other:
+
+- **Nodes** represent assets: tables, dashboards, jobs, topics, models, buckets, users, groups.
+- **Edges** represent relationships: lineage (which table feeds which dashboard), ownership (who is responsible), and dependency (which job produces which dataset).
+
+Unlike a flat catalog or a search index, a context graph preserves **structure**. It knows that a revenue dashboard depends on a sales table, which is produced by an ETL job, which reads from a Kafka topic. This structure is what makes AI useful over enterprise data.
+
+## How Meteor Builds the Context Graph
+
+Meteor operates as a metadata supply chain with three stages:
+
+### Extract
+
+Meteor's extractors connect to 30+ data sources — databases (BigQuery, Postgres, Snowflake), BI tools (Tableau, Metabase, Superset), streaming platforms (Kafka), cloud storage (GCS), orchestrators (Optimus), and more. Each extractor produces standardized **Asset** records containing:
+
+- **Schema metadata** — column names, types, descriptions, constraints
+- **Lineage** — upstream and downstream asset references
+- **Ownership** — who owns and maintains the asset
+- **Service context** — source system, URLs, timestamps
+
+### Process
+
+Processors enrich and transform assets in-flight before they reach a sink. Use them to:
+
+- Append **labels** for classification (environment, domain, sensitivity, PII)
+- **Enrich** assets with custom fields from external systems
+- Run **scripts** (Tengo) for arbitrary transformation logic, including HTTP calls to external APIs
+
+### Deliver
+
+Sinks push the enriched metadata to wherever your AI systems can consume it — a metadata catalog (Compass), a search index, cloud storage (GCS), a streaming platform (Kafka), or a generic HTTP endpoint.
+
+```
+Sources (30+)          Processors           Sinks
+┌──────────────┐      ┌──────────┐      ┌────────────────┐
+│ BigQuery     │      │ Labels   │      │ Compass        │
+│ Postgres     │─────▶│ Enrich   │─────▶│ Kafka          │
+│ Tableau      │      │ Script   │      │ GCS / HTTP     │
+│ Kafka  ...   │      └──────────┘      └────────────────┘
+└──────────────┘                              │
+                                              ▼
+                                        Context Graph
+                                              │
+                                              ▼
+                                     AI Systems & Agents
+```
+
+## Why AI Needs a Context Graph
+
+### Grounding and Retrieval (RAG)
+
+Retrieval-Augmented Generation depends on having a rich, accurate corpus to search against. Meteor-extracted metadata — table descriptions, column names, business labels, ownership — becomes the retrieval layer that grounds LLM responses in reality.
+
+When a user asks *"find me all tables related to revenue"*, the AI searches over Meteor-extracted descriptions and labels instead of guessing. When it generates SQL, it uses real column names and types from the schema metadata.
+
+### Lineage as Causal Reasoning
+
+Most metadata systems tell AI **what exists**. Lineage tells it **how things connect**. This is the difference between a lookup tool and a reasoning partner.
+
+With Meteor's lineage graph, AI can:
+
+- **Trace impact** — "If I change this table's schema, which dashboards break?"
+- **Root-cause analysis** — "This metric dropped. What changed upstream?"
+- **Dependency awareness** — "Before deprecating this dataset, show me everything downstream."
+
+### Asset Discovery for AI Agents
+
+Function-calling AI agents need to know what tools and data are available. The context graph serves as the agent's **world model**:
+
+- What tables, APIs, dashboards, and services exist
+- What columns and types are available for SQL generation
+- How services connect (topic → consumer → table → dashboard)
+- Who to escalate to when something goes wrong
+
+### Trust and Data Quality Signals
+
+AI should not treat all data equally. Metadata enriched through Meteor's processors can carry trust signals — freshness, ownership, sensitivity labels, environment tags — that help AI systems prioritize reliable, appropriate data sources over stale or restricted ones.
+
+## Extending Meteor for AI Workloads
+
+Meteor's plugin architecture makes it straightforward to extend the context graph for AI-specific use cases:
+
+| Capability | Approach |
+|---|---|
+| **Semantic search** | Use a script processor to generate vector embeddings from asset descriptions, enabling similarity-based retrieval |
+| **Business glossary** | Extract metric definitions and business terms as first-class assets, linking them to underlying tables |
+| **Usage signals** | Build extractors that capture query frequency and dashboard views, helping AI rank assets by relevance |
+| **Data quality** | Enrich assets with freshness, completeness, and anomaly scores so AI can assess data trustworthiness |
+| **LLM-optimized exports** | Create sinks that format metadata as structured context windows sized for LLM consumption |
+
+## The Flywheel
+
+The context graph is not a one-time build. It is a continuously improving loop:
+
+1. **Meteor extracts** metadata from across the data ecosystem
+2. **The context graph** grows richer with each extraction cycle
+3. **AI systems** use the graph for grounding, reasoning, and discovery
+4. **AI interactions** reveal gaps — missing descriptions, unknown lineage, unlabeled assets
+5. **Teams fill gaps**, improving metadata quality
+6. **Meteor captures** the improvements, and the cycle continues
+
+Each iteration makes the AI more capable and the metadata more complete. Meteor is the engine that keeps this flywheel turning.
@@ -6,3 +6,4 @@ A bit confused about various terms mentioned, and their usage?? Navigate through
 * [Source](source.md)
 * [Sink](sink.md)
 * [Processor](processor.md)
+* [Context Graph for AI](context_graph.md)
@@ -1,31 +1,88 @@
 # Processor
 
-A recipe can have none or many processors registered, depending upon the way the user wants metadata to be processed. A processor is basically a function that:
+A recipe can have none or many processors registered, depending upon how the user wants metadata to be processed. A processor is a function that takes each asset record, transforms it, and passes it to the next stage.
 
-* expects a list of data
-* processes the list
-* returns a list
+## How Processors Work
 
-The result from a processor will be passed on to the next processor until there is no more processor, hence the flow is sequential.
+Processors execute **sequentially** in the order they are defined in the recipe. The output of one processor becomes the input of the next, forming a transformation pipeline:
+
+```
+Extractor → Processor 1 → Processor 2 → Processor 3 → Sink
+```
+
+If no processors are defined, records flow directly from the extractor to the sink unchanged.
+
+## Error Handling
+
+If a processor encounters an error during execution, the entire recipe run fails. There is no skip-on-error behavior — you must fix the processor configuration to resolve the issue.
 
 ## Built-in Processors
 
-### metadata
+### Enrich
+
+Append custom key-value attributes to each asset's data. Useful for adding metadata that is not present in the source system.
+
+```yaml
+processors:
+  - name: enrich
+    config:
+      attributes:
+        team: data-platform
+        environment: production
+```
 
-This processor will set and overwrite metadata with given fields in the config.
+### Labels
+
+Append key-value labels to each asset. Labels are useful for categorization and filtering in downstream catalog services.
 
 ```yaml
 processors:
-  - name: metadata
+  - name: labels
     config:
-      fieldA: valueA
-      fieldB: valueB
+      labels:
+        source: meteor
+        classification: internal
 ```
 
+### Script
+
+Transform assets using a [Tengo](https://github.com/d5/tengo) script. The script processor gives you full control — including the ability to make HTTP calls to external services for enrichment.
+
+```yaml
+processors:
+  - name: script
+    config:
+      engine: tengo
+      script: |
+        asset.labels["processed"] = "true"
+```
+
+## Writing a Recipe with Processors
+
 | key | Description | requirement |
 | :--- | :--- | :--- |
-| `name` | contains the name of processor | required |
-| `config` | different processors will require different config | required |
+| `name` | Name of the processor to use | required |
+| `config` | Processor-specific configuration | required |
 
-More info about available processors can be found [here](../reference/processors.md).
+## Example: Chaining Multiple Processors
 
+```yaml
+processors:
+  - name: enrich
+    config:
+      attributes:
+        domain: payments
+  - name: labels
+    config:
+      labels:
+        source: meteor
+  - name: script
+    config:
+      engine: tengo
+      script: |
+        asset.name = asset.name + " [" + asset.service + "]"
+```
+
+In this example, each asset first gets enriched with a `domain` attribute, then gets labeled with `source: meteor`, and finally has its name modified by the script processor.
+
+More info about available processors can be found [here](../reference/processors.md).
@@ -27,85 +27,109 @@ sinks: # required - at least 1 sink defined
 * **Console**
 
 ```yaml
-name: sample-recipe
 sinks:
   - name: console
 ```
 
 Print metadata to stdout.
 
+* **Compass**
+
+```yaml
+sinks:
+  - name: compass
+    config:
+      host: https://compass.example.com
+      type: sample-compass-type
+      mapping:
+        new_fieldname: "json_field_name"
+        id: "resource.urn"
+        displayName: "resource.name"
+```
+
+Upload metadata to [Compass](https://github.com/raystack/compass), Raystack's metadata catalog service. Supports lineage and ownership.
+
 * **File**
 
 ```yaml
 sinks:
-    name: file
+  - name: file
     config:
-        path: "./dir/sample.yaml"
-        format: "yaml"
+      path: "./dir/sample.yaml"
+      format: "yaml"
 ```
 
 Sinks metadata to a file in `json/yaml` format as per the config defined.
 
+* **Frontier**
+
+```yaml
+sinks:
+  - name: frontier
+    config:
+      host: frontier.example.com
+      headers:
+        X-Frontier-Email: meteor@raystack.io
+```
+
+Upsert users to a Frontier service. Request will be sent via GRPC.
+
 * **Google Cloud Storage**
 
 ```yaml
 sinks:
   - name: gcs
     config:
-     project_id: google-project-id
-     url: gcs://bucket_name/target_folder
-     object_prefix : github-users
-     service_account_base64: <base64 encoded service account key>
-     service_account_json:
-      {
-        "type": "service_account",
-        "private_key_id": "xxxxxxx",
-        "private_key": "xxxxxxx",
-        "client_email": "xxxxxxx",
-        "client_id": "xxxxxxx",
-        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
-        "token_uri": "https://oauth2.googleapis.com/token",
-        "auth_provider_x509_cert_url": "xxxxxxx",
-        "client_x509_cert_url": "xxxxxxx",
-      }
+      project_id: google-project-id
+      url: gcs://bucket_name/target_folder
+      object_prefix: github-users
+      service_account_base64: <base64 encoded service account key>
 ```
 
-Sinks json data to a file as ndjson format in Google Cloud Storage bucket
+Sinks JSON data as ndjson format in a Google Cloud Storage bucket.
 
-* **http**
+* **HTTP**
 
 ```yaml
 sinks:
-  name: http
-  config:
-    method: POST
-    success_code: 200
-    url: https://compass.com/v1beta1/asset
-    headers:
-      Header-1: value11,value12
+  - name: http
+    config:
+      method: POST
+      success_code: 200
+      url: https://example.com/v1/metadata
+      headers:
+        Header-1: value11,value12
 ```
 
-Sinks metadata to a http destination as per the config defined.
+Sinks metadata to an HTTP destination as per the config defined.
 
-* **Stencil**
+* **Kafka**
 
 ```yaml
 sinks:
-  name: stencil
-  config:
-    host: https://stencil.com
-    namespace_id: myNamespace
-    schema_id: mySchema
-    format: json
-    send_format_header: false
+  - name: kafka
+    config:
+      brokers: "localhost:9092"
+      topic: metadata-topic
+      key_path: ".resource.urn"
 ```
 
-Upload metadata of a given schema `format` in the existing `namespace_id` present in Stencil. Request will be sent via HTTP to a given host.
+Publish metadata as JSON messages to a Kafka topic. Supports message keying for partition control.
 
-## Upcoming sinks
+* **Stencil**
 
-* HTTP
-* Kafka
+```yaml
+sinks:
+  - name: stencil
+    config:
+      host: https://stencil.com
+      namespace_id: myNamespace
+      schema_id: mySchema
+      format: json
+      send_format_header: false
+```
+
+Upload metadata of a given schema `format` in the existing `namespace_id` present in Stencil. Request will be sent via HTTP to a given host.
 
 ## Serializer