diff --git a/skills/knowledge-graph-mcp/SKILL.md b/skills/knowledge-graph-mcp/SKILL.md new file mode 100644 index 0000000..94c00e0 --- /dev/null +++ b/skills/knowledge-graph-mcp/SKILL.md @@ -0,0 +1,222 @@ +--- +name: knowledge-graph-mcp +description: Build an MCP (Model Context Protocol) server that exposes a Markdown-LD knowledge graph to AI agents. Provides tool definitions for SPARQL queries, natural language questions, entity listing, and article discovery. Use when creating an MCP server to make a knowledge bank queryable by LLMs and agent frameworks. +--- + +# Knowledge Graph MCP Server + +Build an [MCP server](https://modelcontextprotocol.io/) that wraps a Markdown-LD knowledge bank, letting any AI agent query the knowledge graph through structured tool calls. + +## Why MCP? + +The knowledge bank already has HTTP endpoints (`/api/sparql`, `/api/ask`). An MCP server adds: + +- **Agent discoverability** — Agents automatically discover available tools +- **Structured input/output** — Zod/Pydantic schemas for parameters and results +- **Tool annotations** — Read-only hints, descriptions that help agents choose tools +- **Local or remote** — Works via stdio (local) or streamable HTTP (remote) + +## Architecture + +``` +Agent (Claude, Copilot, etc.) + ↕ MCP Protocol (stdio or HTTP) +MCP Server + ↕ RDFLib (local) or HTTP (remote) +Knowledge Graph (Turtle files) +``` + +Two deployment modes: + +1. **Local mode** — MCP server loads `.ttl` files directly into RDFLib +2. **Remote mode** — MCP server proxies to the deployed Azure Functions API + +## Tool Definitions + +### 1. `query_sparql` + +Execute a SPARQL query against the knowledge graph. + +```python +@mcp.tool( + annotations={ + "readOnlyHint": True, + "openWorldHint": False, + } +) +def query_sparql(query: str) -> str: + """Execute a SPARQL 1.1 SELECT or ASK query against the knowledge graph. + + The graph uses schema.org vocabulary. Available prefixes: + - schema: + - kb: + - prov: + + Only SELECT and ASK queries are allowed. Include a LIMIT clause. + + Args: + query: A valid SPARQL 1.1 query string with PREFIX declarations. + + Returns: + JSON string with SPARQL Results JSON format. + """ +``` + +### 2. `ask_question` + +Natural language query — the server translates to SPARQL. + +```python +@mcp.tool( + annotations={ + "readOnlyHint": True, + "openWorldHint": False, + } +) +def ask_question(question: str) -> str: + """Ask a natural language question about the knowledge graph. + + The question is translated to SPARQL and executed. The response + includes both the generated SPARQL and the results. + + Example questions: + - "What entities are in the knowledge graph?" + - "Which articles mention SPARQL?" + - "Find all organizations" + + Args: + question: A natural language question about the knowledge. + + Returns: + JSON with 'question', 'sparql', and 'results' fields. + """ +``` + +### 3. `list_entities` + +Browse entities with optional type filter. + +```python +@mcp.tool( + annotations={ + "readOnlyHint": True, + "openWorldHint": False, + } +) +def list_entities(entity_type: str = "schema:Thing", limit: int = 50) -> str: + """List entities in the knowledge graph, optionally filtered by type. + + Available types: schema:Person, schema:Organization, + schema:SoftwareApplication, schema:CreativeWork, schema:Thing + + Args: + entity_type: Schema.org type to filter by (default: all entities). + limit: Maximum number of results (default: 50). + + Returns: + JSON array of entities with id, name, and type. + """ +``` + +### 4. `list_articles` + +Discover articles in the knowledge bank. + +```python +@mcp.tool( + annotations={ + "readOnlyHint": True, + "openWorldHint": False, + } +) +def list_articles(limit: int = 50) -> str: + """List articles in the knowledge bank. + + Args: + limit: Maximum number of results (default: 50). + + Returns: + JSON array of articles with id, title, datePublished, and keywords. + """ +``` + +### 5. `get_entity_details` + +Deep-dive into a specific entity's relationships. + +```python +@mcp.tool( + annotations={ + "readOnlyHint": True, + "openWorldHint": False, + } +) +def get_entity_details(entity_name: str) -> str: + """Get detailed information about a specific entity. + + Returns the entity's type, sameAs links, articles that mention it, + and related entities. + + Args: + entity_name: The name of the entity to look up (case-insensitive). + + Returns: + JSON with entity details and relationships. + """ +``` + +## Implementation (Python / FastMCP) + +See [scripts/server.py](scripts/server.py) for a complete reference implementation. + +### Quick Start + +```bash +# Install dependencies +pip install mcp rdflib + +# Run locally (stdio transport) +python scripts/server.py --graph-dir ./graph/articles + +# Run as HTTP server +python scripts/server.py --transport http --port 8080 +``` + +### Connecting to an Agent + +**Claude Code / Copilot:** +```json +{ + "mcpServers": { + "knowledge-graph": { + "command": "python", + "args": ["skills/knowledge-graph-mcp/scripts/server.py", "--graph-dir", "./graph/articles"] + } + } +} +``` + +## Safety + +All tools are read-only: +- SPARQL queries are validated: `INSERT`, `DELETE`, `LOAD`, `CLEAR`, `DROP`, `CREATE` are blocked +- A `LIMIT` clause is injected if missing (default 100) +- No authentication required for read access + +## Error Handling + +Return clear, actionable errors: + +```python +if not is_valid: + return json.dumps({ + "error": f"Invalid SPARQL syntax: {error_msg}", + "hint": "Check PREFIX declarations and property names. " + "Available properties: schema:name, schema:mentions, schema:about, ..." + }) +``` + +## Reference + +- [references/api-reference.md](references/api-reference.md) — Full API endpoint documentation +- [scripts/server.py](scripts/server.py) — Reference MCP server implementation diff --git a/skills/knowledge-graph-mcp/references/api-reference.md b/skills/knowledge-graph-mcp/references/api-reference.md new file mode 100644 index 0000000..a5631ec --- /dev/null +++ b/skills/knowledge-graph-mcp/references/api-reference.md @@ -0,0 +1,131 @@ +# API Reference + +The Markdown-LD knowledge bank exposes two HTTP endpoints via Azure Functions. + +## `GET/POST /api/sparql` + +Execute SPARQL 1.1 queries against the knowledge graph. + +### GET + +``` +GET /api/sparql?query={url-encoded-sparql} +``` + +### POST (preferred for complex queries) + +``` +POST /api/sparql +Content-Type: application/sparql-query + +PREFIX schema: +SELECT ?name WHERE { ?e schema:name ?name } +``` + +### POST (form-encoded) + +``` +POST /api/sparql +Content-Type: application/x-www-form-urlencoded + +query=PREFIX+schema... +``` + +### Response + +``` +Content-Type: application/sparql-results+json + +{ + "head": { "vars": ["name"] }, + "results": { + "bindings": [ + { "name": { "type": "literal", "value": "Neo4j" } } + ] + } +} +``` + +### Errors + +- `400` — Missing query parameter or SPARQL syntax error +- `403` — Mutating query detected (INSERT, DELETE, etc.) + +### Headers + +``` +Access-Control-Allow-Origin: * +Cache-Control: public, max-age=300 +``` + +## `GET/POST /api/ask` + +Translate natural language to SPARQL and execute. + +### GET + +``` +GET /api/ask?question={url-encoded-question} +``` + +### POST (preferred) + +``` +POST /api/ask +Content-Type: application/json + +{"question": "What entities are in the knowledge graph?"} +``` + +### Response + +```json +{ + "question": "What entities are in the knowledge graph?", + "sparql": "PREFIX schema: \nSELECT DISTINCT ?entity ?name ?type WHERE {\n ?entity a ?type ;\n schema:name ?name .\n FILTER(?type != schema:Article)\n}\nLIMIT 100", + "results": { + "head": { "vars": ["entity", "name", "type"] }, + "results": { "bindings": [...] } + } +} +``` + +### Errors + +- `400` — Missing question parameter or SPARQL execution error +- `502` — LLM rate limit or API error + +### Requirements + +The `/api/ask` endpoint requires `GITHUB_TOKEN` as an environment variable for LLM access (GitHub Models). The `/api/sparql` endpoint works without any configuration. + +## Available Schema + +### Classes + +| Class | Description | +|-------|-------------| +| `schema:Article` | A Markdown article | +| `schema:Person` | A person | +| `schema:Organization` | A company, foundation, or team | +| `schema:SoftwareApplication` | Software tools and libraries | +| `schema:CreativeWork` | Books, papers, specs | +| `schema:Thing` | Base type for any entity | + +### Properties + +| Property | Domain → Range | +|----------|---------------| +| `schema:name` | Any → string | +| `schema:mentions` | Article → Thing | +| `schema:about` | Article → Thing | +| `schema:author` | Article → Person | +| `schema:creator` | Thing → Person/Org | +| `schema:datePublished` | Article → date | +| `schema:dateModified` | Article → date | +| `schema:sameAs` | Thing → URI | +| `schema:keywords` | Article → string | +| `schema:description` | Article → string | +| `kb:relatedTo` | Thing → Thing | +| `kb:confidence` | Assertion → decimal | +| `prov:wasDerivedFrom` | Assertion → URI | diff --git a/skills/knowledge-graph-mcp/scripts/server.py b/skills/knowledge-graph-mcp/scripts/server.py new file mode 100644 index 0000000..b630aa5 --- /dev/null +++ b/skills/knowledge-graph-mcp/scripts/server.py @@ -0,0 +1,227 @@ +"""MCP server for a Markdown-LD knowledge graph. + +Exposes the knowledge bank as MCP tools that AI agents can discover +and call. Supports both local mode (loads .ttl files directly) and +remote mode (proxies to a deployed API). + +Usage: + # Local mode (stdio transport) + python server.py --graph-dir ./graph/articles + + # Local mode (HTTP transport) + python server.py --graph-dir ./graph/articles --transport http --port 8080 + + # Remote mode (proxy to deployed API) + python server.py --api-url https://your-swa.azurestaticapps.net + +Requirements: + pip install mcp rdflib +""" + +import argparse +import json + +from mcp.server.fastmcp import FastMCP +from rdflib import Dataset, Graph + +mcp = FastMCP( + "knowledge-graph", + description="Query a Markdown-LD knowledge graph via SPARQL or natural language.", +) + +_dataset: Dataset | None = None +_args = None + +MUTATING_KEYWORDS = ["INSERT", "DELETE", "LOAD", "CLEAR", "DROP", "CREATE"] + + +def _load_dataset() -> Dataset: + """Load all Turtle files into an RDFLib Dataset.""" + global _dataset + if _dataset is not None: + return _dataset + + from pathlib import Path + + ds = Dataset() + graph_dir = Path(_args.graph_dir) + + if graph_dir.exists(): + for ttl_file in graph_dir.glob("*.ttl"): + g = Graph() + g.parse(str(ttl_file), format="turtle") + for triple in g: + ds.add(triple) + + _dataset = ds + return _dataset + + +def _enforce_safety(sparql: str) -> tuple[bool, str, str]: + """Validate safety constraints. Returns (is_safe, sanitized, error).""" + upper = sparql.strip().upper() + for kw in MUTATING_KEYWORDS: + if kw in upper: + return False, sparql, f"Mutating keyword '{kw}' is not allowed" + if "LIMIT" not in upper and "SELECT" in upper: + sparql = sparql.rstrip().rstrip(";") + "\nLIMIT 100" + return True, sparql, "" + + +@mcp.tool() +def query_sparql(query: str) -> str: + """Execute a SPARQL 1.1 SELECT or ASK query against the knowledge graph. + + The graph uses schema.org vocabulary. Available prefixes: + - schema: + - kb: + - prov: + + Only SELECT and ASK queries are allowed. Include a LIMIT clause. + + Args: + query: A valid SPARQL 1.1 query string with PREFIX declarations. + + Returns: + JSON string with SPARQL Results JSON format. + """ + is_safe, query, error = _enforce_safety(query) + if not is_safe: + return json.dumps({"error": error}) + + try: + ds = _load_dataset() + result = ds.query(query) + serialized = result.serialize(format="json") + if isinstance(serialized, bytes): + serialized = serialized.decode("utf-8") + return serialized + except Exception as e: + return json.dumps({ + "error": f"Query execution failed: {str(e)}", + "hint": "Check PREFIX declarations and property names. " + "Available: schema:name, schema:mentions, schema:about, " + "schema:author, schema:creator, schema:sameAs, kb:relatedTo", + }) + + +@mcp.tool() +def list_entities(entity_type: str = "schema:Thing", limit: int = 50) -> str: + """List entities in the knowledge graph, optionally filtered by type. + + Available types: schema:Person, schema:Organization, + schema:SoftwareApplication, schema:CreativeWork, schema:Thing + + Args: + entity_type: Schema.org type to filter by (default: all non-Article entities). + limit: Maximum number of results (default: 50). + + Returns: + JSON array of entities with id, name, and type. + """ + if entity_type == "schema:Thing": + query = f""" + PREFIX schema: + SELECT DISTINCT ?entity ?name ?type WHERE {{ + ?entity a ?type ; schema:name ?name . + FILTER(?type != schema:Article) + }} LIMIT {limit} + """ + else: + query = f""" + PREFIX schema: + SELECT DISTINCT ?entity ?name WHERE {{ + ?entity a {entity_type} ; schema:name ?name . + }} LIMIT {limit} + """ + return query_sparql(query) + + +@mcp.tool() +def list_articles(limit: int = 50) -> str: + """List articles in the knowledge bank with their titles and dates. + + Args: + limit: Maximum number of results (default: 50). + + Returns: + JSON array of articles with id, title, datePublished, and keywords. + """ + query = f""" + PREFIX schema: + SELECT ?article ?title ?date ?keywords WHERE {{ + ?article a schema:Article ; + schema:name ?title . + OPTIONAL {{ ?article schema:datePublished ?date }} + OPTIONAL {{ ?article schema:keywords ?keywords }} + }} LIMIT {limit} + """ + return query_sparql(query) + + +@mcp.tool() +def get_entity_details(entity_name: str) -> str: + """Get detailed information about a specific entity including relationships. + + Returns the entity's type, sameAs links, articles that mention it, + and related entities. + + Args: + entity_name: The name of the entity to look up (case-insensitive). + + Returns: + JSON with entity details and relationships. + """ + escaped = entity_name.replace('"', '\\"') + query = f""" + PREFIX schema: + PREFIX kb: + SELECT ?entity ?type ?sameAs ?article ?articleTitle ?related ?relatedName WHERE {{ + ?entity schema:name ?name . + FILTER(LCASE(STR(?name)) = LCASE("{escaped}")) + ?entity a ?type . + OPTIONAL {{ ?entity schema:sameAs ?sameAs }} + OPTIONAL {{ + ?article a schema:Article ; + schema:name ?articleTitle ; + schema:mentions ?entity . + }} + OPTIONAL {{ + ?entity kb:relatedTo ?related . + ?related schema:name ?relatedName . + }} + }} LIMIT 100 + """ + return query_sparql(query) + + +def main(): + global _args + parser = argparse.ArgumentParser(description="Knowledge Graph MCP Server") + parser.add_argument( + "--graph-dir", + default="./graph/articles", + help="Directory containing .ttl files (local mode)", + ) + parser.add_argument( + "--transport", + choices=["stdio", "http"], + default="stdio", + help="MCP transport (default: stdio)", + ) + parser.add_argument( + "--port", + type=int, + default=8080, + help="Port for HTTP transport (default: 8080)", + ) + _args = parser.parse_args() + + if _args.transport == "http": + mcp.run(transport="streamable-http", host="0.0.0.0", port=_args.port) + else: + mcp.run(transport="stdio") + + +if __name__ == "__main__": + main() diff --git a/skills/llm-rdf-extraction/SKILL.md b/skills/llm-rdf-extraction/SKILL.md new file mode 100644 index 0000000..ef006fa --- /dev/null +++ b/skills/llm-rdf-extraction/SKILL.md @@ -0,0 +1,206 @@ +--- +name: llm-rdf-extraction +description: Design and iterate on LLM prompts that extract structured RDF (entities and assertions) from natural language text. Covers system prompt architecture, JSON output schema enforcement, few-shot example design, confidence calibration, chunking strategies, and common extraction failure modes. Use when building or tuning an LLM-powered knowledge graph extraction pipeline. +--- + +# LLM RDF Extraction + +Design prompts that reliably extract structured RDF knowledge from natural language text. The extraction pipeline chunks Markdown articles and sends each chunk to an LLM that returns entities and assertions in a strict JSON schema. + +## Pipeline Overview + +``` +Markdown article + → Parse frontmatter (metadata, entity_hints) + → Split at H1/H2 headings into chunks (~750 tokens each) + → For each batch of 3-5 chunks: + → Send to LLM with system prompt + user message + → Parse JSON response + → Cache result per chunk + → Post-process: canonicalize entities, deduplicate assertions + → Validate with SHACL + → Write JSON-LD + Turtle outputs +``` + +## System Prompt Architecture + +The system prompt has five sections. Each is critical: + +### 1. Role Declaration + +``` +You are a deterministic RDF extraction engine. Output MUST be valid JSON +per the schema below. Do not invent facts. +``` + +**Why:** Anchors the LLM in a strict, non-creative role. "Deterministic" and "do not invent facts" reduce hallucination. + +### 2. Ontology Definition + +``` +ONTOLOGY (use schema.org first; kb: only for extraction metadata): +- Types: schema:Article, schema:Person, schema:Organization, + schema:SoftwareApplication, schema:CreativeWork, schema:Thing +- Preferred properties: schema:about, schema:mentions, schema:sameAs, + schema:author, schema:creator, schema:datePublished, schema:dateModified +- Fallback: kb:relatedTo +- Provenance: prov:wasDerivedFrom (use chunk URI) +- Confidence: kb:confidence (0..1) +``` + +**Why:** Constraining the vocabulary prevents the LLM from inventing predicates. The "use schema.org first" instruction creates a preference hierarchy. + +### 3. Extraction Rules + +Key rules and their rationale: + +| Rule | Rationale | +|------|-----------| +| Only extract explicitly stated or strongly implied facts | Prevents hallucination | +| Prefer schema.org properties over kb:relatedTo | Produces richer, more queryable graphs | +| Every entity must have id, type, label | Ensures well-formed output | +| Use stable slug-based IDs | Enables deduplication across chunks | +| Set confidence based on explicitness | Lets downstream consumers filter by quality | + +### 4. Output JSON Schema + +```json +{ + "entities": [ + {"id": "...", "type": "schema:Thing", "label": "...", "sameAs": ["..."]} + ], + "assertions": [ + {"s": "...", "p": "schema:mentions", "o": "...", "confidence": 0.85, "source": "urn:kb:chunk:..."} + ] +} +``` + +**Critical:** Use `response_format: {"type": "json_object"}` in the API call to force JSON output. + +### 5. Few-Shot Examples + +Include 2-3 examples showing: +- Simple entity extraction +- Relationship extraction with confidence scores +- The `ARTICLE_ID` placeholder pattern + +``` +TEXT: "Neo4j was created by Emil Eifrem." +OUTPUT: +{"entities":[ + {"id":"https://example.com/id/neo4j","type":"schema:SoftwareApplication","label":"Neo4j"}, + {"id":"https://example.com/id/emil-eifrem","type":"schema:Person","label":"Emil Eifrem"} +], +"assertions":[ + {"s":".../neo4j","p":"schema:creator","o":".../emil-eifrem","confidence":0.92,"source":"urn:kb:chunk:EXAMPLE"} +]} +``` + +## Confidence Calibration + +Teach the LLM a consistent confidence scale: + +| Score | Criterion | Example | +|-------|-----------|---------| +| 0.9+ | Directly stated | "X was created by Y" | +| 0.7–0.9 | Strongly implied | "Y, the creator of X, said..." | +| 0.5–0.7 | Weakly implied | "X and Y are in the same domain" | +| < 0.5 | **Do not emit** | Speculative connections | + +The 0.5 threshold acts as a quality gate — it's better to miss a weak assertion than pollute the graph with noise. + +## Chunking Strategy + +The chunker determines extraction quality as much as the prompt does. + +### How It Works + +1. Parse frontmatter (strip from extraction input) +2. Split body at H1/H2 heading boundaries +3. Pack consecutive sections until reaching ~750 tokens +4. Hash each chunk's content for a stable `chunk_id` + +### Why ~750 Tokens? + +- **Too small** (< 300): Loses cross-sentence context, entities mentioned once get missed +- **Just right** (~750): Fits multiple paragraphs, each chunk is self-contained +- **Too large** (> 1500): Approaches input limits when batching, LLM attention degrades + +### Batching + +Send 3-5 chunks per API call to stay under the 8K input token limit (GitHub Models GPT-4o-mini). The user message includes: + +``` +ARTICLE_ID: {article_id} +--- CHUNK 1 (chunk_id: abc123, source: urn:kb:chunk:{doc_id}:abc123) --- +{chunk text} + +--- CHUNK 2 (chunk_id: def456, source: urn:kb:chunk:{doc_id}:def456) --- +{chunk text} +``` + +## Caching + +Cache each chunk's extraction result keyed by `{chunk_id}.{prompt_version}.{model}`. This means: + +- Re-running the pipeline on unchanged content is free +- Changing the prompt version invalidates all caches (intentional) +- Changing the model invalidates all caches (intentional) + +## Common Failure Modes + +### 1. JSON Parse Errors + +**Symptom:** LLM returns markdown-fenced JSON or invalid JSON. + +**Fix:** Use `response_format: {"type": "json_object"}`. Add a fallback regex to extract JSON from code blocks: +```python +json_match = re.search(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL) +``` + +### 2. Hallucinated Predicates + +**Symptom:** LLM invents predicates like `schema:uses`, `schema:partOf`. + +**Fix:** List available properties explicitly in the system prompt. Add: "Only use properties listed above." + +### 3. Missing Entity IDs + +**Symptom:** Entities have labels but no `id` field. + +**Fix:** Add the ID generation rule with examples to the prompt. Show the exact `{base_url}/id/{slug}` pattern. + +### 4. Cross-Chunk Reference Loss + +**Symptom:** Entity mentioned in chunk 1 and chunk 3, but chunk 2 creates a duplicate with a different ID. + +**Fix:** Post-processing canonicalization merges by slug. The system tolerates this at extraction time and fixes it downstream. + +### 5. Over-Extraction + +**Symptom:** Every noun becomes an entity. + +**Fix:** Add the rule: "Only extract named entities that are specific, identifiable things — not generic concepts like 'data' or 'systems'." + +## Prompt Versioning + +The prompt file is versioned (`extract_rdf_v1.txt`). When making changes: + +1. Create a new version file (`extract_rdf_v2.txt`) +2. Update `PROMPT_VERSION` in `llm_client.py` +3. This automatically invalidates the cache for all chunks +4. Run the pipeline and compare output quality + +## Rate Limit Management + +GitHub Models free tier: **150 requests/day, 8K input tokens**. + +Strategies: +- Batch 3-5 chunks per request (reduces total API calls) +- Cache aggressively (only call LLM for changed/new content) +- Use `git diff` to detect changed files (skip unchanged articles) +- Exponential backoff on `429 Too Many Requests` + +## Reference + +See [references/prompt-patterns.md](references/prompt-patterns.md) for the complete extraction prompt and iteration patterns. diff --git a/skills/llm-rdf-extraction/references/prompt-patterns.md b/skills/llm-rdf-extraction/references/prompt-patterns.md new file mode 100644 index 0000000..91e6fd4 --- /dev/null +++ b/skills/llm-rdf-extraction/references/prompt-patterns.md @@ -0,0 +1,105 @@ +# Extraction Prompt Patterns + +## Current Prompt (v1) + +The production prompt used for entity/relation extraction: + +``` +SYSTEM: +You are a deterministic RDF extraction engine. Output MUST be valid JSON per the schema below. Do not invent facts. + +ONTOLOGY (use schema.org first; kb: only for extraction metadata): +- Types: schema:Article, schema:Person, schema:Organization, schema:SoftwareApplication, schema:CreativeWork, schema:Thing +- Preferred properties: schema:about, schema:mentions, schema:sameAs, schema:author, schema:creator, schema:datePublished, schema:dateModified +- Fallback: kb:relatedTo +- Provenance: prov:wasDerivedFrom (use chunk URI) +- Confidence: kb:confidence (0..1) + +RULES: +- Only add relations explicitly stated or strongly implied by the text. +- Prefer schema.org properties; use kb:relatedTo only if no schema.org property fits. +- Every entity MUST have: id, type, label. +- Use stable ids: + - articles: canonical_url + - entities: GRAPH_BASE_URL + "/id/" + slug(label) + - slug = lowercase, hyphens, no special chars +- If a wikilink or entity_hints provides sameAs, use it. +- Emit 0..N triples. Avoid duplicates. +- Set confidence based on how explicitly the text states the relation: + - 0.9+: directly stated ("X created Y") + - 0.7-0.9: strongly implied + - 0.5-0.7: weakly implied + - <0.5: do not emit + +OUTPUT JSON SCHEMA: +{ + "entities":[{"id":"...","type":"schema:Thing","label":"...","sameAs":["..."]}], + "assertions":[ + {"s":"...","p":"schema:mentions","o":"...","confidence":0.85,"source":"urn:kb:chunk:..."} + ] +} +``` + +## Few-Shot Example Pattern + +Always include 2-3 examples that demonstrate: + +### Example 1: Explicit relationship + +``` +TEXT: "Neo4j was created by Emil Eifrem." +OUTPUT: +{"entities":[ + {"id":"https://example.com/id/neo4j","type":"schema:SoftwareApplication","label":"Neo4j"}, + {"id":"https://example.com/id/emil-eifrem","type":"schema:Person","label":"Emil Eifrem"} +], +"assertions":[ + {"s":"https://example.com/id/neo4j","p":"schema:creator","o":"https://example.com/id/emil-eifrem","confidence":0.92,"source":"urn:kb:chunk:EXAMPLE"} +]} +``` + +### Example 2: Article mentions + +``` +TEXT: "This article discusses SPARQL and JSON-LD as key standards." +OUTPUT: +{"entities":[ + {"id":"https://example.com/id/sparql","type":"schema:Thing","label":"SPARQL"}, + {"id":"https://example.com/id/json-ld","type":"schema:Thing","label":"JSON-LD"} +], +"assertions":[ + {"s":"","p":"schema:mentions","o":"https://example.com/id/sparql","confidence":0.85,"source":"urn:kb:chunk:EXAMPLE"}, + {"s":"","p":"schema:mentions","o":"https://example.com/id/json-ld","confidence":0.85,"source":"urn:kb:chunk:EXAMPLE"} +]} +``` + +## Prompt Iteration Checklist + +When creating a new prompt version: + +1. **Define the change hypothesis** — What specific extraction quality issue are you fixing? +2. **Create a golden test set** — 5-10 chunks with manually verified expected output +3. **Create the new prompt file** — `extract_rdf_v2.txt` +4. **Run against golden set** — Compare entity and assertion counts, precision, recall +5. **Check for regressions** — Ensure previously correct extractions still work +6. **Update PROMPT_VERSION** — Invalidates all caches +7. **Full pipeline run** — Rebuild the entire graph + +## API Call Configuration + +```python +response = client.chat.completions.create( + model="openai/gpt-4o-mini", + messages=[ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": user_message}, + ], + temperature=0, # deterministic output + response_format={"type": "json_object"}, # force valid JSON +) +``` + +**Key settings:** +- `temperature=0` — Maximizes reproducibility +- `response_format=json_object` — Prevents markdown wrapping, ensures parseable output +- Model: `openai/gpt-4o-mini` via GitHub Models endpoint (`https://models.github.ai/inference`) diff --git a/skills/markdown-ld-authoring/SKILL.md b/skills/markdown-ld-authoring/SKILL.md new file mode 100644 index 0000000..6fab89c --- /dev/null +++ b/skills/markdown-ld-authoring/SKILL.md @@ -0,0 +1,161 @@ +--- +name: markdown-ld-authoring +description: Write Markdown articles that produce high-quality Linked Data when processed by an LLM extraction pipeline. Covers YAML frontmatter conventions, entity_hints for Wikidata alignment, wikilink syntax for entity marking, and content structuring patterns that maximize RDF extraction accuracy. Use when creating or editing content for a Markdown-LD knowledge bank. +--- + +# Markdown-LD Authoring + +Write Markdown articles in `content/` that an LLM pipeline will process into RDF/JSON-LD knowledge graphs. Your goal: write naturally readable prose that also produces rich, accurate structured data. + +## File Placement + +Articles go in `content/` organized by date: + +``` +content/ +└── YYYY/ + └── MM/ + └── slug-title.md +``` + +The path determines the canonical URL: `content/2026/04/my-article.md` → `https://{base}/2026/04/my-article/` + +## Frontmatter Schema + +Every article starts with YAML frontmatter between `---` fences: + +```yaml +--- +title: "What is a Knowledge Graph?" +date_published: "2026-04-04" +date_modified: "2026-04-04" +authors: + - name: "Author Name" +tags: ["knowledge-graph", "sparql", "linked-data"] +about: + - "https://schema.org/KnowledgeGraph" +summary: "An introduction to knowledge graphs — what they are and why they matter." +canonical_url: "https://example.com/2026/04/what-is-a-knowledge-graph/" +entity_hints: + - label: "Knowledge Graph" + sameAs: "https://www.wikidata.org/wiki/Q33002955" + - label: "SPARQL" + sameAs: "https://www.wikidata.org/wiki/Q54872" + - label: "RDF" + sameAs: "https://www.wikidata.org/wiki/Q54848" +--- +``` + +### Required Fields + +| Field | Type | Purpose | +|-------|------|---------| +| `title` | string | Article title → `schema:name` | +| `date_published` | ISO date | Publication date → `schema:datePublished` | +| `tags` | string[] | Topic tags → `schema:keywords` | + +### Recommended Fields + +| Field | Type | Purpose | +|-------|------|---------| +| `date_modified` | ISO date | Last edit date → `schema:dateModified` | +| `authors` | object[] | Author names → `schema:author` | +| `summary` | string | Description → `schema:description` | +| `canonical_url` | URL | Overrides the path-derived article ID | +| `about` | URL[] | Schema.org type URIs the article is about | +| `entity_hints` | object[] | Pre-identified entities with Wikidata links | + +### Entity Hints + +Entity hints are the highest-leverage frontmatter field. They tell the extraction pipeline exactly which entities exist and how to link them to external knowledge bases: + +```yaml +entity_hints: + - label: "Neo4j" + type: "schema:SoftwareApplication" + sameAs: "https://www.wikidata.org/entity/Q7071552" + - label: "Emil Eifrem" + type: "schema:Person" + sameAs: "https://www.wikidata.org/entity/Q5372614" +``` + +**Fields:** +- `label` (required): Exact text as it appears in the article +- `type` (optional): Schema.org type — defaults to `schema:Thing` +- `sameAs` (optional): Wikidata or other canonical URI + +**Finding Wikidata IDs:** Search [wikidata.org](https://www.wikidata.org) for the entity. The ID is in the URL: `https://www.wikidata.org/wiki/Q54872` → `Q54872` for SPARQL. + +## Wikilinks + +Mark important entities with double-bracket wikilinks: + +```markdown +The key standards are [[RDF]] and [[SPARQL]]. +[[Google]] uses the [[Google Knowledge Graph]] to enhance search. +``` + +**Rules:** +- Use the entity's canonical label inside brackets +- Match the label in `entity_hints` when possible +- Don't over-link — mark an entity on first mention in each section +- Common words (the, data, web) should NOT be wikilinked + +## Content Structure for Extraction Quality + +The extraction pipeline chunks articles at H1/H2 heading boundaries (~750 tokens per chunk). Structure your content accordingly: + +### DO + +- **Use H1/H2 headings** to organize into clear sections +- **State relationships explicitly**: "Neo4j was created by Emil Eifrem" → extracts `schema:creator` +- **Name entities clearly** on first use before abbreviating +- **One topic per section** — each chunk should be self-contained +- **Front-load key facts** — the first paragraph sets context for the whole article + +### DON'T + +- **Don't use H3+ exclusively** — the chunker splits on H1/H2 only +- **Don't rely on pronouns** across section boundaries — "it" won't extract well +- **Don't embed entities only in code blocks** — the LLM may skip them +- **Don't write walls of text** without headings — creates oversized chunks + +## Relationship Patterns + +The LLM maps natural language to schema.org properties. Write sentences that align with these patterns for high-confidence extraction: + +| Write this... | Extracts as... | +|---------------|---------------| +| "X was created by Y" | `schema:creator` | +| "X was authored by Y" | `schema:author` | +| "This article discusses X" | `schema:mentions` | +| "X is a type of software" | `schema:SoftwareApplication` | +| "X is related to Y" | `kb:relatedTo` (fallback) | + +## Available Entity Types + +Choose from these schema.org types in `entity_hints`: + +| Type | Use for | +|------|---------| +| `schema:Person` | People, authors, creators | +| `schema:Organization` | Companies, teams, foundations | +| `schema:SoftwareApplication` | Software tools, libraries, platforms | +| `schema:CreativeWork` | Books, papers, specifications | +| `schema:Article` | Reserved for the article itself | +| `schema:Thing` | Default — anything that doesn't fit above | + +## Complete Example + +See [references/example-article.md](references/example-article.md) for a fully annotated example article. + +## Validation Checklist + +Before committing an article: + +- [ ] `title` and `date_published` are set in frontmatter +- [ ] At least 2-3 `entity_hints` with Wikidata `sameAs` links +- [ ] Key entities are `[[wikilinked]]` on first mention +- [ ] Content uses H1/H2 headings for section breaks +- [ ] Relationships are stated explicitly, not implied through pronouns +- [ ] Tags are lowercase, hyphenated (`knowledge-graph`, not `Knowledge Graph`) diff --git a/skills/markdown-ld-authoring/references/example-article.md b/skills/markdown-ld-authoring/references/example-article.md new file mode 100644 index 0000000..c7a553e --- /dev/null +++ b/skills/markdown-ld-authoring/references/example-article.md @@ -0,0 +1,96 @@ +# Example Article + +A complete example showing all authoring conventions. + +```markdown +--- +title: "Graph Databases vs. Triple Stores" +date_published: "2026-04-10" +date_modified: "2026-04-10" +authors: + - name: "Jane Smith" +tags: ["graph-database", "triple-store", "rdf", "neo4j"] +about: + - "https://schema.org/TechArticle" +summary: "A comparison of property graph databases and RDF triple stores, their trade-offs, and when to use each." +canonical_url: "https://example.com/2026/04/graph-databases-vs-triple-stores/" +entity_hints: + - label: "Neo4j" + type: "schema:SoftwareApplication" + sameAs: "https://www.wikidata.org/entity/Q7071552" + - label: "Apache Jena" + type: "schema:SoftwareApplication" + sameAs: "https://www.wikidata.org/entity/Q16939122" + - label: "RDF" + type: "schema:Thing" + sameAs: "https://www.wikidata.org/wiki/Q54848" + - label: "W3C" + type: "schema:Organization" + sameAs: "https://www.wikidata.org/wiki/Q37033" + - label: "Emil Eifrem" + type: "schema:Person" + sameAs: "https://www.wikidata.org/entity/Q5372614" +--- + +# Graph Databases vs. Triple Stores + +There are two dominant approaches to storing graph-structured data: **property graph databases** and **[[RDF]] triple stores**. Both model data as graphs, but they differ in data model, query language, and ecosystem. + +## Property Graph Databases + +[[Neo4j]] is the most widely adopted property graph database. It was created by [[Emil Eifrem]] and uses the [[Cypher]] query language. Property graphs allow arbitrary key-value pairs on both nodes and edges, making them flexible for application-level modeling. + +Other property graph databases include [[Amazon Neptune]] (which also supports RDF) and [[TigerGraph]]. + +## RDF Triple Stores + +RDF triple stores like [[Apache Jena]] and [[Blazegraph]] implement the [[W3C]] standards for Linked Data. They use [[SPARQL]] as the query language and represent all data as subject-predicate-object triples. + +The key advantage of triple stores is interoperability: any RDF dataset can be merged with any other without schema mapping, because the data model is universal. + +## When to Use Each + +- Choose a **property graph** when you need flexible schemas and your data stays within one application. +- Choose a **triple store** when you need to integrate data across organizational boundaries or publish Linked Data on the web. +``` + +## What This Article Produces + +The extraction pipeline would generate entities like: + +```json +{ + "entities": [ + {"id": "https://example.com/id/neo4j", "type": "schema:SoftwareApplication", "label": "Neo4j", "sameAs": ["https://www.wikidata.org/entity/Q7071552"]}, + {"id": "https://example.com/id/emil-eifrem", "type": "schema:Person", "label": "Emil Eifrem", "sameAs": ["https://www.wikidata.org/entity/Q5372614"]}, + {"id": "https://example.com/id/apache-jena", "type": "schema:SoftwareApplication", "label": "Apache Jena", "sameAs": ["https://www.wikidata.org/entity/Q16939122"]} + ], + "assertions": [ + {"s": "https://example.com/id/neo4j", "p": "schema:creator", "o": "https://example.com/id/emil-eifrem", "confidence": 0.92}, + {"s": "https://example.com/2026/04/graph-databases-vs-triple-stores/", "p": "schema:mentions", "o": "https://example.com/id/neo4j", "confidence": 0.85} + ] +} +``` + +## Anti-Patterns + +### Bad: Vague references across sections + +```markdown +## Section One +Neo4j is a graph database. + +## Section Two +It was created by a Swedish entrepreneur. +``` + +The chunker may split these into separate chunks. The LLM won't know "it" refers to Neo4j. + +### Good: Self-contained sections + +```markdown +## Property Graph Databases +[[Neo4j]] is a graph database created by [[Emil Eifrem]]. +``` + +Explicit subject + relationship in the same chunk = high-confidence extraction. diff --git a/skills/rdf-jsonld-engineer/SKILL.md b/skills/rdf-jsonld-engineer/SKILL.md new file mode 100644 index 0000000..dd4f8a9 --- /dev/null +++ b/skills/rdf-jsonld-engineer/SKILL.md @@ -0,0 +1,214 @@ +--- +name: rdf-jsonld-engineer +description: Design and maintain RDF knowledge graph artifacts for Markdown-LD systems. Covers JSON-LD context authoring, Turtle serialization, entity ID minting, sameAs alignment with Wikidata, and the schema.org type system. Use when working with ontology files, graph outputs, or extending the data model. +--- + +# RDF / JSON-LD Engineering + +Maintain the Linked Data layer of a Markdown-LD knowledge bank. The system produces two serialization formats per article: JSON-LD (for web consumption) and Turtle (for SPARQL/RDFLib). + +## JSON-LD Context + +The shared context lives at `ontology/context.jsonld` and defines how JSON keys map to RDF predicates: + +```json +{ + "@context": { + "schema": "https://schema.org/", + "prov": "http://www.w3.org/ns/prov#", + "kb": "https://example.com/vocab/kb#", + "id": "@id", + "type": "@type", + "confidence": { + "@id": "kb:confidence", + "@type": "http://www.w3.org/2001/XMLSchema#decimal" + }, + "source": { + "@id": "prov:wasDerivedFrom", + "@type": "@id" + }, + "relatedTo": { + "@id": "kb:relatedTo", + "@type": "@id" + } + } +} +``` + +### Design Principles + +1. **Use `@id` aliasing** — `"id": "@id"` and `"type": "@type"` make JSON-LD look like normal JSON +2. **Type coercion** — Declare `@type` on properties to avoid ambiguity (e.g., `confidence` is always `xsd:decimal`) +3. **Namespace prefixes** — Declare `schema`, `prov`, `kb` as prefix shortcuts +4. **ID-typed references** — Properties pointing to other nodes use `"@type": "@id"` + +### Extending the Context + +When adding new properties: + +```json +{ + "@context": { + "newProperty": { + "@id": "schema:newProperty", + "@type": "http://www.w3.org/2001/XMLSchema#string" + } + } +} +``` + +**Rules:** +- Prefer schema.org properties when one exists +- Only add `kb:` properties for concepts schema.org doesn't cover +- Always declare `@type` for datatype properties +- Use `"@type": "@id"` for object properties (references to other nodes) + +## JSON-LD Graph Structure + +Each article produces a JSON-LD document with `@graph`: + +```json +{ + "@context": "../../ontology/context.jsonld", + "@graph": [ + { + "id": "https://example.com/2026/04/my-article/", + "type": "schema:Article", + "schema:name": "My Article", + "schema:datePublished": "2026-04-04", + "schema:keywords": "knowledge-graph, rdf" + }, + { + "id": "https://example.com/id/neo4j", + "type": "schema:SoftwareApplication", + "schema:name": "Neo4j", + "schema:sameAs": "https://www.wikidata.org/entity/Q7071552" + }, + { + "type": "kb:Assertion", + "schema:subjectOf": "https://example.com/2026/04/my-article/", + "schema:mentions": "https://example.com/id/neo4j", + "confidence": 0.85, + "source": "urn:kb:chunk:..." + } + ] +} +``` + +### Graph Node Types + +1. **Article node** — One per document, typed `schema:Article` +2. **Entity nodes** — Extracted entities with `schema:name` and optional `schema:sameAs` +3. **Assertion nodes** — Reified statements with provenance and confidence + +## Turtle Serialization + +The same data in Turtle format for RDFLib/SPARQL: + +```turtle +@prefix schema: . +@prefix prov: . +@prefix kb: . +@prefix xsd: . + + a schema:Article ; + schema:name "My Article" ; + schema:datePublished "2026-04-04"^^xsd:date . + + a schema:SoftwareApplication ; + schema:name "Neo4j" . + schema:sameAs . +``` + +### Turtle Escaping + +Special characters in string literals must be escaped: + +| Character | Escape | +|-----------|--------| +| `\` | `\\` | +| `"` | `\"` | +| newline | `\n` | + +## Entity ID Minting + +All entities get deterministic, stable URIs: + +``` +{base_url}/id/{slug(label)} +``` + +The `slugify` function: +1. Lowercase the label +2. Remove characters not in `[a-z0-9\s-]` +3. Replace spaces/underscores with hyphens +4. Collapse consecutive hyphens +5. Strip leading/trailing hyphens + +**Examples:** +- `"JSON-LD 1.1"` → `json-ld-1-1` +- `"W3C"` → `w3c` +- `"Apache Jena"` → `apache-jena` + +### Stability Guarantee + +The same label always produces the same slug and therefore the same URI. This is critical for deduplication across chunks and builds. + +## sameAs Alignment + +Use `schema:sameAs` to link entities to canonical external identifiers: + +```turtle + schema:sameAs . +``` + +**Priority sources:** +1. Wikidata (`wikidata.org/wiki/Q...` or `wikidata.org/entity/Q...`) +2. DBpedia (`dbpedia.org/resource/...`) +3. Official website URLs + +**When merging entities**, the system takes the union of all `sameAs` URIs. + +## Entity Canonicalization + +When the same entity appears across multiple chunks (possibly with different labels or types), the post-processor merges them: + +1. **Slug match** — Two entities with the same slug are merged +2. **sameAs match** — Entities sharing a `sameAs` URI are merged +3. **Type priority** — The most specific type wins (Person > Thing) +4. **Label preservation** — The first-seen label is kept +5. **sameAs union** — All `sameAs` URIs are collected + +## Custom Vocabulary (`ontology/kb.ttl`) + +The `kb:` namespace defines properties not covered by schema.org: + +| Property | Description | +|----------|-------------| +| `kb:confidence` | Extraction confidence [0, 1] | +| `kb:relatedTo` | Fallback relation between entities | +| `kb:chunk` | Source chunk URI | +| `kb:docPath` | Source file path | +| `kb:charStart` | Chunk start offset | +| `kb:charEnd` | Chunk end offset | + +### Adding New Custom Properties + +1. Add the RDF definition to `ontology/kb.ttl`: + ```turtle + kb:newProp a rdf:Property ; + rdfs:label "new property" ; + rdfs:comment "Description of what this property represents." ; + rdfs:domain schema:Thing ; + rdfs:range xsd:string . + ``` + +2. Add it to `ontology/context.jsonld` for JSON-LD support + +3. Add a SHACL shape in `ontology/shapes.ttl` if the property has constraints + +4. Update the extraction prompt in `tools/prompts/` to teach the LLM about it + +## Reference + +See [references/context-design.md](references/context-design.md) for JSON-LD context design patterns. diff --git a/skills/rdf-jsonld-engineer/references/context-design.md b/skills/rdf-jsonld-engineer/references/context-design.md new file mode 100644 index 0000000..a87cdf5 --- /dev/null +++ b/skills/rdf-jsonld-engineer/references/context-design.md @@ -0,0 +1,113 @@ +# JSON-LD Context Design Patterns + +## Pattern 1: Aliasing @id and @type + +Make JSON-LD look like normal JSON by aliasing the keywords: + +```json +{ + "@context": { + "id": "@id", + "type": "@type" + } +} +``` + +This lets consumers write `"id": "https://..."` instead of `"@id": "https://..."`. + +## Pattern 2: Namespace Prefixes + +Define namespace prefixes to shorten URIs: + +```json +{ + "@context": { + "schema": "https://schema.org/", + "kb": "https://example.com/vocab/kb#" + } +} +``` + +Then use `"schema:name"` instead of `"https://schema.org/name"`. + +## Pattern 3: Typed Properties + +Declare the datatype of properties to avoid ambiguity: + +```json +{ + "@context": { + "confidence": { + "@id": "kb:confidence", + "@type": "http://www.w3.org/2001/XMLSchema#decimal" + }, + "datePublished": { + "@id": "schema:datePublished", + "@type": "http://www.w3.org/2001/XMLSchema#date" + } + } +} +``` + +## Pattern 4: Object Properties (References) + +Properties that point to other nodes should use `"@type": "@id"`: + +```json +{ + "@context": { + "source": { + "@id": "prov:wasDerivedFrom", + "@type": "@id" + }, + "relatedTo": { + "@id": "kb:relatedTo", + "@type": "@id" + } + } +} +``` + +This tells JSON-LD processors that the value is a URI reference, not a string literal. + +## Pattern 5: @graph for Multiple Nodes + +Use `@graph` when a document contains multiple entities: + +```json +{ + "@context": "context.jsonld", + "@graph": [ + {"id": "...", "type": "schema:Article", "schema:name": "..."}, + {"id": "...", "type": "schema:Person", "schema:name": "..."} + ] +} +``` + +## Pattern 6: External Context Reference + +Reference a shared context file instead of inlining: + +```json +{ + "@context": "../../ontology/context.jsonld", + "@graph": [...] +} +``` + +This keeps individual graph documents small and consistent. + +## Common Mistakes + +| Mistake | Impact | Fix | +|---------|--------|-----| +| Missing `@type` on numeric properties | Values treated as strings | Add explicit `@type` declaration | +| Using string values for object properties | URIs become literals | Add `"@type": "@id"` | +| Inlining context in every file | Inconsistency risk | Use external `@context` reference | +| Forgetting prefix declarations | Properties become full URIs | Declare all used namespaces | + +## Validation + +Test your JSON-LD with: +- [JSON-LD Playground](https://json-ld.org/playground/) — expand, compact, flatten +- RDFLib: `Graph().parse(data=json_str, format="json-ld")` diff --git a/skills/shacl-shape-designer/SKILL.md b/skills/shacl-shape-designer/SKILL.md new file mode 100644 index 0000000..957da8e --- /dev/null +++ b/skills/shacl-shape-designer/SKILL.md @@ -0,0 +1,256 @@ +--- +name: shacl-shape-designer +description: Write and maintain SHACL (Shapes Constraint Language) shapes for validating RDF knowledge graphs. Covers NodeShape design, property constraints, target declarations, and pySHACL integration. Use when adding new entity types, extending the ontology, or debugging validation failures in a Markdown-LD knowledge bank. +--- + +# SHACL Shape Designer + +Write [W3C SHACL](https://www.w3.org/TR/shacl/) shapes that validate the RDF knowledge graph produced by the extraction pipeline. Shapes live in `ontology/shapes.ttl` and are checked by pySHACL during the build. + +## Existing Shapes + +The KB ships with three foundational shapes: + +### ArticleShape +Every `schema:Article` must have a name and publication date: + +```turtle + a sh:NodeShape ; + sh:targetClass schema:Article ; + sh:property [ + sh:path schema:name ; + sh:minCount 1 ; + sh:datatype xsd:string ; + sh:message "Every Article must have a schema:name." ; + ] ; + sh:property [ + sh:path schema:datePublished ; + sh:minCount 1 ; + sh:message "Every Article must have a schema:datePublished." ; + ] . +``` + +### EntityShape +Every `schema:Thing` must have a label: + +```turtle + a sh:NodeShape ; + sh:targetClass schema:Thing ; + sh:property [ + sh:path schema:name ; + sh:minCount 1 ; + sh:message "Every entity must have a schema:name (label)." ; + ] . +``` + +### ConfidenceShape +Confidence values must be in [0, 1]: + +```turtle + a sh:NodeShape ; + sh:targetSubjectsOf kb:confidence ; + sh:property [ + sh:path kb:confidence ; + sh:minInclusive 0.0 ; + sh:maxInclusive 1.0 ; + sh:datatype xsd:decimal ; + sh:message "kb:confidence must be a decimal in [0, 1]." ; + ] . +``` + +## Writing New Shapes + +### Step 1: Choose a Target + +SHACL shapes need a target — what nodes they apply to: + +```turtle +# Target by class +sh:targetClass schema:Person ; + +# Target by subjects of a property +sh:targetSubjectsOf schema:author ; + +# Target by objects of a property +sh:targetObjectsOf schema:mentions ; + +# Target specific node +sh:targetNode ; +``` + +### Step 2: Define Property Constraints + +Common constraint types: + +```turtle +sh:property [ + sh:path schema:name ; + + # Cardinality + sh:minCount 1 ; # required + sh:maxCount 1 ; # at most one + + # Datatype + sh:datatype xsd:string ; + sh:datatype xsd:date ; + sh:datatype xsd:decimal ; + + # Value range (for numerics) + sh:minInclusive 0.0 ; + sh:maxInclusive 1.0 ; + + # String constraints + sh:minLength 1 ; + sh:maxLength 500 ; + sh:pattern "^https?://" ; + + # Node type (for object properties) + sh:nodeKind sh:IRI ; # must be a URI + sh:nodeKind sh:Literal ; # must be a literal + + # Class constraint (value must be instance of) + sh:class schema:Person ; + + # Error message + sh:message "Human-readable error description." ; +] ; +``` + +### Step 3: Name the Shape + +Use the `urn:kb:shape:` namespace with a descriptive PascalCase name: + +```turtle + a sh:NodeShape ; + ... +``` + +## Example: Adding a PersonShape + +When adding `schema:Person` to the KB, create a shape ensuring every person has a name: + +```turtle + a sh:NodeShape ; + sh:targetClass schema:Person ; + sh:property [ + sh:path schema:name ; + sh:minCount 1 ; + sh:datatype xsd:string ; + sh:message "Every Person must have a schema:name." ; + ] ; + sh:property [ + sh:path schema:sameAs ; + sh:nodeKind sh:IRI ; + sh:message "schema:sameAs must be an IRI (URI), not a literal." ; + ] . +``` + +## Example: SoftwareApplicationShape + +```turtle + a sh:NodeShape ; + sh:targetClass schema:SoftwareApplication ; + sh:property [ + sh:path schema:name ; + sh:minCount 1 ; + sh:datatype xsd:string ; + sh:message "Every SoftwareApplication must have a schema:name." ; + ] . +``` + +## Example: AssertionShape + +Validate that assertions have required provenance: + +```turtle + a sh:NodeShape ; + sh:targetSubjectsOf kb:confidence ; + sh:property [ + sh:path kb:confidence ; + sh:minCount 1 ; + sh:maxCount 1 ; + sh:datatype xsd:decimal ; + sh:minInclusive 0.0 ; + sh:maxInclusive 1.0 ; + sh:message "Assertions must have exactly one kb:confidence in [0, 1]." ; + ] ; + sh:property [ + sh:path prov:wasDerivedFrom ; + sh:minCount 1 ; + sh:nodeKind sh:IRI ; + sh:message "Assertions must have prov:wasDerivedFrom pointing to a chunk URI." ; + ] . +``` + +## Prefixes Required + +Always include these prefixes in `shapes.ttl`: + +```turtle +@prefix sh: . +@prefix schema: . +@prefix kb: . +@prefix xsd: . +@prefix rdf: . +@prefix prov: . +``` + +## Running Validation + +### With pySHACL (Python) + +```python +from pyshacl import validate +from rdflib import Graph + +data_graph = Graph() +data_graph.parse("graph/articles/my-article.ttl", format="turtle") + +shapes_graph = Graph() +shapes_graph.parse("ontology/shapes.ttl", format="turtle") + +conforms, results_graph, results_text = validate( + data_graph, + shacl_graph=shapes_graph, + inference="rdfs", +) + +if not conforms: + print(results_text) +``` + +### In the CI Pipeline + +The build pipeline runs pySHACL automatically. Validation failures block the build and report which shapes were violated. + +### From Tests + +```bash +uv run pytest tests/test_shacl.py -v +``` + +## Debugging Validation Failures + +When a shape violation occurs, the error includes: + +1. **Focus node** — The entity that violated the constraint +2. **Path** — Which property was invalid +3. **Message** — The `sh:message` you defined +4. **Value** — The actual value that caused the failure + +Common causes: +- Missing `schema:name` on extracted entities → check LLM prompt +- Confidence out of [0, 1] → check post-processing +- `sameAs` as a string literal instead of IRI → check entity_hints format + +## Design Guidelines + +1. **One shape per class** — Keep shapes focused and independent +2. **Always add `sh:message`** — Makes debugging much faster +3. **Start minimal** — Require only essential properties (name, type) +4. **Use `sh:severity`** for warnings vs. errors: + ```turtle + sh:severity sh:Warning ; # non-blocking + sh:severity sh:Violation ; # blocks build (default) + ``` +5. **Test shapes before committing** — Run against existing graph data diff --git a/skills/shacl-shape-designer/references/existing-shapes.md b/skills/shacl-shape-designer/references/existing-shapes.md new file mode 100644 index 0000000..fc9445e --- /dev/null +++ b/skills/shacl-shape-designer/references/existing-shapes.md @@ -0,0 +1,68 @@ +# Existing Shapes Reference + +The current shapes in `ontology/shapes.ttl`: + +```turtle +@prefix sh: . +@prefix schema: . +@prefix kb: . +@prefix xsd: . + +# Every schema:Article must have a name and datePublished + a sh:NodeShape ; + sh:targetClass schema:Article ; + sh:property [ + sh:path schema:name ; + sh:minCount 1 ; + sh:datatype xsd:string ; + sh:message "Every Article must have a schema:name." ; + ] ; + sh:property [ + sh:path schema:datePublished ; + sh:minCount 1 ; + sh:message "Every Article must have a schema:datePublished." ; + ] . + +# Every entity must have a label + a sh:NodeShape ; + sh:targetClass schema:Thing ; + sh:property [ + sh:path schema:name ; + sh:minCount 1 ; + sh:message "Every entity must have a schema:name (label)." ; + ] . + +# Confidence must be in [0, 1] + a sh:NodeShape ; + sh:targetSubjectsOf kb:confidence ; + sh:property [ + sh:path kb:confidence ; + sh:minInclusive 0.0 ; + sh:maxInclusive 1.0 ; + sh:datatype xsd:decimal ; + sh:message "kb:confidence must be a decimal in [0, 1]." ; + ] . +``` + +## Shape Coverage Matrix + +| Entity Type | Shape Exists? | Validates | +|-------------|:------------:|-----------| +| `schema:Article` | ✅ | `schema:name`, `schema:datePublished` | +| `schema:Thing` | ✅ | `schema:name` | +| `schema:Person` | ❌ | — | +| `schema:Organization` | ❌ | — | +| `schema:SoftwareApplication` | ❌ | — | +| `schema:CreativeWork` | ❌ | — | +| Assertions (`kb:confidence`) | ✅ | Value in [0, 1] | +| Provenance (`prov:wasDerivedFrom`) | ❌ | — | + +## Recommended New Shapes + +Priority shapes to add: + +1. **PersonShape** — Validate `schema:Person` has `schema:name` +2. **OrganizationShape** — Validate `schema:Organization` has `schema:name` +3. **SoftwareApplicationShape** — Validate `schema:SoftwareApplication` has `schema:name` +4. **AssertionShape** — Validate assertions have both `kb:confidence` and `prov:wasDerivedFrom` +5. **SameAsShape** — Validate that `schema:sameAs` values are IRIs, not literals diff --git a/skills/sparql-query-writer/SKILL.md b/skills/sparql-query-writer/SKILL.md new file mode 100644 index 0000000..91cdbc7 --- /dev/null +++ b/skills/sparql-query-writer/SKILL.md @@ -0,0 +1,219 @@ +--- +name: sparql-query-writer +description: Write SPARQL 1.1 queries against schema.org-based knowledge graphs built from Markdown-LD content. Covers the available vocabulary (classes, properties, prefixes), common query patterns, case-insensitive matching, safety constraints, and result interpretation. Use when querying a Markdown-LD knowledge bank's SPARQL endpoint or helping users formulate queries. +--- + +# SPARQL Query Writer + +Write SPARQL queries for knowledge graphs built from Markdown articles using schema.org vocabulary. The KB stores articles, entities, and their relationships as RDF triples. + +## Prefixes + +Always declare the prefixes you use: + +```sparql +PREFIX schema: +PREFIX kb: +PREFIX prov: +PREFIX rdf: +PREFIX xsd: +``` + +## Available Classes + +| Class | Description | +|-------|-------------| +| `schema:Article` | A Markdown article in the KB | +| `schema:Person` | A person (author, creator, etc.) | +| `schema:Organization` | A company, foundation, or team | +| `schema:SoftwareApplication` | Software tools, libraries, platforms | +| `schema:CreativeWork` | Books, papers, specifications | +| `schema:Thing` | Base type for any entity | + +## Available Properties + +| Property | Domain → Range | Description | +|----------|---------------|-------------| +| `schema:name` | Any → `xsd:string` | Label/title of any entity or article | +| `schema:mentions` | Article → Thing | Article mentions an entity | +| `schema:about` | Article → Thing | Article is about a topic | +| `schema:author` | Article → Person | Author of an article | +| `schema:creator` | Thing → Person/Org | Creator of a thing | +| `schema:datePublished` | Article → `xsd:date` | Publication date | +| `schema:dateModified` | Article → `xsd:date` | Last modified date | +| `schema:sameAs` | Thing → URI | Link to external identity (Wikidata) | +| `schema:keywords` | Article → `xsd:string` | Comma-separated tags | +| `schema:description` | Article → `xsd:string` | Summary text | +| `kb:relatedTo` | Thing → Thing | Fallback relation | +| `kb:confidence` | Assertion → `xsd:decimal` | Extraction confidence 0–1 | +| `prov:wasDerivedFrom` | Assertion → URI | Provenance: source chunk | + +## Core Query Patterns + +### List all entities (non-articles) + +```sparql +PREFIX schema: +SELECT DISTINCT ?entity ?name ?type WHERE { + ?entity a ?type ; + schema:name ?name . + FILTER(?type != schema:Article) +} +LIMIT 100 +``` + +### Find articles mentioning an entity + +```sparql +PREFIX schema: +SELECT ?article ?title WHERE { + ?article a schema:Article ; + schema:name ?title ; + schema:mentions ?entity . + ?entity schema:name ?entityName . + FILTER(LCASE(STR(?entityName)) = "sparql") +} +LIMIT 100 +``` + +### Find all entities of a specific type + +```sparql +PREFIX schema: +SELECT ?entity ?name WHERE { + ?entity a schema:Organization ; + schema:name ?name . +} +LIMIT 100 +``` + +### Find topics covered by an article + +```sparql +PREFIX schema: +SELECT ?topic ?topicName WHERE { + ?article a schema:Article ; + schema:name ?title ; + schema:mentions ?topic . + ?topic schema:name ?topicName . + FILTER(CONTAINS(LCASE(STR(?title)), "knowledge graph")) +} +LIMIT 100 +``` + +### Find connections between entities + +```sparql +PREFIX schema: +SELECT ?subject ?predicate ?object WHERE { + ?subject ?predicate ?object . + FILTER(?predicate != rdf:type) +} +LIMIT 50 +``` + +### Find entities with Wikidata links + +```sparql +PREFIX schema: +SELECT ?entity ?name ?wikidata WHERE { + ?entity schema:name ?name ; + schema:sameAs ?wikidata . + FILTER(CONTAINS(STR(?wikidata), "wikidata.org")) +} +LIMIT 100 +``` + +## Case-Insensitive Matching + +Entity names are stored as plain strings. Always match case-insensitively: + +```sparql +-- Exact match (case-insensitive) +FILTER(LCASE(STR(?name)) = "neo4j") + +-- Contains match (case-insensitive) +FILTER(CONTAINS(LCASE(STR(?name)), "knowledge")) + +-- Regex match +FILTER(REGEX(?name, "graph", "i")) +``` + +**Prefer `LCASE` + exact match** over REGEX for performance. + +## Safety Constraints + +The endpoint enforces these rules: + +1. **Only `SELECT` and `ASK` queries** — `INSERT`, `DELETE`, `LOAD`, `CLEAR`, `DROP`, `CREATE` are blocked +2. **Always include `LIMIT`** — default to `LIMIT 100` unless the user asks for all results +3. **No mutating operations** — the graph is read-only at query time + +## Using the Endpoint + +### GET request + +``` +/api/sparql?query={url-encoded-sparql} +``` + +### POST request (preferred for complex queries) + +``` +POST /api/sparql +Content-Type: application/sparql-query + +{raw SPARQL query} +``` + +### Response format + +The endpoint returns [SPARQL Results JSON](https://www.w3.org/TR/sparql11-results-json/): + +```json +{ + "head": { "vars": ["entity", "name"] }, + "results": { + "bindings": [ + { + "entity": { "type": "uri", "value": "https://example.com/id/neo4j" }, + "name": { "type": "literal", "value": "Neo4j" } + } + ] + } +} +``` + +## Natural Language Alternative + +Don't know SPARQL? Use the `/api/ask` endpoint: + +``` +POST /api/ask +Content-Type: application/json +{"question": "What entities are in the knowledge graph?"} +``` + +The response includes the generated SPARQL so you can learn the query language: + +```json +{ + "question": "What entities are in the knowledge graph?", + "sparql": "PREFIX schema: ...", + "results": { ... } +} +``` + +## Common Mistakes + +| Mistake | Fix | +|---------|-----| +| Forgetting `PREFIX` declarations | Always include prefixes you reference | +| Case-sensitive name matching | Use `LCASE(STR(?name))` | +| Missing `LIMIT` clause | Default to `LIMIT 100` | +| Inventing predicates | Only use properties from the table above | +| Using `INSERT`/`DELETE` | The endpoint is read-only | + +## Reference + +See [references/vocabulary.md](references/vocabulary.md) for the complete vocabulary definition. diff --git a/skills/sparql-query-writer/references/vocabulary.md b/skills/sparql-query-writer/references/vocabulary.md new file mode 100644 index 0000000..c4ce5e8 --- /dev/null +++ b/skills/sparql-query-writer/references/vocabulary.md @@ -0,0 +1,95 @@ +# Vocabulary Reference + +Complete vocabulary used in the Markdown-LD knowledge bank. + +## Namespace Prefixes + +| Prefix | URI | Usage | +|--------|-----|-------| +| `schema:` | `https://schema.org/` | Primary vocabulary for types and properties | +| `kb:` | `https://example.com/vocab/kb#` | Custom KB properties (confidence, relatedTo) | +| `prov:` | `http://www.w3.org/ns/prov#` | Provenance tracking | +| `rdf:` | `http://www.w3.org/1999/02/22-rdf-syntax-ns#` | RDF core | +| `rdfs:` | `http://www.w3.org/2000/01/rdf-schema#` | RDF Schema | +| `xsd:` | `http://www.w3.org/2001/XMLSchema#` | XML Schema datatypes | +| `sh:` | `http://www.w3.org/ns/shacl#` | SHACL validation shapes | + +## Custom Vocabulary (`kb:`) + +Defined in `ontology/kb.ttl`: + +### `kb:confidence` +- Type: `rdf:Property` +- Range: `xsd:decimal` +- Description: Extractor confidence in [0, 1] +- Usage: Attached to reified assertions to indicate how explicitly the text states the relation + +### `kb:relatedTo` +- Type: `rdf:Property` +- Domain: `schema:Thing` +- Range: `schema:Thing` +- Description: Fallback relation when no schema.org property fits + +### `kb:chunk` +- Type: `rdf:Property` +- Description: Chunk that produced an assertion + +### `kb:docPath` +- Type: `rdf:Property` +- Description: Relative file path of the source Markdown document + +### `kb:charStart` / `kb:charEnd` +- Type: `rdf:Property` +- Range: `xsd:integer` +- Description: Character offsets of the chunk in the source document + +## Entity ID Convention + +All entity IDs follow the pattern: + +``` +{base_url}/id/{slug} +``` + +Where `slug` is derived from the entity label: +- Lowercase +- Replace spaces and underscores with hyphens +- Remove special characters +- Collapse consecutive hyphens + +Examples: +- "Neo4j" → `https://example.com/id/neo4j` +- "JSON-LD 1.1" → `https://example.com/id/json-ld-1-1` +- "Emil Eifrem" → `https://example.com/id/emil-eifrem` + +## Article ID Convention + +Article IDs are derived from their file path or `canonical_url` frontmatter: + +``` +content/2026/04/my-article.md → https://example.com/2026/04/my-article/ +``` + +If `canonical_url` is set in frontmatter, it takes precedence over the path-derived ID. + +## Type Hierarchy + +``` +schema:Thing +├── schema:Person +├── schema:Organization +├── schema:SoftwareApplication +├── schema:CreativeWork +│ └── schema:Article +└── (other schema.org types as needed) +``` + +The extraction pipeline uses a type priority system when merging duplicate entities: +- `schema:Person` (priority 5) +- `schema:Organization` (priority 5) +- `schema:SoftwareApplication` (priority 5) +- `schema:CreativeWork` (priority 4) +- `schema:Article` (priority 4) +- `schema:Thing` (priority 1 — default) + +When the same entity appears with different types across chunks, the highest-priority type wins.