Skip to content

Commit f187aa4

Browse files
authored
Add kreuzberg MCP shared workflow (#28392)
1 parent f75e47d commit f187aa4

1 file changed

Lines changed: 88 additions & 0 deletions

File tree

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
mcp-servers:
3+
kreuzberg:
4+
container: "ghcr.io/kreuzberg-dev/kreuzberg"
5+
version: "latest"
6+
entrypointArgs:
7+
- "mcp"
8+
mounts:
9+
- ${GITHUB_WORKSPACE}:${GITHUB_WORKSPACE}:ro
10+
allowed:
11+
# Document extraction tools (read-only)
12+
- "extract_file"
13+
- "extract_bytes"
14+
- "batch_extract_files"
15+
# Format discovery tools (read-only)
16+
- "detect_mime_type"
17+
- "list_formats"
18+
- "get_version"
19+
# Text processing tools (read-only)
20+
- "embed_text"
21+
- "chunk_text"
22+
# Cache inspection tools (read-only)
23+
- "cache_stats"
24+
- "cache_manifest"
25+
# Excluded write/mutating operations:
26+
# - "cache_clear" # Evicts all cached results
27+
# - "cache_warm" # Pre-downloads embedding models
28+
# Excluded feature-flag-gated operations:
29+
# - "extract_structured" # Requires liter-llm feature flag at build time
30+
---
31+
<!--
32+
## Kreuzberg MCP Server
33+
34+
Kreuzberg is a polyglot document intelligence engine. The MCP server exposes its
35+
full extraction engine as 13 discoverable tools, communicating over stdin/stdout
36+
with JSON-RPC 2.0. It supports 97+ file formats including PDF, DOCX, PPTX,
37+
images (with Tesseract OCR), and legacy Office formats (with LibreOffice in the
38+
full image).
39+
40+
Documentation: https://docs.kreuzberg.dev/guides/docker/
41+
MCP integration guide: https://docs.kreuzberg.dev/guides/mcp-integration/
42+
GitHub: https://github.com/kreuzberg-dev/kreuzberg
43+
44+
### Container images
45+
46+
Two images are available (both on `ghcr.io/kreuzberg-dev/kreuzberg`):
47+
- **Core** (~1.0–1.3 GB): Modern formats, Tesseract OCR (12 languages)
48+
- **Full** (~1.5–2.1 GB): Adds LibreOffice for legacy `.doc`/`.ppt` files
49+
Use tag `full` or `latest-full` to select the full image.
50+
51+
### Required secrets
52+
53+
None — no API token is required.
54+
55+
### Available tools (read-only)
56+
57+
| Tool | Params | Description |
58+
|---|---|---|
59+
| `extract_file` | `path` | Extract text and metadata from a local file |
60+
| `extract_bytes` | `data` (base64) | Extract from base64-encoded file content |
61+
| `batch_extract_files` | `paths` | Extract multiple files in one call |
62+
| `detect_mime_type` | `path` | Identify a file's MIME type |
63+
| `list_formats` | — | List all supported file formats |
64+
| `get_version` | — | Return the library version string |
65+
| `embed_text` | `texts` | Generate embedding vectors for text chunks |
66+
| `chunk_text` | `text` | Split text into overlapping chunks |
67+
| `cache_stats` | — | Report how much content is cached |
68+
| `cache_manifest` | — | Return model checksums |
69+
70+
### Excluded tools
71+
72+
- `cache_clear` — Evicts all cached results (write operation)
73+
- `cache_warm` — Pre-downloads embedding models (write operation)
74+
- `extract_structured` — Requires the `liter-llm` build-time feature flag
75+
76+
### Workspace access
77+
78+
The workspace is mounted read-only at the same path it exists on the host,
79+
so `extract_file` and `batch_extract_files` can reference files using their
80+
absolute workspace paths (e.g. `${{ github.workspace }}/document.pdf`).
81+
82+
### Usage in workflows
83+
84+
```yaml
85+
imports:
86+
- shared/mcp/kreuzberg.md
87+
```
88+
-->

0 commit comments

Comments
 (0)