|
| 1 | +--- |
| 2 | +mcp-servers: |
| 3 | + kreuzberg: |
| 4 | + container: "ghcr.io/kreuzberg-dev/kreuzberg" |
| 5 | + version: "latest" |
| 6 | + entrypointArgs: |
| 7 | + - "mcp" |
| 8 | + mounts: |
| 9 | + - ${GITHUB_WORKSPACE}:${GITHUB_WORKSPACE}:ro |
| 10 | + allowed: |
| 11 | + # Document extraction tools (read-only) |
| 12 | + - "extract_file" |
| 13 | + - "extract_bytes" |
| 14 | + - "batch_extract_files" |
| 15 | + # Format discovery tools (read-only) |
| 16 | + - "detect_mime_type" |
| 17 | + - "list_formats" |
| 18 | + - "get_version" |
| 19 | + # Text processing tools (read-only) |
| 20 | + - "embed_text" |
| 21 | + - "chunk_text" |
| 22 | + # Cache inspection tools (read-only) |
| 23 | + - "cache_stats" |
| 24 | + - "cache_manifest" |
| 25 | + # Excluded write/mutating operations: |
| 26 | + # - "cache_clear" # Evicts all cached results |
| 27 | + # - "cache_warm" # Pre-downloads embedding models |
| 28 | + # Excluded feature-flag-gated operations: |
| 29 | + # - "extract_structured" # Requires liter-llm feature flag at build time |
| 30 | +--- |
| 31 | +<!-- |
| 32 | +## Kreuzberg MCP Server |
| 33 | +
|
| 34 | +Kreuzberg is a polyglot document intelligence engine. The MCP server exposes its |
| 35 | +full extraction engine as 13 discoverable tools, communicating over stdin/stdout |
| 36 | +with JSON-RPC 2.0. It supports 97+ file formats including PDF, DOCX, PPTX, |
| 37 | +images (with Tesseract OCR), and legacy Office formats (with LibreOffice in the |
| 38 | +full image). |
| 39 | +
|
| 40 | +Documentation: https://docs.kreuzberg.dev/guides/docker/ |
| 41 | +MCP integration guide: https://docs.kreuzberg.dev/guides/mcp-integration/ |
| 42 | +GitHub: https://github.com/kreuzberg-dev/kreuzberg |
| 43 | +
|
| 44 | +### Container images |
| 45 | +
|
| 46 | +Two images are available (both on `ghcr.io/kreuzberg-dev/kreuzberg`): |
| 47 | +- **Core** (~1.0–1.3 GB): Modern formats, Tesseract OCR (12 languages) |
| 48 | +- **Full** (~1.5–2.1 GB): Adds LibreOffice for legacy `.doc`/`.ppt` files |
| 49 | + Use tag `full` or `latest-full` to select the full image. |
| 50 | +
|
| 51 | +### Required secrets |
| 52 | +
|
| 53 | +None — no API token is required. |
| 54 | +
|
| 55 | +### Available tools (read-only) |
| 56 | +
|
| 57 | +| Tool | Params | Description | |
| 58 | +|---|---|---| |
| 59 | +| `extract_file` | `path` | Extract text and metadata from a local file | |
| 60 | +| `extract_bytes` | `data` (base64) | Extract from base64-encoded file content | |
| 61 | +| `batch_extract_files` | `paths` | Extract multiple files in one call | |
| 62 | +| `detect_mime_type` | `path` | Identify a file's MIME type | |
| 63 | +| `list_formats` | — | List all supported file formats | |
| 64 | +| `get_version` | — | Return the library version string | |
| 65 | +| `embed_text` | `texts` | Generate embedding vectors for text chunks | |
| 66 | +| `chunk_text` | `text` | Split text into overlapping chunks | |
| 67 | +| `cache_stats` | — | Report how much content is cached | |
| 68 | +| `cache_manifest` | — | Return model checksums | |
| 69 | +
|
| 70 | +### Excluded tools |
| 71 | +
|
| 72 | +- `cache_clear` — Evicts all cached results (write operation) |
| 73 | +- `cache_warm` — Pre-downloads embedding models (write operation) |
| 74 | +- `extract_structured` — Requires the `liter-llm` build-time feature flag |
| 75 | +
|
| 76 | +### Workspace access |
| 77 | +
|
| 78 | +The workspace is mounted read-only at the same path it exists on the host, |
| 79 | +so `extract_file` and `batch_extract_files` can reference files using their |
| 80 | +absolute workspace paths (e.g. `${{ github.workspace }}/document.pdf`). |
| 81 | +
|
| 82 | +### Usage in workflows |
| 83 | +
|
| 84 | +```yaml |
| 85 | +imports: |
| 86 | + - shared/mcp/kreuzberg.md |
| 87 | +``` |
| 88 | +--> |
0 commit comments