Skip to content

Commit 357d85f

Browse files
QuentinAmbardQuentin Ambardclaude
authored
Add PDF generation tools and volume folder details (#365)
* Add PDF generation tools and volume folder details PDF Generation: - Add `generate_and_upload_pdfs` MCP tool for batch PDF generation - Add `generate_and_upload_pdf` MCP tool for single PDF with precise control - Uses LLM to generate professional HTML content, converts to PDF via PyMuPDF - Creates companion JSON files with question/guideline pairs for RAG evaluation - Auto-discovers databricks-gpt-* endpoints (no configuration needed) - Uploads directly to Unity Catalog Volumes Volume Tools: - Add `get_volume_folder_details` MCP tool for inspecting data files in Volumes Other Changes: - Remove unused litellm dependency - Update list_serving_endpoints to support limit=None for full listing - Add unit tests for LLM endpoint discovery - Update skill documentation with examples and best practices 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix ruff linting errors - Remove unnecessary pass statement in LLMConfigurationError - Fix line length issues (auto-fixed by ruff) - Remove unused config variable in _get_html_system_prompt 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add linting instructions to CONTRIBUTING.md Document ruff check and format commands used by CI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix all ruff lint and format errors - Fix line length issues in _SIZE_CONFIG strings - Fix line length in system_prompt strings - Remove unnecessary pass statement (auto-fixed) - Apply ruff format to all files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Quentin Ambard <quentin.ambard@databricks.com> Co-authored-by: Claude <noreply@anthropic.com>
1 parent 4e37ead commit 357d85f

File tree

13 files changed

+1313
-64
lines changed

13 files changed

+1313
-64
lines changed

CONTRIBUTING.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,40 @@ This repository is maintained by Databricks and intended for contributions from
3232
- **Type hints**: Include type annotations for public functions
3333
- **Naming**: Use lowercase with hyphens for directories (e.g., `databricks-tools-core`)
3434

35+
## Linting
36+
37+
This project uses [ruff](https://docs.astral.sh/ruff/) for linting and formatting. Run these before submitting a PR:
38+
39+
```bash
40+
# Check for linting errors
41+
uvx ruff@0.11.0 check \
42+
--select=E,F,B,PIE \
43+
--ignore=E401,E402,F401,F403,B017,B904,ANN,TCH \
44+
--line-length=120 \
45+
--target-version=py311 \
46+
databricks-tools-core/ databricks-mcp-server/
47+
48+
# Auto-fix linting errors where possible
49+
uvx ruff@0.11.0 check --fix \
50+
--select=E,F,B,PIE \
51+
--ignore=E401,E402,F401,F403,B017,B904,ANN,TCH \
52+
--line-length=120 \
53+
--target-version=py311 \
54+
databricks-tools-core/ databricks-mcp-server/
55+
56+
# Check formatting
57+
uvx ruff@0.11.0 format --check \
58+
--line-length=120 \
59+
--target-version=py311 \
60+
databricks-tools-core/ databricks-mcp-server/
61+
62+
# Auto-format code
63+
uvx ruff@0.11.0 format \
64+
--line-length=120 \
65+
--target-version=py311 \
66+
databricks-tools-core/ databricks-mcp-server/
67+
```
68+
3569
## Testing
3670

3771
Run integration tests before submitting changes:

databricks-mcp-server/databricks_mcp_server/server.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,4 +154,5 @@ async def _noop_lifespan(*args, **kwargs):
154154
user,
155155
apps,
156156
workspace,
157+
pdf,
157158
)
Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
"""PDF tools - Generate synthetic PDF documents for RAG/unstructured data use cases."""
2+
3+
import tempfile
4+
from typing import Any, Dict, Literal
5+
6+
from databricks_tools_core.pdf import DocSize
7+
from databricks_tools_core.pdf import generate_pdf_documents as _generate_pdf_documents
8+
from databricks_tools_core.pdf import generate_single_pdf as _generate_single_pdf
9+
from databricks_tools_core.pdf.models import DocumentSpecification
10+
11+
from ..server import mcp
12+
13+
14+
@mcp.tool
15+
def generate_and_upload_pdfs(
16+
catalog: str,
17+
schema: str,
18+
description: str,
19+
count: int,
20+
volume: str = "raw_data",
21+
folder: str = "pdf_documents",
22+
doc_size: Literal["SMALL", "MEDIUM", "LARGE"] = "MEDIUM",
23+
overwrite_folder: bool = False,
24+
) -> Dict[str, Any]:
25+
"""
26+
Generate synthetic PDF documents and upload to a Unity Catalog volume.
27+
28+
This tool generates realistic PDF documents using a 2-step process:
29+
1. Uses LLM to generate diverse document specifications
30+
2. Generates HTML content and converts to PDF in parallel
31+
32+
Each PDF also gets a companion JSON file with a question/guideline pair
33+
for RAG evaluation purposes.
34+
35+
Args:
36+
catalog: Unity Catalog name
37+
schema: Schema name
38+
description: Detailed description of what PDFs should contain.
39+
Be specific about the domain, document types, and content.
40+
Example: "Technical documentation for a cloud infrastructure platform
41+
including API guides, troubleshooting manuals, and security policies."
42+
count: Number of PDFs to generate (recommended: 5-20)
43+
volume: Volume name (must already exist). Default: "raw_data"
44+
folder: Folder within volume (e.g., "technical_docs"). Default: "pdf_documents"
45+
doc_size: Size of documents to generate. Default: "MEDIUM"
46+
- "SMALL": ~1 page, concise content
47+
- "MEDIUM": ~4-6 pages, comprehensive coverage (default)
48+
- "LARGE": ~10+ pages, exhaustive documentation
49+
overwrite_folder: If True, delete existing folder content first (default: False)
50+
51+
Returns:
52+
Dictionary with:
53+
- success: True if all PDFs generated successfully
54+
- volume_path: Path to the volume folder containing PDFs
55+
- pdfs_generated: Number of PDFs successfully created
56+
- pdfs_failed: Number of PDFs that failed
57+
- errors: List of error messages if any
58+
59+
Example:
60+
>>> generate_and_upload_pdfs(
61+
... catalog="my_catalog",
62+
... schema="my_schema",
63+
... description="HR policy documents including employee handbook, "
64+
... "leave policies, code of conduct, and benefits guide",
65+
... count=10,
66+
... doc_size="SMALL"
67+
... )
68+
{
69+
"success": True,
70+
"volume_path": "/Volumes/my_catalog/my_schema/raw_data/pdf_documents",
71+
"pdfs_generated": 10,
72+
"pdfs_failed": 0,
73+
"errors": []
74+
}
75+
76+
Environment Variables:
77+
- DATABRICKS_MODEL: Model serving endpoint name (auto-discovered if not set)
78+
- DATABRICKS_MODEL_NANO: Smaller model for faster generation (auto-discovered if not set)
79+
"""
80+
# Convert string to DocSize enum
81+
size_enum = DocSize(doc_size)
82+
83+
result = _generate_pdf_documents(
84+
catalog=catalog,
85+
schema=schema,
86+
description=description,
87+
count=count,
88+
volume=volume,
89+
folder=folder,
90+
doc_size=size_enum,
91+
overwrite_folder=overwrite_folder,
92+
max_workers=4,
93+
)
94+
95+
return {
96+
"success": result.success,
97+
"volume_path": result.volume_path,
98+
"pdfs_generated": result.pdfs_generated,
99+
"pdfs_failed": result.pdfs_failed,
100+
"errors": result.errors,
101+
}
102+
103+
104+
@mcp.tool
105+
def generate_and_upload_pdf(
106+
title: str,
107+
description: str,
108+
question: str,
109+
guideline: str,
110+
catalog: str,
111+
schema: str,
112+
volume: str = "raw_data",
113+
folder: str = "pdf_documents",
114+
doc_size: Literal["SMALL", "MEDIUM", "LARGE"] = "MEDIUM",
115+
) -> Dict[str, Any]:
116+
"""
117+
Generate a single PDF document and upload to a Unity Catalog volume.
118+
119+
Use this when you need to create one PDF with precise control over its
120+
content, title, and associated question/guideline for RAG evaluation.
121+
122+
Args:
123+
title: Document title (e.g., "API Authentication Guide")
124+
description: What this document should contain. Be detailed about
125+
the content, sections, topics, and domain context to cover.
126+
question: A question that can be answered by reading this document.
127+
Used for RAG evaluation.
128+
guideline: How to evaluate if an answer to the question is correct.
129+
Should describe what a good answer includes without giving the exact answer.
130+
catalog: Unity Catalog name
131+
schema: Schema name
132+
volume: Volume name (must already exist). Default: "raw_data"
133+
folder: Folder within volume. Default: "pdf_documents"
134+
doc_size: Size of document to generate. Default: "MEDIUM"
135+
- "SMALL": ~1 page, concise content
136+
- "MEDIUM": ~4-6 pages, comprehensive coverage
137+
- "LARGE": ~10+ pages, exhaustive documentation
138+
139+
Returns:
140+
Dictionary with:
141+
- success: True if PDF generated successfully
142+
- pdf_path: Volume path to the generated PDF
143+
- question_path: Volume path to the companion JSON file (question/guideline)
144+
- error: Error message if generation failed
145+
146+
Example:
147+
>>> generate_and_upload_pdf(
148+
... title="REST API Authentication Guide",
149+
... description="Complete guide to API authentication for a cloud platform "
150+
... "including OAuth2 flows, API keys, and JWT tokens.",
151+
... question="What are the supported authentication methods?",
152+
... guideline="Answer should mention OAuth2, API keys, and JWT tokens",
153+
... catalog="my_catalog",
154+
... schema="my_schema",
155+
... doc_size="SMALL"
156+
... )
157+
{
158+
"success": True,
159+
"pdf_path": "/Volumes/my_catalog/my_schema/raw_data/pdf_documents/rest_api_authentication_guide.pdf",
160+
"question_path": "/Volumes/my_catalog/my_schema/raw_data/pdf_documents/rest_api_authentication_guide.json",
161+
"error": None
162+
}
163+
"""
164+
# Generate model_id from title (used for filename)
165+
import re
166+
167+
model_id = re.sub(r"[^a-zA-Z0-9]+", "_", title).strip("_").upper()
168+
169+
# Create document specification
170+
doc_spec = DocumentSpecification(
171+
title=title,
172+
category="Document", # Simplified - category info can be in description
173+
model=model_id,
174+
description=description,
175+
question=question,
176+
guideline=guideline,
177+
)
178+
179+
# Convert string to DocSize enum
180+
size_enum = DocSize(doc_size)
181+
182+
# Use a temporary directory for local file creation
183+
with tempfile.TemporaryDirectory() as temp_dir:
184+
result = _generate_single_pdf(
185+
doc_spec=doc_spec,
186+
description=description, # Same description used for context
187+
catalog=catalog,
188+
schema=schema,
189+
volume=volume,
190+
folder=folder,
191+
temp_dir=temp_dir,
192+
doc_size=size_enum,
193+
)
194+
195+
return {
196+
"success": result.success,
197+
"pdf_path": result.pdf_path,
198+
"question_path": result.question_path,
199+
"error": result.error,
200+
}

databricks-mcp-server/databricks_mcp_server/tools/sql.py

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
list_warehouses as _list_warehouses,
99
get_best_warehouse as _get_best_warehouse,
1010
get_table_details as _get_table_details,
11+
get_volume_folder_details as _get_volume_folder_details,
1112
TableStatLevel,
1213
)
1314

@@ -215,3 +216,36 @@ def get_table_details(
215216
)
216217
# Convert to dict for JSON serialization
217218
return result.model_dump(exclude_none=True) if hasattr(result, "model_dump") else result
219+
220+
221+
@mcp.tool
222+
def get_volume_folder_details(
223+
volume_path: str,
224+
format: str = "parquet",
225+
table_stat_level: str = "SIMPLE",
226+
warehouse_id: str = None,
227+
) -> Dict[str, Any]:
228+
"""
229+
Get schema and statistics for data files in a Databricks Volume folder.
230+
231+
Similar to get_table_details but for raw files stored in Volumes.
232+
233+
Args:
234+
volume_path: Path to the volume folder. Can be:
235+
- "catalog/schema/volume/path" (e.g., "ai_dev_kit/demo/raw_data/customers")
236+
- "/Volumes/catalog/schema/volume/path"
237+
format: Data format - "parquet", "csv", "json", "delta", or "file" (just list files).
238+
table_stat_level: Level of statistics - "NONE", "SIMPLE" (default), or "DETAILED".
239+
warehouse_id: Optional warehouse ID. If not provided, auto-selects one.
240+
241+
Returns:
242+
Dictionary with schema, row count, column stats, and sample data.
243+
"""
244+
level = TableStatLevel[table_stat_level.upper()]
245+
result = _get_volume_folder_details(
246+
volume_path=volume_path,
247+
format=format,
248+
table_stat_level=level,
249+
warehouse_id=warehouse_id,
250+
)
251+
return result.model_dump(exclude_none=True) if hasattr(result, "model_dump") else result

0 commit comments

Comments
 (0)