Skip to content

Commit 66c85bf

Browse files
committed
docs(azure-ai-integration.md): Add detailed index schema and field mapping guidelines for citations
1 parent 57c6de2 commit 66c85bf

2 files changed

Lines changed: 486 additions & 3 deletions

File tree

docs/azure-ai-citations.md

Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ Each citation is emitted as a separate event to ensure all sources appear in the
7171
```
7272

7373
Key points:
74+
7475
- Each source document gets its own citation event
7576
- The `source.name` includes the doc index (`[doc1]`, `[doc2]`, etc.) to prevent grouping
7677
- The `distances` array contains relevance scores from Azure AI Search, which OpenWebUI displays as a percentage on the citation cards
@@ -159,23 +160,210 @@ This ensures every citation has a meaningful display name.
159160

160161
Citations are filtered to only show documents that are actually referenced in the response content. For example, if Azure returns 5 citations but the response only references `[doc1]` and `[doc3]`, only those 2 citations will appear in the UI.
161162

163+
## Index Schema Requirements for Citations
164+
165+
For citations to work correctly, your Azure AI Search index must contain the right fields with the right attributes. This section explains exactly which fields the pipeline reads and how they map to citation cards in OpenWebUI.
166+
167+
### Required and Recommended Index Fields
168+
169+
| Index Field | Type | Required? | Must Be Retrievable? | Citation Purpose |
170+
|---|---|---|---|---|
171+
| `content` | `Edm.String` | Yes | Yes | Provides the text snippet shown in the citation preview |
172+
| `title` | `Edm.String` | Recommended | Yes | Displayed as the citation card title |
173+
| `filepath` | `Edm.String` | Recommended | Yes | Used as the citation name in the response; fallback for title |
174+
| `url` | `Edm.String` | Recommended | Yes | Makes `[docX]` references into clickable links |
175+
| `chunk_id` | `Edm.String` | Optional | Yes | Helps match citations with relevance scores |
176+
| `contentVector` | `Collection(Edm.Single)` | For vector search | N/A | Enables vector/hybrid search |
177+
178+
> **Key point**: The `title`, `filepath`, and `url` fields must be marked as **retrievable** in your index schema. If they are not retrievable, Azure will not include them in the citation response, and the pipeline cannot display them.
179+
180+
### Title Fallback Chain
181+
182+
The pipeline determines each citation's display title using this fallback chain:
183+
184+
1. `title` field → if present and non-empty
185+
2. `filepath` field → if title is empty
186+
3. `url` field → if both title and filepath are empty
187+
4. `"Unknown Document"` → if all are empty
188+
189+
To avoid seeing "Unknown Document", ensure at least one of `title`, `filepath`, or `url` is populated in your index documents.
190+
191+
### Custom Field Names and `fields_mapping`
192+
193+
If your index uses different field names (e.g., `body` instead of `content`, or `doc_title` instead of `title`), you must tell Azure OpenAI how to map them using the `fields_mapping` parameter in your `AZURE_AI_DATA_SOURCES` configuration.
194+
195+
**`fields_mapping` properties:**
196+
197+
| Property | Type | Maps To |
198+
|---|---|---|
199+
| `content_fields` | `string[]` | The index fields to use as document content |
200+
| `title_field` | `string` | The index field to use as the document title |
201+
| `filepath_field` | `string` | The index field to use as the file path/name |
202+
| `url_field` | `string` | The index field to use as the document URL |
203+
| `vector_fields` | `string[]` | The index fields containing vector embeddings |
204+
| `content_fields_separator` | `string` | Separator pattern between content fields (default: `\n`) |
205+
206+
**Example with custom field names:**
207+
208+
```json
209+
[
210+
{
211+
"type": "azure_search",
212+
"parameters": {
213+
"endpoint": "https://my-search.search.windows.net",
214+
"index_name": "my-custom-index",
215+
"authentication": {
216+
"type": "api_key",
217+
"key": "YOUR-SEARCH-API-KEY"
218+
},
219+
"fields_mapping": {
220+
"content_fields": ["body", "summary"],
221+
"title_field": "doc_title",
222+
"filepath_field": "source_file",
223+
"url_field": "source_url",
224+
"vector_fields": ["embedding"]
225+
}
226+
}
227+
}
228+
]
229+
```
230+
231+
### Creating an Index with the Right Fields
232+
233+
If you are creating a new index manually, here is a minimal schema that supports all citation features:
234+
235+
```json
236+
{
237+
"name": "my-docs-index",
238+
"fields": [
239+
{ "name": "id", "type": "Edm.String", "key": true, "filterable": true },
240+
{ "name": "content", "type": "Edm.String", "searchable": true, "retrievable": true },
241+
{ "name": "title", "type": "Edm.String", "searchable": true, "retrievable": true, "filterable": true },
242+
{ "name": "filepath", "type": "Edm.String", "retrievable": true, "filterable": true },
243+
{ "name": "url", "type": "Edm.String", "retrievable": true },
244+
{ "name": "chunk_id", "type": "Edm.String", "retrievable": true, "filterable": true }
245+
]
246+
}
247+
```
248+
249+
For vector/hybrid search, add a vector field:
250+
251+
```json
252+
{ "name": "contentVector", "type": "Collection(Edm.Single)", "searchable": true, "dimensions": 1536, "vectorSearchProfile": "my-vector-profile" }
253+
```
254+
255+
### Indexer Field Mappings (Blob Storage)
256+
257+
If you index documents from Azure Blob Storage using an indexer, you need to map blob metadata to your index fields. Common blob metadata fields:
258+
259+
| Blob Metadata Field | Description | Typical Index Mapping |
260+
|---|---|---|
261+
| `metadata_storage_name` | Blob filename (e.g., `report.pdf`) | `title` |
262+
| `metadata_storage_path` | Full blob URL (e.g., `https://account.blob.core.windows.net/container/file.pdf`) | `filepath` and `url` |
263+
| `metadata_storage_last_modified` | Last modified timestamp | `last_modified` (optional, useful for sorting) |
264+
| `metadata_storage_content_type` | MIME type | (optional, useful for filtering) |
265+
| `content` | Extracted text from the document | `content` (auto-mapped if names match) |
266+
267+
**Example indexer with field mappings:**
268+
269+
```json
270+
{
271+
"name": "my-blob-indexer",
272+
"dataSourceName": "my-blob-datasource",
273+
"targetIndexName": "my-docs-index",
274+
"fieldMappings": [
275+
{
276+
"sourceFieldName": "metadata_storage_name",
277+
"targetFieldName": "title"
278+
},
279+
{
280+
"sourceFieldName": "metadata_storage_path",
281+
"targetFieldName": "filepath"
282+
},
283+
{
284+
"sourceFieldName": "metadata_storage_path",
285+
"targetFieldName": "url"
286+
},
287+
{
288+
"sourceFieldName": "metadata_storage_last_modified",
289+
"targetFieldName": "last_modified"
290+
}
291+
],
292+
"parameters": {
293+
"configuration": {
294+
"dataToExtract": "contentAndMetadata"
295+
}
296+
}
297+
}
298+
```
299+
300+
> **Note**: The `content` field is automatically mapped when the source and target field names match. The blob indexer also **automatically** maps `metadata_storage_path` (base64-encoded) to the `id` key field — no explicit mapping is needed for `id`. Mapping `metadata_storage_name``title` gives citation cards a readable name from the blob filename.
301+
302+
### How the Pipeline Reads Citation Fields
303+
304+
When Azure OpenAI returns a response with citations, each citation object looks like this:
305+
306+
```json
307+
{
308+
"title": "Architecture Overview",
309+
"content": "The system uses a microservices architecture...",
310+
"url": "https://storageaccount.blob.core.windows.net/docs/architecture.pdf",
311+
"filepath": "architecture.pdf",
312+
"chunk_id": "0",
313+
"original_search_score": 12.5,
314+
"rerank_score": 3.2,
315+
"filter_reason": "rerank"
316+
}
317+
```
318+
319+
The pipeline maps these fields to the OpenWebUI citation event:
320+
321+
| Azure Citation Field | OpenWebUI Citation Property | Display |
322+
|---|---|---|
323+
| `title` | `source.name` | `[doc1] - Architecture Overview` |
324+
| `content` | `document[0]` | Preview text in citation card |
325+
| `url` / `filepath` | `source.url` | Clickable link |
326+
| `rerank_score` / `original_search_score` | `distances[0]` | Relevance percentage |
327+
162328
## Troubleshooting
163329

164330
### Citations Not Appearing
165331

166332
**Problem**: Citations don't appear in the OpenWebUI frontend
167333

168334
**Solutions**:
335+
169336
1. Check that Azure AI Search is properly configured (`AZURE_AI_DATA_SOURCES`)
170337
2. Ensure you're using an Azure OpenAI endpoint (not a generic Azure AI endpoint)
171338
3. Verify the response contains `[docX]` references
172339
4. Check browser console and server logs for errors
173340

341+
### Citations Showing "Unknown Document"
342+
343+
**Problem**: Citation cards display "Unknown Document" instead of a meaningful title
344+
345+
**Solutions**:
346+
347+
1. Verify your index has `title`, `filepath`, or `url` fields and that they are marked as **retrievable**
348+
2. If using custom field names, add `fields_mapping` with `title_field`, `filepath_field`, and `url_field` to your `AZURE_AI_DATA_SOURCES` JSON
349+
3. Verify the fields are actually populated in your indexed documents (empty fields cause fallback to "Unknown Document")
350+
351+
### No Clickable Links on [docX] References
352+
353+
**Problem**: `[docX]` references appear as plain text, not clickable links
354+
355+
**Solutions**:
356+
357+
1. Your index needs a `url` field (or mapped `url_field`) that contains valid URLs
358+
2. If your index stores URLs in a field with a different name, map it using `"url_field": "your_field_name"` in `fields_mapping`
359+
3. Verify that the `url` field is marked as **retrievable** in your index schema
360+
174361
### Relevance Scores Showing 0%
175362

176363
**Problem**: All citation cards show 0% relevance
177364

178365
**Solutions**:
366+
179367
1. Verify `AZURE_AI_INCLUDE_SEARCH_SCORES=true` is set
180368
2. Check that your Azure Search index supports scoring
181369
3. Enable DEBUG logging to see the raw score values from Azure
@@ -185,6 +373,7 @@ Citations are filtered to only show documents that are actually referenced in th
185373
**Problem**: `[docX]` references are not clickable
186374

187375
**Solutions**:
376+
188377
1. Ensure citations have valid `url` or `filepath` fields
189378
2. Check that the document URL is accessible
190379
3. Verify the markdown link format is being generated correctly
@@ -195,6 +384,9 @@ Citations are filtered to only show documents that are actually referenced in th
195384
- [OpenWebUI Event Emitter Documentation](https://docs.openwebui.com/features/plugin/development/events)
196385
- [Azure AI Search Documentation](https://learn.microsoft.com/en-us/azure/search/)
197386
- [Azure On Your Data API Reference](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/references/on-your-data)
387+
- [Azure Search Fields Mapping Options](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/references/azure-search#fields-mapping-options)
388+
- [Azure AI Search Indexer Field Mappings](https://learn.microsoft.com/en-us/azure/search/search-indexer-field-mappings)
389+
- [Azure OpenAI On Your Data - Index Field Mapping](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/use-your-data#index-field-mapping)
198390

199391
## Version History
200392

0 commit comments

Comments
 (0)