Skip to content

Commit bffcce1

Browse files
dabrtmikadamczyk
andauthored
IBX-9846: Describe Embeddings search API (#3029)
* IBX-9846: Describe Embeddings search --------- Co-authored-by: Mikolaj Adamczyk <mikadamczyk@gmail.com> Co-authored-by: dabrt <dabrt@users.noreply.github.com>
1 parent e352ee4 commit bffcce1

9 files changed

Lines changed: 253 additions & 32 deletions

File tree

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
<?php
2+
3+
declare(strict_types=1);
4+
5+
namespace App\Command;
6+
7+
use Ibexa\Contracts\Core\Repository\SearchService;
8+
use Ibexa\Contracts\Core\Repository\Values\Content\EmbeddingQueryBuilder;
9+
use Ibexa\Contracts\Core\Repository\Values\Content\Query\Criterion\ContentTypeIdentifier;
10+
use Ibexa\Contracts\Core\Repository\Values\Content\Search\SearchHit;
11+
use Ibexa\Contracts\Core\Search\Embedding\EmbeddingProviderResolverInterface;
12+
use Ibexa\Contracts\Taxonomy\Search\Query\Value\TaxonomyEmbedding;
13+
use Symfony\Component\Console\Attribute\AsCommand;
14+
use Symfony\Component\Console\Command\Command;
15+
use Symfony\Component\Console\Input\InputInterface;
16+
use Symfony\Component\Console\Output\OutputInterface;
17+
use Symfony\Component\Console\Style\SymfonyStyle;
18+
19+
#[AsCommand(
20+
name: 'ibexa:taxonomy:find-by-embedding',
21+
description: 'Finds content using a taxonomy embedding query.'
22+
)]
23+
final class FindByTaxonomyEmbeddingCommand extends Command
24+
{
25+
public function __construct(
26+
private readonly SearchService $searchService,
27+
private readonly EmbeddingProviderResolverInterface $embeddingProviderResolver,
28+
) {
29+
parent::__construct();
30+
}
31+
32+
protected function execute(
33+
InputInterface $input,
34+
OutputInterface $output
35+
): int {
36+
$io = new SymfonyStyle($input, $output);
37+
38+
$embeddingProvider = $this->embeddingProviderResolver->resolve();
39+
$embedding = $embeddingProvider->getEmbedding('example_content');
40+
41+
$query = EmbeddingQueryBuilder::create()
42+
->withEmbedding(new TaxonomyEmbedding($embedding))
43+
->setFilter(new ContentTypeIdentifier('article'))
44+
->setLimit(10)
45+
->setOffset(0)
46+
->setPerformCount(true)
47+
->build();
48+
49+
$result = $this->searchService->findContent($query);
50+
51+
$io->success(sprintf('Found %d items.', $result->totalCount));
52+
53+
foreach ($result->searchHits as $searchHit) {
54+
assert($searchHit instanceof SearchHit);
55+
56+
/** @var \Ibexa\Contracts\Core\Repository\Values\Content\Content $content */
57+
$content = $searchHit->valueObject;
58+
$contentInfo = $content->versionInfo->contentInfo;
59+
60+
$io->writeln(sprintf(
61+
'%d: %s',
62+
$contentInfo->id,
63+
$contentInfo->name
64+
));
65+
}
66+
67+
return self::SUCCESS;
68+
}
69+
}
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
<?php declare(strict_types=1);
2+
3+
// Create an embedding field using the default embedding provider (type derived from configuration's field suffix)
4+
5+
/** @var Ibexa\Contracts\Core\Search\FieldType\EmbeddingFieldFactory $factory */
6+
$embeddingField = $factory->create();
7+
echo $embeddingField->getType(); // for example, "ibexa_dense_vector_model_123"
8+
9+
// Create a custom embedding field with a specific type
10+
$customField = $factory->create('custom_embedding_type');
11+
echo $customField->getType(); // "custom_embedding_type"

docs/content_management/content_api/managing_content.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ $this->trashService->recover($trashItem, $newParent);
122122
```
123123

124124
You can also search through Trash items and sort the results using several public PHP API Search Criteria and Sort Clauses that have been exposed for `TrashService` queries.
125-
For more information, see [Searching in trash](search_api.md#searching-in-trash).
125+
For more information, see [Search in trash](search_api.md#search-in-trash).
126126

127127
## Content types
128128

docs/release_notes/ez_platform_v3.1.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ A customizable search controller has been extracted and placed in `ezplatform-se
122122

123123
You can now search through the contents of Trash and sort the search results based on a number of Search Criteria and Sort Clauses that can be used by the `\eZ\Publish\API\Repository\TrashService::findTrashItems` method only.
124124

125-
For more information, see [Searching in trash](https://doc.ibexa.co/en/latest/api/public_php_api_search/#searching-in-trash).
125+
For more information, see [Search in trash](https://doc.ibexa.co/en/latest/api/public_php_api_search/#search-in-trash).
126126

127127
### Repository filtering
128128

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
---
2+
month_change: true
3+
description: Embedding queries, embedding configuration, providers, and embedding search fields
4+
---
5+
6+
# Embeddings search reference
7+
8+
Embeddings provide vector representations of content or text, enabling [semantic similarity search](search_api.md#search-with-embeddings).
9+
Foundational abstractions are provided for embedding-based search, while embedding providers generate vector representations.
10+
11+
Searching with embeddings is designed for use with the [Taxonomy suggestions](taxonomy.md#taxonomy-suggestions) feature.
12+
The [`Ibexa\Contracts\Taxonomy\Search\Query\Value\TaxonomyEmbedding`](/api/php_api/php_api_reference/classes/Ibexa-Contracts-Taxonomy-Search-Query-Value-TaxonomyEmbedding.html) class allows embedding queries to target taxonomy data.
13+
14+
!!! note "Feature support"
15+
16+
Searching with embeddings requires a search engine that supports it, such as Elasticsearch or Solr 9.8.1+.
17+
18+
## Core query objects
19+
20+
### EmbeddingQuery
21+
22+
- [`Ibexa\Contracts\Core\Repository\Values\Content\EmbeddingQuery`](/api/php_api/php_api_reference/classes/Ibexa-Contracts-Core-Repository-Values-Content-EmbeddingQuery.html) represents a semantic similarity search request.
23+
It encapsulates an [Embedding](#embedding) instance and supports pagination, aggregations, and result counting through the same API as standard content queries.
24+
25+
!!! note "Embedding query properties"
26+
27+
Embedding queries do not use criteria for similarity, but for additional filtering applied through the query filter.
28+
Also, embedding queries do not allow standard Query properties supported by [search engines](search_engines.md) other than the Legacy Search, such as `query`, `sortClauses`, or `spellcheck`.
29+
30+
- [EmbeddingQueryBuilder](/api/php_api/php_api_reference/classes/Ibexa-Contracts-Core-Repository-Values-Content-EmbeddingQueryBuilder.html) is a builder for constructing `EmbeddingQuery` instances.
31+
It helps construct queries consistently and integrates embedding queries with the search query pipeline.
32+
You must provide the required embedding value by using the `withEmbedding` method
33+
34+
### Embedding
35+
36+
- [`Ibexa\Contracts\Core\Repository\Values\Content\Query\Embedding`](/api/php_api/php_api_reference/classes/Ibexa-Contracts-Core-Repository-Values-Content-Query-Embedding.html) represents the vector input used
37+
for similarity search.
38+
It stores embedding values as float arrays, while providers generate those vectors from text input
39+
40+
## Query execution
41+
42+
Embedding queries are executed by the search engine by using the configured embedding model and provider.
43+
44+
At runtime, the system resolves the appropriate embedding provider and ensures that the embedding vector is compatible with the configured model.
45+
Runtime validation includes validating vector dimensionality and selecting the correct indexed field for similarity search.
46+
Field selection is determined by the configured embedding model and backend specific query mapping, while vector dimensionality is validated when the query reaches the search engine.
47+
48+
## Embedding providers
49+
50+
Embedding providers implement the contract for generating vector representations of input data.
51+
Out of the box, embedding search integration is provided for `TaxonomyEmbedding`.
52+
If you use a custom embedding value type, implement matching embedding visitors for your [search engine](search_engines.md).
53+
Otherwise, query execution may fail due to no visitor available.
54+
55+
- [`Ibexa\Contracts\Core\Search\Embedding\EmbeddingProviderInterface`](/api/php_api/php_api_reference/classes/Ibexa-Contracts-Core-Search-Embedding-EmbeddingProviderInterface.html) generates embeddings for the provided text or other input
56+
57+
- [`Ibexa\Contracts\Core\Search\Embedding\EmbeddingProviderRegistryInterface`](/api/php_api/php_api_reference/classes/Ibexa-Contracts-Core-Search-Embedding-EmbeddingProviderRegistryInterface.html) lists available embedding providers or gets one by its identifier
58+
59+
- [`Ibexa\Contracts\Core\Search\Embedding\EmbeddingProviderResolverInterface`](/api/php_api/php_api_reference/classes/Ibexa-Contracts-Core-Search-Embedding-EmbeddingProviderResolverInterface.html) determines the embedding provider to be used for generating embeddings based on the system configuration, or a demand passed through the `resolveByModelIdentifier` method
60+
61+
## Configuration
62+
63+
Models used to resolve embedding queries must be configured per SiteAccess in [system configuration](configuration.md).
64+
Each entry defines the model's name, vector dimensionality, the field suffix, and the embedding provider that generates vectors.
65+
Field suffixes assigned to the models must be unique, as they become part of the indexed field name.
66+
You select the default model by setting a value in the `default_embedding_model` key.
67+
68+
``` yaml
69+
ibexa:
70+
system:
71+
default:
72+
embedding_models:
73+
text-embedding-3-small:
74+
name: 'text-embedding-3-small'
75+
dimensions: 1536
76+
field_suffix: '3small'
77+
embedding_provider: 'ibexa_openai'
78+
default_embedding_model: text-embedding-ada-002
79+
```
80+
81+
For a real-life example of embedding models configuration, see [Taxonomy suggestions](taxonomy.md#change-the-embedding-generation-model).
82+
83+
- [EmbeddingConfigurationInterface](/api/php_api/php_api_reference/classes/Ibexa-Contracts-Core-Search-Embedding-EmbeddingConfigurationInterface.html) allows access to the embedding model configuration in the system (for example, list of available models, default model name, default provider, field suffix, and so on)
84+
85+
## Embedding fields
86+
87+
Embedding vectors are stored in dedicated search fields.
88+
These fields can be used by the search engine to perform vector similarity comparisons when embedding queries are executed.
89+
90+
``` php
91+
[[= include_file('code_samples/api/public_php_api/src/embedding_fields.php') =]]
92+
```
93+
94+
Once you create a field, subscribe to the `ContentIndexCreateEvent` indexing event that [adds the field to the index](index_custom_elasticsearch_data.md).
95+
96+
97+
- [`Ibexa\Contracts\Core\Search\FieldType\EmbeddingFieldFactory`](/api/php_api/php_api_reference/classes/Ibexa-Contracts-Core-Search-FieldType-EmbeddingFieldFactory.html) creates dedicated search fields that store embedding vectors
98+
99+
## Validation
100+
101+
- [`Ibexa\Contracts\Core\Repository\Values\Content\QueryValidatorInterface`](/api/php_api/php_api_reference/classes/Ibexa-Contracts-Core-Repository-Values-Content-QueryValidatorInterface.html) validates embedding query structure before execution

docs/search/search_api.md

Lines changed: 67 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
---
2+
month_change: true
23
description: You can search for content, locations and products by using the PHP API. Fine-tune the search with Search Criteria, Sort Clauses and Aggregations.
34
month_change: true
45
---
@@ -19,7 +20,7 @@ The service should be [injected into the constructor of your command or controll
1920

2021
`SearchService` is also used in the back office of [[= product_name =]], in components such as Universal Discovery Widget or Sub-items List.
2122

22-
### Performing a search
23+
### Perform search
2324

2425
To search through content you need to create a [`LocationQuery`](/api/php_api/php_api_reference/classes/Ibexa-Contracts-Core-Repository-Values-Content-LocationQuery.html) and provide your Search Criteria as a series of Criterion objects.
2526

@@ -71,7 +72,7 @@ As such, `query` is recommended when the search is based on user input.
7172
The difference between `query` and `filter` is only relevant when using Solr or Elasticsearch search engine.
7273
With the Legacy search engine both properties give identical results.
7374

74-
#### Processing large result sets
75+
#### Process large result sets
7576

7677
To process a large result set, use [`Ibexa\Contracts\Core\Repository\Iterator\BatchIterator`](/api/php_api/php_api_reference/classes/Ibexa-Contracts-Core-Repository-Iterator-BatchIterator.html).
7778
`BatchIterator` divides the results of search or filtering into smaller batches.
@@ -176,7 +177,7 @@ $filter
176177
It's recommended to use an IDE that can recognize type hints when working with Repository Filtering.
177178
If you try to use an unsupported Criterion or Sort Clause, the IDE indicates an issue.
178179

179-
## Searching in a controller
180+
## Search in controller
180181

181182
You can use the `SearchService` or repository filtering in a controller, as long as you provide the required parameters.
182183
For example, in the code below, `locationId` is provided to list all children of a location by using the `SearchService`.
@@ -197,7 +198,7 @@ When using Repository filtering, provide the results of `ContentService::find()`
197198
[[= include_file('code_samples/api/public_php_api/src/Controller/CustomFilterController.php', 16, 31) =]]
198199
```
199200

200-
### Paginating search results
201+
### Paginate search results
201202

202203
To paginate search or filtering results, it's recommended to use the [Pagerfanta library](https://github.com/BabDev/Pagerfanta) and [[[= product_name =]]'s adapters for it.](https://github.com/ibexa/core/blob/main/src/lib/Pagination/Pagerfanta/Pagerfanta.php)
203204

@@ -260,7 +261,7 @@ that doesn't belong to the provided Section:
260261
[[= include_file('code_samples/api/public_php_api/src/Command/FindComplexCommand.php', 46, 54) =]]
261262
```
262263

263-
### Combining independent Criteria
264+
### Combine independent Criteria
264265

265266
Criteria are independent of one another.
266267
This can lead to unexpected behavior, for instance because content can have multiple locations.
@@ -283,7 +284,7 @@ Even though the location B is hidden, the query finds the content because both c
283284
- the content item is visible (it has the visible location A)
284285

285286

286-
## Sorting results
287+
## Sort results
287288

288289
To sort the results of a query, use one of more [Sort Clauses](sort_clause_reference.md).
289290

@@ -297,27 +298,6 @@ For example, to order search results by their publication date, from oldest to n
297298

298299
For the full list and details of available Sort Clauses, see [Sort Clause reference](sort_clause_reference.md).
299300

300-
## Searching in trash
301-
302-
In the user interface, on the **Trash** screen, you can search for content items, and then sort the results based on different criteria.
303-
To search the trash with the API, use the `TrashService::findInTrash` method to submit a query for content items that are held in trash.
304-
Searching in trash supports a limited set of Criteria and Sort Clauses.
305-
For a list of supported Criteria and Sort Clauses, see [Search in trash reference](search_in_trash_reference.md).
306-
307-
!!! note
308-
309-
Searching through the trashed content items operates directly on the database, therefore you cannot use external search engines, such as Solr or Elasticsearch, and it's impossible to reindex the data.
310-
311-
``` php
312-
[[= include_file('code_samples/api/public_php_api/src/Command/FindInTrashCommand.php', 4, 6) =]]//...
313-
[[= include_file('code_samples/api/public_php_api/src/Command/FindInTrashCommand.php', 35, 42) =]]
314-
```
315-
316-
!!! caution
317-
318-
Make sure that you set the Criterion on the `filter` property.
319-
It's impossible to use the `query` property, because the search in trash operation filters the database instead of querying.
320-
321301
## Aggregation
322302

323303
!!! caution "Feature support"
@@ -380,4 +360,63 @@ $query->aggregations[] = new IntegerRangeAggregation('range', 'person', 'age',
380360
`null` means that a range doesn't have an end.
381361
In the example all values above (and including) 60 are included in the last range.
382362

383-
See [Agrregation reference](aggregation_reference.md) for details of all available aggregations.
363+
See [Aggregation reference](aggregation_reference.md) for details of all available aggregations.
364+
365+
## Search with embeddings
366+
367+
368+
!!! note "Feature support"
369+
370+
Searching with embeddings requires a search engine that supports it, such as Elasticsearch or Solr 9.8.1+.
371+
372+
Embeddings are numerical representations that capture the meaning of text, images, or other content.
373+
AI providers generate embeddings by converting words or documents into lists of numbers, instead of treating them as plain text.
374+
Such lists, aka vectors, can then be compared to find content with similar meaning.
375+
376+
Searching with embeddings enables matching content based on meaning rather than exact text matches.
377+
Instead of comparing keywords, the system compares vectors that represent the semantic meaning of content and the query input.
378+
379+
!!! note "Taxonomy suggestions"
380+
381+
Embedding queries have been introduced primarily to support the [Taxonomy suggestions](taxonomy.md#taxonomy-suggestions) feature, therefore embedding search integration is provided for `TaxonomyEmbedding`.
382+
383+
You can narrow down the search results, for example, by content type or location.
384+
To do this, combine searching with embeddings with filters.
385+
Repository search also respects the permissions of the current user.
386+
387+
An embedding query is represented by the [`Ibexa\Contracts\Core\Repository\Values\Content\EmbeddingQuery`](/api/php_api/php_api_reference/classes/Ibexa-Contracts-Core-Repository-Values-Content-EmbeddingQuery.html) value object.
388+
The object encapsulates the embedding used for similarity search and optional search parameters such as filtering, pagination, aggregations, and result counting.
389+
390+
### Use embedding queries in search
391+
392+
Embedding queries are executed through the search API in the same way as other search requests.
393+
You build an `EmbeddingQuery` instance by using a builder and pass it to the search service.
394+
395+
This example shows a minimal embedding query executed directly through the search service:
396+
397+
``` php hl_lines="38-39 41-47 49"
398+
[[= include_file('code_samples/api/public_php_api/src/Command/FindByTaxonomyEmbeddingCommand.php') =]]
399+
```
400+
401+
For more information, see [Embeddings reference](embeddings_reference.md).
402+
403+
## Search in trash
404+
405+
In the user interface, on the **Trash** screen, you can search for content items, and then sort the results based on different criteria.
406+
To search the trash with the API, use the `TrashService::findInTrash` method to submit a query for content items that are held in trash.
407+
Searching in trash supports a limited set of Criteria and Sort Clauses.
408+
For a list of supported Criteria and Sort Clauses, see [Search in trash reference](search_in_trash_reference.md).
409+
410+
!!! note
411+
412+
Searching through the trashed content items operates directly on the database, therefore you cannot use external search engines, such as Solr or Elasticsearch, and it's impossible to reindex the data.
413+
414+
``` php
415+
[[= include_file('code_samples/api/public_php_api/src/Command/FindInTrashCommand.php', 4, 6) =]]//...
416+
[[= include_file('code_samples/api/public_php_api/src/Command/FindInTrashCommand.php', 35, 42) =]]
417+
```
418+
419+
!!! caution
420+
421+
Make sure that you set the Criterion on the `filter` property.
422+
It's impossible to use the `query` property, because the search in trash operation filters the database instead of querying.

docs/search/search_criteria_and_sort_clauses.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ Available tags for Sort Clause handlers in Legacy Storage Engine are:
7979
- for Criterion handlers: `ibexa.core.trash.search.legacy.gateway.criterion_handler`
8080
- for Sort Clause handlers: `ibexa.core.trash.search.legacy.gateway.sort_clause_handler`
8181

82-
For more information about searching for content items in Trash, see [Searching in trash](search_api.md#searching-in-trash).
82+
For more information about searching for content items in Trash, see [Search in trash](search_api.md#search-in-trash).
8383

8484
For more information about the Criteria and Sort Clauses that are supported when searching for trashed content items, see [Searching in trash reference](search_in_trash_reference.md).
8585

docs/search/search_in_trash_reference.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ month_change: false
66

77
# Search in trash reference
88

9-
When you [search for content items that are held in trash](search_api.md#searching-in-trash), you can apply only a limited subset of Search Criteria and Sort Clauses
9+
When you [search for content items that are held in trash](search_api.md#search-in-trash), you can apply only a limited subset of Search Criteria and Sort Clauses
1010
which can be used by [`Ibexa\Contracts\Core\Repository\TrashService::findTrashItems`](/api/php_api/php_api_reference/classes/Ibexa-Contracts-Core-Repository-TrashService.html#method_findTrashItems).
1111
Some sort clauses are exclusive to trash search.
1212

0 commit comments

Comments
 (0)