diff --git a/.github/vale/styles/OpenSearch/SubstitutionsBritish.yml b/.github/vale/styles/OpenSearch/SubstitutionsBritish.yml index fbe0a178de..b3e7e7b37b 100644 --- a/.github/vale/styles/OpenSearch/SubstitutionsBritish.yml +++ b/.github/vale/styles/OpenSearch/SubstitutionsBritish.yml @@ -5,6 +5,7 @@ level: error action: name: replace swap: + 'acknowledgement': acknowledgment 'analyse': analyze 'authorise': authorize 'behaviour': behavior diff --git a/_search-plugins/search-relevance/judgments.md b/_search-plugins/search-relevance/judgments.md index 7729dd0ae3..4dae8030a4 100644 --- a/_search-plugins/search-relevance/judgments.md +++ b/_search-plugins/search-relevance/judgments.md @@ -10,28 +10,201 @@ has_children: false # Judgments A judgment is a relevance rating assigned to a specific document in the context of a particular query. Multiple judgments are grouped together into judgment lists. -Typically, judgments are categorized into two types---implicit and explicit: +Typically, judgments are categorized as two types---implicit and explicit: -* Implicit judgments are ratings that were derived from user behavior (for example, what did the user see and select after searching?) -* Explicit judgments were traditionally made by humans, but large language models (LLMs) are increasingly being used to perform this task. +- Implicit judgments are ratings derived from user behavior (for example, what did the user see and select after searching?). +- Humans have traditionally produced explicit judgments, but large language models (LLMs) are increasingly used for this task. -Search Relevance Workbench supports all types of judgments: +Search Relevance Workbench (SRW) supports all types of judgments: -* Generating implicit judgments based on data that adheres to the User Behavior Insights (UBI) schema specification. -* Using LLMs to generate judgments by connecting OpenSearch to an API or an internally or externally hosted model. -* Importing externally created judgments. +- Using LLMs as automated judges (an approach known as LLM-as-a-Judge) to generate judgments by evaluating search results using a prompt. +- Generating implicit judgments based on data that adheres to the User Behavior Insights (UBI) schema specification. +- Importing judgments that were collected using a process outside of SRW. -## Explicit judgments +## Using LLM-as-a-Judge -Search Relevance Workbench offers two ways to integrate explicit judgments: -* Importing judgments that were collected using a process outside of OpenSearch -* AI-assisted judgments that use LLMs +Generate explicit judgments with an LLM in SRW when you don't have human annotators available, or you need to scale up the number of judgments beyond what humans can provide. -### Importing judgments +For step-by-step instructions, see [Using LLM-as-a-Judge for search relevance]({{site.url}}{{site.baseurl}}/tutorials/llm-as-a-judge-tutorial/). -You may already have external processes for generating judgments. Regardless of the judgment type or the way it was generated, you can import it into Search Relevance Workbench. +### Prerequisites -#### Example request +To use LLM-as-a-Judge, configure the following components: + +- A connector to an LLM to use for generating the judgments. For more information, see [Connectors]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/connectors/). +- A query set: Together with the `size` parameter, the query set defines the scope for generating judgments. For each query, the top k documents are retrieved from the specified index, in which k is defined by the `size` parameter. +- A search configuration: A search configuration defines how documents are retrieved for use in query-document pairs. + +The AI-assisted judgment process consists of the following steps: + +- For each query, the top k documents are retrieved using the defined search configuration, which includes the index information. The query and each document from the result list create a query-document pair. +- The LLM is then called with a predefined prompt to generate a judgment for each query-document pair. +- All generated judgments are stored in the judgments cache index for reuse in future experiments. + +To create a judgment list, provide the model ID of the LLM, an available query set, and a created search configuration. + +The following example uses a generic prompt template with a scale of 0.0 to 1.0. To reduce the volume of data sent to the LLM (and therefore the cost), use the `contextFields` parameter to specify which fields from each result to include: + +```json +PUT _plugins/_search_relevance/judgments +{ + "name":"AI-assisted judgment list", + "description": "Uses GPT-3.5-turbo to evaluate product search results", + "type":"LLM_JUDGMENT", + "modelId":"N8AE1osB0jLkkocYjz7D", + "querySetId":"5f0115ad-94b9-403a-912f-3e762870ccf6", + "searchConfigurationList":["2f90d4fd-bd5e-450f-95bb-eabe4a740bd1"], + "size":5, + "contextFields": ["title", "description", "category"], + "llmJudgmentRatingType": "SCORE0_1", + "promptTemplate": "Rate the relevance of these search results {% raw %}{{hits}}{% endraw %} for the query '{% raw %}{{queryText}}{% endraw %}' on a scale of 0-1, where 0 is completely irrelevant and 1 is perfectly relevant. Consider the product title, description, and category." +} +``` +{% include copy-curl.html %} + +### Request body fields + +The following table lists the parameters for creating LLM-based judgments. + +| Parameter | Data type | Description | +| :--- | :--- | :--- | +| `name` | String | The name of the judgment list. | +| `description` | String | Optional. A description of the judgment list. | +| `type` | String | Set to `LLM_JUDGMENT`. | +| `modelId` | String | The ID of the deployed machine learning (ML) model to use for generating judgments. Must be a remote model connected to an external LLM service. | +| `querySetId` | String | The ID of the query set containing the queries to evaluate. | +| `searchConfigurationList` | Array of strings | The list of search configuration IDs to use for retrieving documents to evaluate. | +| `size` | Integer | The number of top documents to retrieve and evaluate for each query. Default is `10`. | +| `tokenLimit` | Integer | The maximum number of tokens to send to the LLM in a single request. Used to batch documents when the total content exceeds this limit. Default is `4,000`. | +| `contextFields` | Array of strings | Optional. Specifies which document fields to include when sending content to the LLM. If not specified, the entire document source is sent. Use this parameter to reduce costs and focus the LLM on relevant fields. | +| `ignoreFailure` | Boolean | Whether to continue processing other documents if the LLM fails to generate a judgment for some documents. Default is `false`. | +| `llmJudgmentRatingType` | String | The type of rating scale to use. Valid values are `SCORE0_1` (numeric scale 0--1) and `RELEVANT_IRRELEVANT` (binary relevant/irrelevant). Use `SCORE0_1` for graded relevance metrics such as NDCG. Use `RELEVANT_IRRELEVANT` for binary metrics such as precision and recall. | +| `promptTemplate` | String | Optional. A custom prompt template for the LLM. Supports {% raw %}`{{queryText}}`{% endraw %} and {% raw %}`{{hits}}`{% endraw %} placeholders. If not provided, the default template is used. | +| `overwriteCache` | Boolean | Whether to overwrite existing cached judgments for the same query-document pairs. Default is `false` (reuse cached judgments). | + +### Custom prompt templates + +You can customize the prompt template to focus on specific aspects of relevance: + +```json +PUT /_plugins/_search_relevance/judgments +{ + "name": "Custom Prompt Judgment", + "type": "LLM_JUDGMENT", + "modelId": "MODEL_ID_HERE", + "querySetId": "QUERY_SET_ID_HERE", + "searchConfigurationList": ["SEARCH_CONFIGURATION_ID_HERE"], + "promptTemplate": "As an e-commerce search expert, evaluate how well these products {% raw %}{{hits}}{% endraw %} match the user's search for '{% raw %}{{queryText}}{% endraw %}'. Consider product relevance, brand reputation, and price competitiveness. Rate each result from 0-1.", + "llmJudgmentRatingType": "SCORE0_1" +} +``` +{% include copy-curl.html %} + +### Binary relevance judgments + +For simpler relevance assessment, you can use binary (relevant/irrelevant) judgments: + +```json +PUT /_plugins/_search_relevance/judgments +{ + "name": "Binary LLM Judgment", + "type": "LLM_JUDGMENT", + "modelId": "MODEL_ID_HERE", + "querySetId": "QUERY_SET_ID_HERE", + "searchConfigurationList": ["SEARCH_CONFIGURATION_ID_HERE"], + "llmJudgmentRatingType": "RELEVANT_IRRELEVANT", + "promptTemplate": "Determine if these search results {% raw %}{{hits}}{% endraw %} are relevant or irrelevant for the query '{% raw %}{{queryText}}{% endraw %}'. Consider exact matches and semantic relevance." +} +``` +{% include copy-curl.html %} + +### Using different LLM providers + +You can adapt the connector configuration for other providers. + +#### Amazon Bedrock example + +The following example creates a connector for Amazon Bedrock: + +```json +POST /_plugins/_ml/connectors/_create +{ + "name": "Amazon Bedrock Connector", + "description": "Connector to Amazon Bedrock", + "version": "1", + "protocol": "aws_sigv4", + "parameters": { + "region": "us-east-1", + "service_name": "bedrock", + "model": "anthropic.claude-v2" + }, + "credential": { + "access_key": "YOUR_ACCESS_KEY", + "secret_key": "YOUR_SECRET_KEY" + }, + "actions": [ + { + "action_type": "predict", + "method": "POST", + "url": "https://bedrock-runtime.${parameters.region}.amazonaws.com/model/${parameters.model}/invoke", + "request_body": "{ \"prompt\": \"${parameters.messages}\", \"max_tokens_to_sample\": 300 }" + } + ] +} +``` +{% include copy-curl.html %} + +## Implicit judgments + +Implicit judgments are derived from past user interactions. SRW supports the Clicks Over Expected Clicks (COEC) click model, which uses *impression* and *click* signals to calculate judgments. + +Input data must follow the [UBI index schemas]({{site.url}}{{site.baseurl}}/search-plugins/ubi/schemas/). COEC uses every event in the `ubi_events` index with an `action_name` of `impression` or `click`. + +COEC calculates an expected click-through rate (CTR) for each rank by dividing the total number of clicks by the total number of impressions observed at that rank, based on all events in `ubi_events`. This ratio represents the expected CTR for that position. + +For each document displayed in a hit list after a query, the average CTR at that rank serves as the expected value for the query-document pair. COEC calculates the actual CTR for the query-document pair and divides it by this expected rank-based CTR. Consequently, query-document pairs with a higher CTR than the average for that rank have a judgment value greater than 1. Conversely, if the CTR is lower than average, the judgment value is lower than 1. + +Depending on the tracking implementation, multiple clicks for a single query can be recorded in the `ubi_events` index. Consequently, the average CTR can sometimes exceed 1 (or 100%). +{: .note} + +For query-document observations that occur at different positions, all impressions and clicks are assumed to have occurred at the lowest (best) position. This aggregation approach biases the final judgment toward lower values, reflecting the common trend that higher-ranked results typically receive higher CTRs. +{: .note} + +### Example request + +The following example creates an implicit judgment list using the COEC click model: + +```json +PUT _plugins/_search_relevance/judgments +{ + "name": "Implicit Judgments", + "clickModel": "coec", + "type": "UBI_JUDGMENT", + "maxRank": 20 +} +``` +{% include copy-curl.html %} + +### Request body fields + +The following table lists the parameters for creating implicit judgments. + +| Parameter | Data type | Description | +| :--- | :--- | :--- | +| `name` | String | The name of the judgment list. | +| `clickModel` | String | The model used to calculate implicit judgments. Only `coec` (Clicks Over Expected Clicks) is supported. | +| `type` | String | Set to `UBI_JUDGMENT`. | +| `maxRank` | Integer | The maximum rank to consider when including events in the judgment calculation. | +| `startDate` | Date | An optional starting date from which behavioral data events are considered for implicit judgment generation. The format is `yyyy-MM-dd`. | +| `endDate` | Date | An optional end date until which behavioral data events are considered for implicit judgment generation. The format is `yyyy-MM-dd`. | + +## Importing judgments + +You may already have external processes for generating judgments. Regardless of the judgment type or the way they were generated, you can import them into SRW. + +### Example request + +The following example imports a set of judgments for two queries: ```json PUT _plugins/_search_relevance/judgments @@ -95,94 +268,24 @@ PUT _plugins/_search_relevance/judgments ``` {% include copy-curl.html %} -#### Request body fields - -The process of importing judgments supports the following parameters. - -Parameter | Data type | Description -`name` | String | The name of the judgment list. -`description` | String | An optional description of the judgment list. -`type` | String | Set to `IMPORT_JUDGMENT`. -`judgmentRatings` | Array | A list of JSON objects containing the judgments. Judgments are grouped by query, each containing a nested map in which document IDs (`docId`) serve as keys and their floating-point ratings serve as values. - -### Creating AI-assisted judgments - -If you want to use judgments in your experimentation process but do not have a team of humans or the user behavior data to calculate judgments based on interactions, you can use an LLM in Search Relevance Workbench to generate judgments. -#### Prerequisites - -To use AI-assisted judgment generation, ensure that you have configured the following components: - -* A connector to an LLM to use for generating the judgments. For more information, see [Creating connectors for third-party ML platforms]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/connectors/). -* A query set: Together with the `size` parameter, the query set defines the scope for generating judgments. For each query, the top k documents are retrieved from the specified index, where k is defined in the `size` parameter. -* A search configuration: A search configuration defines how documents are retrieved for use in query/document pairs. +### Request body fields -The AI-assisted judgment process works as follows: -- For each query, the top k documents are retrieved using the defined search configuration, which includes the index information. The query and each document from the result list create a query/document pair. -- Each query and document pair forms a query/document pair. -- The LLM is then called with a predefined prompt (stored as a static variable in the backend) to generate a judgment for each query/document pair. -- All generated judgments are stored in the judgments index for reuse in future experiments. +The following table lists the parameters for importing judgments. -To create a judgment list, provide the model ID of the LLM, an available query set, and a created search configuration: - - -```json -PUT _plugins/_search_relevance/judgments -{ - "name":"AI-assisted judgment list", - "type":"LLM_JUDGMENT", - "querySetId":"5f0115ad-94b9-403a-912f-3e762870ccf6", - "searchConfigurationList":["2f90d4fd-bd5e-450f-95bb-eabe4a740bd1"], - "size":5, - "modelId":"N8AE1osB0jLkkocYjz7D", - "contextFields":[] -} -``` -{% include copy-curl.html %} - -## Implicit judgments - -Implicit judgments are derived from user interactions. Several models use signals from user behavior to calculate these judgments. One such model is Clicks Over Expected Clicks (COEC), a click model implemented in Search Relevance Workbench. -The data used to derive relevance labels is based on past user behavior. The data follows the [User Behavior Insights schema specification]({{site.url}}{{site.baseurl}}/search-plugins/ubi/schemas/). The two key interaction types for implicit judgments are *impressions* and *clicks* that occur after a user query. In practice, this means that all events in the `ubi_events` index with an `impression` or `click` recorded in the `action_name` field are used to model implicit judgments. -COEC calculates an expected click-through rate (CTR) for each rank. It does this by dividing the total number of clicks by the total number of impressions observed at that rank, based on all events in `ubi_events`. This ratio represents the expected CTR for that position. - -For each document displayed in a hit list after a query, the average CTR at that rank serves as the expected value for the query/document pair. COEC calculates the actual CTR for the query/document pair and divides it by this expected rank-based CTR. This means that query/document pairs with a higher CTR than the average for that rank will have a judgment value greater than 1. Conversely, if the CTR is lower than average, the judgment value will be lower than 1. - -Note that depending on the tracking implementation, multiple clicks for a single query can be recorded in the `ubi_events` index. As a result, the average CTR can sometimes exceed 1 (or 100%). -For query-document observations that occur at different positions, all impressions and clicks are assumed to have occurred at the lowest (best) position. This approach biases the final judgment toward lower values, reflecting the common trend that higher-ranked results typically receive higher CTRs. -{: .note} - -#### Example request - -```json -PUT _plugins/_search_relevance/judgments -{ - "name": "Implicit Judgements", - "clickModel": "coec", - "type": "UBI_JUDGMENT", - "maxRank": 20 -} -``` -{% include copy-curl.html %} - -#### Request body fields - -The process of creating implicit judgments supports the following parameters. - -Parameter | Data type | Description -`name` | String | The name of the judgment list. -`clickModel` | String | The model used to calculate implicit judgments. Only `coec` (Clicks Over Expected Clicks) is supported. -`type` | String | Set to `UBI_JUDGMENT`. -`maxRank` | Integer | The maximum rank to consider when including events in the judgment calculation. -`startDate` | Date | The optional starting date from which behavioral data events are considered for implicit judgment generation. The format is`yyyy-MM-dd`. -`endDate` | Date | The optional end date until which behavioral data events are considered for implicit judgment generation. The format is`yyyy-MM-dd`. +| Parameter | Data type | Description | +| :--- | :--- | :--- | +| `name` | String | The name of the judgment list. | +| `description` | String | An optional description of the judgment list. | +| `type` | String | Set to `IMPORT_JUDGMENT`. | +| `judgmentRatings` | Array | A list of JSON objects containing the judgments. Judgments are grouped by query, each containing a nested map in which document IDs (`docId`) serve as keys and their floating-point ratings serve as values. | ## Managing judgment lists You can retrieve or delete judgment lists using the following APIs. -### View a judgment list +### Viewing a judgment list -You can retrieve a judgment list using the judgment list ID. +Retrieve a judgment list by its ID. #### Endpoint @@ -190,7 +293,7 @@ You can retrieve a judgment list using the judgment list ID. GET _plugins/_search_relevance/judgments/{judgment_list_id} ``` -### Path parameters +#### Path parameters The following table lists the available path parameters. @@ -301,9 +404,9 @@ GET _plugins/_search_relevance/judgments/b54f791a-3b02-49cb-a06c-46ab650b2ade -### Delete a judgment list +### Deleting a judgment list -You can delete a judgment list using the judgment list ID. +Delete a judgment list by its ID. #### Endpoint @@ -337,9 +440,9 @@ DELETE _plugins/_search_relevance/judgments/b54f791a-3b02-49cb-a06c-46ab650b2ade } ``` -### Search for a judgment list +### Searching for a judgment list -You can search for available judgment lists using query DSL. By default, the `judgmentRatings.ratings` data is not returned. To include the `judgmentRatings.ratings` data, specify the `_source` field in the query. +Search for judgment lists using query domain-specific language (DSL). The response excludes `judgmentRatings.ratings` by default; to include it, specify the `_source` field in the query. #### Endpoints @@ -348,9 +451,9 @@ GET _plugins/_search_relevance/judgments/_search POST _plugins/_search_relevance/judgments/_search ``` -#### Example request: +#### Example request -Search for judgment lists that include the exact query `red dress`: +The following example searches for judgment lists that include the exact query `red dress`: ```json GET _plugins/_search_relevance/judgments/_search @@ -413,3 +516,7 @@ GET _plugins/_search_relevance/judgments/_search } } ``` + +## Related documentation + +- [Automate search relevance evaluation using LLMs]({{site.url}}{{site.baseurl}}/tutorials/llm-as-a-judge-tutorial/) \ No newline at end of file diff --git a/_tutorials/index.md b/_tutorials/index.md index 632b0066a7..f031f7af8b 100644 --- a/_tutorials/index.md +++ b/_tutorials/index.md @@ -9,13 +9,14 @@ permalink: /tutorials/ redirect_from: - /ml-commons-plugin/tutorials/ - /ml-commons-plugin/tutorials/index/ -cards: +getting_started_cards: - heading: "Searching data 101" description: "Learn the fundamentals of search and explore OpenSearch query languages and types" link: "/getting-started/search-data/" - heading: "OpenSearch Dashboards" description: "Start visualizing your data with interactive dashboards and powerful analytics tools" link: "/dashboards/quickstart/" +tutorial_cards: - heading: "Vector search" description: "Implement similarity search using vectors and enhance results with AI capabilities" link: "/tutorials/vector-search/" @@ -28,11 +29,23 @@ cards: - heading: "Faceted search" description: "Build filterable search experiences for applications like e-commerce or location search" link: "/tutorials/faceted-search/" + - heading: "LLM-as-a-Judge" + description: "Automate search relevance evaluation using LLMs" + link: "/tutorials/llm-as-a-judge-tutorial/" --- # Tutorials -Follow our step-by-step tutorials to learn how to use OpenSearch features. +Follow step-by-step tutorials to learn how to use OpenSearch features. -{% include cards.html cards=page.cards %} +## Getting started +Learn the basics of searching and visualizing data in OpenSearch. + +{% include cards.html cards=page.getting_started_cards %} + +## Building search features using OpenSearch + +Implement specific search features end to end. + +{% include cards.html cards=page.tutorial_cards %} diff --git a/_tutorials/llm-as-a-judge-tutorial.md b/_tutorials/llm-as-a-judge-tutorial.md new file mode 100644 index 0000000000..84640c942d --- /dev/null +++ b/_tutorials/llm-as-a-judge-tutorial.md @@ -0,0 +1,198 @@ +--- +layout: default +title: LLM-as-a-Judge +has_children: false +nav_order: 70 +--- + +# Using LLM-as-a-Judge for search relevance + +LLM-as-a-Judge is a technique that uses large language models (LLMs) to automatically evaluate search result relevance. Manually annotating search results is time-consuming and inconsistent across annotators. LLM-as-a-Judge automates this process, enabling frequent and repeatable evaluation of search quality. + +After completing this tutorial, you can [run an experiment to evaluate search quality]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/evaluate-search-quality/#creating-a-pointwise-experiment) using the LLM-generated judgments. + +## Prerequisites + +For this tutorial, you need an API key for an external LLM provider (OpenAI, Amazon Bedrock). + +Using an external LLM incurs API costs based on the number of queries and results evaluated. +{: .note} + +Enable the Search Relevance Workbench and configure the following settings: + +```json +PUT /_cluster/settings +{ + "persistent": { + "plugins.search_relevance.workbench_enabled": true, + "plugins.ml_commons.only_run_on_ml_node": "false", + "plugins.ml_commons.model_access_control_enabled": "true", + "plugins.ml_commons.allow_registering_model_via_url": "true" + } +} +``` +{% include copy-curl.html %} + +### Step 1: Configure a model + +First, create a connector to an externally hosted LLM. This tutorial uses OpenAI, but you can adapt it for other providers such as Amazon Bedrock. Replace `YOUR_API_KEY` with your OpenAI API key: + +```json +POST /_plugins/_ml/connectors/_create +{ + "name": "OpenAI Chat Connector", + "description": "Connector to OpenAI Chat API for LLM judgments", + "version": "1", + "protocol": "http", + "parameters": { + "endpoint": "api.openai.com", + "model": "gpt-3.5-turbo" + }, + "credential": { + "openAI_key": "YOUR_API_KEY" + }, + "actions": [ + { + "action_type": "predict", + "method": "POST", + "url": "https://api.openai.com/v1/chat/completions", + "headers": { + "Authorization": "Bearer ${credential.openAI_key}", + "Content-Type": "application/json" + }, + "request_body": "{ \"model\": \"${parameters.model}\", \"messages\": ${parameters.messages}, \"temperature\": 0 }" + } + ] +} +``` +{% include copy-curl.html %} + +Then register and deploy the model. Replace `{connector_id}` with the ID returned in the previous response: + +```json +POST /_plugins/_ml/models/_register?deploy=true +{ + "name": "openai_gpt-3.5-turbo", + "function_name": "remote", + "description": "External LLM model via OpenAI", + "connector_id": "{connector_id}" +} +``` +{% include copy-curl.html %} + +This is an asynchronous operation. To verify the task status, use the [Get ML task]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/tasks-apis/get-task/) API. Once the state is `COMPLETED`, OpenSearch returns the `model_id` you'll use in the following steps. + +### Step 2: Create a search index + +Create a `products` index: + +```json +PUT /products +{ + "mappings": { + "properties": { + "title": { "type": "text" }, + "description": { "type": "text" }, + "category": { "type": "keyword" }, + "brand": { "type": "keyword" }, + "price": { "type": "float" } + } + } +} +``` +{% include copy-curl.html %} + +Index example documents into the index: + +```json +POST /products/_bulk +{"index":{"_id":"1"}} +{"title":"Samsung 55-inch 4K Smart TV","description":"Ultra HD Smart TV with HDR and built-in streaming apps","category":"Electronics","brand":"Samsung","price":599.99} +{"index":{"_id":"2"}} +{"title":"LG 65-inch OLED TV","description":"Premium OLED display with perfect blacks and vibrant colors","category":"Electronics","brand":"LG","price":1299.99} +{"index":{"_id":"3"}} +{"title":"Sony Wireless Headphones","description":"Noise-canceling over-ear headphones with 30-hour battery","category":"Electronics","brand":"Sony","price":199.99} +{"index":{"_id":"4"}} +{"title":"Apple MacBook Pro 14-inch","description":"Professional laptop with M2 chip and Retina display","category":"Computers","brand":"Apple","price":1999.99} +{"index":{"_id":"5"}} +{"title":"Dell Gaming Monitor 27-inch","description":"High refresh rate gaming monitor with G-Sync support","category":"Computers","brand":"Dell","price":399.99} +``` +{% include copy-curl.html %} + +### Step 3: Create a search configuration + +A _search configuration_ defines a search strategy to evaluate. The `%SearchText%` placeholder is replaced with each query from the query set during evaluation: + +```json +PUT /_plugins/_search_relevance/search_configurations +{ + "name": "baseline", + "query": "{\"query\":{\"multi_match\":{\"query\":\"%SearchText%\",\"fields\":[\"title\",\"description\",\"category\",\"brand\"]}}}", + "index": "products" +} +``` +{% include copy-curl.html %} + +### Step 4: Create a query set + +Create a query set containing test queries for evaluation: + +```json +PUT /_plugins/_search_relevance/query_sets +{ + "name": "Electronics Queries", + "description": "Test queries for electronics products", + "sampling": "manual", + "querySetQueries": [ + {"queryText": "smart tv"}, + {"queryText": "laptop computer"}, + {"queryText": "wireless headphones"} + ] +} +``` +{% include copy-curl.html %} + +### Step 5: Generate LLM judgments + +Create an LLM judgment that uses your deployed model to evaluate search results. Replace `{model_id}`, `{query_set_id}`, and `{search_configuration_id}` with the IDs returned in previous steps: + +```json +PUT /_plugins/_search_relevance/judgments +{ + "name": "LLM Judgment via OpenAI", + "description": "Uses GPT-3.5-turbo to evaluate product search results", + "type": "LLM_JUDGMENT", + "modelId": "{model_id}", + "querySetId": "{query_set_id}", + "searchConfigurationList": ["{search_configuration_id}"], + "size": 10, + "tokenLimit": 4000, + "contextFields": ["title", "description", "category"], + "ignoreFailure": false, + "llmJudgmentRatingType": "SCORE0_1", + "promptTemplate": "Rate the relevance of these search results {% raw %}{{hits}}{% endraw %} for the query '{% raw %}{{queryText}}{% endraw %}' on a scale of 0-1, where 0 is completely irrelevant and 1 is perfectly relevant. Consider the product title, description, and category.", + "overwriteCache": false +} +``` +{% include copy-curl.html %} + +For a description of all request body parameters, see [Judgments]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/judgments/#request-body-fields). + +The judgment process runs asynchronously. To verify the status, retrieve the judgment by its ID: + +```json +GET /search-relevance-judgment/_doc/{judgment_id} +``` +{% include copy-curl.html %} + +When the `status` field is `COMPLETED`, the `judgmentRatings` array contains the generated relevance scores for each query-document pair. + +## Next steps + +You are now ready to [run an experiment to evaluate search quality]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/evaluate-search-quality/#creating-a-pointwise-experiment) with the LLM-generated judgments. The search configuration and query set that you created during this tutorial can serve as inputs for your first evaluation. + +## Related documentation + +- [Search Relevance Workbench]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/using-search-relevance-workbench/) +- [Using LLM-as-a-Judge]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/judgments/#using-llm-as-a-judge) +- [Connecting to externally hosted models]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/index/)