RHIDP-12086: Lightspeed Evaluation Framework documentation (redhat-developer#2161)

pabel-rh · web-flow · commit ff9e623ce937 · 2026-05-14T11:58:29.000+02:00
* Lightspeed Evaluation Framework documentation

* Incorporate CQA comments

* CQA changes 2

* CQA check

* Minor fix

* Incorporating Gerry's comments

* JTBD updates

* Incorporated Heena's comment
diff --git a/assemblies/shared/assembly-ai-model-evaluation-data-to-select-the-right-ai-model.adoc b/assemblies/shared/assembly-ai-model-evaluation-data-to-select-the-right-ai-model.adoc
@@ -0,0 +1,36 @@
+:_mod-docs-content-type: ASSEMBLY
+ifdef::context[:parent-context: {context}]
+
+[id="ai-model-evaluation-data-to-select-the-right-ai-model_{context}"]
+= AI model evaluation data to select the right AI model
+
+:context: assembly-evaluate-developer-lightspeed-performance
+
+[role="_abstract"]
+Use the {ls-brand-name} evaluation framework to validate the performance, accuracy, and reliability of {ls-short}. 
+
+With this automated toolset, you can measure how effectively various large language models (LLMs) answer questions based on {product} documentation.
+
+.Components of the evaluation framework
+[cols="1,3",options="header"]
+|===
+| Component | Description
+| Evaluation framework | Contains the core logic and scripts used to run evaluations.
+| Datasets | Includes the input files used to test the model.
+| Evaluation metrics integration | Provides scoring through various metrics, including Ragas, DeepEval, and custom metrics. Ragas is the primary metric used to validate {ls-short} performance.
+|===
+
+include::../modules/shared/proc-configure-the-evaluation-environment-to-validate-model-accuracy.adoc[leveloffset=+1]
+
+include::../modules/shared/proc-prepare-evaluation-datasets-to-verify-ai-generated-responses.adoc[leveloffset=+1]
+
+include::../modules/shared/proc-run-performance-tests-to-ensure-ai-response-reliability.adoc[leveloffset=+1]
+
+include::../modules/shared/proc-analyze-evaluation-results-to-identify-performance-gaps.adoc[leveloffset=+1]
+
+include::../modules/shared/ref-evaluation-metrics-and-historical-data-reference.adoc[leveloffset=+1]
+
+include::../modules/shared/ref-release-report-and-historical-data.adoc[leveloffset=+1]
+
+ifdef::parent-context[:context: {parent-context}]
+ifndef::parent-context[:!context:]
diff --git a/modules/shared/proc-analyze-evaluation-results-to-identify-performance-gaps.adoc b/modules/shared/proc-analyze-evaluation-results-to-identify-performance-gaps.adoc
@@ -0,0 +1,22 @@
+:_mod-docs-content-type: PROCEDURE
+
+[id="analyze-evaluation-results-to-identify-performance-gaps_{context}"]
+= Analyze evaluation results to identify performance gaps
+
+[role="_abstract"]
+Determine the performance of {ls-short} and identify documentation areas that require model improvement by analyzing evaluation results in the repository. You can use these reports to compare performance across different large language models (LLMs) and topics.
+
+.Prerequisites
+* You must have access to the link:https://github.com/redhat-ai-dev/developer-lightspeed-evaluation/tree/main[`developer-lightspeed-evaluation` repository].
+
+.Procedure
+
+. In the root of the repository, navigate to the version-specific folder within the link:https://github.com/redhat-ai-dev/developer-lightspeed-evaluation/tree/main/evaluation-result[`/evaluation-result`] directory.
+. Open the following files to evaluate performance:
+
+** Model Pass Rate: Compare the overall performance between different LLMs.
+** Topic Pass Rate: Identify performance trends and gaps within specific documentation areas.
+
+.Verification
+
+* Verify that the reports display data visualizations or metrics consistent with your recent evaluation run.
diff --git a/modules/shared/proc-configure-the-evaluation-environment-to-validate-model-accuracy.adoc b/modules/shared/proc-configure-the-evaluation-environment-to-validate-model-accuracy.adoc
@@ -0,0 +1,62 @@
+:_mod-docs-content-type: PROCEDURE
+
+[id="configure-the-evaluation-environment-to-validate-model-accuracy_{context}"]
+= Configure the evaluation environment to validate model accuracy
+
+[role="_abstract"]
+Set up the evaluation environment to validate the performance and accuracy of {ls-short}. Configure this evaluation to ensure the model correctly interprets documentation and provides dependable answers. 
+
+By performing these evaluations, you minimize the risk of the model delivering incorrect or hallucinated information to users in production.
+
+.Prerequisites
+
+* Install *uv* for Python package management (Python 3.11 or later).
+
+.Procedure
+
+. Clone the evaluation repository and navigate to the directory:
++
+[source,bash]
+----
+git clone https://github.com/lightspeed-core/lightspeed-evaluation
+cd lightspeed-evaluation
+----
+
+. Synchronize the environment and install dependencies:
++
+[source,bash]
+----
+uv sync
+----
+
+. Configure the environment variables for the judge LLM. You can create a `.env` file in the root directory or export the keys directly to your terminal.
+** If you use Gemini, you must set the Gemini API key:
++
+[source,bash]
+----
+export GEMINI_API_KEY="your-google-api-key"
+----
+** If you use OpenAI, you must set the OpenAI API key:
++
+[source,bash]
+----
+export OPENAI_API_KEY="your-key"
+----
+
+. Optional: If you test with a live service, set your {ls-short} service API key:
++
+[source,bash]
+----
+export API_KEY="your-lightspeed-service-key"
+----
+
+.Verification
+
+* Verify that the environment is synchronized and the virtual environment is active:
++
+[source,bash]
+----
+uv run python --version
+----
++
+The output must return Python 3.11 or later.
diff --git a/modules/shared/proc-prepare-evaluation-datasets-to-verify-ai-generated-responses.adoc b/modules/shared/proc-prepare-evaluation-datasets-to-verify-ai-generated-responses.adoc
@@ -0,0 +1,39 @@
+:_mod-docs-content-type: PROCEDURE
+
+[id="prepare-evaluation-datasets-to-verify-ai-generated-responses_{context}"]
+= Prepare evaluation datasets to verify AI-generated responses
+
+[role="_abstract"]
+Prepare evaluation datasets to test the performance of {ls-short}. You can use pre-generated AI datasets for specific {product} releases or generate custom AI datasets from your own documentation.
+
+.Prerequisites
+
+* You must clone the evaluation repository to your local machine.
+
+.Procedure
+
+. Download pre-generated datasets: Use this method to test the performance of specific {product-very-short} releases. These datasets are generated using link:https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/[Ragas testset generation for RAG].
+
+.. In your terminal, navigate to the link:https://github.com/redhat-ai-dev/developer-lightspeed-evaluation/tree/main/dataset[/dataset folder] in the evaluation repository.
+.. Locate the `.evaluation_dataset_yaml` files. These files are pre-configured for the evaluation tool.
+.. To test a historical release, switch to the corresponding branch.
++
+--
+For example, to access the {product} 1.8 dataset, switch to the `1.8` branch.
+
+[IMPORTANT]
+====
+The `main` branch contains work-in-progress (WIP) datasets. Avoid using this branch for stable evaluations.
+====
+--
+
+. Generate custom datasets: Use this method to create a new test set from your own technical documentation.
+
+.. Generate a diverse set of question-and-answer (Q&A) pairs by following the link:https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/[Ragas test data generation documentation].
+
+.. Ensure your Q&A pairs match the required format by link:https://github.com/lightspeed-core/lightspeed-evaluation?tab=readme-ov-file[reviewing the evaluation data structure configuration].
+
+.Verification
+
+* Verify that your custom dataset matches the required schema before you start the evaluation run.
+
diff --git a/modules/shared/proc-run-performance-tests-to-ensure-ai-response-reliability.adoc b/modules/shared/proc-run-performance-tests-to-ensure-ai-response-reliability.adoc
@@ -0,0 +1,41 @@
+:_mod-docs-content-type: PROCEDURE
+
+[id="run-performance-tests-to-ensure-ai-response-reliability_{context}"]
+= Run performance tests to ensure AI response reliability
+
+[role="_abstract"]
+Use the evaluation framework to run performance tests in either static mode to evaluate pre-recorded responses or dynamic mode to call a live service. 
+
+These evaluations identify performance gaps, allow you to compare different large language models (LLMs), and ensure that {ls-short} provides reliable information to users.
+
+.Prerequisites
+
+* You must link:https://github.com/lightspeed-core/lightspeed-evaluation#installation[install and configure the evaluation environment].
+* You must prepare an evaluation dataset.
+
+.Procedure
+. Download the link:https://github.com/lightspeed-core/lightspeed-evaluation/blob/main/config/system.yaml[`system.yaml` configuration template] from the repository.
+. Configure the parameters in the `system.yaml` file based on your evaluation mode:
++
+[cols="1,3",options="header"]
+|===
+| Field | Description
+| `llm` | Defines the judge LLM that scores the responses, such as `gemini-2.5-pro`.
+| `api.enabled` | Set to `false` for static mode to use pre-filled data. Set to `true` for dynamic mode to call a live service.
+| `api.api_base` | (Required for dynamic mode only) Provide the URL of your {ls-short} service.
+| `api.endpoint_type` | Specify the service configuration type: `streaming` or `query`.
+|===
+
+. Execute the evaluation by using the `lightspeed-eval` command:
++
+[source,bash]
+----
+lightspeed-eval \
+  --system-config config/system.yaml \
+  --eval-data config/evaluation_data.yaml \
+  --output-dir ./my_evaluation_results
+----
+
+.Verification
+
+* Navigate to the specified output directory and verify that the generated reports contain the model performance scores.
diff --git a/modules/shared/ref-evaluation-metrics-and-historical-data-reference.adoc b/modules/shared/ref-evaluation-metrics-and-historical-data-reference.adoc
@@ -0,0 +1,19 @@
+:_mod-docs-content-type: REFERENCE
+
+[id="evaluation-metrics-and-historical-data-reference_{context}"]
+= Evaluation metrics and historical data reference
+
+[role="_abstract"]
+Use the link:https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/[available metrics] to evaluate the performance of {ls-short} at the conversation turn level. 
+
+These metrics provide a standardized way to measure the accuracy and reliability of the generated responses and the retrieved content.
+
+[cols="1,3",options="header"]
+|===
+| Metric | Description
+| `Faithfulness` | Measures how well the answer is derived solely from the retrieved context.
+| `Context recall` | Measures whether the retrieved context contains all information required to answer the question.
+| `Context relevance` | Verifies if the retrieved documentation chunks are relevant to the user query.
+| `Context precision without reference` | Measures the ratio of useful information within the retrieved documentation chunks.
+| `Answer correctness` | Compares the generated response against the expected ground-truth response. This custom metric is implemented in the evaluation tool.
+|===
diff --git a/modules/shared/ref-release-report-and-historical-data.adoc b/modules/shared/ref-release-report-and-historical-data.adoc
@@ -0,0 +1,22 @@
+:_mod-docs-content-type: REFERENCE
+
+[id="release-report-and-historical-data_{context}"]
+= Release report and historical data
+
+[role="_abstract"]
+Use the link:https://github.com/redhat-ai-dev/developer-lightspeed-evaluation[latest Q&A dataset and evaluation results] to monitor the current performance of {ls-short}. 
+
+Access version-specific branches that contain the datasets and evaluation results required to track improvements or regressions across product releases.
+
+[IMPORTANT]
+====
+The `main` branch contains work-in-progress data for versions currently under development. For stable evaluations or historical tracking, you must switch to the branch associated with a specific release.
+====
+
+[cols="1,1,2",options="header"]
+|===
+| Release version | Branch name | Data included
+| Latest stable | Most recent version branch | The current question and answer (Q&A) dataset and evaluation results.
+| Historical | Previous version branches | Datasets and evaluation results for previous releases to track regressions.
+|===
+
diff --git a/titles/integrate_interacting-with-developer-lightspeed-for-rhdh/master.adoc b/titles/integrate_interacting-with-developer-lightspeed-for-rhdh/master.adoc
@@ -28,6 +28,8 @@ include::assemblies/shared/assembly-customize.adoc[leveloffset=+1]
 
 include::assemblies/shared/assembly-get-ai-assisted-help-for-your-development-tasks.adoc[leveloffset=+1]
 
+include::assemblies/shared/assembly-ai-model-evaluation-data-to-select-the-right-ai-model.adoc[leveloffset=+1]
+
 include::assemblies/shared/assembly-appendix-llm-requirements.adoc[leveloffset=+1]
 
 include::assemblies/shared/assembly-appendix-about-user-data-security.adoc[leveloffset=+1]