|
| 1 | +# Instruction Following |
| 2 | +### _Eval Recipe for model migration_ |
| 3 | + |
| 4 | +This Eval Recipe demonstrates how to compare performance of an Instruction Following prompt with Gemini 1.5 Flash and Gemini 2.0 Flash using an unlabeled dataset and open source evaluation tool [Promptfoo](https://www.promptfoo.dev/). |
| 5 | + |
| 6 | +[Instruction-Following Eval (IFEval)](https://arxiv.org/abs/2311.07911) is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words", "write in bullet points", etc. |
| 7 | + |
| 8 | +- Use case: Instruction Following |
| 9 | + |
| 10 | +- Evaluation Dataset is based on [Instruction Following Evaluation Dataset](https://github.com/google-research/google-research/blob/master/instruction_following_eval/data/input_data.jsonl). It includes 10 randomly sampled prompts in a JSONL file `dataset.jsonl`. Each record in this file includes 1 attribute wrapped in the `vars` object. This structure allows Promptfoo to specify the variables needed to populate prompt templates (document and question), as well as the ground truth label required to score the accuracy of model responses: |
| 11 | + - `prompt`: The task with specific instructions provided |
| 12 | + |
| 13 | +- Prompt Template is a zero-shot prompt located in [`prompt_template.txt`](./prompt_template.txt) with one prompt variable (`prompt`) that is automatically populated from our dataset. |
| 14 | + |
| 15 | +- [`promptfooconfig.yaml`](./promptfooconfig.yaml) contains all Promptfoo configuration: |
| 16 | + - `providers`: list of models that will be evaluated |
| 17 | + - `prompts`: location of the prompt template file |
| 18 | + - `tests`: location of the labeled dataset file |
| 19 | + - `defaultTest`: defines the scoring logic: |
| 20 | + `type: answer-relevance` The answer-relevance assertion evaluates whether an LLM's output is relevant to the original query. It uses a combination of embedding similarity and LLM evaluation to determine relevance.. |
| 21 | + `value: "Check if the response adheres to the instructions in the prompt"` instructs Promptfoo to verify and score based on how well the generated response is aligned with the original prompt. |
| 22 | + `threshold: 0.5` Mark any responses with a score below 0.5 as a failure. |
| 23 | + |
| 24 | + |
| 25 | + |
| 26 | +## How to run this Eval Recipe |
| 27 | + |
| 28 | +- Google Cloud Shell is the easiest option as it automatically clones our Github repo: |
| 29 | + |
| 30 | + <a href="https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples&cloudshell_git_branch=main&cloudshell_workspace=genai-on-vertex-ai/gemini/model_upgrades"> |
| 31 | + <img alt="Open in Cloud Shell" src="http://gstatic.com/cloudssh/images/open-btn.png"> |
| 32 | + </a> |
| 33 | + |
| 34 | +- Alternatively, you can use the following command to clone this repo to any Linux environment with configured [Google Cloud Environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment): |
| 35 | + |
| 36 | + ``` bash |
| 37 | + git clone --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && \ |
| 38 | + cd applied-ai-engineering-samples && \ |
| 39 | + git sparse-checkout init && \ |
| 40 | + git sparse-checkout set genai-on-vertex-ai/gemini/model_upgrades && \ |
| 41 | + git pull origin main |
| 42 | + cd genai-on-vertex-ai/gemini/model_upgrades |
| 43 | + ``` |
| 44 | + |
| 45 | +1. Install Promptfoo using [these instructions](https://www.promptfoo.dev/docs/installation/). |
| 46 | +1. Navigate to the Eval Recipe directory in terminal and run the command `promptfoo eval`. |
| 47 | + |
| 48 | +``` bash |
| 49 | +cd genai-on-vertex-ai/gemini/model_upgrades/instruction_following/promptfoo |
| 50 | +promptfoo eval |
| 51 | +``` |
| 52 | +1. Run `promptfoo view` to analyze the eval results. You can switch the Display option to `Show failures only` in order to investigate any underperforming prompts. |
| 53 | + |
| 54 | +## How to customize this Eval Recipe: |
| 55 | +1. Copy the configuration file `promptfooconfig.yaml` to a new folder. |
| 56 | +1. Add your labeled dataset file with JSONL schema similar to `dataset.jsonl`. |
| 57 | +1. Save your prompt template to `prompt_template.txt` and make sure that the template variables map to the variables defined in your dataset. |
| 58 | +1. That's it! You are ready to run `promptfoo eval`. If needed, add alternative prompt templates or additional metrics to promptfooconfig.yaml as explained [here](https://www.promptfoo.dev/docs/configuration/parameters/). |
0 commit comments