Skip to content

Commit 7640481

Browse files
Feat: instruction following eval recipe (#179)
* instruction following eval recipe * instruction following eval recipe changes * instruction following eval recipe changes for bucket name * instruction following eval recipe changes for plotly import * erasing ouputs from cells * erasing ouputs from cells and removing blank cells * updating datasets and promptfoo prompt * updating readme and renaming the folder to include underscore * Update README.md --------- Co-authored-by: Rajesh Thallam <rthallam@google.com>
1 parent cf729b0 commit 7640481

14 files changed

Lines changed: 490 additions & 0 deletions

File tree

genai-on-vertex-ai/gemini/model_upgrades/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,4 +17,6 @@ The goal is to accelerate the process of upgrading to the latest version of Gemi
1717
| Document QnA | [view](./document_qna/vertex_colab/document_qna_eval.ipynb) | [view](./document_qna/vertex_script/README.md) | [view](./document_qna/promptfoo/README.md) |
1818
| Summarization | [view](./summarization/vertex_colab/summarization_eval.ipynb) | [view](./summarization/vertex_script/README.md) | [view](./summarization/promptfoo/README.md) |
1919
| Text Classification | [view](./text_classification/vertex_colab/text_classification_eval.ipynb) | [view](./text_classification/vertex_script/README.md) | [view](./text_classification/promptfoo/README.md) |
20+
| Multi-turn Chat | [view](./multiturn_chat/vertex_colab/multiturn_chat_eval.ipynb) | [view](./multiturn_chat/vertex_script/README.md) | [view](./multiturn_chat/promptfoo/README.md) |
21+
| Instruction Following | [view](./instruction_following/vertex_colab/instruction_following_eval.ipynb) | [view](./instruction_following/vertex_script/README.md) | [view](./instruction_following/promptfoo/README.md) |
2022

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Instruction Following
2+
### _Eval Recipe for model migration_
3+
4+
This Eval Recipe demonstrates how to compare performance of an Instruction Following prompt with Gemini 1.5 Flash and Gemini 2.0 Flash using an unlabeled dataset and open source evaluation tool [Promptfoo](https://www.promptfoo.dev/).
5+
6+
[Instruction-Following Eval (IFEval)](https://arxiv.org/abs/2311.07911) is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words", "write in bullet points", etc.
7+
8+
- Use case: Instruction Following
9+
10+
- Evaluation Dataset is based on [Instruction Following Evaluation Dataset](https://github.com/google-research/google-research/blob/master/instruction_following_eval/data/input_data.jsonl). It includes 10 randomly sampled prompts in a JSONL file `dataset.jsonl`. Each record in this file includes 1 attribute wrapped in the `vars` object. This structure allows Promptfoo to specify the variables needed to populate prompt templates (document and question), as well as the ground truth label required to score the accuracy of model responses:
11+
- `prompt`: The task with specific instructions provided
12+
13+
- Prompt Template is a zero-shot prompt located in [`prompt_template.txt`](./prompt_template.txt) with one prompt variable (`prompt`) that is automatically populated from our dataset.
14+
15+
- [`promptfooconfig.yaml`](./promptfooconfig.yaml) contains all Promptfoo configuration:
16+
- `providers`: list of models that will be evaluated
17+
- `prompts`: location of the prompt template file
18+
- `tests`: location of the labeled dataset file
19+
- `defaultTest`: defines the scoring logic:
20+
`type: answer-relevance` The answer-relevance assertion evaluates whether an LLM's output is relevant to the original query. It uses a combination of embedding similarity and LLM evaluation to determine relevance..
21+
`value: "Check if the response adheres to the instructions in the prompt"` instructs Promptfoo to verify and score based on how well the generated response is aligned with the original prompt.
22+
`threshold: 0.5` Mark any responses with a score below 0.5 as a failure.
23+
24+
25+
26+
## How to run this Eval Recipe
27+
28+
- Google Cloud Shell is the easiest option as it automatically clones our Github repo:
29+
30+
<a href="https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples&cloudshell_git_branch=main&cloudshell_workspace=genai-on-vertex-ai/gemini/model_upgrades">
31+
<img alt="Open in Cloud Shell" src="http://gstatic.com/cloudssh/images/open-btn.png">
32+
</a>
33+
34+
- Alternatively, you can use the following command to clone this repo to any Linux environment with configured [Google Cloud Environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment):
35+
36+
``` bash
37+
git clone --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && \
38+
cd applied-ai-engineering-samples && \
39+
git sparse-checkout init && \
40+
git sparse-checkout set genai-on-vertex-ai/gemini/model_upgrades && \
41+
git pull origin main
42+
cd genai-on-vertex-ai/gemini/model_upgrades
43+
```
44+
45+
1. Install Promptfoo using [these instructions](https://www.promptfoo.dev/docs/installation/).
46+
1. Navigate to the Eval Recipe directory in terminal and run the command `promptfoo eval`.
47+
48+
``` bash
49+
cd genai-on-vertex-ai/gemini/model_upgrades/instruction_following/promptfoo
50+
promptfoo eval
51+
```
52+
1. Run `promptfoo view` to analyze the eval results. You can switch the Display option to `Show failures only` in order to investigate any underperforming prompts.
53+
54+
## How to customize this Eval Recipe:
55+
1. Copy the configuration file `promptfooconfig.yaml` to a new folder.
56+
1. Add your labeled dataset file with JSONL schema similar to `dataset.jsonl`.
57+
1. Save your prompt template to `prompt_template.txt` and make sure that the template variables map to the variables defined in your dataset.
58+
1. That's it! You are ready to run `promptfoo eval`. If needed, add alternative prompt templates or additional metrics to promptfooconfig.yaml as explained [here](https://www.promptfoo.dev/docs/configuration/parameters/).
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{"vars":{"prompt": "I work in the marketing department and I need your help. I need a template for an advertisement for a new product which is a portable camera. In the template, capitalize a few words to stress main points. Please limit the number of words with all capital letters to less than four. Your response should contain at least ten sentences."}}
2+
{"vars":{"prompt": "Write a rubric, in the form of a list of bullet points, for evaluating the performance of a customer service representative. Your answer must not include keywords ['bad', 'underperform'] and must contain exactly 6 bullet points in the following form:\n* Bullet point 1\n* Bullet point 2\n* Bullet point 3\n* Bullet point 4\n* Bullet point 5\n* Bullet point 6"}}
3+
{"vars":{"prompt": "Can you create an itinerary for a 5 day trip to Switzerland that includes exactly 3 bullet points in markdown format, in all lowercase letters, and a postscript at the end starting with P.S.?"}}
4+
{"vars":{"prompt": "Who won the defamation case between Amber Heard and Johnny Depp? Write your answer as if you are writing to a group of elderly people. First, write in the perspective of Amber Heard, then write in the perspective of Johnny Depp. Separate those two version by 6 asterisk symbols ******. The entire response should have less than 300 words."}}
5+
{"vars":{"prompt": "Write a cover letter for a job application as a tour guide in Beijing in all lowercase letters, with no capitalizations. Make it short -- the entire output should have less than 5 sentences."}}
6+
{"vars":{"prompt": "A colt is 5 feet tall. It will grow 6 inches every month. How tall will it be in 3 years? Think step-by-step, then give your answer. Separate your thinking and the final answer by a line with just three \"*\" symbols: ***\nAt the end of your response, please explicitly add a postscript starting with P.P.S"}}
7+
{"vars":{"prompt": "I asked a friend about how to remove rust from my bike chain. He told me to pour coke on it and then scrub it with a steel wool. Is this a good way to remove rust? Respond with at least 20 sentences and have more than 4 words be in all capital letters."}}
8+
{"vars":{"prompt": "Compose a startup pitch on a new app called Tipperary that helps people to find the average tip size for each restaurant. Please make the response strongly structured. Wrap your entire output in JSON format."}}
9+
{"vars":{"prompt": "What is the next number in this series: 1, 4, 7, 11, 17? Please answer with only mathematical notation without any commas."}}
10+
{"vars":{"prompt": "Translate the following sentence into German and then criticize it: Werner was a good friend of mine, but not very smart.\nAvoid the word \"schlau\" throughout your response."}}
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
You are an expert at following instructions precisely and accurately.
2+
Your goal is to carefully read the instruction provided below and execute it exactly as specified.
3+
4+
Pay close attention to all constraints and requirements mentioned in the instruction.
5+
6+
# Instruction
7+
{{prompt}}
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
2+
3+
providers: # one or more models, with optional temperature and system instructions
4+
- id: "vertex:gemini-1.5-flash"
5+
- id: "vertex:gemini-2.0-flash"
6+
7+
prompts:
8+
- file://prompt_template.txt
9+
10+
tests: # Promptfoo will generate a separate prompt for each record in this dataset based on the prompt template above
11+
- file://dataset.jsonl
12+
13+
defaultTest: # The rules for scoring each model response
14+
assert:
15+
- type: answer-relevance
16+
value: "Evaluate the generated response to ensure it fully adheres to all instructions and constraints specified in the original prompt."
17+
threshold: 0.5
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{"prompt": "I work in the marketing department and I need your help. I need a template for an advertisement for a new product which is a portable camera. In the template, capitalize a few words to stress main points. Please limit the number of words with all capital letters to less than four. Your response should contain at least ten sentences."}
2+
{"prompt": "Write a rubric, in the form of a list of bullet points, for evaluating the performance of a customer service representative. Your answer must not include keywords ['bad', 'underperform'] and must contain exactly 6 bullet points in the following form:\n* Bullet point 1\n* Bullet point 2\n* Bullet point 3\n* Bullet point 4\n* Bullet point 5\n* Bullet point 6"}
3+
{"prompt": "Can you create an itinerary for a 5 day trip to Switzerland that includes exactly 3 bullet points in markdown format, in all lowercase letters, and a postscript at the end starting with P.S.?"}
4+
{"prompt": "Who won the defamation case between Amber Heard and Johnny Depp? Write your answer as if you are writing to a group of elderly people. First, write in the perspective of Amber Heard, then write in the perspective of Johnny Depp. Separate those two version by 6 asterisk symbols ******. The entire response should have less than 300 words."}
5+
{"prompt": "Write a cover letter for a job application as a tour guide in Beijing in all lowercase letters, with no capitalizations. Make it short -- the entire output should have less than 5 sentences."}
6+
{"prompt": "A colt is 5 feet tall. It will grow 6 inches every month. How tall will it be in 3 years? Think step-by-step, then give your answer. Separate your thinking and the final answer by a line with just three \"*\" symbols: ***\nAt the end of your response, please explicitly add a postscript starting with P.P.S"}
7+
{"prompt": "I asked a friend about how to remove rust from my bike chain. He told me to pour coke on it and then scrub it with a steel wool. Is this a good way to remove rust? Respond with at least 20 sentences and have more than 4 words be in all capital letters."}
8+
{"prompt": "Compose a startup pitch on a new app called Tipperary that helps people to find the average tip size for each restaurant. Please make the response strongly structured. Wrap your entire output in JSON format."}
9+
{"prompt": "What is the next number in this series: 1, 4, 7, 11, 17? Please answer with only mathematical notation without any commas."}
10+
{"prompt": "Translate the following sentence into German and then criticize it: Werner was a good friend of mine, but not very smart.\nAvoid the word \"schlau\" throughout your response."}

0 commit comments

Comments
 (0)