updating datasets and promptfoo prompt

jaivaldesai · jaivaldesai · commit dd6e75e93426 · 2025-04-24T15:00:55.000Z
diff --git a/genai-on-vertex-ai/gemini/model_upgrades/instruction-following/promptfoo/README.md b/genai-on-vertex-ai/gemini/model_upgrades/instruction-following/promptfoo/README.md
@@ -7,7 +7,7 @@ This Eval Recipe demonstrates how to compare performance of an Instruction Follo
 
 - Use case: Instruction Following 
 
-- Evaluation Dataset is based on [Instruction Following Evaluation Dataset](https://github.com/google-research/google-research/blob/master/instruction_following_eval/data/input_data.jsonl). It includes 15 randomly sampled prompts in a JSONL file `dataset.jsonl`. Each record in this file includes 1 attribute wrapped in the `vars` object. This structure allows Promptfoo to specify the variables needed to populate prompt templates (document and question), as well as the ground truth label required to score the accuracy of model responses:
+- Evaluation Dataset is based on [Instruction Following Evaluation Dataset](https://github.com/google-research/google-research/blob/master/instruction_following_eval/data/input_data.jsonl). It includes 10 randomly sampled prompts in a JSONL file `dataset.jsonl`. Each record in this file includes 1 attribute wrapped in the `vars` object. This structure allows Promptfoo to specify the variables needed to populate prompt templates (document and question), as well as the ground truth label required to score the accuracy of model responses:
     - `prompt`: The task with specific instructions provided
 
 - Prompt Template is a zero-shot prompt located in [`prompt_template.txt`](./prompt_template.txt) with one prompt variable (`prompt`) that is automatically populated from our dataset.
diff --git a/genai-on-vertex-ai/gemini/model_upgrades/instruction-following/promptfoo/dataset.jsonl b/genai-on-vertex-ai/gemini/model_upgrades/instruction-following/promptfoo/dataset.jsonl
@@ -1,15 +1,10 @@
-{"vars":{"prompt": "I am planning a trip to Japan, and I would like thee to write an itinerary for my journey in a Shakespearean style. You are not allowed to use any commas in your response."}}
-{"vars":{"prompt": "Which of the following is not a fish: salmon or avocado? Answer this easy question first, then expand into an interesting article about salmon and avocado. Use some words in all caps in your response, but those all-caps words should appear at most 10 times."}}
-{"vars":{"prompt": "Write a funny article about why the dinosaurs went extinct and put double quotations marks around your whole response. Include exactly 8 bullet points in your response. The bullet points should be in the form of:\n* This is bullet 1\n* This is bullet 2\n..."}}
-{"vars":{"prompt": "A colt is 5 feet tall. It will grow 6 inches every month. How tall will it be in 3 years? Think step-by-step, then give your answer. Separate your thinking and the final answer by a line with just three \"*\" symbols: ***\nAt the end of your response, please explicitly add a postscript starting with P.P.S"}}
+{"vars":{"prompt": "I work in the marketing department and I need your help. I need a template for an advertisement for a new product which is a portable camera. In the template, capitalize a few words to stress main points. Please limit the number of words with all capital letters to less than four. Your response should contain at least ten sentences."}}
+{"vars":{"prompt": "Write a rubric, in the form of a list of bullet points, for evaluating the performance of a customer service representative. Your answer must not include keywords ['bad', 'underperform'] and must contain exactly 6 bullet points in the following form:\n* Bullet point 1\n* Bullet point 2\n* Bullet point 3\n* Bullet point 4\n* Bullet point 5\n* Bullet point 6"}}
+{"vars":{"prompt": "Can you create an itinerary for a 5 day trip to Switzerland that includes exactly 3 bullet points in markdown format, in all lowercase letters, and a postscript at the end starting with P.S.?"}}
 {"vars":{"prompt": "Who won the defamation case between Amber Heard and Johnny Depp? Write your answer as if you are writing to a group of elderly people. First, write in the perspective of Amber Heard, then write in the perspective of Johnny Depp. Separate those two version by 6 asterisk symbols ******. The entire response should have less than 300 words."}}
+{"vars":{"prompt": "Write a cover letter for a job application as a tour guide in Beijing in all lowercase letters, with no capitalizations. Make it short -- the entire output should have less than 5 sentences."}}
+{"vars":{"prompt": "A colt is 5 feet tall. It will grow 6 inches every month. How tall will it be in 3 years? Think step-by-step, then give your answer. Separate your thinking and the final answer by a line with just three \"*\" symbols: ***\nAt the end of your response, please explicitly add a postscript starting with P.P.S"}}
 {"vars":{"prompt": "I asked a friend about how to remove rust from my bike chain. He told me to pour coke on it and then scrub it with a steel wool. Is this a good way to remove rust? Respond with at least 20 sentences and have more than 4 words be in all capital letters."}}
-{"vars":{"prompt": "Write a very angry letter to someone who's been trying to convince you that 1+1=3. There should be exactly 4 paragraphs. Separate the paragraphs with ***."}}
-{"vars":{"prompt": "Write a 100 word riddle that leaves the reader satisfied and enlightened. Include a few words in all capital letters. But the number of words in all capital letters should be less than 5."}}
-{"vars":{"prompt": "Students are travelling to UCI for 3 days. Create a hilarious itinerary for them. Do not use the word 'university'. Your entire response should have exactly 4 paragraphs.  Separate paragraphs with the markdown divider: ***"}}
-{"vars":{"prompt": "Write a poem about the top 20 tallest buildings in the world and their heights. End your response with the exact question: Is there anything else I can help with?"}}
-{"vars":{"prompt": "Can you create an itinerary for a 5 day trip to switzerland that includes exactly 3 bullet points in markdown format, in all lowercase letters, and a postscript at the end starting with P.S.?"}}
-{"vars":{"prompt": "Explain to me how to ride a bike like I am a kid. Also, do not include the keywords \"slow\", \"like\" and \"kid\"."}}
-{"vars":{"prompt": "Can you tell me why there are oval race tracks in the desert? Please rewrite the answer to make it more concise and include the word \"desert\" in the answer. Make sure the answer contains exactly 3 bullet points in markdown format."}}
-{"vars":{"prompt": "Write me a letter in the style of Shakespeare about the mandates and instructions of the King. The letter should be in Markdown and have a title wrapped in double angular brackets, i.e. <<title>>."}}
-{"vars":{"prompt": "Darth Vader is angry with you, his server, because you assassinated a fake Jedi. Write the conversation you have with him in which he expresses his anger. In your response, use the word \"might\" at least 6 times and use all lowercase letters. Make sure to break the conversation down to 3 parts, separated by ***, such as:\n[conversation part 1]\n***\n[conversation part 2]\n***\n[conversation part 3]"}}
+{"vars":{"prompt": "Compose a startup pitch on a new app called Tipperary that helps people to find the average tip size for each restaurant. Please make the response strongly structured. Wrap your entire output in JSON format."}}
+{"vars":{"prompt": "What is the next number in this series: 1, 4, 7, 11, 17? Please answer with only mathematical notation without any commas."}}
+{"vars":{"prompt": "Translate the following sentence into German and then criticize it: Werner was a good friend of mine, but not very smart.\nAvoid the word \"schlau\" throughout your response."}}
diff --git a/genai-on-vertex-ai/gemini/model_upgrades/instruction-following/promptfoo/promptfooconfig.yaml b/genai-on-vertex-ai/gemini/model_upgrades/instruction-following/promptfoo/promptfooconfig.yaml
@@ -13,5 +13,5 @@ tests:  # Promptfoo will generate a separate prompt for each record in this data
 defaultTest:  # The rules for scoring each model response
   assert:
     - type: answer-relevance
-      value: "Check if the response adheres to the instructions in the prompt"
+      value: "Evaluate the generated response to ensure it fully adheres to all instructions and constraints specified in the original prompt."
       threshold: 0.5
diff --git a/genai-on-vertex-ai/gemini/model_upgrades/instruction-following/vertex_colab/dataset.jsonl b/genai-on-vertex-ai/gemini/model_upgrades/instruction-following/vertex_colab/dataset.jsonl
@@ -1,15 +1,10 @@
-{"prompt": "I am planning a trip to Japan, and I would like thee to write an itinerary for my journey in a Shakespearean style. You are not allowed to use any commas in your response."}
-{"prompt": "Which of the following is not a fish: salmon or avocado? Answer this easy question first, then expand into an interesting article about salmon and avocado. Use some words in all caps in your response, but those all-caps words should appear at most 10 times."}
-{"prompt": "Write a funny article about why the dinosaurs went extinct and put double quotations marks around your whole response. Include exactly 8 bullet points in your response. The bullet points should be in the form of:\n* This is bullet 1\n* This is bullet 2\n..."}
-{"prompt": "A colt is 5 feet tall. It will grow 6 inches every month. How tall will it be in 3 years? Think step-by-step, then give your answer. Separate your thinking and the final answer by a line with just three \"*\" symbols: ***\nAt the end of your response, please explicitly add a postscript starting with P.P.S"}
+{"prompt": "I work in the marketing department and I need your help. I need a template for an advertisement for a new product which is a portable camera. In the template, capitalize a few words to stress main points. Please limit the number of words with all capital letters to less than four. Your response should contain at least ten sentences."}
+{"prompt": "Write a rubric, in the form of a list of bullet points, for evaluating the performance of a customer service representative. Your answer must not include keywords ['bad', 'underperform'] and must contain exactly 6 bullet points in the following form:\n* Bullet point 1\n* Bullet point 2\n* Bullet point 3\n* Bullet point 4\n* Bullet point 5\n* Bullet point 6"}
+{"prompt": "Can you create an itinerary for a 5 day trip to Switzerland that includes exactly 3 bullet points in markdown format, in all lowercase letters, and a postscript at the end starting with P.S.?"}
 {"prompt": "Who won the defamation case between Amber Heard and Johnny Depp? Write your answer as if you are writing to a group of elderly people. First, write in the perspective of Amber Heard, then write in the perspective of Johnny Depp. Separate those two version by 6 asterisk symbols ******. The entire response should have less than 300 words."}
+{"prompt": "Write a cover letter for a job application as a tour guide in Beijing in all lowercase letters, with no capitalizations. Make it short -- the entire output should have less than 5 sentences."}
+{"prompt": "A colt is 5 feet tall. It will grow 6 inches every month. How tall will it be in 3 years? Think step-by-step, then give your answer. Separate your thinking and the final answer by a line with just three \"*\" symbols: ***\nAt the end of your response, please explicitly add a postscript starting with P.P.S"}
 {"prompt": "I asked a friend about how to remove rust from my bike chain. He told me to pour coke on it and then scrub it with a steel wool. Is this a good way to remove rust? Respond with at least 20 sentences and have more than 4 words be in all capital letters."}
-{"prompt": "Write a very angry letter to someone who's been trying to convince you that 1+1=3. There should be exactly 4 paragraphs. Separate the paragraphs with ***."}
-{"prompt": "Write a 100 word riddle that leaves the reader satisfied and enlightened. Include a few words in all capital letters. But the number of words in all capital letters should be less than 5."}
-{"prompt": "Students are travelling to UCI for 3 days. Create a hilarious itinerary for them. Do not use the word 'university'. Your entire response should have exactly 4 paragraphs.  Separate paragraphs with the markdown divider: ***"}
-{"prompt": "Write a poem about the top 20 tallest buildings in the world and their heights. End your response with the exact question: Is there anything else I can help with?"}
-{"prompt": "Can you create an itinerary for a 5 day trip to switzerland that includes exactly 3 bullet points in markdown format, in all lowercase letters, and a postscript at the end starting with P.S.?"}
-{"prompt": "Explain to me how to ride a bike like I am a kid. Also, do not include the keywords \"slow\", \"like\" and \"kid\"."}
-{"prompt": "Can you tell me why there are oval race tracks in the desert? Please rewrite the answer to make it more concise and include the word \"desert\" in the answer. Make sure the answer contains exactly 3 bullet points in markdown format."}
-{"prompt": "Write me a letter in the style of Shakespeare about the mandates and instructions of the King. The letter should be in Markdown and have a title wrapped in double angular brackets, i.e. <<title>>."}
-{"prompt": "Darth Vader is angry with you, his server, because you assassinated a fake Jedi. Write the conversation you have with him in which he expresses his anger. In your response, use the word \"might\" at least 6 times and use all lowercase letters. Make sure to break the conversation down to 3 parts, separated by ***, such as:\n[conversation part 1]\n***\n[conversation part 2]\n***\n[conversation part 3]"}
+{"prompt": "Compose a startup pitch on a new app called Tipperary that helps people to find the average tip size for each restaurant. Please make the response strongly structured. Wrap your entire output in JSON format."}
+{"prompt": "What is the next number in this series: 1, 4, 7, 11, 17? Please answer with only mathematical notation without any commas."}
+{"prompt": "Translate the following sentence into German and then criticize it: Werner was a good friend of mine, but not very smart.\nAvoid the word \"schlau\" throughout your response."}
diff --git a/genai-on-vertex-ai/gemini/model_upgrades/instruction-following/vertex_colab/instruction_following_eval.ipynb b/genai-on-vertex-ai/gemini/model_upgrades/instruction-following/vertex_colab/instruction_following_eval.ipynb
@@ -67,7 +67,7 @@
     "\n",
     "- Metric: This eval uses a Pairwise Instruction Following template to evaluate the responses and pick a model as the winner.\n",
     "\n",
-    "- Evaluation Dataset is based on [Instruction Following Evaluation Dataset](https://github.com/google-research/google-research/blob/master/instruction_following_eval/data/input_data.jsonl). It includes 15 randomly sampled prompts in a JSONL file `dataset.jsonl` with the following structure:\n",
+    "- Evaluation Dataset is based on [Instruction Following Evaluation Dataset](https://github.com/google-research/google-research/blob/master/instruction_following_eval/data/input_data.jsonl). It includes 10 randomly sampled prompts in a JSONL file `dataset.jsonl` with the following structure:\n",
     "    - `prompt`: The task with specific instructions provided\n",
     "\n",
     "- Prompt Template is a zero-shot prompt located in [`prompt_template.txt`](./prompt_template.txt) with one prompt variable (`prompt`) that is automatically populated from our dataset.\n"
@@ -87,7 +87,7 @@
    "outputs": [],
    "source": [
     "%%writefile .env\n",
-    "PROJECT_ID=jaival-genai-sa          # Google Cloud Project ID\n",
+    "PROJECT_ID=your-project-id          # Google Cloud Project ID\n",
     "LOCATION=us-central1                  # Region for all required Google Cloud services\n",
     "EXPERIMENT_NAME=instructionfollowing-eval-recipe-demo      # Creates Vertex AI Experiment to track the eval runs\n",
     "MODEL_BASELINE=gemini-1.5-flash  # Name of your current model\n",
diff --git a/genai-on-vertex-ai/gemini/model_upgrades/instruction-following/vertex_script/README.md b/genai-on-vertex-ai/gemini/model_upgrades/instruction-following/vertex_script/README.md
@@ -7,7 +7,7 @@ This Eval Recipe demonstrates how to compare performance of an Instruction Follo
 
 - Use case: Instruction Following 
 
-- Evaluation Dataset is based on [Instruction Following Evaluation Dataset](https://github.com/google-research/google-research/blob/master/instruction_following_eval/data/input_data.jsonl). It includes 15 randomly sampled prompts in a JSONL file `dataset.jsonl` with the follwing structure:
+- Evaluation Dataset is based on [Instruction Following Evaluation Dataset](https://github.com/google-research/google-research/blob/master/instruction_following_eval/data/input_data.jsonl). It includes 10 randomly sampled prompts in a JSONL file `dataset.jsonl` with the follwing structure:
     - `prompt`: The task with specific instructions provided
 
 - Prompt Template is a zero-shot prompt located in [`prompt_template.txt`](./prompt_template.txt) with one prompt variable (`prompt`) that is automatically populated from our dataset.
diff --git a/genai-on-vertex-ai/gemini/model_upgrades/instruction-following/vertex_script/dataset.jsonl b/genai-on-vertex-ai/gemini/model_upgrades/instruction-following/vertex_script/dataset.jsonl