Skip to content

Commit fb3e88a

Browse files
Add eval recipes for Summarization and Document QnA (#172)
* text_classification V1 * Copy improvements * remove redundant copyright * Add link to eval service docs * update the link bar on top of Colab * Minor fixes * Update README and eval script * Update README and eval script * Add to docs * Add eval recipe for Document QnA * add eval recipes for summarization and document QnA * Add dataset citations and CD commands * fix links --------- Co-authored-by: RajeshThallam <rthallam@google.com>
1 parent 21d4fac commit fb3e88a

63 files changed

Lines changed: 1113 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Document QnA
2+
### _Eval Recipe for model migration_
3+
4+
This Eval Recipe demonstrates how to compare performance of a Document Question Answering prompt with Gemini 1.0 and Gemini 2.0 using a labeled dataset and open source evaluation tool [Promptfoo](https://www.promptfoo.dev/).
5+
6+
![](diagram.png "Promptfoo")
7+
8+
- Use case: answer questions based on information from the given document.
9+
10+
- The Evaluation Dataset is based on [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/). It includes 6 documents stored as plain text files, and a JSONL file that provides ground truth labels: [`dataset.jsonl`](./dataset.jsonl). Each record in this file includes 3 attributes wrapped in the `vars` object. This structure allows Promptfoo to specify the variables needed to populate prompt templates (document and question), as well as the ground truth label required to score the accuracy of model responses:
11+
- `document`: relative path to the plain text document file
12+
- `question`: the question that we want to ask about this particular document
13+
- `answer`: expected correct answer or special code `ANSWER_NOT_FOUND` used to verify that the model does not hallucinate answers when the document does not provide enough information to answer the given question.
14+
15+
- Prompt Template is a zero-shot prompt located in [`prompt_template.txt`](./prompt_template.txt) with two prompt variables (`document` and `question`) that are automatically populated from our dataset.
16+
17+
- [`promptfooconfig.yaml`](./promptfooconfig.yaml) contains all Promptfoo configuration:
18+
- `providers`: list of models that will be evaluated
19+
- `prompts`: location of the prompt template file
20+
- `tests`: location of the labeled dataset file
21+
- `defaultTest`: defines the scoring logic:
22+
1. `type: factuality` uses an Autorater (aka LLM Judge) to compare the model answer with our ground truth label and rate its correctness
23+
1. `value: "{{answer}}"` instructs Promptfoo to use the dataset attribute "answer" as the ground truth label
24+
25+
## How to run this Eval Recipe
26+
27+
1. Configure your [Google Cloud Environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment) and clone this Github repo to your environment. We recommend [Cloud Shell](https://shell.cloud.google.com/) or [Vertex AI Workbench](https://cloud.google.com/vertex-ai/docs/workbench/instances/introduction).
28+
29+
``` bash
30+
git clone --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && \
31+
cd applied-ai-engineering-samples && \
32+
git sparse-checkout init && \
33+
git sparse-checkout set genai-on-vertex-ai/gemini/model_upgrades && \
34+
git pull origin main
35+
```
36+
37+
1. Install Promptfoo using [these instructions](https://www.promptfoo.dev/docs/installation/).
38+
1. Navigate to the Eval Recipe directory in terminal and run the command `promptfoo eval`.
39+
40+
``` bash
41+
cd genai-on-vertex-ai/gemini/model_upgrades/document_qna/promptfoo
42+
promptfoo eval
43+
```
44+
1. Run `promptfoo view` to analyze the eval results. You can switch the Display option to `Show failures only` in order to investigate any underperforming prompts.
45+
46+
## How to customize this Eval Recipe:
47+
1. Copy the configuration file `promptfooconfig.yaml` to a new folder.
48+
1. Add your labeled dataset file with JSONL schema similar to `dataset.jsonl`.
49+
1. Save your prompt template to `prompt_template.txt` and make sure that the template variables map to the variables defined in your dataset.
50+
1. That's it! You are ready to run `promptfoo eval`. If needed, add alternative prompt templates or additional metrics to promptfooconfig.yaml as explained [here](https://www.promptfoo.dev/docs/configuration/parameters/).
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{"vars": {"document": "file://documents/document_01.txt", "question": "How many households used BSkyB service in 1994?", "answer": "3.5 million"}}
2+
{"vars": {"document": "file://documents/document_02.txt", "question": "When did Dodge end production of their full size sedans?", "answer": "ANSWER_NOT_FOUND"}}
3+
{"vars": {"document": "file://documents/document_03.txt", "question": "How many square miles was the region impacted by the 2010 drought?", "answer": "1160000"}}
4+
{"vars": {"document": "file://documents/document_04.txt", "question": "What is the largest Mansion in the west coast?", "answer": "ANSWER_NOT_FOUND"}}
5+
{"vars": {"document": "file://documents/document_05.txt", "question": "Who founded Telnet?", "answer": "Larry Roberts"}}
6+
{"vars": {"document": "file://documents/document_06.txt", "question": "When did the Chinese famine begin?", "answer": "1331"}}
192 KB
Loading
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
The service started on 1 September 1993 based on the idea from the then chief executive officer, Sam Chisholm and Rupert Murdoch, of converting the company business strategy to an entirely fee-based concept. The new package included four channels formerly available free-to-air, broadcasting on Astra's satellites, as well as introducing new channels. The service continued until the closure of BSkyB's analogue service on 27 September 2001, due to the launch and expansion of the Sky Digital platform. Some of the channels did broadcast either in the clear or soft encrypted (whereby a Videocrypt decoder was required to decode, without a subscription card) prior to their addition to the Sky Multichannels package. Within two months of the launch, BSkyB gained 400,000 new subscribers, with the majority taking at least one premium channel as well, which helped BSkyB reach 3.5 million households by mid-1994. Michael Grade criticized the operations in front of the Select Committee on National Heritage, mainly for the lack of original programming on many of the new channels.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
An increase in imported cars into North America forced General Motors, Ford and Chrysler to introduce smaller and fuel-efficient models for domestic sales. The Dodge Omni / Plymouth Horizon from Chrysler, the Ford Fiesta and the Chevrolet Chevette all had four-cylinder engines and room for at least four passengers by the late 1970s. By 1985, the average American vehicle moved 17.4 miles per gallon, compared to 13.5 in 1970. The improvements stayed even though the price of a barrel of oil remained constant at $12 from 1974 to 1979. Sales of large sedans for most makes (except Chrysler products) recovered within two model years of the 1973 crisis. The Cadillac DeVille and Fleetwood, Buick Electra, Oldsmobile 98, Lincoln Continental, Mercury Marquis, and various other luxury oriented sedans became popular again in the mid-1970s. The only full-size models that did not recover were lower price models such as the Chevrolet Bel Air, and Ford Galaxie 500. Slightly smaller, mid-size models such as the Oldsmobile Cutlass, Chevrolet Monte Carlo, Ford Thunderbird and various other models sold well.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
In 2010 the Amazon rainforest experienced another severe drought, in some ways more extreme than the 2005 drought. The affected region was approximately 1,160,000 square miles (3,000,000 km2) of rainforest, compared to 734,000 square miles (1,900,000 km2) in 2005. The 2010 drought had three epicenters where vegetation died off, whereas in 2005 the drought was focused on the southwestern part. The findings were published in the journal Science. In a typical year the Amazon absorbs 1.5 gigatons of carbon dioxide; during 2005 instead 5 gigatons were released and in 2010 8 gigatons were released.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Fresno has three large public parks, two in the city limits and one in county land to the southwest. Woodward Park, which features the Shinzen Japanese Gardens, numerous picnic areas and several miles of trails, is in North Fresno and is adjacent to the San Joaquin River Parkway. Roeding Park, near Downtown Fresno, is home to the Fresno Chaffee Zoo, and Rotary Storyland and Playland. Kearney Park is the largest of the Fresno region's park system and is home to historic Kearney Mansion and plays host to the annual Civil War Revisited, the largest reenactment of the Civil War in the west coast of the U.S.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Telenet was the first FCC-licensed public data network in the United States. It was founded by former ARPA IPTO director Larry Roberts as a means of making ARPANET technology public. He had tried to interest AT&T in buying the technology, but the monopoly's reaction was that this was incompatible with their future. Bolt, Beranack and Newman (BBN) provided the financing. It initially used ARPANET technology but changed the host interface to X.25 and the terminal interface to X.29. Telenet designed these protocols and helped standardize them in the CCITT. Telenet was incorporated in 1973 and started operations in 1975. It went public in 1979 and was then sold to GTE.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
The plague disease, caused by Yersinia pestis, is enzootic (commonly present) in populations of fleas carried by ground rodents, including marmots, in various areas including Central Asia, Kurdistan, Western Asia, Northern India and Uganda. Nestorian graves dating to 1338 near Lake Issyk Kul in Kyrgyzstan have inscriptions referring to plague and are thought by many epidemiologists to mark the outbreak of the epidemic, from which it could easily have spread to China and India. In October 2010, medical geneticists suggested that all three of the great waves of the plague originated in China. In China, the 13th century Mongol conquest caused a decline in farming and trading. However, economic recovery had been observed at the beginning of the 14th century. In the 1330s a large number of natural disasters and plagues led to widespread famine, starting in 1331, with a deadly plague arriving soon after. Epidemics that may have included plague killed an estimated 25 million Chinese and other Asians during the 15 years before it reached Constantinople in 1347.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
<DOCUMENT>
2+
{{document}}
3+
</DOCUMENT>
4+
5+
# Instructions
6+
- Answer the question below using strictly information from the document above.
7+
- Provide a brief answer, without any unnecessary comments.
8+
- If the document does not provide enough information to answer this question, respond with "ANSWER_NOT_FOUND".
9+
10+
Question: {{question}}
11+
Answer:

0 commit comments

Comments
 (0)