Skip to content

Define workflow for tests with ai and create initial tests for Dart SDK.#1251

Merged
polina-c merged 67 commits intomainfrom
eval-issue
Apr 24, 2026
Merged

Define workflow for tests with ai and create initial tests for Dart SDK.#1251
polina-c merged 67 commits intomainfrom
eval-issue

Conversation

@polina-c
Copy link
Copy Markdown
Member

@polina-c polina-c commented Apr 21, 2026

Contributes to:

The eval verifies that these entities work well together:

  1. Python agent, plus whatever libraries it uses
  2. Dart SDK
  3. AI model

This PR:

  1. Adds eval workflow that:
  • reads api key from secret
  • runs eval/run.sh
  1. Creates project eval/dart_and_flutter with test, that verify that app can:
  • ask a gemini model for a joke directly
  • start restaurant finder agent
  • request the agent to request model
  1. Adds configurations to start processes via vscode UI

gemini-code-assist[bot]

This comment was marked as outdated.

@polina-c polina-c changed the title Eval issue Eval for rest finder. Apr 21, 2026
@polina-c polina-c changed the title Eval for rest finder. Eval for restaurant finder. Apr 21, 2026
@polina-c
Copy link
Copy Markdown
Member Author

polina-c commented Apr 23, 2026

Thank you. It seems we are almost on the same page here.

Some comments:

this is an end-to-end test for Gen UI SDK + Restaurant finder, right?

Almost. It is for three things: Gen UI SDK + Restaurant finder agent + AI Model

AI model is important component here, not stubbed. We want to see Gen UI SDK is working reliably with real responses, produced by Restaurant finder + AI Model, where AI Model is not stubbed.
If we will want to test SDK for some concrete responses, we will create tests for a stubbed finder agent.

I think this is closer to an end-to-end test than an eval system

I am observing that engineering community is not consistent defining what "eval" is. So far I saw these definitions:

  1. A test that tests prompt
  2. A test that tests anything that involves AI
  3. A test that is ok to produce X% of failures.

It seems you use definition #1. I updated README.md in the folder 'eval' to contain this definition.

How about putting it at samples/agent/adk/restaurant_finder/end_to_end_tests/dart?

Yes, it make sense to have this test closer to the code it tests. The code it tests is the code of the finder's Dart client.
Initially I wanted to put it to GenUI SDK repo. But now I see it will be easier to have in in a2ui repo, to test it together with the agent.

I am thinking to place it to:

samples/client/flutter/restaurant_finder/app

So, it will make sense to put the test to:

samples/client/flutter/restaurant_finder/e2e_test

This is PR that removes the client from GenUI SDK: flutter/genui#885

How does all this sound for you?

Changes I did:

  1. Renamed .github/workflows/eval.yaml to .github/workflows/test_with_ai.yaml
  2. Moved eval/run.sh to scripts/test_with_ai.sh
  3. Updated .github/PULL_REQUEST_TEMPLATE.md to include instruction to run the tests on forked branched manually
  4. Moved the test to samples/client/flutter/restaurant_finder/e2e_test

Does this structure look right for you?

@polina-c polina-c changed the title Enable Dart eval and create initial tests for Dart SDK. Define workflow for tests with ai and create initial tests for Dart SDK. Apr 23, 2026
Copy link
Copy Markdown
Collaborator

@jacobsimionato jacobsimionato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating! This looks great to me. I think it will be nice for developers to have a Flutter example in this repo, alongside the other client samples.

Comment thread scripts/e2e_test.sh
# To run script locally, you need to set API key as an environment variable.
# Example: export GEMINI_API_KEY=your_api_key

cd "$(dirname "$0")/../samples/client/flutter/restaurant_finder/e2e_test"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is specific to the Flutter e2e tests for the restaurant finder agent. How about moving this script into samples/client/flutter/restaurant_finder/e2e_test and reference it there?

BTW part of my motivation here is that there is a plan to break out the A2UI repository into many repositories - https://docs.google.com/document/d/1914FKcF5LOXrj8y7-xf-45bJAwaPtcix_nBhOCuMECc/edit?resourcekey=0-g9pTHlodm5jGpkbknk-TJw&tab=t.0#heading=h.sd9tbgmewn0l

When we do this, it will be easier if the different parts (e.g. samples, renderers etc) are clearly separated to begin with.

Comment thread .github/PULL_REQUEST_TEMPLATE.md Outdated
- [ ] I have added updates to the [CHANGELOG].
- [ ] I updated/added relevant documentation.
- [ ] My code changes (if any) have tests.
- [ ] I have verified that [scripts/test_with_ai.sh](../scripts/test_with_ai.sh) passes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is covered by CI, so I don't think we need it specifically here - we'll see if this is necessary / has worked based on the results of the CI.

If we did add an item to this checklist, I think it should be a way to run all the CI workflows locally, rather than just this one. In most cases, other tests will be more relevant, because this only covers the Python SDK, restaurant finder demo, and Flutter SDK. Maybe we can experiment with something like https://github.com/nektos/act ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, CI will run only for local branches, not for forked. Forked branches do not have access to key.

Updated requirement.

People will need to configure key to run this test. I do not see how act will help.

Comment thread .github/workflows/test_with_ai.yaml Outdated
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.

name: test_with_ai_workflow
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling this something like Restaurant Finder E2E Test and the file restaurant_finder_e2e_test, to be consistent with the other workflows, e.g. https://github.com/google/A2UI/blob/main/.github/workflows/lit_build_and_test.yml etc. E.g. I think that it should be "x test" rather than "test x", and the name here should be in regular English rather than snake case etc.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was supposing this workflow, as well as scripts/test_with_ai.sh to run all tests that require key.

But, yes, it may become test for all e2e. I want to put all them together, to make it easier to run them with one command locally.

Made renamings you requested.

Comment thread .github/workflows/e2e_test.yaml Outdated
@polina-c polina-c merged commit e37809a into main Apr 24, 2026
15 checks passed
@polina-c polina-c deleted the eval-issue branch April 24, 2026 01:09
@github-project-automation github-project-automation Bot moved this from Todo to Done in A2UI Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants