Skip to content

Add CLI support for task prompt synchronization#346

Draft
formgit wants to merge 1 commit into
mainfrom
add_task_sync_script_cli
Draft

Add CLI support for task prompt synchronization#346
formgit wants to merge 1 commit into
mainfrom
add_task_sync_script_cli

Conversation

@formgit
Copy link
Copy Markdown
Collaborator

@formgit formgit commented Mar 18, 2026

  • Added gd gen-task-suite command to batch update task files with the latest prompts
  • Added --sync-task option to gd dev to force update a task prompt
  • Added warnings in gd dev when an existing task prompt drifts from prompts.md
  • Updated 6 (3 before rebase) out of sync task descriptions

@formgit formgit force-pushed the add_task_sync_script_cli branch from 8fc7f89 to e1a3a05 Compare March 18, 2026 17:12
@formgit formgit requested review from micahjo7 and paulirish March 18, 2026 17:14
@formgit formgit marked this pull request as draft March 18, 2026 17:31
Comment thread bin/gd.ts
${"Piece-wise options for `dev`:"}
${cDim('--grade')} Run/calibrate grader
${cDim('--test-grader')} Check grader calibration (demo + negative-demo)
${cDim('--sync-task')} Force update task prompt from prompts.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can just make the default for gd dev to update existing tasks (just the prompt)? and remove this sync-task option?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make it the default without a flag, would it be a concern that task descriptions got unintentionally overwritten? or maybe gd dev itself (without any flag) is already intentional enough..

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, if a dev runs gd dev without the sync-task option, it would still check if there's any outdated tasks. And instead of sync the prompts, it would warn the dev for any outdated tasks

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to have it overwrite. for now we can keep the prompts consistent between the two locations.

@paulirish
Copy link
Copy Markdown
Member

i'd be interested in solving this problem a slightly different way.. essentially move task.md into the usecase folder. delete prompt. use task for everything we used prompt for.

this also implies synced prompts, naturally.

But I don't currently recall a reason why we need these two things to be separate. especially now with the choice of empty vs daily-grind.

@micahjo7 do you know a reason why we'd want a prompt and task to have diff content? IIRC we chatted about this briefly last week. I think on a call.

@micahjo7
Copy link
Copy Markdown
Collaborator

micahjo7 commented Mar 18, 2026

@micahjo7 do you know a reason why we'd want a prompt and task to have diff content? IIRC we chatted about this briefly last week. I think on a call.

that would also work and simplify things. I don't think we need them to have different content. a couple things we would need to address though:

  1. some of the prompts.md files have multiple prompts in them, if we move them to the task files, we will need to make sure we update existing logic to only take the first prompt when running the tasks (currently, we only have logic to put the first prompt from prompts.md INTO -task.md)

  2. in our dashboard, we read the task files to surface things like base_app, prompt, and diff. so we would have to update this to find the task files in the guides/ folder rather than harness/tasks (http://github.com/GoogleChrome/guidance/blob/main/eval-view/dashboard.js#L502). and https://github.com/GoogleChrome/guidance/blob/main/eval-view/package.json#L8 would need to be updated too (copies the resources to GH pages when deploying to support the dashboard remotely)

  3. it would be good to handle negative tasks as well (tasks/negative/ and negative-suite-gen). Ideally, we also remove tasks/negative, and update negative-suite-gen to only generate the negative base apps. Then, in our suite runner, if the config isNegative = true, we just swap out the base_app from the new task file to negative/<guide-name>. In the dashboard we would also need to pass isNegative through the config (in evals.json: https://github.com/GoogleChrome/guidance/blob/main/harness/evaluate.ts#L31) and when true in dashboard.js, we read the negative base app manually https://github.com/GoogleChrome/guidance/blob/main/eval-view/dashboard.js#L512

  4. also we could probably remove the grader field from the task since it is already in the guide's folder

maybe that's a lot.. @paulirish let me know what you think, if we want to go ahead with this, I can talk thru these things in more detail with you @formgit

@paulirish
Copy link
Copy Markdown
Member

paulirish commented Mar 24, 2026

I like these ideas and think we should do it. 👍

@formgit i think there's still a bit of nuance here that you'll have to figure out.. or at least raise as questions we need to answer... like... how do we want to handle multiple prompts? and some of the other things micah raised.

but i'm excited to do this to clarify our eval pipeline and workflow. this prompt/task overlap has been a code smell that's primarily my fault.

@formgit
Copy link
Copy Markdown
Collaborator Author

formgit commented Mar 25, 2026

I like these ideas and think we should do it. 👍

@formgit i think there's still a bit of nuance here that you'll have to figure out.. or at least raise as questions we need to answer... like... how do we want to handle multiple prompts? and some of the other things micah raised.

but i'm excited to do this to clarify our eval pipeline and workflow. this prompt/task overlap has been a code smell that's primarily my fault.

  • For handling the multiple prompts, I think we can keep them as a standard markdown bulleted list inside the new task.md body. This way, it's clearly listed and easy to parse if we were to add logic to use different prompts from the list in the future.

  • After moving the task.md files under guides/, I can update all logics (including dashboard and deployment scripts) that previously pointing to the original task.md files.

  • For negative tasks (originally under harness/tasks/negative), I think it makes sense to run dynamically rather than generate separate task files. So in negative-suite-gen, I can keep the negative base app generation but ditch the negative tasks generation. Then, at run time, the suite runner will swap the base app to its negative equivalent if isNegative is set. I'll make sure the dashboard is updated to read an isNegative flag from evals.json to support this.

  • Since task.md will be under guides/, I can remove the grader fields and hook the runner directly to grader.ts.

This is my high level understanding and it's clear to me atm. Will raise more questions as they surface during implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants