Skip to content

feat: support for app model#3

Open
vsoch wants to merge 18 commits intomainfrom
dispatch
Open

feat: support for app model#3
vsoch wants to merge 18 commits intomainfrom
dispatch

Conversation

@vsoch
Copy link
Copy Markdown
Member

@vsoch vsoch commented Apr 13, 2026

We want to be able to do dispatch experiments, and we need a hardened way to associate very specific granularity of an actual command with a prompt. For example, here are four ways I can ask for a resource manager:

  • exact: (provide the full, exact command, akin to the baseline or ground truth)
  • explicit: use "flux run" to do...
  • verbose "You should use the flux resource manager..."
  • discovery: "Use the resource manager and max nodes/tasks you discover"

Those examples are prompt styles. The high level research questions might be:

  • What are the ways a researcher can best ask to run a computational workload?
  • What styles shouldn't they use? What styles lead to unexpected outcomes?
  • What styles are comparable to ground truth (running the exact command we intended)

The other variable we have to model is the actual complexity of the command. For example:

  • simple: "Just run the lmp binary with no flags"
  • hard "Run lammps with all these customizations, affinity, etc."

Explicitly, asking for just running lammps (no flags) is a much simpler request than asking for CPU affinity. So we have to model that too.

If you put the prompt style (first) together with the command complexity (second), you get a matrix of possible configurations. If you model each piece of the command, it can get large very quickly, but that's OK! As long as we can capture the exact choice and granularity for each dimension (to compare to the baseline), I think we can assess the relative contribution of a style/complexity to an outcome. The outcome can be a figure of merit, wall time, or content of a log (e.g., telling the agent to use -nocite or "remove the citation." This is what I implemented today, and I've run the base cases for.

This is the basis for dispatch experiments. We have baseline runs across the actual lammps commands, and we compare them to different (programmatically and controlled) prompts. We (I hope) will see when things start to fall off, and which strategies / combinations are good. And maybe which specific parts of the request are harder for the agent to handle. Add that to what we modeled for negotiation (e.g., expected vs. actual tool calls, success rate to actually run it, report the correct job id, etc) I think it's a good assessment of answering the question "How well can an agent reliably submit jobs for us?" I've finished the runs for the base cases, and I am doing two sizes (a larger, longer running problem size) and a smaller one because it occurred to me one result could be biased (e.g., if a smaller running time has smaller variation it might falsely look better).

The design of this is really cool in that we model an application akin to a provider, for which we have 51 across real and simulated (this was the SC26 paper). It's cool because most of the discovery and base class I inherit for the app from the BaseProvider. The prompt generation that requires the workload manager (Flux, a provider) comes from the provider class - they are working together! We do a much better job here modeling the prompts, and I would like to make a more hardened prompt generator class under simulation that can be shared between the two, and then design the prompt generation for providers akin to apps. I might want to redo earlier experiments with this new strategy (the generation can be generalized to work with the providers). I'm really liking (and enjoying working on) this library.

image image

I usually like to test the experiment setup and tweak details before full runs, so I should be able to run the experiments this week. The orchestration is complete, and the cluster setup with the mcp-server fully working. I added a new dual mode as a lazy man's way of saying "come up as a hub and a worker" just for testing.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch force-pushed the dispatch branch 2 times, most recently from 2a81be3 to 62bc9bf Compare April 13, 2026 18:33
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch force-pushed the dispatch branch 4 times, most recently from 4ced393 to 01700e9 Compare April 14, 2026 04:53
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch force-pushed the dispatch branch 2 times, most recently from fd07c70 to c41618d Compare April 14, 2026 17:00
This is a different use case than flux-mcp, and arguably we should not
be exposing extra information about the cluster.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch force-pushed the dispatch branch 2 times, most recently from 7e5b2aa to 462d998 Compare April 14, 2026 17:40
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
vsoch added 5 commits April 14, 2026 11:08
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
And we can do this with other params in the future

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
vsoch added 3 commits April 15, 2026 20:32
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
to be clear, this does not influence the experiment execution,
it is just that we are missing the cpu affinity in the ground
truth (but it is always there.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch force-pushed the dispatch branch 5 times, most recently from bd0011d to 8250aa0 Compare April 16, 2026 21:41
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
vsoch added 2 commits April 17, 2026 20:38
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant