Conversation
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
2a81be3 to
62bc9bf
Compare
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
4ced393 to
01700e9
Compare
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
fd07c70 to
c41618d
Compare
This is a different use case than flux-mcp, and arguably we should not be exposing extra information about the cluster. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
7e5b2aa to
462d998
Compare
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
And we can do this with other params in the future Signed-off-by: vsoch <vsoch@users.noreply.github.com>
to be clear, this does not influence the experiment execution, it is just that we are missing the cpu affinity in the ground truth (but it is always there. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
bd0011d to
8250aa0
Compare
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We want to be able to do dispatch experiments, and we need a hardened way to associate very specific granularity of an actual command with a prompt. For example, here are four ways I can ask for a resource manager:
Those examples are prompt styles. The high level research questions might be:
The other variable we have to model is the actual complexity of the command. For example:
Explicitly, asking for just running lammps (no flags) is a much simpler request than asking for CPU affinity. So we have to model that too.
If you put the prompt style (first) together with the command complexity (second), you get a matrix of possible configurations. If you model each piece of the command, it can get large very quickly, but that's OK! As long as we can capture the exact choice and granularity for each dimension (to compare to the baseline), I think we can assess the relative contribution of a style/complexity to an outcome. The outcome can be a figure of merit, wall time, or content of a log (e.g., telling the agent to use
-nociteor "remove the citation." This is what I implemented today, and I've run the base cases for.This is the basis for dispatch experiments. We have baseline runs across the actual lammps commands, and we compare them to different (programmatically and controlled) prompts. We (I hope) will see when things start to fall off, and which strategies / combinations are good. And maybe which specific parts of the request are harder for the agent to handle. Add that to what we modeled for negotiation (e.g., expected vs. actual tool calls, success rate to actually run it, report the correct job id, etc) I think it's a good assessment of answering the question "How well can an agent reliably submit jobs for us?" I've finished the runs for the base cases, and I am doing two sizes (a larger, longer running problem size) and a smaller one because it occurred to me one result could be biased (e.g., if a smaller running time has smaller variation it might falsely look better).
The design of this is really cool in that we model an application akin to a provider, for which we have 51 across real and simulated (this was the SC26 paper). It's cool because most of the discovery and base class I inherit for the app from the BaseProvider. The prompt generation that requires the workload manager (Flux, a provider) comes from the provider class - they are working together! We do a much better job here modeling the prompts, and I would like to make a more hardened prompt generator class under simulation that can be shared between the two, and then design the prompt generation for providers akin to apps. I might want to redo earlier experiments with this new strategy (the generation can be generalized to work with the providers). I'm really liking (and enjoying working on) this library.
I usually like to test the experiment setup and tweak details before full runs, so I should be able to run the experiments this week. The orchestration is complete, and the cluster setup with the mcp-server fully working. I added a new dual mode as a lazy man's way of saying "come up as a hub and a worker" just for testing.