Skip to content

1.5.x cloud benchmarks#3615

Open
kmatasfp wants to merge 141 commits into
1.5.xfrom
1.5.x-cloud-benchmarks
Open

1.5.x cloud benchmarks#3615
kmatasfp wants to merge 141 commits into
1.5.xfrom
1.5.x-cloud-benchmarks

Conversation

@kmatasfp

@kmatasfp kmatasfp commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Also contains replacement for memory semaphore plus fixes eviction deadlock.

for memory semaphore replacement see:

  • golem-worker-executor/src/services/active_workers/admission/mod.rs
  • golem-worker-executor/src/services/active_workers/component_charge/mod.rs
  • golem-worker-executor/src/services/active_workers/memory_probe.rs

@kmatasfp kmatasfp requested a review from a team June 5, 2026 00:18
kill_all() is called after cloud_preflight_warmup completes. ProvidedShardManager
wraps an already-running process we don't own, so neither kill nor restart should
crash the binary. Both are now silent no-ops, matching UnavailableShardManager.
@kmatasfp kmatasfp force-pushed the 1.5.x-cloud-benchmarks branch 7 times, most recently from c5df097 to 2556fa3 Compare June 27, 2026 08:03

assert_eq!(result1.len(), 2, "G1002"); // TODO: this is temporarily not working because of using the dynamic invoke API and not having structured information in the oplog
assert_eq!(result2.len(), 2, "imported-function");
assert_eq!(result2.len(), 1, "imported-function");

@kmatasfp kmatasfp Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and other similar test changes are because we do not call host function to get agent metadata anymore

if matches!(get_asyncness(&constructor_method.sig), Asyncness::Future) {
return syn::Error::new_spanned(
&constructor_method.sig.ident,
"Agent constructors must be synchronous. Async constructors can call host APIs during snapshot restore and make recovery fall back to full replay.",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob need a better message here, regardless Agent constructor should not be async imho, adds more complexity than value, unless I am missing something

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, I think that would be too limiting (no awaitable rpc, no http calls etc) - and whether it does side effects or not (it has to be able to!) does not actually depend on the "asyncness", in p2 all host functions are sync and in p3 we get async ones beside them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I see how this interferes with load-snapshot initializing the agent. I need to think more about it :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One real-world example: it has been a useful pattern to spawn child agents and trigger some background operation in the constructor. During recovery, of course this does not get triggered again because it's persisted in the oplog. But if load-snapshot calls the same constructor, it's going to be re-trigger the same background operation with a fresh idempotency key (or crash if we detect it as an unallowed side-effect). Both are wrong. The assumption was so far that "it's the user's responsibility" but it's probably too easy to do it wrong.

pub mod websocket;

#[cfg(any(test, feature = "test-support"))]
pub mod snapshot_recovery_test_support {

@kmatasfp kmatasfp Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could not think of a better way how to do this, I need some hook inside the snapshot recovery code that actually records that we recovered from snapshot or not. Rust makes it easier than other languages to use test Spy with conditional compilation

@kmatasfp kmatasfp force-pushed the 1.5.x-cloud-benchmarks branch 2 times, most recently from 54ce165 to 63aff3d Compare June 30, 2026 04:00
@kmatasfp kmatasfp force-pushed the 1.5.x-cloud-benchmarks branch from 63aff3d to 13df001 Compare June 30, 2026 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants