Skip to content

feat(tests): stateful test filling for benchmark tests#2637

Draft
fselmo wants to merge 8 commits intoethereum:forks/amsterdamfrom
fselmo:feat/stateful-engine-test-filling
Draft

feat(tests): stateful test filling for benchmark tests#2637
fselmo wants to merge 8 commits intoethereum:forks/amsterdamfrom
fselmo:feat/stateful-engine-test-filling

Conversation

@fselmo
Copy link
Copy Markdown
Contributor

@fselmo fselmo commented Apr 8, 2026

🗒️ Description

Adds a new fill-stateful CLI command that generates BlockchainEngineStatefulFixture files by running benchmark tests against a live EL client via testing_buildBlockV1.

Key points:

  • Connects to a running client loaded with a network snapshot (e.g. perfnet / bloatnet)
  • Records engine API payloads per test, separated into setup and execution phases
  • Resets chain state between tests via debug_setHead
  • Funds test accounts via CL withdrawals — no pre-funded seed key required
  • Outputs fixture JSON + a pre_run/global_setup.json for session-level setup (factory deploy, etc.)

The generated fixtures are consumed by benchmarkoor to replay benchmarks against any EL client from the same snapshot. Replaces the gas-benchmarks MITMProxy-based approach.


Run against a running client that has testing, eth, debug available and has testing_buildBlockV1 support...

uv run fill-stateful \
    --clean \
    --engine-jwt-secret-file /path/to/jwt \
    --fork Osaka \
    tests/benchmark/stateful/ \
    -k ...

Optional:

  • --rpc-endpoint — defaults to http://localhost:8545
  • --engine-endpoint — derived from RPC endpoint with port 8551
  • --chain-id — auto-detected from the client
  • --rpc-seed-key — random key generated and funded via CL withdrawal
  • --output — defaults to ./fixtures

🔗 Related Issues or PRs

#2560

✅ Checklist

  • All: Ran fast static checks to avoid unnecessary CI fails, see also Code Standards and Enabling Pre-commit Checks:
    just static
  • All: PR title adheres to the repo standard - it will be used as the squash commit message and should start type(scope):.
  • All: Considered updating the online docs in the ./docs/ directory.
  • All: Set appropriate labels for the changes (only maintainers can apply labels).
  • Cute animal picture 🙀

Cute Animal Picture

Screenshot 2026-04-14 at 13 27 48

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.24%. Comparing base (8ae9dcc) to head (1aa5f99).
⚠️ Report is 11 commits behind head on forks/amsterdam.

Additional details and impacted files
@@               Coverage Diff                @@
##           forks/amsterdam    #2637   +/-   ##
================================================
  Coverage            86.24%   86.24%           
================================================
  Files                  599      599           
  Lines                36984    36984           
  Branches              3795     3795           
================================================
  Hits                 31895    31895           
  Misses                4525     4525           
  Partials               564      564           
Flag Coverage Δ
unittests 86.24% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fselmo fselmo force-pushed the feat/stateful-engine-test-filling branch from 2aa1826 to f01b02b Compare April 8, 2026 23:02
@fselmo fselmo added C-feat Category: an improvement or new feature A-test-benchmark Area: execution_testing.benchmark and tests/benchmark A-test-cli Area: execution_testing.cli A-test-fixtures Area: execution_testing.fixtures labels Apr 8, 2026
@fselmo fselmo force-pushed the feat/stateful-engine-test-filling branch from f01b02b to 1aa5f99 Compare April 8, 2026 23:13
@fselmo fselmo changed the title feat(tests): stateful engine test filling for benchmark tests feat(tests): stateful test filling for benchmark tests Apr 9, 2026
@fselmo fselmo marked this pull request as ready for review April 14, 2026 19:28
Copy link
Copy Markdown
Member

@jochem-brouwer jochem-brouwer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super exciting this is already close to what we would need! This is starting to come close to indeed what the MITM is doing, except now EELS is driving it 😄 👍

Lemme know if there are any questions after this review 😄 👍

next_block_number = int(HexNumber(head_block["number"]) + 1)
next_timestamp = int(HexNumber(head_block["timestamp"]) + 1)
payload_attributes = self._payload_attributes(
next_block_number=next_block_number,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block number is not part of payload attributes? See also the amsterdam version of PayloadAttributesV4

next_timestamp=next_timestamp,
withdrawals=withdrawals,
)
new_payload = self.testing_rpc.build_block(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_payload = self.testing_rpc.build_block(
parent_block_hash=Hash(head_block["hash"]),
payload_attributes=payload_attributes,
transactions=None,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the transactions parameter is JSON null, the client MAY build a block from its local transaction pool (mempool). (from https://github.com/ethereum/execution-apis/blob/main/src/testing/testing_buildBlockV1.yaml)
I don't think we want this (?), I think

Suggested change
transactions=None,
transactions=[],

@pytest.fixture(scope="session")
def sender_fund_refund_gas_limit() -> int:
"""Gas limit for fund/refund transactions (replaces sender)."""
return 21_000
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(should be read from fork config as this changes in eip2780)

Resolved before ``execute_required_contracts`` via fixture
dependency, so this records the pre-setup chain head.
"""
block = eth_rpc.get_block_by_number("latest")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be queried by number, this should be a command line argument where you provide the blockhash. A number is ambiguous, if you have a client and you stop the client at chain head, if it then reorgs we are now building on a block which no one has 😅 \

If no argument for snapshot hash provided, it is ok (with warn/log) to read chain head and use that.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, using the latest block can lead to unexpected behavior, like we saw in test_ec_pairing, where the block did not reset correctly to the original snapshot start point.

start_hex = start_block["number"]
logger.info(f"Resetting chain to start block {start_hex}")
try:
debug_rpc.set_head(start_hex)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this does, would it also set forkchoice in the client correctly? I think this is what should be done instead using standardized APIs:

Let's say we have done the setup and we have reached the start block. Build two testing_buildBlock on top of this start block. Everything should be empty, no withdrawals, no transactions) and all other parameters should be equal except extraData. One of the blocks will be the block hash you use as parentHash in testing_buildBlockV1 and the other one is the reorg block. We never build on top of this blockhash, we only use this to reorg to.

Why? If you try to fcu back to a block which is already in the current canonical chain, then most/all clients will throw. If we use this approach by creating two different blocks which are basically equal except their extraData, we can use it as fork block because everything is the same (RPC calls would also return the correct data on state, except of course last block hash).

This would use existing API and not debug_setHead as I think this differs for all clients and there might be unforeseen side effects.

BTW, this double-block approach is not my idea, it's from @kamilchodola 😄 👍

If something unclear here let us know 😄 👍

)


class StatefulPreRunFixture(CamelModel):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a special class or a class with a special name? I think these prerun fixtures are essentially the same layout as normal fixtures right?

For benchmark/repricing effort we would need to control this presetup also. For repricing and also stateful benchmarks we cannot do test setup right before the benchmark block (this loads all state into memory cache and makes the ops super fast). We would need to create this state already in the snapshot / start block (the block where all tests start from).
To give an example, we would take a recent snapshot of mainnet and now in the prerun fill it will the desired data for repricings (basically setting up the state which we also have on perf-devnet-3, except that this entire state is much bigger and is thus worse than mainnet).

Once this is done, we don't run any tests (yet) and now use the chain state right after this prerun as the snapshot state. This is what we distribute to benchmarkoor. We also fill the benchmark fixtures on top of this.

We do this because now the state which we target in the benchs is already in the snapshot (we do not create it when actually running the benchs). This forces the client to actually read the DB and not use some memory cache.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not super urgent because can also write one big test and once test finishes just use that to get the desired snapshot)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a special setup phase that already exists in the gas-benchmarks mechanism (which is globals setting, or something else) It's not for individual benchmark setup; rather, it ensures common contracts (like the factory contract using Nick's method) are pre-deployed on the network. Without this initial setup, all subsequent create2 address calculations and contract deployments would fail.

def set_head(self, block_number: str) -> None:
"""`debug_setHead`: Reset chain head to the given block."""
self.post_request(
request=RPCCall(method="setHead", params=[block_number])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this is a number and not a hash indicates that we cannot for instance perform a reorg via this RPC call (we can only roll back into the canonical chain). This would be limiting. Also don´t know if this updates the forkchoice status of client.

@jochem-brouwer
Copy link
Copy Markdown
Member

Okay I realized I might be too skeptical about debug_setHead 😅 Ignore that if it's not right.

Copy link
Copy Markdown
Collaborator

@LouisTsai-Csie LouisTsai-Csie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fill_stateful.py is mixing three things, types, pytest hooks, and fixtures, now without clear separation. It's a bit hard to know where to start.

The entry point (pytest_configure) is at L234 behind classes and helpers, and pytest_sessionfinish is way down at L587. Plus, fixtures are declared out of order (e.g., execute_required_contracts at L291 depends on stuff at L406 and L428), so we need to jump around to understand the flow.

I would suggest adding section headers and reorganizing (maybe types → helpers → pytest hooks → fixtures by lifecycle → per-test fixtures) might make it way easier to follow. wdyt?

@click.command(
name="fill-stateful",
context_settings={
"help_option_names": ["-h", "--help"],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run uv run fill-stateful --help command and this shows up, is it expected behavior?

(ethereum-execution) caijiacheng@caijiachengs-MacBook-Air wt-forks-op-gas-cost % uv run fill-stateful --help
Retrying execution_testing.rpc.rpc.BaseRPC._make_request in 0.5 seconds as it raised ConnectionError: HTTPConnectionPool(host='localhost', port=8545): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10be56e40>: Failed to establish a new connection: [Errno 61] Connection refused')).
Retrying execution_testing.rpc.rpc.BaseRPC._make_request in 1.0 seconds as it raised ConnectionError: HTTPConnectionPool(host='localhost', port=8545): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10bedc910>: Failed to establish a new connection: [Errno 61] Connection refused')).
Retrying execution_testing.rpc.rpc.BaseRPC._make_request in 2.0 seconds as it raised ConnectionError: HTTPConnectionPool(host='localhost', port=8545): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10bedccd0>: Failed to establish a new connection: [Errno 61] Connection refused')).

Or i should connect to the node first?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this the root cause: the kwargs below (see fill_stateful function) is deleted and thus not being passed to the is_help_or_collectonly_mode function

Resolved before ``execute_required_contracts`` via fixture
dependency, so this records the pre-setup chain head.
"""
block = eth_rpc.get_block_by_number("latest")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, using the latest block can lead to unexpected behavior, like we saw in test_ec_pairing, where the block did not reset correctly to the original snapshot start point.

Comment on lines +251 to +252
if is_help_or_collectonly_mode(config):
return
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure - should this check done before the rpc_endpoint and engine_endpoint check

Comment on lines +180 to +181
if version >= 4 and response.execution_requests is not None:
params.append(response.execution_requests)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if version >= 4 and response.execution_requests is not None:
params.append(response.execution_requests)
if version >= 4:
assert response.execution_requests is not None
params.append(response.execution_requests)

I checked Engine API v4+ (link), and found the request field is required. Should we assert its presence instead of ignoring it when it’s None?

Comment on lines +145 to +148
# Mixed phases: SETUP takes precedence (a block containing
# any setup transaction is considered part of the setup).
if TestPhase.SETUP in phases:
return TestPhase.SETUP
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall, some benchmarks include a single setup transaction for 7702 authorizations, followed by the benchmark transactions within the same block. In these cases, we will be labeling the block as a setup block.

This isn’t an issue for gas-benchmarks, since they keep collecting transactions within the same phase until either the buffer reaches the block gas limit or the phase changes. At that point, it flushes the transactions into a block and generate the payload.

@LouisTsai-Csie
Copy link
Copy Markdown
Collaborator

Quick question: who's responsible for mounting clients to the snapshot? In gas-benchmarks, Nethermind clients are mounted via overlay, a layer that captures state changes without modifying the snapshot itself. Does this PR have a similar mechanism? If not, where would that be handled?

@fselmo fselmo marked this pull request as draft April 20, 2026 22:20
@fselmo
Copy link
Copy Markdown
Contributor Author

fselmo commented Apr 20, 2026

@LouisTsai-Csie I appreciate the review. This is a working PoC, not something I plan to merge as is. I should've clarified it's still a work in progress and I'm actually making this a lot more fill based, rather than execute based. I will try to update this soon. Your feedback is valuable for future additions, thanks 🙏🏼.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-test-benchmark Area: execution_testing.benchmark and tests/benchmark A-test-cli Area: execution_testing.cli A-test-fixtures Area: execution_testing.fixtures C-feat Category: an improvement or new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants