Skip to content

DELTA benchmark#1541

Draft
mnoukhov wants to merge 19 commits into
mainfrom
ballsim
Draft

DELTA benchmark#1541
mnoukhov wants to merge 19 commits into
mainfrom
ballsim

Conversation

@mnoukhov
Copy link
Copy Markdown
Contributor

No description provided.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes a robust infrastructure for integrating BounceSim code execution and verification into the existing system. It provides a scalable API for evaluating Python code against BounceSim test cases, supported by Nginx for efficient load balancing. The changes also introduce a dedicated verifier and configuration options, enabling seamless integration into training workflows for models that generate code for physics simulations.

Highlights

  • BounceSim API: Introduced a new FastAPI application for executing BounceSim code, providing a /health endpoint and a /test_program endpoint for code verification.
  • Nginx Load Balancing: Implemented a setup script to deploy multiple BounceSim API servers and configure Nginx as a load balancer to distribute requests across them.
  • BounceSim Verifier: Added a new BallsimVerifier class that integrates with the BounceSim API to evaluate Python code against test cases, supporting different scoring modes.
  • Configuration Options: Extended StreamingDataLoaderConfig with new fields for BounceSim API URL, maximum execution time, and scoring mode, allowing flexible configuration of the verification process.
  • Training Script Integration: Included a new training script that launches the BounceSim API and runs the grpo_fast.py training process with BounceSim-specific parameters and a dedicated system prompt.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for a new 'ballsim' task, including a new API, a verifier, and associated configuration and scripts. The implementation is generally solid, particularly the use of asyncio.to_thread in the API for performance.

My main feedback points are:

  • The Beaker setup script has some robustness issues, such as a hardcoded CPU count and a fixed sleep time for server startup, which could lead to flakiness.
  • There is significant code duplication between the new BallsimVerifier and the existing CodeVerifier. Refactoring this into a shared base class would improve maintainability.
  • A minor typo was found in the ballsim_system_prompt.txt file.

I've left specific comments with suggestions for these points.

Comment on lines +86 to +87
echo "Waiting for all servers to start up..."
sleep 5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a fixed sleep 5 is not a robust way to wait for 128 uvicorn servers to start. This can lead to race conditions and flaky behavior if servers take longer to initialize. A better approach is to poll the health endpoint of the last server until it becomes responsive.

Suggested change
echo "Waiting for all servers to start up..."
sleep 5
echo "Waiting for all servers to start up..."
local last_port=$((API_BASE_PORT + BALLSIM_SERVER_CPUS - 1))
# Wait up to 15 seconds for the last server to become healthy.
for _ in {1..30}; do
if curl -s --fail "http://localhost:$last_port/health" > /dev/null; then
echo "All servers are assumed to be up."
return
fi
sleep 0.5
done
echo "Warning: Last server on port $last_port is not responding after 15 seconds."
tail -5 "api_logs/ballsim_server_$last_port.log"


# Configuration
TOTAL_CPUS=$(nproc)
BALLSIM_SERVER_CPUS=128
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The number of BALLSIM_SERVER_CPUS is hardcoded to 128. This is not robust and will cause issues on machines with fewer CPUs. It's better to cap this value by the total number of available CPUs to prevent errors.

Suggested change
BALLSIM_SERVER_CPUS=128
BALLSIM_SERVER_CPUS=${BALLSIM_SERVER_CPUS:-128}
if [[ "$TOTAL_CPUS" -lt "$BALLSIM_SERVER_CPUS" ]]; then
echo "Warning: Requested BALLSIM_SERVER_CPUS ($BALLSIM_SERVER_CPUS) is greater than total available CPUs ($TOTAL_CPUS). Using $TOTAL_CPUS CPUs instead."
BALLSIM_SERVER_CPUS=$TOTAL_CPUS
fi

Comment on lines +1038 to +1157
class BallsimVerifier(VerifierFunction):
"""
Verifier that executes Python code against BounceSim test cases using an external API.
"""

_session_pool = None

def __init__(self, verifier_config: BallsimVerifierConfig) -> None:
super().__init__("ballsim", verifier_config=verifier_config, weight=1.0)

@classmethod
def _get_session(cls):
if cls._session_pool is None:
cls._session_pool = requests.Session()
retry_config = requests.adapters.Retry(
total=3,
connect=3,
read=3,
status=3,
backoff_factor=0.3,
status_forcelist=[500, 502, 503, 504],
allowed_methods=None,
)
adapter = requests.adapters.HTTPAdapter(pool_connections=100, pool_maxsize=100, max_retries=retry_config)
cls._session_pool.mount("http://", adapter)
cls._session_pool.mount("https://", adapter)
return cls._session_pool

def extract_python_code(self, model_output: str) -> str:
pattern = r"```(?:python)?(.*?)```"
matches = re.findall(pattern, model_output, re.DOTALL)
if not matches:
return model_output
return matches[-1].strip()

async def async_call(
self,
tokenized_prediction: list[int],
prediction: str,
label: Any,
query: str | None = None,
rollout_state: dict | None = None,
) -> VerificationResult:
python_code = self.extract_python_code(prediction)

if isinstance(label, str):
try:
tests = json.loads(label)
except json.JSONDecodeError:
logger.warning("Failed to parse BounceSim tests as JSON; got string label")
return VerificationResult(score=0.0)
else:
tests = label

payload = {
"program": python_code,
"tests": tests,
"max_execution_time": self.verifier_config.ballsim_max_execution_time,
}

try:
session = self._get_session()
http_timeout = max(30, min(300, self.verifier_config.ballsim_max_execution_time * 10))

def make_request():
response = session.post(
self.verifier_config.ballsim_api_url,
json=payload,
headers={"Content-Type": "application/json"},
timeout=http_timeout,
)
response.raise_for_status()
return response.json()

result = await asyncio.to_thread(make_request)
passes = result["results"]
pass_rate = sum(passes) / len(passes) if passes else 0.0

if self.verifier_config.ballsim_scoring_mode == "pass_rate":
score = pass_rate
else:
score = 1.0 if pass_rate == 1.0 else 0.0
return VerificationResult(score=score)
except Exception as e:
logger.warning(f"Error verifying ballsim code sample: {e}")
return VerificationResult(score=0.0)

def __call__(
self,
tokenized_prediction: list[int],
prediction: str,
label: Any,
query: str | None = None,
rollout_state: dict | None = None,
) -> VerificationResult:
try:
loop = asyncio.get_event_loop()
if loop.is_running():
raise RuntimeError(
"Cannot call synchronous __call__ method from within an async context. Use async_call instead."
)
else:
return asyncio.run(self.async_call(tokenized_prediction, prediction, label, query))
except RuntimeError as e:
if "cannot schedule new futures after interpreter shutdown" in str(e):
logger.warning("Skipping ballsim verification due to interpreter shutdown")
return VerificationResult(score=0.0, reasoning="Verification skipped due to shutdown")
try:
return asyncio.run(self.async_call(tokenized_prediction, prediction, label, query))
except Exception as nested_e:
logger.warning(f"Error verifying ballsim sample during shutdown: {nested_e}")
return VerificationResult(score=0.0, reasoning=f"Verification failed: {nested_e}")
except Exception as e:
logger.warning(f"Error verifying ballsim sample: {e}")
return VerificationResult(score=0.0, reasoning=f"Verification failed: {e}")

@classmethod
def get_config_class(cls) -> type:
return BallsimVerifierConfig

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new BallsimVerifier class duplicates a significant amount of code from the existing CodeVerifier class, specifically the _get_session, extract_python_code, and the synchronous __call__ wrapper logic. To improve maintainability and reduce redundancy, consider creating a common base class (e.g., APICodeVerifier) that contains this shared logic. CodeVerifier and BallsimVerifier can then inherit from this base class, overriding only the parts that are specific to them, such as the async_call implementation and configuration.

#### Objects
- Ball 1: regular polygon (3 sides), radius 40.0m, initial position (750, 750), initial velocity (-220.61, 6.21) m/s

22
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be a stray number 22 on this line. This might be a typo and could potentially confuse the model processing the prompt. It should probably be removed.

@mnoukhov mnoukhov changed the title ballsim intermediate DELTA benchmark Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant