Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes a robust infrastructure for integrating BounceSim code execution and verification into the existing system. It provides a scalable API for evaluating Python code against BounceSim test cases, supported by Nginx for efficient load balancing. The changes also introduce a dedicated verifier and configuration options, enabling seamless integration into training workflows for models that generate code for physics simulations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for a new 'ballsim' task, including a new API, a verifier, and associated configuration and scripts. The implementation is generally solid, particularly the use of asyncio.to_thread in the API for performance.
My main feedback points are:
- The Beaker setup script has some robustness issues, such as a hardcoded CPU count and a fixed sleep time for server startup, which could lead to flakiness.
- There is significant code duplication between the new
BallsimVerifierand the existingCodeVerifier. Refactoring this into a shared base class would improve maintainability. - A minor typo was found in the
ballsim_system_prompt.txtfile.
I've left specific comments with suggestions for these points.
| echo "Waiting for all servers to start up..." | ||
| sleep 5 |
There was a problem hiding this comment.
Using a fixed sleep 5 is not a robust way to wait for 128 uvicorn servers to start. This can lead to race conditions and flaky behavior if servers take longer to initialize. A better approach is to poll the health endpoint of the last server until it becomes responsive.
| echo "Waiting for all servers to start up..." | |
| sleep 5 | |
| echo "Waiting for all servers to start up..." | |
| local last_port=$((API_BASE_PORT + BALLSIM_SERVER_CPUS - 1)) | |
| # Wait up to 15 seconds for the last server to become healthy. | |
| for _ in {1..30}; do | |
| if curl -s --fail "http://localhost:$last_port/health" > /dev/null; then | |
| echo "All servers are assumed to be up." | |
| return | |
| fi | |
| sleep 0.5 | |
| done | |
| echo "Warning: Last server on port $last_port is not responding after 15 seconds." | |
| tail -5 "api_logs/ballsim_server_$last_port.log" |
|
|
||
| # Configuration | ||
| TOTAL_CPUS=$(nproc) | ||
| BALLSIM_SERVER_CPUS=128 |
There was a problem hiding this comment.
The number of BALLSIM_SERVER_CPUS is hardcoded to 128. This is not robust and will cause issues on machines with fewer CPUs. It's better to cap this value by the total number of available CPUs to prevent errors.
| BALLSIM_SERVER_CPUS=128 | |
| BALLSIM_SERVER_CPUS=${BALLSIM_SERVER_CPUS:-128} | |
| if [[ "$TOTAL_CPUS" -lt "$BALLSIM_SERVER_CPUS" ]]; then | |
| echo "Warning: Requested BALLSIM_SERVER_CPUS ($BALLSIM_SERVER_CPUS) is greater than total available CPUs ($TOTAL_CPUS). Using $TOTAL_CPUS CPUs instead." | |
| BALLSIM_SERVER_CPUS=$TOTAL_CPUS | |
| fi |
| class BallsimVerifier(VerifierFunction): | ||
| """ | ||
| Verifier that executes Python code against BounceSim test cases using an external API. | ||
| """ | ||
|
|
||
| _session_pool = None | ||
|
|
||
| def __init__(self, verifier_config: BallsimVerifierConfig) -> None: | ||
| super().__init__("ballsim", verifier_config=verifier_config, weight=1.0) | ||
|
|
||
| @classmethod | ||
| def _get_session(cls): | ||
| if cls._session_pool is None: | ||
| cls._session_pool = requests.Session() | ||
| retry_config = requests.adapters.Retry( | ||
| total=3, | ||
| connect=3, | ||
| read=3, | ||
| status=3, | ||
| backoff_factor=0.3, | ||
| status_forcelist=[500, 502, 503, 504], | ||
| allowed_methods=None, | ||
| ) | ||
| adapter = requests.adapters.HTTPAdapter(pool_connections=100, pool_maxsize=100, max_retries=retry_config) | ||
| cls._session_pool.mount("http://", adapter) | ||
| cls._session_pool.mount("https://", adapter) | ||
| return cls._session_pool | ||
|
|
||
| def extract_python_code(self, model_output: str) -> str: | ||
| pattern = r"```(?:python)?(.*?)```" | ||
| matches = re.findall(pattern, model_output, re.DOTALL) | ||
| if not matches: | ||
| return model_output | ||
| return matches[-1].strip() | ||
|
|
||
| async def async_call( | ||
| self, | ||
| tokenized_prediction: list[int], | ||
| prediction: str, | ||
| label: Any, | ||
| query: str | None = None, | ||
| rollout_state: dict | None = None, | ||
| ) -> VerificationResult: | ||
| python_code = self.extract_python_code(prediction) | ||
|
|
||
| if isinstance(label, str): | ||
| try: | ||
| tests = json.loads(label) | ||
| except json.JSONDecodeError: | ||
| logger.warning("Failed to parse BounceSim tests as JSON; got string label") | ||
| return VerificationResult(score=0.0) | ||
| else: | ||
| tests = label | ||
|
|
||
| payload = { | ||
| "program": python_code, | ||
| "tests": tests, | ||
| "max_execution_time": self.verifier_config.ballsim_max_execution_time, | ||
| } | ||
|
|
||
| try: | ||
| session = self._get_session() | ||
| http_timeout = max(30, min(300, self.verifier_config.ballsim_max_execution_time * 10)) | ||
|
|
||
| def make_request(): | ||
| response = session.post( | ||
| self.verifier_config.ballsim_api_url, | ||
| json=payload, | ||
| headers={"Content-Type": "application/json"}, | ||
| timeout=http_timeout, | ||
| ) | ||
| response.raise_for_status() | ||
| return response.json() | ||
|
|
||
| result = await asyncio.to_thread(make_request) | ||
| passes = result["results"] | ||
| pass_rate = sum(passes) / len(passes) if passes else 0.0 | ||
|
|
||
| if self.verifier_config.ballsim_scoring_mode == "pass_rate": | ||
| score = pass_rate | ||
| else: | ||
| score = 1.0 if pass_rate == 1.0 else 0.0 | ||
| return VerificationResult(score=score) | ||
| except Exception as e: | ||
| logger.warning(f"Error verifying ballsim code sample: {e}") | ||
| return VerificationResult(score=0.0) | ||
|
|
||
| def __call__( | ||
| self, | ||
| tokenized_prediction: list[int], | ||
| prediction: str, | ||
| label: Any, | ||
| query: str | None = None, | ||
| rollout_state: dict | None = None, | ||
| ) -> VerificationResult: | ||
| try: | ||
| loop = asyncio.get_event_loop() | ||
| if loop.is_running(): | ||
| raise RuntimeError( | ||
| "Cannot call synchronous __call__ method from within an async context. Use async_call instead." | ||
| ) | ||
| else: | ||
| return asyncio.run(self.async_call(tokenized_prediction, prediction, label, query)) | ||
| except RuntimeError as e: | ||
| if "cannot schedule new futures after interpreter shutdown" in str(e): | ||
| logger.warning("Skipping ballsim verification due to interpreter shutdown") | ||
| return VerificationResult(score=0.0, reasoning="Verification skipped due to shutdown") | ||
| try: | ||
| return asyncio.run(self.async_call(tokenized_prediction, prediction, label, query)) | ||
| except Exception as nested_e: | ||
| logger.warning(f"Error verifying ballsim sample during shutdown: {nested_e}") | ||
| return VerificationResult(score=0.0, reasoning=f"Verification failed: {nested_e}") | ||
| except Exception as e: | ||
| logger.warning(f"Error verifying ballsim sample: {e}") | ||
| return VerificationResult(score=0.0, reasoning=f"Verification failed: {e}") | ||
|
|
||
| @classmethod | ||
| def get_config_class(cls) -> type: | ||
| return BallsimVerifierConfig | ||
|
|
There was a problem hiding this comment.
The new BallsimVerifier class duplicates a significant amount of code from the existing CodeVerifier class, specifically the _get_session, extract_python_code, and the synchronous __call__ wrapper logic. To improve maintainability and reduce redundancy, consider creating a common base class (e.g., APICodeVerifier) that contains this shared logic. CodeVerifier and BallsimVerifier can then inherit from this base class, overriding only the parts that are specific to them, such as the async_call implementation and configuration.
| #### Objects | ||
| - Ball 1: regular polygon (3 sides), radius 40.0m, initial position (750, 750), initial velocity (-220.61, 6.21) m/s | ||
|
|
||
| 22 |
No description provided.