[AutoTuner] Restore resume check #3070
Conversation
99ea816 to
8e18ff7
Compare
| folders = glob.glob(os.path.join(experiment_dir, f"variant-*-or-{iteration}")) | ||
| return max((os.path.getmtime(folder) for folder in folders), default=9e99) |
There was a problem hiding this comment.
This is a little obfuscated:
- The function's name does not match the expected return value type/format. A check should return True/False.
- Returning 9e99 is not clear on what is going on.
- Creating a folder does not confirm the run status. For this test to be true to its purpose, we need to guarantee that it has not finished, not just started.
We should consider using a get_experiment_status function to check if all iterations have finished running; this function would return at least "RUNNING" and "FINISHED", other states might be helpful but are not required now.
There was a problem hiding this comment.
- The function's name does not match the expected return value type/format. A check should return True/False.
Can possibly change it to
get_trial_times
- Returning 9e99 is not clear on what is going on.
It is just a dummy value, to compare for latest modified time in while loop line 121-128
# Check if first config is complete
while True:
cur_modified_time = self.check_trial_times()
print(f"Current modified time: {cur_modified_time}")
print(f"Latest modified time: {latest_modified_time}")
if abs(cur_modified_time - latest_modified_time) < 1e-3:
break
latest_modified_time = cur_modified_time
time.sleep(10)- Creating a folder does not confirm the run status. For this test to be true to its purpose, we need to guarantee that it has not finished, not just started.
This function returns the latest modified time of a given iteration (folder names are matched using
iterationglob) - so if a run is completed the folder should no longer be modified.
4, We should consider using a get_experiment_status function to check if all iterations have finished running; this function would return at least "RUNNING" and "FINISHED", other states might be helpful but are not required now.
get_experiment_statusis helpful in general, but might not be too useful in resume_check because we need to check iteration completion, as opposed to experiment completion (or all_iterations completion)
Signed-off-by: Jack Luar <jluar@precisioninno.com>
…ror handling Signed-off-by: Jack Luar <jluar@precisioninno.com>
Signed-off-by: Jack Luar <jluar@precisioninno.com>
Signed-off-by: Jack Luar <jluar@precisioninno.com>
8e18ff7 to
cf5ed3d
Compare
Signed-off-by: Jack Luar <jluar@precisioninno.com>
Signed-off-by: Jack Luar <jluar@precisioninno.com>
Replace the fixed time.sleep(120) with ExperimentAnalysis-based polling to reliably detect when trials complete before stopping the initial run. This addresses the flakiness reported in issue The-OpenROAD-Project#3005 and the review feedback from draft PR The-OpenROAD-Project#3070. Key changes: - Use Ray Tune ExperimentAnalysis to poll experiment status instead of fixed sleep - Add managed_process context manager for safe subprocess cleanup - Add stop_ray_cluster helper that retries until Ray shuts down cleanly - Re-enable the resume check test in test_autotuner.sh Signed-off-by: Harsh <harshkumar3446@gmail.com>
Replace the fixed time.sleep(120) with ExperimentAnalysis-based polling to reliably detect when trials complete before stopping the initial run. This addresses the flakiness reported in issue The-OpenROAD-Project#3005 and the review feedback from draft PR The-OpenROAD-Project#3070. Key changes: - Use Ray Tune ExperimentAnalysis to poll experiment status instead of fixed sleep - Add managed_process context manager for safe subprocess cleanup - Add stop_ray_cluster helper that retries until Ray shuts down cleanly - Re-enable the resume check test in test_autotuner.sh Signed-off-by: Harsh <harshkumar3446@gmail.com> Signed-off-by: Harsh Kumar <harshkumar3446@gmail.com>
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs. Remove the |
This pull request includes several changes to the
AutoTunertests to re-enable previously disabled tests, improve the test setup, and add new utility methods.Fixes #3005
Re-enabling tests and improving setup:
flow/test/test_autotuner.sh: Re-enabled thetest_tune_resumetest in theAutoTunerscript by uncommenting the test execution line.Codebase simplification and improvements:
tools/AutoTuner/test/resume_check.py: Removed unnecessary directory changes and imported theglobmodule to facilitate file pattern matching. [1] [2]tools/AutoTuner/test/resume_check.py: Introduced thecheck_trial_timesmethod to check the modification times of trial iterations, improving the robustness of thetest_tune_resumetest.