Port `test-gpu.yml` to `Open-Athena/ec2` runner by ryan-williams · Pull Request #308 · m2lines/Samudra

ryan-williams · 2025-07-18T04:22:47Z

Use Open-Athena/ec2-gha#3 (self-terminating GHA EC2 runner)

2x speedup:
- 14mins → 7mins
- Previously waited an extra 7mins shutting down instance, before ✅
Less boilerplate (smaller "start ec2" block, no "stop ec2" block)
Reusable across repos (e.g. mamba#771).
Pattern should scale to other clouds, Lambda, etc.

Before/After gif:

jder

Thanks! Just one question/suggestion from me.

jder · 2025-09-04T13:54:54Z

Hey @ryan-williams just wanted to check on the state of this PR. What's left before we can merge this?

ryan-williams · 2025-09-15T05:07:11Z

@jder merging this now is reasonable. The only reason to wait would be because it uses ec2-gha@v2, which I've only just sent for review at Open-Athena/ec2-gha#3.

It empirically works with this repo's GPU tests, so if you're comfortable with that, we can merge this without waiting on that PR.

jder · 2025-09-15T20:30:59Z

+    uses: Open-Athena/ec2-gha/.github/workflows/runner.yml@v2
+    with:
+      ec2_instance_type: g4dn.xlarge
+      ec2_image_id: ami-00096836009b16a22  # Deep Learning OSS Nvidia Driver AMI GPU PyTorch


FYI when you update this there will be a change here to a new AMI.

I'd actually pushed a previous merge here that missed this! I've corrected it now (I force-pushed over the previous bad merge, that also brings in latest main, so this thread is now orphaned 🫠 ty!)

jder

Thanks!

- In [#308] I neglected to update `benchmarks.yml` to use ec2-gha, which resulted in an invalid workflow file. - However, [`benchmarks.yml`] has been broken on `main` since [#384], which added a FOMO model benchmark that uses more memory (and GPU memory) - I've not yet found instances that can handle either - e.g. [benchmarks#175] uses an [m7i.8xlarge] for CPU benchmarks and a [g6.xlarge] for GPU, and both fail This PR fixes the former by updating `benchmarks.yml` to use ec2-gha, and works around the latter by restoring the benchmarks configs to the pre-[#384] state. [benchmarks#172] is a passing run from [`f313865`] [Open-Athena/Ocean_Emulator#399]: https://github.com/Open-Athena/Ocean_Emulator/pull/399 [benchmarks#172]: https://github.com/Open-Athena/Ocean_Emulator/actions/runs/17985502557/job/51162739635 [benchmarks#175]: https://github.com/Open-Athena/Ocean_Emulator/actions/runs/17986640677/job/51166555978 [m7i.8xlarge]: https://instances.vantage.sh/aws/ec2/m7i.8xlarge [g6.xlarge]: https://instances.vantage.sh/aws/ec2/g6.xlarge [`benchmarks.yml`]: https://github.com/Open-Athena/Ocean_Emulator/actions/workflows/benchmarks.yml?query=branch%3Amain [#308]: https://github.com/Open-Athena/Ocean_Emulator/pull/308 [#384]: https://github.com/Open-Athena/Ocean_Emulator/pull/384 [`f313865`]: https://github.com/Open-Athena/Ocean_Emulator/pull/399/commits/f313865a76db1b401243a4189ae876954b94c4c9  --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Jesse Rusak <jesse@openathena.ai>

Port test-gpu.yml to Open-Athena/ec2 runner

792e5cf

ryan-williams force-pushed the rw/ec2 branch from 631af41 to 792e5cf Compare July 18, 2025 04:34

ryan-williams marked this pull request as ready for review July 18, 2025 04:43

ryan-williams requested review from alxmrs, jder and mihasya July 18, 2025 04:43

jder approved these changes Jul 18, 2025

View reviewed changes

Comment thread .github/workflows/test-gpu.yml Outdated

alxmrs approved these changes Jul 22, 2025

View reviewed changes

ryan-williams marked this pull request as draft July 24, 2025 04:19

ryan-williams mentioned this pull request Jul 24, 2025

Initial implementation: auto-shutdown EC2 GHA runner Open-Athena/ec2#1

Closed

ryan-williams force-pushed the rw/ec2 branch 3 times, most recently from 1310f2d to cde1ff9 Compare August 8, 2025 19:15

ryan-williams mentioned this pull request Aug 8, 2025

Multi-job support (using job start/end hooks), optional CloudWatch logging, demo workflows, module/repo rename Open-Athena/ec2-gha#2

Closed

Merge main

b492c47

ryan-williams force-pushed the rw/ec2 branch 3 times, most recently from ba13991 to 352f33b Compare August 8, 2025 20:06

Port to Open-Athena/ec2-gha@v2

d2a1031

ryan-williams force-pushed the rw/ec2 branch from 352f33b to d2a1031 Compare August 12, 2025 22:42

ryan-williams added 2 commits August 16, 2025 18:59

test-gpu: use latest ec2-gha@v2

8d8657d

Merge main

f69a426

ryan-williams mentioned this pull request Sep 15, 2025

Multi-runner support (on one instance), multi-{OS,arch} demos Open-Athena/ec2-gha#3

Open

jder approved these changes Sep 16, 2025

View reviewed changes

Merge main

fbec5a4

ryan-williams force-pushed the rw/ec2 branch 2 times, most recently from f09e1d6 to 823b4cb Compare September 19, 2025 04:14

ec2_root_device_size: +1

0c7b02f

ryan-williams force-pushed the rw/ec2 branch from 823b4cb to 0c7b02f Compare September 19, 2025 04:47

jder approved these changes Sep 23, 2025

View reviewed changes

ryan-williams marked this pull request as ready for review September 23, 2025 16:16

Merge main

5f5402e

ryan-williams merged commit aa68796 into main Sep 23, 2025
4 checks passed

ryan-williams deleted the rw/ec2 branch September 23, 2025 16:33

ryan-williams mentioned this pull request Sep 24, 2025

Fix benchmarks.yml: use ec2-gha, remove FOMO benchmark #399

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port `test-gpu.yml` to `Open-Athena/ec2` runner#308

Port `test-gpu.yml` to `Open-Athena/ec2` runner#308
ryan-williams merged 8 commits into
mainfrom
rw/ec2

ryan-williams commented Jul 18, 2025 •

edited

Loading

Uh oh!

jder left a comment

Uh oh!

Uh oh!

jder commented Sep 4, 2025

Uh oh!

ryan-williams commented Sep 15, 2025

Uh oh!

jder Sep 15, 2025

Uh oh!

ryan-williams Sep 16, 2025 •

edited

Loading

Uh oh!

jder left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ryan-williams commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jder commented Sep 4, 2025

Uh oh!

ryan-williams commented Sep 15, 2025

Uh oh!

jder Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

ryan-williams Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ryan-williams commented Jul 18, 2025 •

edited

Loading

ryan-williams Sep 16, 2025 •

edited

Loading