Skip to content

Port test-gpu.yml to Open-Athena/ec2 runner#308

Merged
ryan-williams merged 8 commits into
mainfrom
rw/ec2
Sep 23, 2025
Merged

Port test-gpu.yml to Open-Athena/ec2 runner#308
ryan-williams merged 8 commits into
mainfrom
rw/ec2

Conversation

@ryan-williams

@ryan-williams ryan-williams commented Jul 18, 2025

Copy link
Copy Markdown
Collaborator

Use Open-Athena/ec2-gha#3 (self-terminating GHA EC2 runner)

  • 2x speedup:
    • 14mins7mins
    • Previously waited an extra 7mins shutting down instance, before ✅
  • Less boilerplate (smaller "start ec2" block, no "stop ec2" block)
  • Reusable across repos (e.g. mamba#771).
  • Pattern should scale to other clouds, Lambda, etc.

Before/After gif:

gha

@ryan-williams ryan-williams marked this pull request as ready for review July 18, 2025 04:43
@ryan-williams ryan-williams requested review from alxmrs, jder and mihasya July 18, 2025 04:43

@jder jder left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just one question/suggestion from me.

Comment thread .github/workflows/test-gpu.yml Outdated
@ryan-williams ryan-williams force-pushed the rw/ec2 branch 3 times, most recently from ba13991 to 352f33b Compare August 8, 2025 20:06
@jder

jder commented Sep 4, 2025

Copy link
Copy Markdown
Member

Hey @ryan-williams just wanted to check on the state of this PR. What's left before we can merge this?

@ryan-williams

Copy link
Copy Markdown
Collaborator Author

@jder merging this now is reasonable. The only reason to wait would be because it uses ec2-gha@v2, which I've only just sent for review at Open-Athena/ec2-gha#3.

It empirically works with this repo's GPU tests, so if you're comfortable with that, we can merge this without waiting on that PR.

Comment thread .github/workflows/test-gpu.yml Outdated
uses: Open-Athena/ec2-gha/.github/workflows/runner.yml@v2
with:
ec2_instance_type: g4dn.xlarge
ec2_image_id: ami-00096836009b16a22 # Deep Learning OSS Nvidia Driver AMI GPU PyTorch

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI when you update this there will be a change here to a new AMI.

@ryan-williams ryan-williams Sep 16, 2025

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually pushed a previous merge here that missed this! I've corrected it now (I force-pushed over the previous bad merge, that also brings in latest main, so this thread is now orphaned 🫠 ty!)

@ryan-williams ryan-williams force-pushed the rw/ec2 branch 2 times, most recently from f09e1d6 to 823b4cb Compare September 19, 2025 04:14

@jder jder left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ryan-williams ryan-williams marked this pull request as ready for review September 23, 2025 16:16
@ryan-williams ryan-williams merged commit aa68796 into main Sep 23, 2025
4 checks passed
@ryan-williams ryan-williams deleted the rw/ec2 branch September 23, 2025 16:33
ryan-williams added a commit that referenced this pull request Sep 24, 2025
- In [#308] I neglected to update `benchmarks.yml` to use ec2-gha, which
resulted in an invalid workflow file.
- However, [`benchmarks.yml`] has been broken on `main` since [#384],
which added a FOMO model benchmark that uses more memory (and GPU
memory)
  - I've not yet found instances that can handle either
- e.g. [benchmarks#175] uses an [m7i.8xlarge] for CPU benchmarks and a
[g6.xlarge] for GPU, and both fail

This PR fixes the former by updating `benchmarks.yml` to use ec2-gha,
and works around the latter by restoring the benchmarks configs to the
pre-[#384] state.

[benchmarks#172] is a passing run from [`f313865`]

[Open-Athena/Ocean_Emulator#399]:
https://github.com/Open-Athena/Ocean_Emulator/pull/399

[benchmarks#172]:
https://github.com/Open-Athena/Ocean_Emulator/actions/runs/17985502557/job/51162739635
[benchmarks#175]:
https://github.com/Open-Athena/Ocean_Emulator/actions/runs/17986640677/job/51166555978
[m7i.8xlarge]: https://instances.vantage.sh/aws/ec2/m7i.8xlarge
[g6.xlarge]: https://instances.vantage.sh/aws/ec2/g6.xlarge

[`benchmarks.yml`]:
https://github.com/Open-Athena/Ocean_Emulator/actions/workflows/benchmarks.yml?query=branch%3Amain

[#308]: https://github.com/Open-Athena/Ocean_Emulator/pull/308
[#384]: https://github.com/Open-Athena/Ocean_Emulator/pull/384
[`f313865`]:
https://github.com/Open-Athena/Ocean_Emulator/pull/399/commits/f313865a76db1b401243a4189ae876954b94c4c9

<!-- Synced with
https://gist.github.com/81ed211c8bf19f9b97ab1d4c3cdb51bd/1135cb77d4fb2c35bdd774f1330751e4b233f35d
via
[github-pr.py](https://github.com/ryan-williams/git-helpers/blob/main/github/github-pr.py)
-->

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Jesse Rusak <jesse@openathena.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants