Skip to content

inference quickstart#19

Merged
mahf708 merged 5 commits into
mainfrom
mahf708/inference-workflow
Jan 30, 2026
Merged

inference quickstart#19
mahf708 merged 5 commits into
mainfrom
mahf708/inference-workflow

Conversation

@mahf708

@mahf708 mahf708 commented Jan 8, 2026

Copy link
Copy Markdown
Collaborator

this is a quick-start to inference, discussed yesterday... for now only a draft to get feedback, then we will add some more info (especially about formalizing a process for "simulation pages" on confluence). To reviewing resulting doc, see this page https://docs.e3sm.org/aigroup/pr-preview/pr-19/ace2-inference/

@mahf708 mahf708 requested review from rebassoo, wagmanbe and wlin7 January 8, 2026 15:46
@github-actions

github-actions Bot commented Jan 8, 2026

Copy link
Copy Markdown
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://E3SM-Project.github.io/aigroup/pr-preview/pr-19/

Built to branch gh-pages at 2026-01-30 20:54 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@mahf708 mahf708 requested a review from ndkeen January 8, 2026 19:52
@wlin7

wlin7 commented Jan 9, 2026

Copy link
Copy Markdown

Thanks @mahf708 for putting up this inference quickstart. I faithfully followed the steps and obtained a successful test1.

After going through the process, I feel it will be useful to have some additional guides:

  1. How to create an environment that contains pytorch and uv, and ensuring them running with the same python interpreter. What other packages are suggested to also be included. I likely started with an unclean environment, causing uv to run a different python interpreter. I had to force the UV_PYTHON to get it to work.

  2. Need to request a gpu node to run the uv command, at least for perlmutter. Though the login node also has a GPU, running on a login node would get CUDA out of memory error. Should n_initial_conditions always equal to num_data_workers? I requested 2 gpus on 1 node to run the example. Are they a suitable match (best use of the requested resource)?

  3. restart.nc file is recorded. How to perform a restart run?

  4. Some info about the output data: what are the pressure values for levels 0 to 7? How the monthly_mean data are computed? Equal length for all months? (The example test1 ran 1000 steps -- 250 days, and has 11 monthly values)

Comment thread docs/ace2-inference.md Outdated

## Prerequisites

- [uv](https://github.com/astral-sh/uv) installed

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyTorch as well.

@mahf708 mahf708 self-assigned this Jan 17, 2026
@mahf708 mahf708 marked this pull request as ready for review January 29, 2026 15:49
@mahf708

mahf708 commented Jan 29, 2026

Copy link
Copy Markdown
Collaborator Author

@wlin7 I added some info about python envs separately; the expectation is that people will figure this out, but uv is used as a shortcut. In the future, we can potentially host an aigroup env like e3sm-unified so that people can just activate it

@mahf708 mahf708 enabled auto-merge (rebase) January 29, 2026 16:42
@mahf708 mahf708 requested review from elynnwu and wlin7 January 29, 2026 16:42
@mahf708

mahf708 commented Jan 29, 2026

Copy link
Copy Markdown
Collaborator Author

Hi @elynnwu, do you know what the restart.nc does in the context of ACE? I didn't look into it yet. Also, adding this quickstart for people to try out inference using ACE, any comments appreciated!

@elynnwu

elynnwu commented Jan 29, 2026

Copy link
Copy Markdown

do you know what the restart.nc does in the context of ACE?

It's the last prognostic state in inference, so it can be used as a new initial condition file if you want to run another inference from this step.

Comment thread docs/ace2-inference.md Outdated
Comment thread docs/ace2-inference.md
Comment thread docs/ace2-inference.md Outdated
Comment thread docs/ace2-inference.md
@mahf708 mahf708 force-pushed the mahf708/inference-workflow branch from 2fe0591 to e668d51 Compare January 30, 2026 20:50
@mahf708 mahf708 disabled auto-merge January 30, 2026 20:54
@mahf708 mahf708 merged commit 194c543 into main Jan 30, 2026
1 check passed
@mahf708 mahf708 deleted the mahf708/inference-workflow branch January 30, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants