Skip to content

Add private OCI OKE cookbook for Llama Nemotron Nano 8B#117

Open
fede-kamel wants to merge 13 commits into
NVIDIA-NeMo:mainfrom
fede-kamel:fk/oci-phoenix-private-nemotron
Open

Add private OCI OKE cookbook for Llama Nemotron Nano 8B#117
fede-kamel wants to merge 13 commits into
NVIDIA-NeMo:mainfrom
fede-kamel:fk/oci-phoenix-private-nemotron

Conversation

@fede-kamel
Copy link
Copy Markdown

@fede-kamel fede-kamel commented Mar 16, 2026

Summary

  • add a Nemotron-specific OCI cookbook for nvidia/Llama-3.1-Nemotron-Nano-8B-v1
  • document a validated private-only OKE deployment in us-phoenix-1 with no public control-plane endpoint, no public worker IPs, and no public inference endpoint
  • add a checked-in Terraform wrapper for the private Phoenix OKE infrastructure using Oracle's official OKE module plus OCI Bastion service
  • include a known-good vLLM values file for a single VM.GPU.A10.1 node
  • surface the new OCI cookbook in the root README and cookbook index

Validation

Validated against a live Phoenix OKE deployment of nvidia/Llama-3.1-Nemotron-Nano-8B-v1 using a private cluster plus OCI Bastion/tunnel access:

  • terraform plan
  • terraform apply
  • private OKE cluster active
  • OCI Bastion service active
  • CPU node pool active
  • GPU node pool active in PHX-AD-2
  • /health
  • /v1/models
  • chat completion
  • tool calling
  • streaming
  • async concurrent requests

Notes

  • this contribution is intentionally Nemotron-specific and OCI-specific
  • the deployment guidance is private-only and does not use public IPs for the Kubernetes API or inference endpoint
  • the OCI path is documented as a reproducible option comparable to common AWS GPU/Kubernetes deployment patterns, without claiming repo-local AWS Terraform already exists in this repo
  • the Bastion resource is the OCI Bastion service, not a public bastion VM

@fede-kamel
Copy link
Copy Markdown
Author

@chrisalexiuk-nvidia @anushapant @shashank3959 — Friendly follow-up! This PR adds an OCI OKE deployment cookbook for Llama Nemotron Nano 8B. Would love to get a review when you get a chance. Thanks!

@fede-kamel
Copy link
Copy Markdown
Author

fede-kamel commented Mar 27, 2026

Hey team 👋 — just checking in on this one. Happy to address any feedback or make adjustments to scope if that helps move things along. Let me know if there's anything I can do on my end! @chrisalexiuk-nvidia @anushapant @shashank3959

@fede-kamel
Copy link
Copy Markdown
Author

✅ Cross-validated with NeMo Agent Toolkit OCI integration

This OKE deployment is now serving as the live inference backend for NVIDIA/NeMo-Agent-Toolkit#1804, which adds first-class OCI Generative AI support to the Agent Toolkit.

The full Agent Toolkit OCI test suite — 11/11 tests pass — was validated against the nvidia/Llama-3.1-Nemotron-Nano-8B-v1 endpoint running on this exact private OKE infrastructure in us-phoenix-1. Both PRs together deliver a complete story: reproducible OCI deployment (this PR) powering a production-ready LLM provider and LangChain integration (Toolkit PR).

@fede-kamel
Copy link
Copy Markdown
Author

@chrisalexiuk-nvidia @anushapant @shashank3959 — Quick update — we just used this exact OKE deployment to validate the full OCI integration for NeMo Agent Toolkit! 🚀

The nvidia/Llama-3.1-Nemotron-Nano-8B-v1 model running on the private Phoenix cluster documented in this cookbook passed 11/11 tests as the live inference backend for NVIDIA/NeMo-Agent-Toolkit#1804 — covering the OCI provider, LangChain wrapper, and an end-to-end agent workflow.

Really exciting to see both pieces come together: this PR provides the reproducible OCI deployment, and the Toolkit PR builds on top of it with a first-class integration. Two repos, one Nemotron story on OCI. Looking forward to your review!

@fede-kamel
Copy link
Copy Markdown
Author

Hey @chrisalexiuk-nvidia @anushapant @shashank3959 @chtruong814 — wanted to check in and see if any of you have had a chance to take a look at this one.

This PR is part of the broader NVIDIA × OCI partnership effort — it provides a reproducible OKE deployment for Nemotron Nano 8B on Oracle Cloud, and has already been validated as the live backend for the OCI integration in NeMo Agent Toolkit. Would love to get it across the finish line when bandwidth allows.

No pressure — just want to make sure it's on your radar. Happy to address any feedback!

@chrisalexiuk-nvidia
Copy link
Copy Markdown
Contributor

Hey - I wasn't getting pings on these, I've fixed this up.

This LGTM - would love to merge it in!

@fede-kamel
Copy link
Copy Markdown
Author

Nice!. Are you able to help on moving this one forward? Thank you.

@chrisalexiuk-nvidia
Copy link
Copy Markdown
Contributor

Yessir! Will get the review going and let's go from there!

@chrisalexiuk-nvidia
Copy link
Copy Markdown
Contributor

/claude review.

modelSpec:
- name: "llama31-nemotron-nano-8b"
repository: "vllm/vllm-openai"
tag: "latest"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cookbook is documented as a "validated" / "known-good" configuration, but using tag: "latest" means anyone deploying later will get a different vLLM version than what was actually validated. This could silently break the deployment (e.g., tool-calling behavior, memory profile, chat template paths).

Consider pinning to the specific vLLM tag that was validated.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in d863095. Pinned to v0.14.0, which is the vLLM release that was current when this cookbook was validated on the live OKE cluster (Jan 22 2026).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction on the version — confirmed by exec'ing into the running Nemotron pod on the live private OKE cluster: vllm.__version__ reports 0.17.1. Updated the pin to v0.17.1 in 0d8eeeb.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final update — upgraded the live OKE cluster to vLLM v0.19.0 (latest) and re-validated. Pinned to v0.19.0 in c984916. All endpoints (health, models, chat completion) confirmed working through both engine and router.

@fede-kamel fede-kamel force-pushed the fk/oci-phoenix-private-nemotron branch 2 times, most recently from d876121 to 7c59d19 Compare April 15, 2026 12:44
@fede-kamel
Copy link
Copy Markdown
Author

fede-kamel commented Apr 15, 2026

@chrisalexiuk-nvidia — all commits are signed off and the PR is ready for review. Thanks!

@fede-kamel
Copy link
Copy Markdown
Author

Updated in a3363ae:

  • Appendix A — jump-host VM alternative to OCI Bastion. OCI Bastion port-forwarding sessions close immediately after publickey auth when the client is OpenSSH 10.x (shipped with macOS 15+ and recent Linux distros). Appendix A replaces Step 4 and the bastion-session block inside Step 6 with a public-IP jump host and a plain ssh -L tunnel.
  • Step 7b — soft-reset each node after the filesystem expansion in Step 7. The in-place systemctl restart kubelet in Step 7 does not refresh Node.Capacity.ephemeral-storage (kubelet caches it at startup; the restart is also killed by SIGKILL when containerd bounces mid-exec). Without Step 7b the engine pod is evicted repeatedly mid image pull for disk-pressure, despite df showing the expanded filesystem. Step 7b drains, soft-resets, waits for Ready, uncordons, and gates on a capacity-verification snippet before Step 8.
  • Step 5 AD picker — replaces the hardcoded data[1].name with a loop that queries gpu-a10-count availability per AD and selects the first with non-zero capacity.
  • Appendix A.1nc -z wait on port 22 before the first SSH, to avoid the cloud-init race between VM RUNNING and authorized_keys being installed.
  • Two new troubleshooting entries covering the symptoms above.

Validated end-to-end against nvidia/Llama-3.1-Nemotron-Nano-8B-v1 on a fresh private OKE cluster: /health returns 200, /v1/models lists the served model, chat completion returns the expected sentinel, tool calling yields finish_reason: tool_calls with the correct function selected.

Ready for review.

@fede-kamel fede-kamel force-pushed the fk/oci-phoenix-private-nemotron branch from 88ee6ab to 8a22372 Compare April 23, 2026 18:46
@fede-kamel
Copy link
Copy Markdown
Author

@chrisalexiuk-nvidia DCO is green and I just ran the updated cookbook end-to-end in one uninterrupted pass on a fresh private OKE cluster in us-phoenix-1. Step 12 results:

  • /health{"status":"healthy"}
  • /v1/modelsnvidia/Llama-3.1-Nemotron-Nano-8B-v1
  • chat completion → {"content":"NEMOTRON_OK","finish_reason":"stop"}
  • tool call → {"finish_reason":"tool_calls","tool_calls":[{"name":"get_utc_time"}]}

Kubelet capacity after Step 7b: 198 GiB on the GPU node (VM.GPU.A10.1, 200 GB boot volume), 89 GiB on the CPU node (100 GB boot volume). The expansion took effect, no pod evictions for disk-pressure during the Helm deploy.

Two signed-off commits on this branch:

  • a7a1fab — OpenSSH 10.x workaround (Appendix A: jump-host VM alternative to OCI Bastion) + Step 7b (VM soft-reset so kubelet re-reads Node.Capacity.ephemeral-storage) + AD capacity-aware picker in Step 5 + nc -z cloud-init race guard in A.1.
  • 8a22372 — Step 7b instance-OCID lookup via .spec.providerID. The earlier oci ce node-pool list query returned empty because list does not populate the nested nodes array; providerID is a reliable one-liner.

Ready for review.

@fede-kamel
Copy link
Copy Markdown
Author

Hey @chrisalexiuk-nvidia @anushapant @shashank3959 @chtruong814 — quick update: the companion PR NVIDIA/NeMo-Agent-Toolkit#1804 ("Add OCI LangChain support for hosted Nemotron workflows") was merged on April 14 ✅

That PR's OCI integration was validated end-to-end against the exact private OKE deployment documented in this cookbook (11/11 tests pass on nvidia/Llama-3.1-Nemotron-Nano-8B-v1 running in us-phoenix-1), so the two pieces are tightly coupled: the Toolkit ships the provider, this PR ships the reproducible deployment that backs it.

On this PR's side everything is green — DCO passing, signed-off, end-to-end re-validated on a fresh cluster after the OpenSSH 10.x / Step 7b updates (8a22372). Would love to get this across the line so the OCI + Nemotron story can be announced as a single coordinated drop alongside the Toolkit release. What's needed from my end to land it?

Thanks!

fede-kamel added 12 commits May 6, 2026 11:56
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
Replace tag: "latest" with the specific vLLM release that was
running when the cookbook was validated on the private OKE cluster
(Jan 22 2026). This ensures reproducible deployments.

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
Confirmed by exec'ing into the running Nemotron pod on the
validated private OKE cluster — vllm.__version__ reports 0.17.1.

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
Upgraded the nemotron-vllm-phx cluster from v0.17.1 to v0.19.0
and validated chat completion, health, and model listing through
both the engine service and the router.

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
Document that the cookbook was validated against vLLM v0.19.0.

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
Rewrite the README as a complete step-by-step guide based on a
clean end-to-end deployment validated on 2026-04-15 against OKE
v1.31.10 in us-phoenix-1 with vllm-stack chart 0.1.10.

Key fixes from live deployment validation:

- Move chatTemplate from chart field to vllmConfig.extraArgs
  (chart 0.1.10 prepends /templates/ to the path, breaking
  absolute container paths)
- Document vllm-templates-pvc prerequisite (new in chart 0.1.10,
  engine pod stays Pending without it)
- Add boot volume filesystem expansion steps with iSCSI rescan
  for online-resized volumes
- Document CPU node boot volume must be >= 100 GB (router image
  is ~10.5 GB, default 47 GB causes eviction)
- Add CoreDNS and kube-dns-autoscaler GPU toleration patches
- Add StorageClass creation step
- Add Bastion tunnel setup with kubeconfig CA cert workaround
- Add troubleshooting for all issues encountered during
  deployment

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
Switch from the OKE Terraform module to manual OCI CLI commands
based on the Oracle vLLM Production Stack guide. The Terraform
module's NSG configuration blocks OCI Bastion port-forwarding
to the private control plane endpoint.

The CLI approach creates a dedicated bastion subnet with proper
security list rules, which provides reliable tunnel access to
the private OKE cluster.

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
- Add explicit wait gates between steps (cluster ACTIVE before
  node pools, nodes ACTIVE before connecting)
- Use unique pod names per node for boot volume expansion
- Add NVIDIA device plugin note (pre-installed on enhanced OKE)
- Add reference to checked-in values file
- Add cleanup section with teardown commands
- Add note about Terraform directory as alternative reference
- Clarify AD selection for GPU capacity

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
- Add ASCII architecture diagram showing VCN layout
- Change default profile to DEFAULT (not user-specific)
- Add image ID verification step with fallback query
- Explain why bastion subnet is public
- Improve cleanup section with pool listing command
  and subnet deletion loop
- Add commented fallback for manual image selection

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
- Add Appendix A: jump-host VM alternative to OCI Bastion. OpenSSH 10.x
  clients (macOS 15+, recent Linux) cannot sustain OCI Bastion
  port-forwarding sessions; the server closes the connection immediately
  after publickey auth. Appendix A replaces Step 4 and the
  bastion-session block inside Step 6 with a public-IP jump-host in
  the bastion subnet and a plain SSH -L tunnel.
- Add Step 7b: soft-reset each node after filesystem expansion so
  kubelet re-reads ephemeral-storage capacity. The in-place
  systemctl restart kubelet in Step 7 never worked (SIGKILL from
  containerd bounce; kubelet also caches capacity at startup), causing
  the engine pod to be evicted mid image pull with disk-pressure despite
  df showing the expanded filesystem. Step 7b includes a capacity
  verification loop that gates on the expected Ki values.
- Replace Step 5's hardcoded data[1] AD picker with a capacity-aware
  loop that queries gpu-a10-count availability per AD and selects the
  first AD with non-zero capacity.
- Appendix A.1 waits for port 22 with nc -z before the first SSH, to
  avoid the cloud-init race between VM RUNNING and authorized_keys
  being installed.
- New troubleshooting entries for stale kubelet capacity and SSH tunnel
  closure on OpenSSH 10.x.

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
The previous snippet used `oci ce node-pool list` with a query over the
nested `nodes[]` array, but the `list` API does not populate `nodes` —
it always returns `null` — so the lookup produced an empty
`INSTANCE_ID` and the subsequent `compute instance action` looped with
`Parameter --instance-id cannot be whitespace or empty string`.

OKE sets each node's `.spec.providerID` to `oci://<instance-ocid>`, so
`kubectl get node <ip> -o jsonpath='{.spec.providerID}'` gives a
reliable one-line lookup.

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
@fede-kamel fede-kamel force-pushed the fk/oci-phoenix-private-nemotron branch from 4798f74 to 69bbedd Compare May 6, 2026 16:11
@fede-kamel
Copy link
Copy Markdown
Author

Hey @chrisalexiuk-nvidia — friendly nudge on this one. Last word from you was an LGTM on Apr 9; since then the cookbook's been re-validated end-to-end on a fresh OKE cluster, the branch is rebased clean on main (13 ahead, 0 behind), DCO is green, and the companion NeMo-Agent-Toolkit PR (#1804) merged on Apr 14.

Anything blocking on your side, or is it just an Approve click away? Happy to address anything else if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants