Skip to content

Add PreallocatedMode for predetermined GUID lookup#20

Open
ttalati175 wants to merge 5 commits into
masterfrom
ttalati/tcl-6978-modify-ufm-to-support-using-a-hash-to-lookup-the-pre
Open

Add PreallocatedMode for predetermined GUID lookup#20
ttalati175 wants to merge 5 commits into
masterfrom
ttalati/tcl-6978-modify-ufm-to-support-using-a-hash-to-lookup-the-pre

Conversation

@ttalati175

@ttalati175 ttalati175 commented Jun 18, 2026

Copy link
Copy Markdown

What

Adds pre-allocated GUID mode for substrates where we are not the UFM fabric admin and GUIDs/PKeys are pre-provisioned by the provider (IREN B300s). When PREALLOCATED_MODE is on, the daemon assigns GUIDs locally with no UFM calls.

How

  • Each tenant pkey owns a fixed band of the GUID range: GUID = RANGE_START + (pkey << PREALLOCATED_BAND_BITS) + slot. The daemon hands out the next free slot in that band from the existing pool.
  • PREALLOCATED_BAND_BITS defaults to 16 (65536 GUIDs/band) to leave headroom for large clusters. Must match the provider's per-pkey block layout.
  • No-op UFM client. The pool's in-use set is seeded from pod annotations at startup (initPool) and maintained via allocate/release — so a restart can't re-hand-out a live GUID without UFM.
  • On band exhaustion: alert + resync the pool from pod annotations and retry.
  • Off by default, so existing substrates are unchanged.

Out of scope

  • Provider must provision each pkey block at RANGE_START + pkey << bits.
  • Charts/tinfra wiring tracked separately.

Refs TCL-6978

@coveralls

coveralls commented Jun 18, 2026

Copy link
Copy Markdown

Coverage Report for CI Build 28261857916

Coverage decreased (-1.1%) to 54.987%

Details

  • Coverage decreased (-1.1%) from the base build.
  • Patch coverage: 56 uncovered changes across 2 files (17 of 73 lines covered, 23.29%).
  • No coverage regressions found.

Uncovered Changes

File Changed Covered %
pkg/daemon/daemon.go 59 6 10.17%
pkg/guid/guid_pool.go 14 11 78.57%

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 1875
Covered Lines: 1031
Line Coverage: 54.99%
Coverage Strength: 22.8 hits per line

💛 - Coveralls

Each pkey owns a fixed band of the GUID range; the daemon hands out the
next free slot from the pool with no UFM calls. Off by default
(PREALLOCATED_MODE). Refs TCL-6978

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ttalati175 ttalati175 force-pushed the ttalati/tcl-6978-modify-ufm-to-support-using-a-hash-to-lookup-the-pre branch from 5fcd462 to a697737 Compare June 23, 2026 17:30
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ttalati175 ttalati175 marked this pull request as ready for review June 23, 2026 18:32
@ttalati175 ttalati175 changed the title Add PreallocatedMode skeleton for predetermined GUID lookup Add PreallocatedMode for predetermined GUID lookup Jun 23, 2026

@Yarosh Yarosh left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! few nits

Comment thread pkg/daemon/daemon.go Outdated
return nil
}

// LookupPredeterminedGUID returns a GUID from the pre-allocated band that the pod's tenant

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: those AI-generated comments are noise IMHO - way too verbose and unnecessary.

Comment thread pkg/daemon/daemon.go
// Pre-allocated mode: pull a free GUID from the band the tenant pkey owns,
// instead of from the whole pool. No UFM call is made.
guidAddr, err = d.LookupPredeterminedGUID(spec)
if err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you handle the ErrGUIDPoolExhausted error explicitly here? same as non-pre-allocated mode handles it?
I would at least raise an explicit error log we can alert on

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice the syncGUIDPool called below when GUIDs are exhausted. This actually calls UFM and resets the local map GUID pool from the real UFM results.
We can implement something similar - get all virt-launcher pods with IB GUID annotations, and reset the pool in case of exhaustion. It's a good mechanism to recover from drift

Comment thread pkg/config/config.go Outdated
// Number of low bits reserved per pkey band in pre-allocated mode. Each pkey owns a
// band of 2^bits guids (default 13 = 8192, covers 1000 nodes x 8 HCAs). Must match the
// per-pkey blocks the provider provisions: base = GUID_POOL_RANGE_START + pkey*2^bits.
PreallocatedBandBits int `env:"PREALLOCATED_BAND_BITS" envDefault:"13"`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving this some though - we can future proof ourselves more. Let's do 16 bits - in case a large tenant cluster spins up in 6 months

Comment thread pkg/daemon/daemon.go
// Pre-allocated mode: pull a free GUID from the band the tenant pkey owns,
// instead of from the whole pool. No UFM call is made.
guidAddr, err = d.LookupPredeterminedGUID(spec)
if err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice the syncGUIDPool called below when GUIDs are exhausted. This actually calls UFM and resets the local map GUID pool from the real UFM results.
We can implement something similar - get all virt-launcher pods with IB GUID annotations, and reset the pool in case of exhaustion. It's a good mechanism to recover from drift

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ttalati175 ttalati175 requested a review from Yarosh June 26, 2026 19:26
ttalati175 and others added 2 commits June 26, 2026 12:34
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants