Add PreallocatedMode for predetermined GUID lookup#20
Conversation
Coverage Report for CI Build 28261857916Coverage decreased (-1.1%) to 54.987%Details
Uncovered Changes
Coverage RegressionsNo coverage regressions found. Coverage Stats
💛 - Coveralls |
Each pkey owns a fixed band of the GUID range; the daemon hands out the next free slot from the pool with no UFM calls. Off by default (PREALLOCATED_MODE). Refs TCL-6978 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5fcd462 to
a697737
Compare
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| return nil | ||
| } | ||
|
|
||
| // LookupPredeterminedGUID returns a GUID from the pre-allocated band that the pod's tenant |
There was a problem hiding this comment.
NIT: those AI-generated comments are noise IMHO - way too verbose and unnecessary.
| // Pre-allocated mode: pull a free GUID from the band the tenant pkey owns, | ||
| // instead of from the whole pool. No UFM call is made. | ||
| guidAddr, err = d.LookupPredeterminedGUID(spec) | ||
| if err != nil { |
There was a problem hiding this comment.
shouldn't you handle the ErrGUIDPoolExhausted error explicitly here? same as non-pre-allocated mode handles it?
I would at least raise an explicit error log we can alert on
There was a problem hiding this comment.
Notice the syncGUIDPool called below when GUIDs are exhausted. This actually calls UFM and resets the local map GUID pool from the real UFM results.
We can implement something similar - get all virt-launcher pods with IB GUID annotations, and reset the pool in case of exhaustion. It's a good mechanism to recover from drift
| // Number of low bits reserved per pkey band in pre-allocated mode. Each pkey owns a | ||
| // band of 2^bits guids (default 13 = 8192, covers 1000 nodes x 8 HCAs). Must match the | ||
| // per-pkey blocks the provider provisions: base = GUID_POOL_RANGE_START + pkey*2^bits. | ||
| PreallocatedBandBits int `env:"PREALLOCATED_BAND_BITS" envDefault:"13"` |
There was a problem hiding this comment.
Giving this some though - we can future proof ourselves more. Let's do 16 bits - in case a large tenant cluster spins up in 6 months
| // Pre-allocated mode: pull a free GUID from the band the tenant pkey owns, | ||
| // instead of from the whole pool. No UFM call is made. | ||
| guidAddr, err = d.LookupPredeterminedGUID(spec) | ||
| if err != nil { |
There was a problem hiding this comment.
Notice the syncGUIDPool called below when GUIDs are exhausted. This actually calls UFM and resets the local map GUID pool from the real UFM results.
We can implement something similar - get all virt-launcher pods with IB GUID annotations, and reset the pool in case of exhaustion. It's a good mechanism to recover from drift
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
What
Adds pre-allocated GUID mode for substrates where we are not the UFM fabric admin and GUIDs/PKeys are pre-provisioned by the provider (IREN B300s). When
PREALLOCATED_MODEis on, the daemon assigns GUIDs locally with no UFM calls.How
pkeyowns a fixed band of the GUID range:GUID = RANGE_START + (pkey << PREALLOCATED_BAND_BITS) + slot. The daemon hands out the next free slot in that band from the existing pool.PREALLOCATED_BAND_BITSdefaults to 16 (65536 GUIDs/band) to leave headroom for large clusters. Must match the provider's per-pkey block layout.initPool) and maintained via allocate/release — so a restart can't re-hand-out a live GUID without UFM.Out of scope
RANGE_START + pkey << bits.Refs TCL-6978