Skip to content

Encapsulate Checkpoint internal state#148

Open
nojnhuh wants to merge 13 commits intokubernetes-sigs:mainfrom
nojnhuh:encapsulate_checkpoint
Open

Encapsulate Checkpoint internal state#148
nojnhuh wants to merge 13 commits intokubernetes-sigs:mainfrom
nojnhuh:encapsulate_checkpoint

Conversation

@nojnhuh
Copy link
Copy Markdown
Contributor

@nojnhuh nojnhuh commented Feb 5, 2026

This PR builds off of #90:

This PR encapsulates the internal state of Checkpoint and prevents direct access to CheckpointV1 from DeviceState.

The second commit drops our dependency on k/k by refactoring away from the kubelet's CheckpointManager to a pair of functions to read and write checkpoints from the local filesystem directly. Checksums have been dropped from the checkpoint format in favor of safely updating the file by writing a temporary copy and performing an atomic rename.

In addition, the checkpoint format has been changed to include an apiVersion and kind like other Kubernetes APIs. The new format also includes less data, relying on most container config for a ResourceClaim to be recomputed rather than restored directly from the checkpoint.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nojnhuh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 5, 2026
@pohly
Copy link
Copy Markdown
Contributor

pohly commented Feb 5, 2026

/cc

@k8s-ci-robot k8s-ci-robot requested a review from pohly February 5, 2026 10:42
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Feb 6, 2026
"sigs.k8s.io/dra-example-driver/internal/profiles"
)

type PreparedClaims map[string]profiles.PreparedDevices
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type PreparedDevice struct {
	drapbv1.Device
	ContainerEdits *cdiapi.ContainerEdits
	AdminAccess    bool
}

This does not look like a struct that is intended to be used in an on-disk file format. All it takes is for someone to rename one of these fields and encoding/decoding will break. The current unit test can catch that only for fields which are covered by a test case.

Should we use this opportunity (i.e. this PR) to introduced proper serialization of the internal state with a versioned on-disk format?

See kubernetes/kubernetes#137086 for the machinery, used there for the on-disk metadata information file format. The problem is the same: one process writing a file, another reading it. That's an API, in this case between one version of the example driver and another.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might as well do that here.

I'm trying to make the internal version contain a literal k8s.io/kubelet/pkg/apis/dra/v1beta1.Device and tags.cncf.io/container-device-interface/pkg/cdi.ContainerEdits since that's what the driver ultimately uses, but codegen is not liking those at all. The protobuf type has unexported fields which controller-gen tries to access in generated DeepCopy methods. The CDI types are structured kind of weirdly that's either weird to mirror in our API or needs a bunch of manual conversion to handle. To generate code that compiles it seems like we'll need to entirely duplicate all the types for every version and then manually convert from the internal API version to the types that the CDI and kubelet Go interfaces use. At that point I'd almost rather generate no code at all. I lost steam following this all the way through but pushed what I have so far in the newest commit here in case that's informative.

Am I way off base or do we need to accept that this just isn't going to be smooth?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does all of that information really need to be persisted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pared down the checkpoints all the way down to just the claim UID and rely on recomputing everything else from the ResourceClaim. I added some comments and refactored a bit to describe how real drivers can handle more complex scenarios.

@nojnhuh nojnhuh force-pushed the encapsulate_checkpoint branch from 04676a0 to 240f94e Compare March 5, 2026 06:55
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2026
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 27, 2026
@nojnhuh nojnhuh force-pushed the encapsulate_checkpoint branch from 240f94e to 848c141 Compare April 17, 2026 18:51
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 17, 2026
@nojnhuh
Copy link
Copy Markdown
Contributor Author

nojnhuh commented Apr 17, 2026

I plan to squash before merging.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 17, 2026
…into encapsulate_checkpoint

# Conflicts:
#	go.mod
#	go.sum
Copy link
Copy Markdown
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this goes in the right direction, but needs documentation of the concept.

Comment thread cmd/dra-example-kubeletplugin/checkpoint.go Outdated
Comment thread cmd/dra-example-kubeletplugin/checkpoint.go
Comment thread cmd/dra-example-kubeletplugin/state.go Outdated
Comment thread cmd/dra-example-kubeletplugin/state.go Outdated
Comment thread cmd/dra-example-kubeletplugin/state.go Outdated
checkpointScheme,
json.SerializerOptions{
Pretty: true,
Strict: true,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Strict" can bite us. This causes unknown fields during decoding to result in an error, right?

Suppose we need to extend the on-disk format with a field that older releases of the DRA driver don't know about. If we treat API versions like in the rest of Kubernetes, we'd add that field to all API versions and serialize it. Older clients then need to ignore unknown fields, which we can rely upon when adding fields that are okay to be ignored.

The alternative is to add new fields only in a new API version and then only serialize using that version. This immediately prevents downgrading, because older clients won't be able to read the file.

Which brings me to my next point: how do new API versions get rolled out? We need to document it. Probably something like this:

  • In driver version x.y, support v1alpha1.
  • In driver version x.y+1, support v1beta1, but still serialize as v1alpha1 (thus supporting downgrades).
  • In driver version x.y+2, write as v1beta1 and then only support downgrades back to x.y+1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs suggest it means something different: https://pkg.go.dev/k8s.io/apimachinery@v0.35.4/pkg/runtime/serializer/json#SerializerOptions.Strict

Strict: configures the Serializer to return strictDecodingError's when duplicate fields are present decoding JSON or YAML. Note that enabling this option is not as performant as the non-strict variant, and should not be used in fast paths.

I tried manually editing the checkpoint on disk to add an unknown field and the next NodePrepareResources succeeds, removing the unknown field when it serializes the checkpoint again. I think that's the expected behavior?

  • In driver version x.y+1, support v1beta1, but still serialize as v1alpha1 (thus supporting downgrades).

Does it make sense to even define a v1beta1 if the driver only writes v1alpha1? If v1beta1 is added to support some feature, then the driver should start writing v1beta1 right away. Maybe the driver can be more intelligent and continue writing v1alpha1 if it can tell it doesn't need any of the features in v1beta1. I think we can worry about that though once we need a new API version. Added some go docs to describe that process.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. In https://pkg.go.dev/sigs.k8s.io/yaml#UnmarshalStrict, "strict" includes "presence of fields in the serialized data unknown to the struct will yield an error".

I tried manually editing the checkpoint on disk to add an unknown field and the next NodePrepareResources succeeds, removing the unknown field when it serializes the checkpoint again. I think that's the expected behavior?

Yes, thanks for checking.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to even define a v1beta1 if the driver only writes v1alpha1?

I think we can do the opposite and go straight to v1. Basically the driver then commits to only use that version and maybe extend it with new fields, with the caveat that older releases will ignore and drop those. Versioning as "v1" is just future-proofing the storage format because if structural changes are ever going to be needed, they can be done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this code was pulled from a (very) old version of the DRA driver for GPUs, and we have since relaxed the use of strict for (most) code paths. The example driver should probably do the same: kubernetes-sigs/dra-driver-nvidia-gpu@fb4822f

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated to match. Strict for config, non-strict for checkpoints.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do the opposite and go straight to v1. Basically the driver then commits to only use that version and maybe extend it with new fields, with the caveat that older releases will ignore and drop those. Versioning as "v1" is just future-proofing the storage format because if structural changes are ever going to be needed, they can be done.

Makes sense, updated to v1.

Comment thread internal/api/checkpoint/v1/register.go
Comment thread Makefile
Comment thread Makefile
@nojnhuh nojnhuh force-pushed the encapsulate_checkpoint branch from 69b1bf0 to ff58c14 Compare April 23, 2026 04:03
@pohly
Copy link
Copy Markdown
Contributor

pohly commented Apr 23, 2026

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants