-
Notifications
You must be signed in to change notification settings - Fork 702
[Initiative]: Cloud Native AI Scheduling Challenges Whitepaper #1641
Copy link
Copy link
Open
Labels
kind/initiativeAn initiative or an item related to imitative processesAn initiative or an item related to imitative processesneeds-groupIndicates an issue or PR that has not been assigned a group (toc or tag/foo label applied)Indicates an issue or PR that has not been assigned a group (toc or tag/foo label applied)needs-triageIndicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied)Indicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied)
Metadata
Metadata
Assignees
Labels
kind/initiativeAn initiative or an item related to imitative processesAn initiative or an item related to imitative processesneeds-groupIndicates an issue or PR that has not been assigned a group (toc or tag/foo label applied)Indicates an issue or PR that has not been assigned a group (toc or tag/foo label applied)needs-triageIndicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied)Indicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied)
Type
Fields
Give feedbackNo fields configured for issues without a type.
Projects
Status
New
Status
status/new
Status
No status
Status
No status
Status
No status
Name
Cloud Native AI Scheduling Challenges Whitepaper
Short description
Whitepaper about the scheduling challenges for AI/ML workloads in Cloud Native environments
Responsible group
TOC
Does the initiative belong to a subproject?
Yes
Subproject name
Cloud Native AI Working Group
Primary contact
@raravena80
Additional contacts
@zanetworker
@ronaldpetty
Initiative description
https://docs.google.com/document/d/1KNmTKwI_cRXZ0KVBqdBhkO1EuS4PhLIUvT16Y2a5erU/edit?tab=t.0#heading=h.l5opvu2gvmzq
This paper aims to enumerate and educate the various challenges and opportunities regarding optimizing resource allocation (aka scheduling) for Cloud Native Artificial Intelligence (CNAI) workloads. Cloud Native allows easy scaling of resources, making it ideal for AI workloads of two types: training and inference. A standard Cloud Native scheduler like the one provided with Kubernetes is, by default, better suited for microservice-type workloads and not yet for AI-related workloads.
Deliverable(s) or exit criteria
Final draft version to be handed off to the CNCF publishing staff.