[SPARK-56693][K8S] Support built-in K8s ExecutorPVCResizePlugin#55642
[SPARK-56693][K8S] Support built-in K8s ExecutorPVCResizePlugin#55642dongjoon-hyun wants to merge 4 commits intoapache:masterfrom
ExecutorPVCResizePlugin#55642Conversation
|
Could you review this PR when you have some time, @viirya ? |
|
Could you review this PR, @peter-toth ? |
| .toSet | ||
| } | ||
|
|
||
| private def tryResize( |
There was a problem hiding this comment.
tryResize grows the PVC from the current spec.resources.requests.storage every time the reported ratio is above the threshold. After a successful PVC spec patch, the executor filesystem capacity may not reflect the new size immediately, so the next reporting interval can still report the old high usage ratio. With a 1-minute interval and the default growth factor, a 50Gi PVC could become 100Gi, then 200Gi, then 400Gi before the first resize is actually completed.
There was a problem hiding this comment.
This is especially risky because the PR description already notes that some cloud providers only allow volume modification once every several hours. The current code only blacklists PVCs when the synchronous patch call throws, but async resizer failures or pending resize states are not detected.
I think this should track per-PVC in-flight resize state or cooldown, and check PVC status/conditions before issuing another patch. At minimum, it should avoid issuing another expansion while spec.requests.storage is already larger than the observed/status capacity.
There was a problem hiding this comment.
Technically, the case you mention doesn't happen, @viirya . It's because the second and subsequent invocation will take no harm from Spark side according to the K8s design policy. We can request the desirable status multiple times, but the second and subsequent requests are not accepted by K8s control plane for next 6 hours.
In other words, the expansion will be processed at every 6 hours. The only exception is we can expand once at any time after creation.
There was a problem hiding this comment.
Thanks, is 6-hour behavior guaranteed by a Kubernetes control-plane? That sounds like an AWS EBS backend limitation rather than a generic Kubernetes PVC expansion policy? I only found that it is mentioned in AWS EBS article like this.
From the Kubernetes side, users request expansion by updating the PVC spec to a larger requested size, and the controller reconciles that desired state asynchronously. The Kubernetes docs also say failed expansion requests are continuously retried and should be monitored via PVC status/events. I don't see a generic Kubernetes rule that rejects subsequent larger PVC spec updates while a previous resize is pending.
So even if EBS rejects actual backend ModifyVolume calls within 6 hours, the Spark driver may still keep increasing the PVC desired size in Kubernetes, for example from 50Gi to 100Gi to 200Gi, before the first resize has converged. That changes the target state and can result in over-expansion once the backend is eventually able to process it, or leave the PVC stuck retrying a much larger requested size.
I think the plugin should be storage-provider-neutral and should not rely on AWS EBS cooldown behavior for safety. A simple guard such as skipping when spec.resources.requests.storage is already greater than status.capacity["storage"], or otherwise tracking an in-flight resize per PVC, would prevent repeated target inflation while still following Kubernetes' desired-state model.
| }.maxOption | ||
| maxRatio.foreach { ratio => | ||
| logInfo(s"Reporting max PVC disk usage ratio for executor ${env.executorId}: $ratio") | ||
| pluginContext.send(PVCDiskUsageReport(env.executorId, ratio)) |
There was a problem hiding this comment.
The executor reports only the maximum usage ratio across all SPARK_LOCAL_DIRS, and the driver applies that same ratio to every PVC mounted by the executor container. This means that with multiple Spark local-dir PVCs, an empty PVC can be expanded just because another local dir is full. It can also resize unrelated PVCs mounted into the executor container, since pvcsOf returns all mounted PVCs rather than only the PVC backing the reported local dir.
The class comment says the driver maps each reported mount path back to the executor pod’s PVC, but the message does not contain any mount path or per-directory usage, so the implementation cannot actually do that.
I think the executor should report per-local-dir usage, including the path, and the driver should map each reported path to the corresponding volumeMount/PVC. It should also limit resizing to Spark local-dir PVCs rather than every PVC mounted in the executor container.
viirya
left a comment
There was a problem hiding this comment.
The current tests cover the basic above/below-threshold behavior, patch failure blacklisting, and pods without PVCs, but they do not cover the main correctness risks of this feature.
I think we probably should add tests for:
- Multiple local-dir PVCs where only one directory is above the threshold. Only the PVC backing that directory should be resized.
- Executor pods with additional non-local-dir PVC mounts. Those PVCs should not be resized.
- A PVC whose spec.resources.requests.storage is already larger than status.capacity, meaning a resize is already pending. The plugin should not issue another expansion.
- Repeated check intervals with the same high usage report before the filesystem capacity changes. This should not repeatedly double the PVC size.
- The actual patched storage value, not just that patch() was called.
It's one of theoretically proper alternatives, but it's not MUST-HAVE. I chose this simple design for the heavy load executor because all PVCs (of that executor) will be released (affected) if the executor dies with Out-Of-Disk of one PVC.
As I mentioned, it sounds good but it's not K8s design policy where requesters set the desirable status and controller meets the target asynchronously. There is no need to be
Technically, hourly check is assumed for the long running jobs. The main reason why the interval is MINUTE is only to support the first expansion after creation for small jobs, @viirya . |
|
I'd like to explain the background a little more, @viirya .
|
Thanks, I understand the intended trade-off now: if any local-dir PVC can kill the executor when it runs out of space, resizing all local-dir PVCs for that executor is a simpler conservative policy. My remaining concern is that the current implementation does not limit this policy to Spark local-dir PVCs. It resizes every PVC mounted by the executor container, including PVCs that are unrelated to Spark local storage. If the intended contract is “all Spark local-dir PVCs are resized together for a homogeneous local-dir setup,” then I think the code and docs should make that explicit and avoid touching non-local-dir PVCs. |
Thanks, that helps clarify the intended scope. If the intended contract is a homogeneous PVC shuffle/local-dir storage setup, then I think we should make that contract explicit. Right now the implementation resizes every PVC mounted by the executor container, but the docs/configs only say “executor PVC” and do not warn that heterogeneous PVC mounts are unsupported. Users may reasonably combine Spark local-dir PVCs with other executor PVC mounts via pod templates, and those unrelated PVCs would also be resized. Could we either document this limitation very explicitly, or preferably restrict the implementation to Spark local-dir PVC mounts only? That would preserve the simple homogeneous design while avoiding accidental resize of unrelated PVCs. |
I agree with the Kubernetes desired-state model, but my concern is not about reasserting the same desired state. The current code computes a new desired state from the already-increased spec on each interval. If the PVC has So the guard is not against Kubernetes async reconciliation itself. It is to avoid deriving another larger target while the previous target has not converged yet. |
|
For documentation, of course, I'm able to explain specifically.
For the code, we may introduce a string pattern (
|
|
Thank you so much, @viirya . I addressed your concerns via 14de748. This PR becomes much better than before. As I mentioned, I'll also enrich the documentation to cover non-shuffle storage PVCs . Currently, I've been working on SPARK-55553 Document heterogeneous K8s executors because this is a little big area. |
|
Thanks, this update addresses my main repeated-expansion concern. The I think the remaining concern is the scope of PVC selection. One smaller test comment: the “grown size” test still only verifies that |
|
Let me narrow down specifically, |
viirya
left a comment
There was a problem hiding this comment.
Thanks, the spark-local-dir- filtering addresses my concern about accidentally resizing unrelated PVC mounts. Together with the pending-resize guard, this looks much safer now.
One small follow-up: the class comment still says the driver maps each reported mount path back to the executor pod's PVC, but the report only contains the max ratio across local dirs and the driver applies it to all spark-local-dir-* PVCs for that executor. Could we update the comment to match the actual behavior?
Optionally, the “grown size” test could capture the patched PVC and assert the requested storage value, but that is more of a test-strengthening nit than a blocker from my side.
|
Thank you for the iterative reviews. 😄 |
|
Merged to master for Apache Spark 4.2.0. |
What changes were proposed in this pull request?
This PR aims to support a new
ExecutorPVCResizePluginthat monitors executor PVC disk usageand grows each PVC's
spec.resources.requests.storagewhen usage exceeds athreshold.
The executor side reports the max filesystem usage ratio across
DiskBlockManager.localDirs. The driver side patches the executor pod's PVCs tocurrentSize * (1 + factor)when the reported ratio exceeds the threshold.New configurations:
spark.kubernetes.executor.pvc.resizeInterval0min0disables.spark.kubernetes.executor.pvc.resizeThreshold0.5spark.kubernetes.executor.pvc.resizeFactor1.0Note that the public cloud PVC has clear limitations. For example, you can increase only and once per every six hour. So, for a short Spark job (whose lifetime is less than 6 hours),
ExecutorPVCResizePluginis able to increase only one time. For a recently increased PVCs, K8s will show a warning something like the following.Why are the changes needed?
PVC-backed
SPARK_LOCAL_DIRSmust be sized conservatively up front to avoidmid-job disk-full failures, which wastes storage cost.
ExecutorResizePluginalready established the observe-and-patch pattern for memory; this extends it to
PVC storage.
Does this PR introduce any user-facing change?
No. The user needs to set this to
spark.pluginsexplicitly.SUBMIT
DRIVER LOGS
EXECUTOR SIZE REPORTING (60% -> 30%)
RESIZED PVC
How was this patch tested?
Pass the CIs with a new
ExecutorPVCResizePluginSuite.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.7 (1M context)