Skip to content

[FLINK-39370][flink-autoscaler] In-place scaling check only inspects K8s resource spec for adaptive scheduler, ignoring JM running configuration#1080

Closed
Dennis-Mircea wants to merge 1 commit into
apache:mainfrom
Dennis-Mircea:FLINK-39370
Closed

[FLINK-39370][flink-autoscaler] In-place scaling check only inspects K8s resource spec for adaptive scheduler, ignoring JM running configuration#1080
Dennis-Mircea wants to merge 1 commit into
apache:mainfrom
Dennis-Mircea:FLINK-39370

Conversation

@Dennis-Mircea
Copy link
Copy Markdown
Contributor

@Dennis-Mircea Dennis-Mircea commented Mar 31, 2026

What is the purpose of the change

NativeFlinkService.supportsInPlaceScaling() currently only inspects the Kubernetes FlinkDeployment resource spec to determine whether the adaptive scheduler is configured. If the adaptive scheduler is set through mechanisms outside the K8s spec (such as Flink's native flink-conf.yaml, environment variables, or command-line dynamic properties) the operator incorrectly concludes that in-place scaling is not supported and falls back to a full restart/redeploy cycle.

This PR adds a fallback path: when the K8s resource spec does not indicate jobmanager.scheduler: Adaptive, the operator queries the JobManager's actual running configuration via the JobManagerJobConfigurationHeaders REST endpoint. If the JM confirms the adaptive scheduler is active, in-place scaling proceeds normally.

Brief change log

  • NativeFlinkService.java

    • Changed supportsInPlaceScaling from static to instance method (needed for getClusterClient and operatorConfig access)
    • Added JM REST configuration fallback when observeConfig.get(SCHEDULER) is not Adaptive:
      • Sends a JobManagerJobConfigurationHeaders request to the running JM
      • Searches the response for the jobmanager.scheduler key
      • Returns true only if the value equals Adaptive
    • Graceful degradation: if the JM REST call fails (e.g., JM unreachable), in-place scaling is denied
    • The terminal/reconciling state check is correctly preserved - both the K8s spec path and JM fallback path flow through to it (no early return on success)
    • Extracted SCHEDULER to a static import for readability
    • Added imports: ConfigurationInfoEntry, JobManagerJobConfigurationHeaders, Optional, static SCHEDULER
  • NativeFlinkServiceTest.java

    • Updated existing testScaling service setup to also override getClusterClient (returns a TestingClusterClient configured to return Default scheduler from JM config, so existing "do not scale without adaptive scheduler" assertions remain valid)
    • Added new testScalingWithJmConfigFallback() test with 5 cases:
      1. JM returns Adaptive - scaling succeeds ✅
      2. JM returns Default - scaling denied ❌
      3. JM REST throws exception - scaling denied (graceful degradation) ❌
      4. JM returns Adaptive but job is FAILED - scaling denied (terminal state check is not skipped) ❌
      5. JM returns empty config (no scheduler key) - scaling denied ❌
    • Added imports: ConfigurationInfo, ConfigurationInfoEntry, EmptyResponseBody, JobManagerJobConfigurationHeaders

Verifying this change

This change is verified by the new testScalingWithJmConfigFallback test which covers the complete matrix of fallback scenarios. Case 4 is the critical regression test - it verifies that the terminal/reconciling state check is never bypassed when the adaptive scheduler confirmation comes from the JM REST fallback path rather than the K8s spec. All 10 tests in NativeFlinkServiceTest pass.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

…K8s resource spec for adaptive scheduler, ignoring JM running configuration
@Dennis-Mircea Dennis-Mircea changed the title [FLINK-39370][flink-autoscaler] In-place scaling check only inspects … [FLINK-39370][flink-autoscaler] In-place scaling check only inspects K8s resource spec for adaptive scheduler, ignoring JM running configuration Mar 31, 2026
Copy link
Copy Markdown
Contributor

@gyfora gyfora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree with the intent, I think the implementation is too localized whereas the problem generally exists throughout the operator logic. We should instead follow up a more generic approach as started here: #1000

@gyfora
Copy link
Copy Markdown
Contributor

gyfora commented Apr 14, 2026

closing in favor of a different approach

@gyfora gyfora closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants