Use resource.yml in field classification in config-remote-sync#4677
Use resource.yml in field classification in config-remote-sync#4677ilyakuz-db wants to merge 3 commits into
resource.yml in field classification in config-remote-sync#4677Conversation
Integration test reportCommit: f7b3de5
226 interesting tests: 203 FAIL, 13 SKIP, 8 RECOVERED, 2 KNOWN
Top 4 slowest tests (at least 2 minutes):
|
| email_notifications.on_failure[0]: replace | ||
| max_concurrent_runs: replace | ||
| tags['env']: remove | ||
| timeout_seconds: remove |
There was a problem hiding this comment.
This is expected, logic was incorrect previously - this case was skipped completely
| // If a CLI-defaulted field is changed on remote and should be disabled | ||
| // (e.g. queueing disabled → remote field is nil), we can't define it | ||
| // in the config as "null" because the CLI default will be applied again. | ||
| var resetValues = map[string][]resetRule{ |
There was a problem hiding this comment.
Should we include this logic in resources.yml?
| reason: managed | ||
|
|
||
| backend_defaults: | ||
| # Backend returns PERFORMANCE_OPTIMIZED for serverless jobs when not set |
There was a problem hiding this comment.
This is set only for serverless jobs, probably, this should live in changes desc overrides
There was a problem hiding this comment.
Is this field ever set for non-serverless jobs?
There was a problem hiding this comment.
It is, actually! I thought that it only applies to Serverless jobs because in UI we show this checkbox as disabled for non-Serverless, but looking at the settings object in the resource, we do have this field set to PERFORMANCE_OPTIMIZED
|
|
||
| # Same as clusters.node_type_id — see clusters/resource_cluster.go#L333 | ||
| # s.SchemaPath("node_type_id").SetComputed() | ||
| - field: tasks[*].new_cluster.node_type_id |
There was a problem hiding this comment.
When I try to create new_cluster without node_type_id it fails with 400 error:
Endpoint: POST https://<host>/api/2.2/jobs/reset
HTTP Status: 400 Bad Request
API error_code: INVALID_PARAMETER_VALUE
API message: Cluster validation error: Unknown node type id
Considering that backend default is applied only when config value is nil, and for nil value we have 400 error - it doesn't make sense to keep it here
There was a problem hiding this comment.
Question, if I don't specify new_cluster, can it be added by the backend when resource is fetched? I guess not, or some test would fail, but good to say it explicitly.
There was a problem hiding this comment.
In which case? You can either use existing_cluster_id fields or new_cluster object. node_type_id exists in new_cluster and it cannot be set independently. When you specify compute for a task, you have 3 options:
- select existing cluster (existing_cluster_id field)
- add new (new_cluster field)
- use serverless (doesn't have node_type_id)
resource.yml in field classification in config-remote-sync
| ignore_remote_changes: | ||
| # edit_mode is set by CLI, not user-configurable | ||
| - field: edit_mode | ||
| reason: managed |
There was a problem hiding this comment.
I'm using "managed" in other places to mean managed by backend, we should have another reason here.
|
|
||
| jobs: | ||
| ignore_remote_changes: | ||
| # edit_mode is set by CLI, not user-configurable |
There was a problem hiding this comment.
You mean DABs override user setting? Do you know if that is intentional?
There was a problem hiding this comment.
No, I meant that the user can't specify this field in databricks.yml, it's not in the json schema
It's only set by CLI
There was a problem hiding this comment.
Understood, although still not clear why ignore_remote_changes for it? What if CLI sets A and user updates to B, should we not act on that drift?
There was a problem hiding this comment.
In config-remote-sync, we don't need to act, as this field only controls "Job is locked for edits" state. And when yaml sync feature is enabled, we always expect EDITABLE. We also set it explicitly during deployment:
But for the bundle plan, perhaps it could make sense. Example use case - user clicks "Disconnect this job from DAB", which changes edit_mode: UI_LOCKED -> EDITABLE in the resource. When user deploys again - the change is edit_mode: EDITABLE -> UI_LOCKED
Maybe it makes sense to keep a separate option in resources.yml for such cases, for example: ignore_for_sync / ignore_in_sync
There was a problem hiding this comment.
Moved to configsync package from here
| reason: managed | ||
|
|
||
| backend_defaults: | ||
| # Backend returns PERFORMANCE_OPTIMIZED for serverless jobs when not set |
There was a problem hiding this comment.
Is this field ever set for non-serverless jobs?
simonfaltum
left a comment
There was a problem hiding this comment.
Review (automated, 2 agents)
Verdict: Not ready yet | 0 Critical | 1 Major | 1 Gap(Major) | 2 Nit | 1 Suggestion
[Major] Migration drops suppression behavior for ~16 fields
bundle/direct/dresources/resources.yml
The old serverSideDefaults handled timeout_seconds, email_notifications, webhook_notifications, task disabled, pipeline_task.full_refresh, single_user_name, data_security_mode, pipeline continuous, and others. These have no equivalent rules in resources.yml yet. After this PR, config-remote-sync will start surfacing these as drift.
Additionally, removing node_type_id from backend_defaults changes deploy plan behavior (not just config-sync), potentially causing spurious updates for clusters where the backend auto-fills node type. Note the asymmetry: driver_node_type_id is still in backend_defaults.
Suggestion: Add equivalent backend_defaults/ignore_remote_changes rules for the missing fields, or explain why removing them is safe.
[Gap (Major)] No regression tests for migrated configsync classification rules
bundle/configsync/diff_test.go (deleted), bundle/configsync/diff.go
The only unit test file was deleted (90 lines of matchPattern tests). No replacement tests for shouldClassifySkip, filterEntityDefaults, or convertChangeDesc. The new glue logic (synthetic ChangeDesc{Old: nil, New: nil, Remote: value}) has subtle interactions with the classification pipeline.
Suggestion: Add table-driven tests for shouldClassifySkip and convertChangeDesc covering backend defaults, zero/nil values, nested maps, and non-default values.
[Suggestion] Document the hasConfigValue behavioral change
bundle/configsync/diff.go:107
The change from hasConfigValue := cd.Old != nil || cd.New != nil to hasConfigValue := cd.New != nil is a correctness fix (confirmed by the acceptance test update), but there's no comment explaining why only cd.New is checked, making it easy for a future contributor to "fix" it back.
[Nit] Silent error swallowing in shouldClassifySkip
bundle/configsync/diff.go:60-63 - Classification errors silently return false. A debug log would help diagnose issues.
[Nit] Array handling in filterEntityDefaults preserves nil entries
bundle/configsync/diff.go:72-77 - Filtered arrays produce [nil, nil, nil] instead of nil, inconsistent with the map case that returns nil for empty maps. Low priority.
## Changes Use the same logic for server-side defaults as in resources.yml logic. I have a PR that updates `config-remote-sync` to use `resources.yml`, but it's not yet ready to merge #4677 ## Why <!-- Why are these changes needed? Provide the context that the reviewer might be missing. For example, were there any decisions behind the change that are not reflected in the code itself? --> Changes in performance mode were incorrectly ignored previously ## Tests <!-- How have you tested the changes? --> <!-- If your PR needs to be included in the release notes for next release, add a separate entry in NEXT_CHANGELOG.md as part of your PR. -->
|
This PR has not received an update in a while. If you want to keep this PR open, please leave a comment below or push a new commit and auto-close will be canceled. |
## Changes Use the same logic for server-side defaults as in resources.yml logic. I have a PR that updates `config-remote-sync` to use `resources.yml`, but it's not yet ready to merge #4677 ## Why <!-- Why are these changes needed? Provide the context that the reviewer might be missing. For example, were there any decisions behind the change that are not reflected in the code itself? --> Changes in performance mode were incorrectly ignored previously ## Tests <!-- How have you tested the changes? --> <!-- If your PR needs to be included in the release notes for next release, add a separate entry in NEXT_CHANGELOG.md as part of your PR. -->
4045284 to
2d76234
Compare
…adata
config-remote-sync previously filtered server-side noise out of detected
remote changes via a hand-maintained serverSideDefaults table in
bundle/configsync/defaults.go. The table constantly lagged behind
resources.yml. This change derives the filtering from the same lifecycle
metadata that bundle plan uses (bundle/direct/dresources/resources.yml
and resources.generated.yml) and deletes the table.
configsync matches the metadata directly instead of relying on the
plan's per-field actions: the plan answers "what will deploy do", which
is not the same as "does this field belong in configuration". For
example, ResourceDashboard.OverrideChangeDesc re-promotes etag drift to
Update after the metadata chain (deploy needs that for modified-remotely
detection), but etag must never be written to YAML. Concretely:
- Fields matching ignore_remote_changes (in either the hand-written or
the generated config) are never written to configuration, regardless
of values or plan action.
- Fields matching backend_defaults are dropped when absent from the
config (cd.New == nil) and the remote value matches the rule,
including the optional values constraint.
- The same matching drives the nested filtering of entities added
remotely as a whole (e.g. a new task); map fields that become empty
after filtering are pruned, but empty elements inside arrays are kept
as {} to preserve array arity.
Two intentional semantic changes compared to the old table:
- When a field set in config has a remote value that filters entirely
to backend defaults, the operation is now Remove (the field is
dropped from the YAML) instead of Replace with an empty map. This
produces cleaner config and is pinned by a unit test.
- Pre-existing backend_defaults entries without values constraints
(e.g. notebook_task.source, data_security_mode) now make sync drop
any remote value when the field is absent from config, where the old
table only dropped one specific default value. This intentionally
aligns sync with the deploy plan's backend-default semantics; adding
values constraints to those entries would change plan behavior and is
out of scope here.
The matching helpers move from bundle/direct/bundle_plan.go into the
dresources package (FindMatchingRule, MatchesAllowedValue, and the
backend-defaults loop as ResourceLifecycleConfig.MatchesBackendDefault)
so plan and sync share one implementation.
Sync-specific rules stay in the configsync package, ported to structpath
patterns keyed by resource type: jobs edit_mode (set by the CLI, not
user-settable in databricks.yml), the jobs queue reset value, and
name-prefix stripping. These have sync-only drift semantics and don't
belong in resources.yml.
One backend_defaults rule needs a sync-side exception rather than
removal: the jobs new_cluster node_type_id entries (tasks[*] and
job_clusters[*]) are computed fields for the plan. Config legitimately
omits node_type_id when a cluster policy fixes it, and jobs/get then
returns the policy-resolved value in the stored settings (verified on
AWS prod, both with and without apply_policy_default_values), so
removing the entries would cause permadrift on every plan and a
spurious update on every deploy. For sync the opposite holds: a
remotely added cluster's explicit node_type_id is required
configuration, and stripping it via nested filtering would make the
synced YAML undeployable (cluster creation fails with 400 "Unknown
node type id" unless a pool or policy provides the node type). These
fields are therefore listed in fieldsKeptForSync, which nested entity
filtering consults before dropping a field. Keeping them
unconditionally cannot introduce a node_type_id/instance_pool_id
conflict: the Jobs API rejects requests carrying both, and jobs/get
does not materialize node_type_id for pool-backed clusters (also
verified empirically), so a remote cluster never has both fields.
resources.yml changes:
- jobs: add performance_target backend default with values
["PERFORMANCE_OPTIMIZED"]; the backend returns it for all jobs when
unset, not only for serverless jobs.
- jobs: add tasks[*].timeout_seconds / tasks[*].disabled backend
defaults (and for_each_task variants) with values [0] / [false].
These are inert for the plan, which already skips them as all-empty;
they are needed so config-remote-sync does not write these
backend-returned defaults into YAML for tasks added remotely.
- jobs: add email_notifications.no_alert_for_skipped_runs backend
defaults with values [false] at the job level, tasks[*], and
tasks[*].for_each_task.task. The backend populates this deprecated
field when notifications are not configured. Inert for the plan: the
all-empty check in addPerFieldActions runs before backend defaults
and already skips zero values. Needed so sync drops the leaf and
then prunes the empty email_notifications map for entities added
remotely.
- jobs: add tasks/job_clusters new_cluster.single_user_name backend
defaults, mirroring the existing standalone clusters entry (#4418).
- clusters: add driver_node_type_flexibility and
worker_node_type_flexibility backend defaults; some backends compute
alternate node types for standalone clusters when the config does not
specify flexibility (observed on AWS staging workspaces).
- dashboards: add etag to ignore_remote_changes (output_only). This does
not change deploy behavior because OverrideChangeDesc runs after the
metadata chain; see the comment in resources.yml.
- pipelines: remove the clusters[*].node_type_id backend_defaults
entry. Unlike jobs and standalone clusters, pipelines/get returns
the stored spec verbatim: clusters relying on an instance pool or a
policy-fixed node type never gain node_type_id in the spec (verified
on AWS prod for both cases), and ResourcePipeline.DoRead reads the
spec, so the entry can never match a real remote diff. Removing it
also keeps sync's nested filtering from stripping node_type_id off
remotely added pipeline clusters.
- The jobs and standalone clusters node_type_id entries stay: jobs/get
materializes policy-fixed node types (see above), and clusters/get
reports the pool's node type at the top level for pool-backed
clusters while config must omit it (the API rejects node_type_id
together with instance_pool_id).
Table entries that needed no resources.yml counterpart: zero/empty
values (job and task timeout_seconds: 0 at the top level, empty
email_notifications/webhook_notifications, pipelines continuous: false,
tasks[*].pipeline_task.full_refresh: false) are already skipped by the
plan's all-empty check before they reach sync, and empty maps inside
remotely added entities are pruned by the nested filtering; the
non-empty email_notifications form the backend returns
({no_alert_for_skipped_runs: false}) is covered by the new explicit
rules above. The remaining entries (cloud attributes,
data_security_mode, experiments artifact_location, registered_models
owner/full_name/metastore_id/storage_location, volumes
storage_location, sql_warehouses
creator_name/min_num_clusters/warehouse_type, jobs run_as, pipelines
storage) were already covered by existing resources.yml rules.
191100d to
dc8ac14
Compare
Changes
bundle config-remote-syncderives its field filtering from the resource lifecycle metadata (resources.yml+resources.generated.yml) instead of the hand-maintainedserverSideDefaultstable:ignore_remote_changesfields are excluded from sync unconditionally;backend_defaultsfields only when absent from config, honoringvalues.OverrideChangeDeschooks — those answer "what will deploy do", not "what belongs in config" (e.g. dashboardetagis re-promoted to Update for modified-remotely detection, yet must never be synced).edit_mode,queuereset values, name-prefix stripping, and anode_type_idkeep-list so remotely added job clusters stay deployable.dresources/match.goso plan and sync share one implementation.resources.ymlchanges:custom_tags/cluster_log_confbecomebackend_defaultson job and standalone clusters (moved from the configsync-only rule in configsync: don't sync policy-injected custom_tags and cluster_log_conf #5610, per its TODO). This also stops perpetual direct-engine plan updates for policy-injected values.performance_target ["PERFORMANCE_OPTIMIZED"], tasktimeout_seconds [0]/disabled [false]/pipeline_task.full_refresh [false],email_notifications.no_alert_for_skipped_runs [false], task/job-clustersingle_user_name, clusterdriver/worker_node_type_flexibility; dashboardsetag→ignore_remote_changes(inert for deploy). The zero-value rules are inert for the plan (the all-empty check runs before backend defaults, and any other remote value fails thevaluesconstraint); they exist for config-remote-sync's nested entity filtering.node_type_identries are kept — policy- and pool-backed clusters materialize it remotely while config legitimately omits it. Only the never-matching pipelinesclusters[*].node_type_identry is removed (pipelines/getreturns the stored spec verbatim).Two intentional sync semantic changes, pinned by unit tests: (a) a remote value that filters entirely to defaults emits
removeinstead ofreplace {}; (b)backend_defaultswithoutvaluesstrip any remote value when the field is absent from config (the old table matched one specific value). One narrow divergence: a remote-onlydata_security_modeon a standalone cluster now syncs verbatim instead of being stripped when equal toSINGLE_USER— its resources.yml rule is deliberately absent because clusters use the alias handler in cluster.go.Why
The table constantly lagged behind
resources.yml: every new field had to be added in two places, and gaps leaked server-side values into users' YAML. Lifecycle metadata added for the deploy plan now automatically protects config-remote-sync.Tests
custom_tags/cluster_log_confalone produce no update; a real remote edit still does.node_type_id/custom_tags/cluster_log_conf/spark_conf): spurious plan updates drop tospark_confonly (known gap, unchanged), no new drift fields, legit remote edits detected and synced identically, deploy end state identical; changing the policy's fixed node type after deploy stays suppressed on both.