Skip to content

Commit 193be7f

Browse files
denikandrewnesterdependabot[bot]Varun Deep Sainivarundeepsaini
authored
Persist state to .wal immediately after resource update (#5149)
## Changes Every time we update a state, we now write resource entry to resources.json.wal. At the end of the deployment, we read those entries, merge them into state and write state file. Internally, the state file can be opened in two modes: read and write. In write mode we only track IDs in memory but not full state. This ensures that when we merge .wal file into state in-memory representation, the .wal file is source of truth, so normal state update is not different from recovery state-update. ## Why If "bundle deploy" is killed, we don't lose state changes. Next time any bundle command runs, we'll recover the updates. ## Tests Modified testserver kill API to be more flexible. Instead of being configured via test.toml, it's a separate endpoint that can add kill middleware on any other endpoint. So you can dynamically insert kill action during script execution. --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Varun Deep Saini <varun.23bcs10048@ms.sst.scaler.com> Signed-off-by: Varun Deep Saini <deepsainivarun@gmail.com> Co-authored-by: Andrew Nester <andrew.nester@databricks.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Varun Deep Saini <varun.23bcs10048@ms.sst.scaler.com> Co-authored-by: Varun Deep Saini <deepsainivarun@gmail.com>
1 parent c62b641 commit 193be7f

105 files changed

Lines changed: 1503 additions & 273 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

NEXT_CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,5 +13,6 @@
1313
* Make sure warnings asking for approval are understood by agents ([#5239](https://github.com/databricks/cli/pull/5239))
1414
* Support `replace_existing: true` on `postgres_branches` and `postgres_endpoints` so bundles can manage the implicitly-created production branch and primary read-write endpoint of a Lakebase project.
1515
* Add `postgres_catalogs` resource to bind a Unity Catalog catalog to a Postgres database on a Lakebase Autoscaling branch ([#5265](https://github.com/databricks/cli/pull/5265)).
16+
* engine/direct: Changes to state file now persisted to .wal file right away instead of being saved in the end ([#5149](https://github.com/databricks/cli/pull/5149))
1617

1718
### Dependency updates

acceptance/bin/assert_exists.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
#!/usr/bin/env python3
2+
import os, sys
3+
4+
errors = 0
5+
6+
for filename in sys.argv[1:]:
7+
if not os.path.exists(filename):
8+
sys.stderr.write(f"Unexpected: {filename} does not exist.\n")
9+
errors += 1
10+
11+
if errors:
12+
sys.exit(1)
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
#!/usr/bin/env python3
2+
import os, sys
3+
4+
errors = 0
5+
6+
for filename in sys.argv[1:]:
7+
if os.path.exists(filename):
8+
sys.stderr.write(f"Unexpected: {filename} exists.\n")
9+
errors += 1
10+
11+
if errors:
12+
sys.exit(1)

acceptance/bin/kill_after.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
#!/usr/bin/env python3
2+
"""Set up a kill rule on the testserver for the current test token.
3+
4+
Usage: kill_after.py PATTERN OFFSET TIMES
5+
6+
PATTERN HTTP method and path, e.g. "POST /api/2.2/jobs/create"
7+
OFFSET number of requests to let through before killing starts
8+
TIMES number of times to kill the caller
9+
10+
The rule is scoped to the current DATABRICKS_TOKEN so it only affects
11+
the test that registers it, even when tests share a server.
12+
"""
13+
14+
import json
15+
import os
16+
import sys
17+
import urllib.request
18+
19+
host = os.environ.get("DATABRICKS_HOST", "")
20+
token = os.environ.get("DATABRICKS_TOKEN", "")
21+
22+
if not host:
23+
print("DATABRICKS_HOST not set", file=sys.stderr)
24+
sys.exit(1)
25+
26+
if len(sys.argv) != 4:
27+
print(f"usage: {sys.argv[0]} PATTERN OFFSET TIMES", file=sys.stderr)
28+
sys.exit(1)
29+
30+
pattern, offset, times = sys.argv[1], int(sys.argv[2]), int(sys.argv[3])
31+
32+
data = json.dumps({"pattern": pattern, "offset": offset, "times": times}).encode()
33+
req = urllib.request.Request(
34+
f"{host}/__testserver/kill",
35+
data=data,
36+
headers={"Content-Type": "application/json", "Authorization": f"Bearer {token}"},
37+
method="POST",
38+
)
39+
urllib.request.urlopen(req)
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
bundle:
2+
name: wal-chain-test
3+
4+
resources:
5+
jobs:
6+
# Linear chain: job_01 -> job_02 -> job_03
7+
# Execution order: job_01 first, job_03 last
8+
job_01:
9+
name: "job-01"
10+
description: "first in chain"
11+
tasks:
12+
- task_key: "task"
13+
spark_python_task:
14+
python_file: ./test.py
15+
new_cluster:
16+
spark_version: 15.4.x-scala2.12
17+
node_type_id: i3.xlarge
18+
job_02:
19+
name: "job-02"
20+
description: "depends on ${resources.jobs.job_01.id}"
21+
tasks:
22+
- task_key: "task"
23+
spark_python_task:
24+
python_file: ./test.py
25+
new_cluster:
26+
spark_version: 15.4.x-scala2.12
27+
node_type_id: i3.xlarge
28+
job_03:
29+
name: "job-03"
30+
description: "depends on ${resources.jobs.job_02.id}"
31+
tasks:
32+
- task_key: "task"
33+
spark_python_task:
34+
python_file: ./test.py
35+
new_cluster:
36+
spark_version: 15.4.x-scala2.12
37+
node_type_id: i3.xlarge

acceptance/bundle/deploy/wal/chain-3-jobs/out.test.toml

Lines changed: 3 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
=== First deploy (crashes on job_03) ===
2+
3+
>>> errcode [CLI] bundle deploy
4+
Uploading bundle files to /Workspace/Users/[USERNAME]/.bundle/wal-chain-test/default/files...
5+
Deploying resources...
6+
[PROCESS_KILLED]
7+
8+
Exit code: [KILLED]
9+
10+
=== WAL content after crash ===
11+
{
12+
"cli_version": "[DEV_VERSION]",
13+
"lineage": "[UUID]",
14+
"serial": 1,
15+
"state_version": 2
16+
}
17+
{
18+
"k": "resources.jobs.job_01",
19+
"v": {
20+
"__id__": "[JOB_01_ID]",
21+
"state": {
22+
"deployment": {
23+
"kind": "BUNDLE",
24+
"metadata_file_path": "/Workspace/Users/[USERNAME]/.bundle/wal-chain-test/default/state/metadata.json"
25+
},
26+
"description": "first in chain",
27+
"edit_mode": "UI_LOCKED",
28+
"format": "MULTI_TASK",
29+
"max_concurrent_runs": 1,
30+
"name": "job-01",
31+
"queue": {
32+
"enabled": true
33+
},
34+
"tasks": [
35+
{
36+
"new_cluster": {
37+
"node_type_id": "[NODE_TYPE_ID]",
38+
"spark_version": "15.4.x-scala2.12"
39+
},
40+
"spark_python_task": {
41+
"python_file": "/Workspace/Users/[USERNAME]/.bundle/wal-chain-test/default/files/test.py"
42+
},
43+
"task_key": "task"
44+
}
45+
]
46+
}
47+
}
48+
}
49+
{
50+
"k": "resources.jobs.job_02",
51+
"v": {
52+
"__id__": "[JOB_02_ID]",
53+
"depends_on": [
54+
{
55+
"label": "${resources.jobs.job_01.id}",
56+
"node": "resources.jobs.job_01"
57+
}
58+
],
59+
"state": {
60+
"deployment": {
61+
"kind": "BUNDLE",
62+
"metadata_file_path": "/Workspace/Users/[USERNAME]/.bundle/wal-chain-test/default/state/metadata.json"
63+
},
64+
"description": "depends on [JOB_01_ID]",
65+
"edit_mode": "UI_LOCKED",
66+
"format": "MULTI_TASK",
67+
"max_concurrent_runs": 1,
68+
"name": "job-02",
69+
"queue": {
70+
"enabled": true
71+
},
72+
"tasks": [
73+
{
74+
"new_cluster": {
75+
"node_type_id": "[NODE_TYPE_ID]",
76+
"spark_version": "15.4.x-scala2.12"
77+
},
78+
"spark_python_task": {
79+
"python_file": "/Workspace/Users/[USERNAME]/.bundle/wal-chain-test/default/files/test.py"
80+
},
81+
"task_key": "task"
82+
}
83+
]
84+
}
85+
}
86+
}
87+
88+
=== Number of jobs saved in WAL ===
89+
2
90+
91+
=== Bundle summary (reads from WAL) ===
92+
Name: wal-chain-test
93+
Target: default
94+
Workspace:
95+
User: [USERNAME]
96+
Path: /Workspace/Users/[USERNAME]/.bundle/wal-chain-test/default
97+
Resources:
98+
Jobs:
99+
job_01:
100+
Name: job-01
101+
URL: [DATABRICKS_URL]/jobs/[JOB_01_ID]?o=[NUMID]
102+
job_02:
103+
Name: job-02
104+
URL: [DATABRICKS_URL]/jobs/[JOB_02_ID]?o=[NUMID]
105+
job_03:
106+
Name: job-03
107+
URL: (not deployed)
108+
109+
=== WAL after successful deploy ===
110+
WAL deleted (expected)
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Linear chain: job_01 -> job_02 -> job_03
2+
# Let first 2 jobs/create succeed, then kill on the 3rd
3+
kill_after.py "POST /api/2.2/jobs/create" 2 1
4+
5+
echo "=== First deploy (crashes on job_03) ==="
6+
trace errcode $CLI bundle deploy
7+
8+
echo ""
9+
echo "=== WAL content after crash ==="
10+
jq -S . .databricks/bundle/default/resources.json.wal 2>/dev/null || echo "No WAL file"
11+
12+
echo ""
13+
echo "=== Number of jobs saved in WAL ==="
14+
grep -c '"k":"resources.jobs' .databricks/bundle/default/resources.json.wal 2>/dev/null || echo "0"
15+
16+
echo ""
17+
echo "=== Bundle summary (reads from WAL) ==="
18+
$CLI bundle summary
19+
20+
echo ""
21+
echo "=== WAL after successful deploy ==="
22+
cat .databricks/bundle/default/resources.json.wal 2>/dev/null || echo "WAL deleted (expected)"
23+
24+
replace_ids.py
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
print("test")
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
bundle:
2+
name: wal-corrupted-test
3+
4+
resources:
5+
jobs:
6+
valid_job:
7+
name: "valid-job"
8+
tasks:
9+
- task_key: "task-a"
10+
spark_python_task:
11+
python_file: ./test.py
12+
new_cluster:
13+
spark_version: 15.4.x-scala2.12
14+
node_type_id: i3.xlarge
15+
another_valid:
16+
name: "another-valid"
17+
tasks:
18+
- task_key: "task-b"
19+
spark_python_task:
20+
python_file: ./test.py
21+
new_cluster:
22+
spark_version: 15.4.x-scala2.12
23+
node_type_id: i3.xlarge

0 commit comments

Comments
 (0)