Skip to content

Commit 542c43f

Browse files
authored
Fix(ci): support Ubuntu Noble stemcell in create-bosh-lite (#3790)
The cf-deployment default stemcell moved from ubuntu-jammy to ubuntu-noble (cloudfoundry/cf-deployment#1224), which broke the create-bosh-lite workflow. Three Noble-specific problems plus supporting fixes: - Warden agents wouldn't connect: Noble warden containers boot under systemd, which requires bbl >= 9.0.41 (warden_cpi start_containers_with_systemd:true). Provided via the BBL_CLI_VERSION repository variable (>= 9.0.41; set to 9.0.45). - External DNS broken inside containers: the Noble bosh-dns config lives under the "bosh-dns-systemd" addon with disable_recursors:true, so diego-cells couldn't resolve buildpacks.cloudfoundry.org and app staging failed ("server misbehaving"). bosh-dns-noble-bosh-lite.yml enables recursion with the 169.254.169.254 recursor, applied to the dns runtime-config. - App Envoy sidecars crashed on start ("inotify_fd_ >= 0", exit 134): with systemd in every warden container the director host's fs.inotify.max_user_instances (128) was exhausted. director-inotify.yml adds an os-conf sysctl job (1024 / 524288) to the director via create-director-override.sh. inotify limits bind at the host root user namespace, so this is set on the director VM, not the diego-cell. - Increase the bosh-lite director VM to n2-standard-16 (64 GB): the whole deployment runs as warden containers on one VM and 32 GB overcommitted memory. - Fix the failure-cleanup step: `bbl down` was passed --gcp-service-account-key=key.json (no such file is created), so it parsed the literal string as JSON and failed, leaving orphaned infrastructure on any failed run. Authenticate via BBL_GCP_SERVICE_ACCOUNT_KEY, like `bbl up`. Requires the BBL_CLI_VERSION repository variable to be >= 9.0.41. Signed-off-by: Prem Kumar Kalle <prem.kalle@broadcom.com>
1 parent 7ece4fa commit 542c43f

5 files changed

Lines changed: 79 additions & 4 deletions

File tree

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
#!/bin/sh
2+
# Overrides bbl's generated create-director.sh so we can apply extra ops files
3+
# to the BOSH Lite director. Mirrors the stock bosh-lite-gcp plan-patch override
4+
# and adds director-inotify.yml (raises fs.inotify limits on the director host
5+
# so Noble app Envoy sidecars don't crash with "inotify_fd_ >= 0").
6+
bosh create-env \
7+
${BBL_STATE_DIR}/bosh-deployment/bosh.yml \
8+
--state ${BBL_STATE_DIR}/vars/bosh-state.json \
9+
--vars-store ${BBL_STATE_DIR}/vars/director-vars-store.yml \
10+
--vars-file ${BBL_STATE_DIR}/vars/director-vars-file.yml \
11+
--var-file gcp_credentials_json="${BBL_GCP_SERVICE_ACCOUNT_KEY_PATH}" \
12+
-v project_id="${BBL_GCP_PROJECT_ID}" \
13+
-v zone="${BBL_GCP_ZONE}" \
14+
-o ${BBL_STATE_DIR}/bosh-deployment/gcp/cpi.yml \
15+
-o ${BBL_STATE_DIR}/bosh-deployment/jumpbox-user.yml \
16+
-o ${BBL_STATE_DIR}/bosh-deployment/uaa.yml \
17+
-o ${BBL_STATE_DIR}/bosh-deployment/credhub.yml \
18+
-o ${BBL_STATE_DIR}/bosh-deployment/bosh-lite.yml \
19+
-o ${BBL_STATE_DIR}/bosh-deployment/bosh-lite-runc.yml \
20+
-o ${BBL_STATE_DIR}/bosh-deployment/gcp/bosh-lite-vm-type.yml \
21+
-o ${BBL_STATE_DIR}/bosh-deployment/gcp/director-inotify.yml \
22+
-o ${BBL_STATE_DIR}/external-ip-gcp.yml \
23+
-o ${BBL_STATE_DIR}/ip-forwarding.yml
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
---
2+
# BOSH DNS recursor fix for Ubuntu Noble on BOSH Lite (GCP).
3+
# bosh-deployment's dns.yml places the Noble bosh-dns config under the
4+
# "bosh-dns-systemd" addon (NOT the "bosh-dns" addon, which only covers
5+
# trusty/xenial/bionic/jammy) with disable_recursors: true. That blocks
6+
# external DNS resolution (e.g. buildpacks.cloudfoundry.org) inside diego-cell
7+
# containers, so app staging fails with "lookup ... server misbehaving".
8+
# Enable recursion and forward to GCP's metadata resolver.
9+
- type: replace
10+
path: /addons/name=bosh-dns-systemd/jobs/name=bosh-dns/properties/disable_recursors
11+
value: false
12+
- type: replace
13+
path: /addons/name=bosh-dns-systemd/jobs/name=bosh-dns/properties/recursors?
14+
value:
15+
- 169.254.169.254

.github/ops-files/bosh-lite-vm-type.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,12 @@
11
---
2-
# Configure sizes for bosh-lite on gcp
2+
# Configure sizes for bosh-lite on gcp.
3+
# n2-standard-16 (16 vCPU / 64 GB): the whole cf-deployment runs as warden
4+
# containers on this single director VM; on Ubuntu Noble each container runs a
5+
# full systemd PID 1, so 32 GB (n2-standard-8) overcommits memory and a random
6+
# instance-group agent fails to boot ("Timed out pinging VM"). 64 GB gives headroom.
37
- type: replace
48
path: /resource_pools/name=vms/cloud_properties/machine_type
5-
value: n2-standard-8
9+
value: n2-standard-16
610
- type: replace
711
path: /disk_pools/name=disks/disk_size
812
value: 250000
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
# Raise inotify limits on the BOSH Lite director VM (the host running every
3+
# warden container). On Ubuntu Noble each warden container runs systemd as PID 1
4+
# (start_containers_with_systemd), a heavy inotify consumer, so the host's
5+
# default fs.inotify.max_user_instances (128) is exhausted. App Envoy sidecars
6+
# then abort with "assert failure: inotify_fd_ >= 0" (Exit status 134), which
7+
# marks every app instance CRASHED even though staging succeeds.
8+
#
9+
# inotify limits are enforced at the host root user namespace (a new userns
10+
# defaults to unlimited and inc_ucount checks every ancestor up to root), so
11+
# this MUST be set on the director VM, NOT on the diego-cell.
12+
#
13+
# The os-conf release is already declared by bosh-deployment's bosh-lite.yml
14+
# (which create-director-override.sh applies before this file, for its
15+
# disable_agent job), so we only add the sysctl job here. Re-declaring the
16+
# release fails with "releases[N].name 'os-conf' must be unique".
17+
- type: replace
18+
path: /instance_groups/name=bosh/jobs/-
19+
value:
20+
name: sysctl
21+
release: os-conf
22+
properties:
23+
sysctl:
24+
- fs.inotify.max_user_instances=1024
25+
- fs.inotify.max_user_watches=524288

.github/workflows/create-bosh-lite.yml

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,12 @@ jobs:
8181
cp ${GITHUB_WORKSPACE}/cli/.github/bosh-lite-files/bosh-lite-dns.tf terraform/
8282
cp ${GITHUB_WORKSPACE}/cli/.github/bosh-lite-files/bosh-lite.tfvars vars/
8383
cp ${GITHUB_WORKSPACE}/cli/.github/ops-files/bosh-lite-vm-type.yml bosh-deployment/gcp/
84+
cp ${GITHUB_WORKSPACE}/cli/.github/ops-files/director-inotify.yml bosh-deployment/gcp/
85+
# Overwrite the plan-patch's stock create-director-override.sh with ours
86+
# (bbl runs *-override.sh in preference to the generated create-director.sh)
87+
# so the director gets director-inotify.yml during bosh create-env.
88+
cp ${GITHUB_WORKSPACE}/cli/.github/bosh-lite-files/create-director-override.sh create-director-override.sh
89+
chmod +x create-director-override.sh
8490
bbl up
8591
8692
- name: Authenticate to Google Cloud
@@ -131,7 +137,9 @@ jobs:
131137
cd $env_name/bbl-state
132138
eval "$(bbl print-env --shell-type posix)"
133139
134-
bosh update-runtime-config ${GITHUB_WORKSPACE}/bosh-deployment/runtime-configs/dns.yml --name dns
140+
bosh update-runtime-config ${GITHUB_WORKSPACE}/bosh-deployment/runtime-configs/dns.yml \
141+
-o ${GITHUB_WORKSPACE}/cli/.github/ops-files/bosh-dns-noble-bosh-lite.yml \
142+
--name dns
135143
STEMCELL_VERSION=$(bosh interpolate ${GITHUB_WORKSPACE}/cf-deployment/cf-deployment.yml --path /stemcells/alias=default/version)
136144
bosh upload-stemcell "https://bosh.io/d/stemcells/bosh-warden-boshlite-ubuntu-noble?v=${STEMCELL_VERSION}"
137145
bosh update-cloud-config ${GITHUB_WORKSPACE}/cf-deployment/iaas-support/bosh-lite/cloud-config.yml
@@ -167,7 +175,7 @@ jobs:
167175
eval "$(bbl print-env --shell-type posix)"
168176
169177
echo "Deleting env ${env_name}"
170-
bbl down --no-confirm --gcp-service-account-key=key.json
178+
bbl down --no-confirm
171179
172180
echo "Deleting bbl state directory"
173181
if gsutil ls gs://cf-cli-bosh-lites | grep -q /${env_name}/; then

0 commit comments

Comments
 (0)