Skip to content

Apply Databricks Labs Repository Lockdown policy#19

Open
mjohns-databricks wants to merge 5 commits intomasterfrom
security/repo-lockdown
Open

Apply Databricks Labs Repository Lockdown policy#19
mjohns-databricks wants to merge 5 commits intomasterfrom
security/repo-lockdown

Conversation

@mjohns-databricks
Copy link
Copy Markdown
Collaborator

Summary

Applies the Databricks Labs Repository Lockdown policy to GeoBrix ahead of the 2026-03-10 SHA-pinning cutoff. Scope is lockdown items 1, 3–6 (item 2, Hatch→uv, is N/A — GeoBrix is Scala/Maven + setuptools Python with no Hatch).

Five commits on top of master:

  1. 3ff670a Add scripts/security/ action-pinning tooling (list-external-actions, resolve-action-ref, pin-gh-actions, README).
  2. 514871b Pin external GitHub Actions to commit SHAs (cutoff 2026-03-10). Every uses: org/repo@<tag> in .github/workflows/ and .github/actions/ is rewritten to @<sha> # <tag>. Local first-party uses: ./.github/actions/* refs are intentionally unchanged. Tooling is rerunnable.
  3. 0fab757 Permissions + environment: runtime hardening. Top-level contents: read added where missing; stray top-level id-token: write removed from jobs that never request OIDC. Every job using REPO_ACCESS_TOKEN (the only non-exempt secret in use) now runs in the single protected environment runtime. deploy-docs drops pages: write / id-token: write from top level — moved to the deploy job only. release.yml's environment: release renamed → runtime. release.yml and publish-maven.yml disabled via if: false with banner comments and re-enable instructions (we are not publishing to PyPI / GitHub Packages from Actions today).
  4. 7076d47 Dependabot: cooldown.default-days: 7 on maven and pip ecosystems; github-actions ecosystem intentionally absent (SHAs are refreshed manually via scripts/security/pin-gh-actions), documented in a comment.
  5. 6bd5a0b Dockerfile + install_hadoop.sh hardening. FROM ubuntu:24.04 pinned by multi-arch manifest-list digest sha256:c4a8d5503dfb…41c7b. Hadoop 3.4.0 (pinned SHA-512 from downloads.apache.org), GDAL 3.11.4 (pinned SHA-256; upstream only ships MD5, so we MD5-verified the tarball then computed SHA-256 locally), and Maven 3.9.9 (pinned SHA-512; previously did a dynamic .sha512 fetch from the same origin as the tarball → no protection against origin compromise). scripts/util/install_hadoop.sh (unreferenced manual helper) hardened with set -euo pipefail + matching SHA-512 verification.

Policy items — coverage map

Policy item Status Notes
1. SHA-pin Actions before 2026-03-10 Commit 2 + tooling in commit 1
2. Hatch → uv migration N/A No Hatch in this repo
3. Checksum-verify downloads Commit 5
4. Docker image digest pin Commit 5
5. Workflow permissions + protected env Commit 3
6. Dependabot cooldown + ecosystem hygiene Commit 4

Reviewer notes — breaking-ish changes to double-check

Some Actions were pinned at a newer major than the tag the repo was previously using (commit 514871b):

  • actions/checkout v5 → v6 (SHA de0fac2e…)
  • actions/upload-artifact v5 → v7
  • actions/download-artifact v5 → v8
  • actions/setup-node v4 → v6
  • actions/setup-python v5 → v6
  • actions/upload-pages-artifact v3 → v4

The repo's workflows still accept node20 runtime and the public API shapes are unchanged, but please confirm with a green CI run.

Operational prerequisites on the repo

Before merging:

  • Create the GitHub Environment named exactly runtime (Settings → Environments → New environment). No reviewers/wait-timer required initially — the environment binding itself is the gate for REPO_ACCESS_TOKEN scoping.
  • Move REPO_ACCESS_TOKEN from repo-level secrets to the runtime environment's secrets so it can only be read by jobs that bind to it.
  • Confirm CODECOV_TOKEN stays at the repo/org level (exempt secret — no environment needed).

Test plan

  • Confirm runtime environment exists and REPO_ACCESS_TOKEN is scoped to it
  • Green build main run (PR trigger path hits update-doc-inventory + build, both gated by environment: runtime)
  • Verify deploy-docs preview run still builds (doesn't deploy on PRs)
  • gbx:test:scala + gbx:test:python pass in Docker (no behavior change expected, but Dockerfile was rewritten around the Hadoop/GDAL/Maven fetch sections)
  • scripts/security/list-external-actions returns an empty problem list (every external ref is a SHA with a tag comment)
  • Dependabot PRs (when they land) honor the 7-day cooldown

This pull request and its description were written by Isaac.

Implements the three-script workflow from the Databricks Labs Repository
Lockdown policy: list-external-actions -> resolve-action-ref -> pin-gh-actions.

- list-external-actions: emits every third-party action referenced under
  .github/ (requires yq by Mike Farah).
- resolve-action-ref: for each action, finds the most recent release tag
  published before the cutoff (2026-03-10T00:00:00Z) and resolves it to a
  commit SHA. Handles both mono-repo conventions: subpath-prefixed tags
  (databrickslabs/sandbox/acceptance -> acceptance/v0.4.4) and top-level
  shared tags (github/codeql-action/analyze -> v4.32.6, where the subpath
  is just a directory inside a repo using a unified tag series).
- pin-gh-actions: consumes resolve-action-ref output, rewrites every
  matching `uses:` under .github/ with the SHA form + tag comment, and
  stages (but does not commit) the result. Skips databricks/databrickslabs
  actions per policy. Deviates from the blueprint reference in one way:
  does not auto-create or switch branches, because GeoBrix manages
  branches manually.

README documents the typical flow and the 2026-03-10 cutoff.

Co-authored-by: Isaac
Every third-party `uses:` under .github/workflows/ and .github/actions/ is
now pinned to the commit SHA of the most recent release published before
2026-03-10T00:00:00Z, with the release tag preserved as an inline comment
for cross-reference (the comment is informational only — reviewers must
re-verify the SHA against the upstream release). Generated by running:

  ./scripts/security/list-external-actions \
    | xargs ./scripts/security/resolve-action-ref \
    | ./scripts/security/pin-gh-actions

Resolutions (all 15 external refs, ordered; every ref was on a mutable
tag prior to this change):

  actions/cache@v4, v5            -> cdf6c1fa...  # v5.0.3
  actions/checkout@v5             -> de0fac2e...  # v6.0.2   (major bump)
  actions/deploy-pages@v4         -> d6db9016...  # v4.0.5
  actions/download-artifact@v5    -> 70fc10c6...  # v8.0.0   (major bump)
  actions/setup-java@v5           -> be666c2f...  # v5.2.0
  actions/setup-node@v4           -> 53b83947...  # v6.3.0   (major bump)
  actions/setup-python@v5         -> a309ff8b...  # v6.2.0   (major bump)
  actions/upload-artifact@v5      -> bbbca2dd...  # v7.0.0   (major bump)
  actions/upload-pages-artifact@v3-> 7b1f4a76...  # v4.0.0   (major bump)
  codecov/codecov-action@v5       -> 671740ac...  # v5.5.2
  github/codeql-action/*@v4       -> 0d579ffd...  # v4.32.6
  pypa/gh-action-pypi-publish@... -> ed0c5393...  # v1.13.0

Major-version jumps are consistent with the policy ("latest release before
the cutoff") but carry breaking-change risk — reviewers should validate
each bump against the action's CHANGELOG before merge. In particular,
upload-artifact v4+ and download-artifact v4+ changed artifact immutability
semantics; the new versions may interact with the existing upload_artifacts
composite action in ways worth exercising under CI before unblocking.

Local composite action refs (./.github/actions/*) are unaffected —
they're first-party.

Co-authored-by: Isaac
…kflows

Databricks Labs Repository Lockdown policy requires any workflow using a
non-exempt secret (anything other than GITHUB_TOKEN or CODECOV_TOKEN) to
run inside a single protected GitHub Environment. GeoBrix uses
REPO_ACCESS_TOKEN (PAT fallback for private-repo checkout) across most
workflows, so every job that calls actions/checkout with that token now
sets `environment: runtime`.

Changes:
- Added `permissions: contents: read` at top level where missing
  (codeql-analysis, publish-maven, release) and removed stray top-level
  `id-token: write` from build_main / build_python / build_scala /
  build_scala_by_package / codecov-scala-parallel / codecov-upload
  (none of those jobs request OIDC tokens).
- deploy-docs: moved `pages: write` and `id-token: write` from top level
  down to the deploy job only (least privilege). The build job keeps
  `environment: runtime` for its REPO_ACCESS_TOKEN checkout; the deploy
  job keeps its existing `environment: github-pages`.
- doc-tests: added `environment: runtime` on all three (currently
  disabled) jobs that perform REPO_ACCESS_TOKEN checkouts, so they are
  compliant when re-enabled.
- release.yml: changed `environment: release` -> `environment: runtime`
  to converge on the single protected env the policy expects.
- release.yml + publish-maven.yml: DISABLED via `if: false` on their
  publish jobs with a banner comment explaining the policy context and
  how to re-enable. GeoBrix is not publishing to PyPI or GitHub Packages
  from Actions today; we will coordinate with Labs before re-enabling.

Exempt secrets per policy (GITHUB_TOKEN, CODECOV_TOKEN) are untouched
and do not require the protected environment.

Co-authored-by: Isaac
Labs Repository Lockdown policy: every Dependabot ecosystem in the repo
must apply a cooldown so we are not the first adopters of a just-released
(possibly compromised) version. Applied `cooldown.default-days: 7` to both
maven and pip ecosystems.

The policy also excludes `github-actions` from Dependabot entirely — action
SHAs are refreshed manually via scripts/security/pin-gh-actions so bumps
are reviewed as part of the security workflow rather than as auto-opened
PRs. Added a comment documenting the intentional absence.

Co-authored-by: Isaac
Databricks Labs Repository Lockdown policy requires all build-time binary
fetches to be integrity-verified and all base images to be pinned by
digest so a compromised registry/mirror cannot silently swap bytes.

Dockerfile changes:
- Pinned `FROM ubuntu:24.04` to the multi-arch manifest-list digest
  `sha256:c4a8d5503dfb2a3eb8ab5f807da5bc69a85730fb49b5cfca2330194ebcc41c7b`
  (kept `# ubuntu:24.04` comment for human readability).
- Hadoop 3.4.0 tarball: replaced `wget | tar` stream with
  download -> sha512sum -c -> extract, using the official
  HADOOP_SHA512 from downloads.apache.org/.sha512.
- GDAL 3.11.4 tarball: same pattern with a locally-computed SHA-256.
  OSGeo only publishes MD5; we MD5-verified the upstream download
  (9f4fa4b3be48fb60d5dd76fecb11a5f6) then computed and pinned SHA-256.
- Apache Maven 3.9.9: replaced the dynamic `.sha512` fetch (which reads
  the checksum from the same origin as the tarball and therefore provides
  no protection against origin compromise) with an in-Dockerfile pinned
  MAVEN_SHA512 ARG, cross-checked against archive.apache.org.

scripts/util/install_hadoop.sh:
- Not referenced by the build; kept as a manual mirror of the Dockerfile
  flow. Rewrote with `set -euo pipefail`, a pinned HADOOP_SHA512, and
  `sha512sum -c` verification. Made executable.

Each checksum has a matching comment documenting the authoritative source
and the requirement to bump it in lockstep with the underlying version.

Co-authored-by: Isaac
@mjohns-databricks
Copy link
Copy Markdown
Collaborator Author

Notes:

  1. This project was still in-process when all the changes were imposed of late, so still needs more finalization from your team prior to official launch, things like CODECOV_TOKEN have never been properly configured.
  2. release.yml and publish-maven.yml disabled via if: false with banner comments and re-enable instructions (we are not publishing to PyPI / GitHub Packages from Actions today). Do you want these fully deleted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant