feat: migrate vault to Go binary runevault with multi-CSP installer#68
Open
jh-lee-cryptolab wants to merge 26 commits intomainfrom
Open
feat: migrate vault to Go binary runevault with multi-CSP installer#68jh-lee-cryptolab wants to merge 26 commits intomainfrom
jh-lee-cryptolab wants to merge 26 commits intomainfrom
Conversation
Wraps tail (macOS) or journalctl (Linux) with optional -f flag. Log path is derived from config source on macOS; Linux delegates to journald.
- Add --target <local|aws|gcp|oci> flag and interactive target menu
- Add --install-dir flag for CSP install directory override
- Add CSP dispatch functions: resolve_target, csp_preflight,
csp_prompt_config, csp_generate_ssh_key, csp_copy_terraform_files,
csp_render_tfvars, csp_run_terraform, csp_post_deploy, csp_summary
- Add RUNEVAULT_TLS_HOSTNAME support in generate_tls_certs() as DNS SAN
- Remove team_secret from operator→VM flow; VM auto-generates it
- Remove RUNEVAULT_TEAM_SECRET env var from user-facing interface
- Rewrite deployment/{aws,gcp,oci} cloud-init/startup-script files to
use Go-native install.sh instead of Docker compose
- Add runevault_version variable to deployment/{aws,gcp,oci}/main.tf
- Remove team_secret variable and output from all three main.tf files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove separate vault_index_name and tls_hostname variables — team_name now serves as both the cloud resource name and the vault index passed to the VM via RUNEVAULT_TEAM_NAME. Fixes runtime crash after region prompt caused by VAULT_INDEX_NAME validation on an unset variable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
csp_prompt_config ended with [[ "$csp" = oci ]] && { ... }, so on AWS
the function's last command returned non-zero. With set -e, the calling
csp_prompt_config invocation in csp_dispatch then killed the script
silently right after the AWS region prompt. Rewrite the GCP/OCI checks
as if-statements; apply the same fix to setup_system.
Also slim csp_preflight to terraform-only with a y/N auto-install prompt
mirroring local preflight, and add terraform install support to
_install_tool (brew on macOS, HashiCorp zip on Linux).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously csp_post_deploy waited for port 50051 (10 min) plus a fixed 30s sleep, then made 6 short SCP attempts. If the VM-side install was slow or the cert hadn't been generated yet, SCP failed and the script fell back to a useless "retry the same SCP" warning. Replace the port wait + fixed sleep with a single SCP polling loop that retries every 15s for up to 30 min. SCP succeeds the moment the VM has generated /opt/runevault/certs/ca.pem, which is the precise signal we care about. On timeout, die with a pointer to the VM-side install log. Also pick ssh_user from $csp instead of trying ubuntu/opc in turn. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The VM-side install.sh runs as root via cloud-init, so SUDO_USER is empty and _add_invoking_user_to_group is a no-op — the canonical SSH user (ubuntu on all three CSPs) ends up outside the runevault group and can't reach /opt/runevault/admin.sock or /opt/runevault/certs. Add an explicit "usermod -aG runevault ubuntu" right after install.sh in each cloud-init / startup-script. Drop the auto-detect fallback in install.sh now that cloud-init owns this responsibility for cloud deploys; local installs still pick up SUDO_USER as before. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… summary team_secret no longer surfaces in operator-facing output (auto-generated on the VM, not relevant to share). Replace that section with a Next steps block that SSHes into the VM and runs runevault commands there, mirroring the local install's Next steps. Same block for all three CSPs since AWS/GCP/OCI all use Ubuntu 22.04 + systemd. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Terraform's default credential chain is satisfied by the cloud CLI's auth artifacts in practice (~/.aws/credentials, gcloud ADC file, ~/.oci/config), so verifying the CLI is installed and authenticated catches the most common "terraform apply silently fails" cause early. Run "<cli> <non-destructive-auth-call>" as $SUDO_USER (the user that csp_run_terraform will run terraform under) so the check matches the actual credential resolution path: - aws sts get-caller-identity - gcloud auth application-default print-access-token - oci iam region list Also point both local install and CSP summary to "runevault logs" for the View logs hint, replacing the per-OS journalctl/tail snippets now that the CLI provides a unified entrypoint. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update AMI/image filters across all three CSPs: - AWS: ubuntu-noble-24.04 on hvm-ssd-gp3 - GCP: ubuntu-2404-lts-amd64 - OCI: Canonical-Ubuntu-24.04-* filtered by VM.Standard.E5.Flex compatibility The OCI bump also fixes the launch failure on ap-seoul-1 where the 22.04 image wasn't compatible with E5.Flex shape. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OCI launches the Canonical Ubuntu cloud image, whose default SSH user is 'ubuntu' (the 'opc' default belongs to Oracle Linux images, which we don't deploy). The csp_post_deploy SCP polling loop was hard-coded to opc for OCI, so it would loop forever connecting as a non-existent user. Use 'ubuntu' for all three CSPs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
run_uninstall now dispatches by --target. Local keeps the existing service + files + data flow; CSP targets call the new csp_uninstall, which runs terraform destroy against the install dir's terraform.tfstate and optionally removes the directory afterwards. Operators no longer have to manually cd into the install dir to tear down cloud infrastructure — --uninstall --target aws|gcp|oci is enough. Also switches the interactive target prompt label between "installation" and "uninstall" to match the active flow, and reorders main so target is resolved before the uninstall dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- install-dev.sh prompts for enVector endpoint and API key on
interactive local installs (previously silently used placeholder
defaults, producing a non-functional vault on every dev local run).
- install-dev.sh forwards --uninstall to install.sh for both local and
CSP targets, skipping the dev preflight/build path. The CSP variant
reuses install.sh's new csp_uninstall (terraform destroy) wrapper.
- Add dev cloud-init / startup-script variants that only install
prereqs (cosign + apt packages); install.sh + the locally built
binary are SCP'd in by install-dev.sh after cloud-init finishes.
AWS variant escapes \${carch} as \$\${carch} so terraform's
templatefile() leaves the shell expansion intact.
- Switch resolve_target label between "install" and "uninstall" to
match the active flow.
- Default team_name changed from "dev-team" to "devteam" because
vault index names cannot contain hyphens.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Realign README, CONTRIBUTING, ARCHITECTURE, AGENTS, and tests/FIXTURES with the actual state of epic/go-migration: single-binary `runevault` daemon, native systemd/launchd service, admin Unix domain socket, YAML-only config (`runevault.conf`), install.sh with --target local|aws|gcp|oci, scripts/install-dev.sh sibling, Sigstore-signed release pipeline, and Ubuntu 24.04 cloud images. Drop stale references to Python modules, Docker Compose / GHCR, port-8081 HTTP admin, env-var fallback, and aspirational HA / restore-from-backup terraform variables that don't exist. Add an [Unreleased] CHANGELOG section capturing the BREAKING Go rewrite plus the installer, CSP, and release pipeline work landed on this branch.
Per prior decision that cosign-based hashsum verification was operationally
heavy, remove every cosign/Sigstore dependency:
- install.sh: drop cosign from preflight tools, _install_tool case, and
Phase 2/3 verify-blob block. Stop downloading SHA256SUMS.{sig,pem}.
RUNEVAULT_SKIP_VERIFY now toggles the SHA256SUMS check.
- release.yaml: drop sigstore/cosign-installer step and cosign sign-blob
step. Release artifacts are now <archive> + SHA256SUMS only.
- deployment/{aws,gcp,oci}/cloud-init.yaml + startup-script.sh: drop the
cosign download (no longer needed since install.sh doesn't use it).
- deployment/{aws,gcp,oci}/cloud-init-dev.yaml + startup-script-dev.sh:
replace the cosign-as-sentinel pattern with a plain
/var/run/runevault-dev-ready file. install-dev.sh polls the new sentinel.
- install-dev.sh: drop RUNEVAULT_SKIP_VERIFY=1 (LOCAL_BINARY already
short-circuits download_and_verify so verification never runs in dev).
- README, CONTRIBUTING, CHANGELOG, ARCHITECTURE: replace
Sigstore/signature-verification language with checksum verification.
SHA256SUMS integrity now relies on GitHub HTTPS for the release-page download.
Contributor
Author
|
https://github.com/CryptoLabInc/rune-admin/releases/tag/v0.4.0-beta.1 You could test installation from above pre-released version |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
checkSecretMode was warning-only, which let runevault.conf, api_key_file, and team_secret_file slip through with world-readable bits set. Convert the helper to return an error and propagate it through LoadConfig and readSecretFile so the daemon refuses to start when any secret file is looser than 0640. Tests cover both the main config and the team_secret_file indirection paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DecryptScores and DecryptMetadata previously returned the in-band .Error field but a nil gRPC status on five paths (base64 decode, FHE decrypt, JSON envelope decode, DEK derivation, metadata decrypt, and the missing-team-secret guard). Clients that key on standard gRPC codes silently missed those failures. Map each path to InvalidArgument or Internal so the wire-level status is consistent with the rest of the handler set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Daemon lifecycle belongs to the OS service manager (systemd / launchd), not to the admin socket. The endpoints duplicated systemctl and launchctl, and the runevault-group permission model meant any group member could trigger a process kill — too broad for a control plane. Remove POST /shutdown and POST /restart from buildAdminMux, drop the matching onShutdown plumbing in AdminFromConfig, delete Vault.RequestRestart / RestartRequested, ErrRestartRequested, and the restart-aware exit branch in main. Operators stop and restart via systemctl / launchctl, which the install scripts already document. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #61 (Phase 1 — Go runtime migration), #63 (Phase 2 — multi-platform release artifacts), #64 (Phase 3 — Docker-free installer).
Context
Vault was a Python+Docker stack distributed via GHCR; ops was fragile and required runtime deps. This branch ships a Go binary with a one-command installer for local and AWS/GCP/OCI deploys.
TL;DR
Rewrite vault as single-binary Go daemon
runevaultwith one-command installer for local and AWS/GCP/OCI deploys.Summary
runevault(Epic: Phase 1 — Go runtime migration + unified binary #61) replacing the Python vault — single binary, no runtime deps beyond TLSSHA256SUMSchecksum manifestinstall.sh(Epic: Phase 3 — Docker-free installer with native binaries #64) with--target local|aws|gcp|oci, SHA256SUMS verification, and--uninstallscripts/install-dev.sh— local Go build forlocal, Docker cross-compile + SCP for CSP targetsrunevault.service) and launchd (com.cryptolabinc.runevault) service registrationrunevaultgroup lets members run CLI without sudorunevault.conf; env-var fallback removed (BREAKING per Epic: Phase 1 — Go runtime migration + unified binary #61). Ubuntu 24.04 LTS cloud imagesAlternatives
installer-cli— rejected: shell + terraform is closer to what cloud admins already read and audit.Test plan
mise run checkpasses (gofmt + go vet + unit tests with race)mise run go:buildproducesvault/bin/runevaultmise run go:test:e2epasses against the built binarysudo bash install.sh --target localsucceeds;runevault statusreturns SERVINGsudo bash install.sh --uninstall --target localcleanly removes service and filesinstall.sh --uninstall --target <csp> --install-dir ...) runsterraform destroyrunevault token issue|rotate|revoke|listwork through the admin socket/opt/runevault/logs/audit.logsha256sum --check --ignore-missing SHA256SUMSpasses against the release artifacts