diff --git a/content/posts/vault-migration/index.md b/content/posts/vault-migration/index.md index 9d77489..12d369b 100644 --- a/content/posts/vault-migration/index.md +++ b/content/posts/vault-migration/index.md @@ -1,7 +1,7 @@ --- title: "Migrating HashiCorp Vault Between AKS Clusters" -date: 2025-03-18 -author: "Max Leske (Xovis) and Mathis Kretz (bespinian)" +date: 2026-01-01 +author: "Max Leske (Xovis, @theseion) and Mathis Kretz (bespinian)" tags: ["Kubernetes", "Vault", "Azure", "Cloud Engineering"] categories: ["Engineering", "Cloud Native"] --- @@ -11,11 +11,11 @@ categories: ["Engineering", "Cloud Native"] ## Introduction At [Xovis](https://xovis.com) and bespinian, we recently faced the challenge of -migrating a Kubernetes-based HashiCorp Vault instance from one Azure Kubernetes +migrating HashiCorp Vault cluster from one Azure Kubernetes Service (AKS) cluster to another. The goal was to consolidate infrastructure and reduce the number of AKS clusters that required maintenance. The outcome of a somewhat arduous journey is a Bash script that has enabled us to test the -migration repeatedly and then perform the migration with a outage of less than 2 +migration repeatedly and then perform the migration with an outage of less than 2 minutes. This blog post walks through our approach and presents the migration steps in 14 @@ -38,152 +38,173 @@ we needed to manage. Our primary goals were: ## Migration Strategy -Our Vault instances are running in a high-availability configuration using the -integrated Raft storage backend and auto-unseal with Azure Key Vault. This setup -is needed in order to ensure resilience and security across client environments. -To migrate Vault between AKS clusters, we implemented the process with a Bash -script that automated all steps including snapshot creation, data restoration, -and unseal key migration. +We implemented the entire process of migrating the original Vault instance to the +new AKS cluster with a Bash script that automated all steps including snapshot +creation, auto-unseal key migration, and data restoration to the new Vault instance.\ +We wrote an additional Bash script to handle restoration of the original Vault +instance, in case of unexpected issues. + +Before starting the actual migration, we tested the migration several times between +two non-production clusters. ### Pre-Migration Setup -Before initiating the migration: +Before the migration script can be used, the target cluster must be prepared with +an empty Vault instance with the same configuration as the original Vault instance. +Setting up this empty Vault instance isn't covered in this post, however, as part +of this work we also automated setting up new Vault instances, so that we could +iterate on testing as fast as possible, with minimal manual intervention. In general, +before the start of a migration run, the following work needs to be completed: 1. **Provision the new Vault cluster**: Set up a fresh Vault instance on the target AKS cluster with the same configuration as the existing one. 2. **Configure networking and authentication**: Ensure that the new instance has the necessary access permissions and that external clients will be able to connect post-migration. -3. **Test the migration in a staging environment**: Migrate a test Vault - instance within between two non-production clusters to verify the process. ## Migration Process -The migration script consists of the steps outlined below. +In the following sections, we describe each individual step in the script. + +Uppercase variables like `OLD_K8S_NAME` refer to data that was configured +as part of the script invocation, e.g., with `--old-k8s-name`, or data that +was created in one step and needs to be available in another step. Lowercase +variables like `required_tools` are local variables and are only read in the +current step. + +The script snippets use some functions that are not explained further. Refer +to the actual script to see what they do exactly. + +Note that the script snippets shown have been adapted to fit the blog post +and can differ slightly from the actual script. ### 1. Prepare the Migration -Before performing the actual migration, it's crucial to validate your +Before performing the actual migration, it's crucial to validate the environment and ensure all prerequisites are met. This step includes setting up local backups, verifying Kubernetes contexts, ensuring proper access permissions, and preparing the user for the maintenance window. -- Create a local backup directory: +1. Create a local backup directory: - ```bash - mkdir -p vault-backups - ``` + ```bash + mkdir -p vault-backups + ``` -- Check that the required tools are available and that `kubectl wait` is - supported +2. Check that the required tools are available and that `kubectl wait` is + supported: - ```bash - REQUIRED_TOOLS=(az kubectl jq ed dig) - - for tool in "${REQUIRED_TOOLS[@]}"; do - if ! command -v "$tool" >/dev/null 2>&1; then - print_message "Required tool '$tool' is not installed or not in PATH." - MISSING=1 - fi - done - - if ! kubectl wait --help | grep -q 'Wait for a specific condition'; then - print_message "'kubectl wait' is not supported." - MISSING=1 - fi + ```bash + required_tools=(az kubectl jq ed dig) - if [ -n "$MISSING" ]; then - print_message "Please install the missing tools." - exit 1 - fi - ``` + for tool in "${required_tools[@]}"; do + if ! command -v "${tool}" >/dev/null 2>&1; then + print_message "Required tool '${tool}' is not installed or not on PATH." + missing=1 + fi + done -- Check that the `kubectl` contexts for both clusters are available + if ! kubectl wait --help | grep -q 'Wait for a specific condition'; then + print_message "'kubectl wait' is not supported." + missing=1 + fi - ```bash - if ! (kubectl config get-contexts "${OLD_K8S_NAME}" > /dev/null 2>&1 && \ - kubectl config get-contexts "${NEW_K8S_NAME}" > /dev/null 2>&1); then - print_message "No contexts for ${OLD_K8S_NAME} and ${NEW_K8S_NAME}." - print_message "Make sure you have credentials for both clusters" - exit 1 - fi - ``` + if [ -n "${missing}" ]; then + print_message "Please install the missing tools." + exit 1 + fi + ``` -- Check that both `kubectl` contexts are accessible +3. Check that the `kubectl` contexts for both clusters are available. You may have + to run `az aks get-credentials` to set up the required contexts for `kubectl`. - ```bash - switch_context "${OLD_K8S_NAME}" "${OLD_VAULT_NAME}" - if ! kubectl get cm > /dev/null 2>&1; then - print_message "Can't access resources on ${OLD_K8S_NAME}. Aborting" - exit 1 - fi + ```bash + if ! (kubectl config get-contexts "${OLD_K8S_NAME}" > /dev/null 2>&1 && \ + kubectl config get-contexts "${NEW_K8S_NAME}" > /dev/null 2>&1); then + print_message "No contexts for ${OLD_K8S_NAME} and ${NEW_K8S_NAME}." + print_message "Make sure you have credentials for both clusters" + exit 1 + fi + ``` - switch_context "${NEW_K8S_NAME}" "${NEW_VAULT_NAME}" - if ! kubectl get cm > /dev/null 2>&1; then - print_message "Can't access resources on ${NEW_K8S_NAME}. Aborting" - exit 1 - fi - ``` +4. Check that both `kubectl` contexts are accessible: -- Check that the new Vault cluster has three replicas + ```bash + switch_context "${OLD_K8S_NAME}" "${OLD_VAULT_NAME}" + if ! kubectl get cm > /dev/null 2>&1; then + print_message "Can't access resources on ${OLD_K8S_NAME}. Aborting" + exit 1 + fi - ```bash - availableReplicas=$(kubectl get statefulsets.apps "${NEW_VAULT_NAME}" \ - -o template --template="{{.status.availableReplicas}}") - if [ "${availableReplicas}" -ge 3 ]; then - print_message "Expected number of replicas found on ${NEW_K8S_NAME}" - else - print_message "Unexpected replica count on ${NEW_K8S_NAME}. Exiting." - exit 1 - fi - ``` - -- Our auto-unseal setup using Azure Key Vault is based on the AKS cluster's - Managed Service Identity. We thus need to check that the Managed Service - Identity of the new Vault cluster has access to the unseal keys of both Vault - instances, since both sets of keys will be rquired during the migration. - - ```bash - az account set -s "${NEW_VAULT_SUBSCRIPTION_NAME}" - - msi_principal_id=$(az identity list \ - -g "${NEW_VAULT_RESOURCE_GROUP}" \ - --query "[?clientId == '${NEW_VAULT_MSI_CLIENT_ID}'].principalId" \ - | jq -r '.[]') - - key_vault_id=$(az keyvault show --name "${NEW_KEY_VAULT_NAME}" \ - --query "id" | jq -r '.') - if - ! az role assignment list \ - --scope "${key_vault_id}" \ - --role "Key Vault Crypto Service Encryption User" \ - --query "[?principalId == '${msi_principal_id}']" > /dev/null 2>&1; then - print_message "Role assigment missing for new vault MSI" - print_message "Assign role 'Key Vault Crypto Service Encryption User'" - exit 1 - fi + switch_context "${NEW_K8S_NAME}" "${NEW_VAULT_NAME}" + if ! kubectl get cm > /dev/null 2>&1; then + print_message "Can't access resources on ${NEW_K8S_NAME}. Aborting" + exit 1 + fi + ``` + +5. Check that the new Vault cluster has the number of desired replicas, + as specified when the script was called: + + ```bash + availableReplicas=$(kubectl get statefulsets.apps "${NEW_VAULT_NAME}" \ + -o template \ + --template="{{.status.availableReplicas}}") + if [ "${availableReplicas}" -ge "${EXPECTED_REPLICAS} ]; then + print_message "Expected number of replicas found on ${NEW_K8S_NAME}" + else + print_message "Unexpected replica count on ${NEW_K8S_NAME}: ${availableReplicas}, expected ${EXPECTED_REPLICAS}. Exiting." + exit 1 + fi + ``` + +6. Our auto-unseal setup using Azure Key Vault is based on the AKS cluster's + Managed Service Identity. We thus need to check that the Managed Service + Identity of the new Vault cluster has access to the recovery keys of both Vault + instances, since both sets of keys will be rquired during the migration. + + ```bash + az account set -s "${NEW_VAULT_SUBSCRIPTION_NAME}" + + msi_principal_id=$(az identity list \ + -g "${NEW_VAULT_RESOURCE_GROUP}" \ + --query "[?clientId == '${NEW_VAULT_MSI_CLIENT_ID}'].principalId" \ + | jq -r '.[]') + + key_vault_id=$(az keyvault show \ + --name "${NEW_KEY_VAULT_NAME}" \ + --query "id" | jq -r '.') + if + ! az role assignment list \ + --scope "${key_vault_id}" \ + --role "Key Vault Crypto Service Encryption User" \ + --query "[?principalId == '${msi_principal_id}']" > /dev/null 2>&1; then + print_message "Role assigment missing for new vault MSI" + print_message "Assign role 'Key Vault Crypto Service Encryption User'" + exit 1 + fi - az account set -s "${OLD_VAULT_SUBSCRIPTION_NAME}" - - key_vault_id=$(az keyvault show --name "${OLD_KEY_VAULT_NAME}" \ - --query "id" | jq -r '.') - if - ! az role assignment list \ - --scope "${key_vault_id}" \ - --role "Key Vault Crypto Service Encryption User" \ - --query "[?principalId == '${msi_principal_id}']" > /dev/null 2>&1; then - print_message "Role assigment missing for new vault MSI" - print_message "Assign role 'Key Vault Crypto Service Encryption User'" - exit 1 - fi - ``` + az account set -s "${OLD_VAULT_SUBSCRIPTION_NAME}" + + key_vault_id=$(az keyvault show --name "${OLD_KEY_VAULT_NAME}" \ + --query "id" | jq -r '.') + if + ! az role assignment list \ + --scope "${key_vault_id}" \ + --role "Key Vault Crypto Service Encryption User" \ + --query "[?principalId == '${msi_principal_id}']" > /dev/null 2>&1; then + print_message "Role assigment missing for new vault MSI" + print_message "Assign role 'Key Vault Crypto Service Encryption User'" + exit 1 + fi + ``` -- Remind the user to now announce the start of the maintenance window +7. Remind the user to now announce the start of the maintenance window: - ```bash - print_message "Please announce the start of the maintenance window." - wait_for_any_key - ``` + ```bash + print_message "Please announce the start of the maintenance window." + wait_for_any_key + ``` ### 2. Create temporary migration tokens @@ -192,241 +213,247 @@ Vault tokens with limited permissions. These tokens allow access to specific endpoints like snapshot creation, restoration, sealing, and leadership operations, which aren't granted to the existing policies. -- Write policies for authorizing the migration operations - - ```hcl - # Policy allowing to step down Vault leader - path "sys/step-down" { - capabilities = ["update", "sudo"] - } - # Policy allowing to save snapshots - path "sys/storage/raft/snapshot" { - capabilities = [ "create", "read", "update", "list" ] - } - # Policy allowing to restore vault's snapshots - path "sys/storage/raft/snapshot-force" { - capabilities = [ "create", "read", "update", "list" ] - } - ``` +1. The following policies are required by the migration tokens and are stored in + a dedicated file: -- Detect the index of the leader pod in each Vault instance - - ```bash - kubectl get pod -l "vault-active=true" -o jsonpath \ - --template "{.items[0].metadata.labels.apps\.kubernetes\.io/pod-index}" - ``` - -- Create tokens for each vault instances - - ```bash - switch_context "${cluster_name}" "${vault_name}" > /dev/null - leader_index="$(detect_vault_leader_index)" > /dev/null - - kubectl cp migration-policy.hcl "${vault_name}-${leader_index}":/tmp > /dev/null - - kubectl exec -it "${vault_name}-${leader_index}" -- \ - vault login "${admin_token}" > /dev/null - - kubectl exec -it "${vault_name}-${leader_index}" -- \ - sh -c "cat /tmp/migration-policy.hcl | vault policy write migration -" \ - > /dev/null + ```hcl + # Policy allowing to step down Vault leader + path "sys/step-down" { + capabilities = ["update", "sudo"] + } + # Policy allowing to save snapshots + path "sys/storage/raft/snapshot" { + capabilities = [ "create", "read", "update", "list" ] + } + # Policy allowing to restore vault's snapshots + path "sys/storage/raft/snapshot-force" { + capabilities = [ "create", "read", "update", "list" ] + } + ``` - kubectl exec -it "${vault_name}-${leader_index}" -- \ - vault token create -policy=migration -period=30m -format json \ - | jq -r '.auth.client_token' +2. Create tokens for each vault instance, using the policies shown above: - kubectl exec -it "${vault_name}-${leader_index}" -- \ - rm /tmp/migration-policy.hcl > /dev/null - ``` + ```bash + NEW_VAULT_MIGRATION_TOKEN="$(create_migration_token "${NEW_K8S_NAME}" "${NEW_VAULT_NAME}" "${NEW_VAULT_ADMIN_TOKEN}")" + OLD_VAULT_MIGRATION_TOKEN="$(create_migration_token "${OLD_K8S_NAME}" "${OLD_VAULT_NAME}" "${OLD_VAULT_ADMIN_TOKEN}")" + ``` ### 3. Block access to the old Vault instance -Before creating the snapshot, it’s important to prevent any writes to the old -Vault instance. This step ensures no data is changed after the snapshot and -avoids issues with lease revocation. - -- Backup and delete the ingress for the old Vault instance +We are now ready to actually perform the migration. Before creating the snapshot +of the data in the old Vault instance, it’s important to prevent any further +writes to the old Vault instance. This step ensures no data is changed after the +snapshot has been created. - ```bash - kubectl get ing vault -o yaml > "${BACKUPS_DIR}/${OLD_VAULT_INGRESS_FILENAME}" +Backup and delete the ingress resource for the old Vault instance: - kubectl delete ing vault --wait=true - ``` +```bash +kubectl delete ing vault --wait=true +``` ### 4. Create snapshot of the old Vault instance We now take a snapshot of the Vault data using the Raft backend's built-in snapshot mechanism. After the snapshot is saved and downloaded, we immediately -disable the old Vault to avoid unintended behavior. - -- Log in with the migration token - - ```bash - kubectl exec -it "${OLD_VAULT_NAME}-${leader_index}" -- \ - vault login "${OLD_VAULT_ADMIN_TOKEN}" - ``` - -- Create the snapshot in the leader pod - - ```bash - kubectl exec -it "${OLD_VAULT_NAME}-${leader_index}" -- \ - vault operator raft snapshot save "${snapshot_filepath}" - ``` - -- Dowload the snapshot - - ```bash - kubectl cp "${OLD_VAULT_NAME}-${leader_index}":"${snapshot_filepath}" \ - "${BACKUPS_DIR}/${OLD_VAULT_SNAPSHOT_FILENAME}" - - kubectl exec -it "${OLD_VAULT_NAME}-${leader_index}" -- \ - rm "${snapshot_temp_filepath}" - ``` - -- Disable the old Vault instance by changing the command of the Vault - StatefulSet to `sleep` - - ```bash - kubectl get statefulset vault -o yaml > \ - "${BACKUPS_DIR}/${OLD_VAULT_STATEFULSET_FILENAME}" - cp "${BACKUPS_DIR}/${OLD_VAULT_STATEFULSET_FILENAME}" . - - ed "${OLD_VAULT_STATEFULSET_FILENAME}" < \ + "${BACKUPS_DIR}/${OLD_VAULT_STATEFULSET_FILENAME}" + cp "${BACKUPS_DIR}/${OLD_VAULT_STATEFULSET_FILENAME}" . + + ed "${OLD_VAULT_STATEFULSET_FILENAME}" < "${vault_config_filename}" - - ed '+/seal "/' "${vault_config_filename}" < "${vault_config_filename}" + + ed '+/seal "/' "${vault_config_filename}" < vault-config.yaml ed '+/seal "/' vault-config.yaml < vault-config.yaml +```bash +kubectl get cm vault-config -o yaml > vault-config.yaml - ed '+/disabled \+= true/' vault-config.yaml <" +spec: + rules: + - http: + paths: + - backend: + service: + name: vault-migration-proxy + port: + number: 443 +``` + +### 14. Test the reachability of the old and new Vault domains Finally, we test that both the old and new Vault domains point to the same -instance and that authentication works as expected.\ +instance (the old instance will not respond since it is not running) and that +authentication works as expected.\ Note again the use of the _old_ Vault token for both tests. ```bash @@ -645,6 +698,17 @@ Note again the use of the _old_ Vault token for both tests. fi ``` +## Restoration in case of failure + +The migration script creates backups of all modified resources in the working directory. +When something goes wrong, the restoration script can simply restore the backed up +resources and restart the nodes of the old Vault instance. + +Despite careful planning and multiple test runs on non-production clusters, we did use the +restoration script twice during migration attempts on the production clusters. In both +instances, the reasons were related to network access, which varied slightly between +the production and non-production clusters. + ## Conclusion Migrating Vault between AKS clusters may seem daunting, especially when @@ -658,6 +722,10 @@ Thanks to careful planning, extensive testing, and robust scripting, we were able to execute the migration with minimal downtime and no impact on dependent systems. +We provide all the scripts and files described in this blog post for your use at +. We provide no guarantees or support, use them at your own +risk. + We hope this walkthrough — and the accompanying script excerpts — can serve as a blueprint for others facing similar migrations. If you have questions, feedback, -or suggestions for improvements, feel free to reach out! +or suggestions for improvements, feel free to reach out! \ No newline at end of file