Skip to content

Commit e2f49b6

Browse files
arielr-ltAriel Rolfo
andauthored
Migrate Redis PVC from gp2 to gp3 across all environments (#1019)
* Add production DB table dump workflow via K8s Job Introduces a manually-triggered GitHub Actions workflow that spins up a K8s Job in credreg-prod to pg_dump a specified table, uploads the result to a dedicated private S3 bucket, and notifies Slack with a 1-hour presigned download URL. - New terraform module: db_dumps_s3 (private bucket, AES-256, 7-day expiry) - IRSA application policy extended with PutObject on cer-db-dumps-prod - github-oidc-widget gets s3:GetObject for presigned URL generation - K8s Job template: postgres:16-alpine, uses existing app-secrets for DB creds - GH Actions workflow: validates table name, creates job, waits, presigns, notifies Slack * Fix db-table-dump job: mount main-app-config for POSTGRESQL_DATABASE and POSTGRESQL_USERNAME * Fix db-table-dump: stream via psql COPY to S3, add optional where_clause input, bump deadline to 4h * Fix Slack notification formatting: proper newlines and hyperlink for presigned URL * Switch to pg_dump plain SQL format, remove where_clause * Fix Slack message: rename CSV to SQL dump * Make db-table-dump fire-and-forget: job handles Slack notifications autonomously - Dedicated db-dump-service-account with 12h IRSA token expiration - Job posts started/done/failed to Slack independently - GH Actions exits after launching job and posting started notification - Add s3:GetObject to IRSA policy for presigned URL generation from the job - Add db_dump_service_account_prod to IRSA trust policy * Simplify: reuse main-app-service-account, drop dedicated dump SA * Fix variables.tf: restore missing closing brace * Add dedicated db-dump-service-account with 12h IRSA token for long-running dumps * Migrate Redis PVC storageClassName from gp2 to gp3 across all envs Switches Redis StatefulSet volumeClaimTemplates from the deprecated in-tree aws-ebs provisioner (gp2) to the EBS CSI driver (gp3) in prod, staging, and sandbox. Reduces EBS force-detach delay on node failure from ~6min to ~1-2min and eliminates use of the deprecated kubernetes.io/aws-ebs provisioner. Live migration was performed for each environment with EBS snapshots taken as backup before cutover. --------- Co-authored-by: Ariel Rolfo <arielr-lt+username@users.noreply.github.com>
1 parent 9fcccc8 commit e2f49b6

7 files changed

Lines changed: 50 additions & 51 deletions

File tree

.github/workflows/db-table-dump.yaml

Lines changed: 8 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,6 @@ permissions:
1515
env:
1616
AWS_REGION: us-east-1
1717
EKS_CLUSTER: ce-registry-eks
18-
S3_DUMP_BUCKET: cer-db-dumps-prod
19-
PRESIGNED_URL_TTL: 3600
2018

2119
jobs:
2220
dump:
@@ -53,60 +51,24 @@ jobs:
5351
run: |
5452
TABLE="${{ inputs.table_name }}"
5553
JOB_NAME=$(
56-
sed "s/__TABLE_NAME__/${TABLE}/" \
54+
sed -e "s/__TABLE_NAME__/${TABLE}/" \
55+
-e "s|__SLACK_WEBHOOK_URL__|${{ secrets.SLACK_WEBHOOK_URL }}|" \
5756
terraform/environments/eks/k8s-manifests-prod/db-table-dump-job.yaml |
5857
kubectl -n credreg-prod create -f - -o name | sed 's|job.batch/||'
5958
)
6059
echo "job_name=${JOB_NAME}" >> "$GITHUB_OUTPUT"
6160
echo "Launched job: ${JOB_NAME}"
6261
63-
- name: Wait for job completion
64-
run: |
65-
kubectl -n credreg-prod wait \
66-
--for=condition=complete \
67-
"job/${{ steps.create_job.outputs.job_name }}" \
68-
--timeout=4h
69-
70-
- name: Get presigned URL
71-
id: presign
72-
run: |
73-
S3_KEY=$(
74-
kubectl -n credreg-prod logs "job/${{ steps.create_job.outputs.job_name }}" |
75-
grep "^S3_KEY=" | tail -1 | cut -d= -f2-
76-
)
77-
if [ -z "$S3_KEY" ]; then
78-
echo "Could not parse S3_KEY from job logs" >&2
79-
exit 1
80-
fi
81-
PRESIGNED_URL=$(
82-
aws s3 presign "s3://${{ env.S3_DUMP_BUCKET }}/${S3_KEY}" \
83-
--expires-in "${{ env.PRESIGNED_URL_TTL }}" \
84-
--region "${{ env.AWS_REGION }}"
85-
)
86-
echo "presigned_url=${PRESIGNED_URL}" >> "$GITHUB_OUTPUT"
87-
88-
- name: Notify Slack
89-
if: always()
62+
- name: Notify Slack (started)
9063
env:
9164
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
9265
RUN_URL: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}
9366
run: |
94-
if [ -z "${SLACK_WEBHOOK_URL}" ]; then
95-
echo "SLACK_WEBHOOK_URL not set; skipping"
96-
exit 0
97-
fi
98-
99-
STATUS="${{ job.status }}"
67+
if [ -z "${SLACK_WEBHOOK_URL}" ]; then exit 0; fi
10068
TABLE="${{ inputs.table_name }}"
10169
ACTOR="${{ github.actor }}"
102-
NL=$'\n'
103-
104-
if [ "$STATUS" = "success" ]; then
105-
PRESIGNED_URL="${{ steps.presign.outputs.presigned_url }}"
106-
MSG=":white_check_mark: *DB dump ready*${NL}• Table: \`${TABLE}\`${NL}• Triggered by: ${ACTOR}${NL}• <${PRESIGNED_URL}|Download SQL dump> _(expires in 1h)_${NL}• <${RUN_URL}|View run>"
107-
else
108-
MSG=":x: *DB dump failed*${NL}• Table: \`${TABLE}\`${NL}• Triggered by: ${ACTOR}${NL}• <${RUN_URL}|View run>"
109-
fi
110-
111-
payload=$(jq -nc --arg text "$MSG" '{text:$text}')
70+
JOB="${{ steps.create_job.outputs.job_name }}"
71+
payload=$(jq -nc \
72+
--arg text ":hourglass: *DB dump started* | table: \`${TABLE}\` | triggered by: ${ACTOR} | job: \`${JOB}\`" \
73+
'{text:$text}')
11274
curl -sS -X POST -H 'Content-type: application/json' --data "$payload" "$SLACK_WEBHOOK_URL" || true
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
apiVersion: v1
2+
kind: ServiceAccount
3+
metadata:
4+
name: db-dump-service-account
5+
namespace: credreg-prod
6+
annotations:
7+
eks.amazonaws.com/role-arn: "arn:aws:iam::996810415034:role/ce-registry-eks-db-dump-irsa-role"
8+
eks.amazonaws.com/token-expiration: "43200"

terraform/environments/eks/k8s-manifests-prod/db-table-dump-job.yaml

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ spec:
1111
ttlSecondsAfterFinished: 300
1212
template:
1313
spec:
14-
serviceAccountName: main-app-service-account
14+
serviceAccountName: db-dump-service-account
1515
restartPolicy: Never
1616
containers:
1717
- name: pg-dump
@@ -21,10 +21,22 @@ spec:
2121
- |
2222
set -euo pipefail
2323
24-
apk add --no-cache aws-cli >/dev/null 2>&1
24+
apk add --no-cache aws-cli curl >/dev/null 2>&1
2525
2626
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
2727
S3_KEY="table-dumps/${TABLE_NAME}/${TIMESTAMP}.sql.gz"
28+
NL=$'\n'
29+
30+
notify_slack() {
31+
[ -z "${SLACK_WEBHOOK_URL:-}" ] && return 0
32+
payload=$(printf '{"text":"%s"}' "$(echo "$1" | sed 's/"/\\"/g; s/$/\\n/' | tr -d '\n' | sed 's/\\n$//')")
33+
curl -sS -X POST -H 'Content-type: application/json' --data "$payload" "$SLACK_WEBHOOK_URL" || true
34+
}
35+
36+
on_failure() {
37+
notify_slack ":x: *DB dump failed* | table: \`${TABLE_NAME}\`"
38+
}
39+
trap on_failure ERR
2840
2941
echo "Dumping table '${TABLE_NAME}' from ${POSTGRESQL_DATABASE}..."
3042
PGPASSWORD="${POSTGRESQL_PASSWORD}" pg_dump \
@@ -40,11 +52,22 @@ spec:
4052
--region us-east-1 \
4153
--content-encoding gzip \
4254
--content-type "text/plain"
55+
<<<<<<< feature/redis-pvc-gp2-to-gp3-migration
56+
57+
PRESIGNED_URL=$(aws s3 presign "s3://${S3_DUMP_BUCKET}/${S3_KEY}" \
58+
--expires-in 3600 \
59+
--region us-east-1)
60+
61+
notify_slack ":white_check_mark: *DB dump ready*\n• Table: \`${TABLE_NAME}\`\n• <${PRESIGNED_URL}|Download SQL dump> _(expires in 1h)_"
62+
=======
63+
>>>>>>> master
4364

4465
echo "S3_KEY=${S3_KEY}"
4566
env:
4667
- name: TABLE_NAME
4768
value: "__TABLE_NAME__"
69+
- name: SLACK_WEBHOOK_URL
70+
value: "__SLACK_WEBHOOK_URL__"
4871
- name: S3_DUMP_BUCKET
4972
value: "cer-db-dumps-prod"
5073
envFrom:

terraform/environments/eks/k8s-manifests-prod/redis-deployment.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ spec:
8181
name: redis-data
8282
spec:
8383
accessModes: ["ReadWriteOnce"]
84-
storageClassName: "gp2" # AWS EBS gp3 (adjust if needed)
84+
storageClassName: "gp3"
8585
resources:
8686
requests:
8787
storage: 10Gi

terraform/environments/eks/k8s-manifests-sandbox/redis-deployment.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ spec:
7979
name: redis-data
8080
spec:
8181
accessModes: ["ReadWriteOnce"]
82-
storageClassName: "gp2" # AWS EBS gp3 (adjust if needed)
82+
storageClassName: "gp3"
8383
resources:
8484
requests:
8585
storage: 10Gi

terraform/environments/eks/k8s-manifests-staging/redis-deployment.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ spec:
7979
name: redis-data
8080
spec:
8181
accessModes: ["ReadWriteOnce"]
82-
storageClassName: "gp2" # AWS EBS gp3 (adjust if needed)
82+
storageClassName: "gp3"
8383
resources:
8484
requests:
8585
storage: 10Gi

terraform/modules/eks/irsa-iam-policy-and-role.tf

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,12 @@ resource "aws_iam_policy" "application_policy" {
131131
"arn:aws:s3:::cer-db-dumps-prod/*"
132132
]
133133
},
134+
{
135+
"Sid" : "DbDumpsGetObject",
136+
"Effect" : "Allow",
137+
"Action" : ["s3:GetObject"],
138+
"Resource" : ["arn:aws:s3:::cer-db-dumps-prod/*"]
139+
},
134140
{
135141
"Sid" : "S3BucketReadMeta",
136142
"Effect" : "Allow",

0 commit comments

Comments
 (0)