Skip to content

feat: Add Valkey Sentinel & HAProxy support in High Availability setup.#137

Open
khtee wants to merge 16 commits into
valkey-io:mainfrom
khtee:main
Open

feat: Add Valkey Sentinel & HAProxy support in High Availability setup.#137
khtee wants to merge 16 commits into
valkey-io:mainfrom
khtee:main

Conversation

@khtee
Copy link
Copy Markdown

@khtee khtee commented Feb 6, 2026

Feat #22

@dmaes
Copy link
Copy Markdown

dmaes commented Feb 11, 2026

Nice work!

I have some notes/questions:

@khtee
Copy link
Copy Markdown
Author

khtee commented Feb 11, 2026

Nice work!

I have some notes/questions:

Good points! Will work on both improvements.

@yoannrt
Copy link
Copy Markdown
Contributor

yoannrt commented Feb 11, 2026

Good job,

I have some questions too,
I assume to discover who is master, you have to ask sentinel SENTINEL GET-MASTER-ADDR-BY-NAME mymaster ?
(Currently, the replica service is sticked to pod-0 in the STS)

Also, would that make sense to run sentinel as a side container in the replica STS pods ?

@amontalban
Copy link
Copy Markdown
Contributor

Thank you @khtee can't wait to get this merged in 🙏

@dmaes
Copy link
Copy Markdown

dmaes commented Feb 14, 2026

Also, would that make sense to run sentinel as a side container in the replica STS pods ?

You'll want to disable the default statefulset when sentinel is enabled

I got confused, thinking valkey-sentinel also runs the server process, it's been a while since I used redis/valkey. But yes, @khtee you'll want to run sentinel and the server side-to-side in the same pod, don't split them out in different statefulsets.

@yoannrt
Copy link
Copy Markdown
Contributor

yoannrt commented Feb 17, 2026

Do you plan on adding a HAProxy deployment, so clients don't have to be aware of sentinel?

@dmaes I guess when you suggest to implement HAproxy, it's for sentinel incompatible clients workloads ?
So this would be optional and the default implementation would be a valkey/sentinel HA setup ?

@khtee khtee marked this pull request as draft February 18, 2026 23:57
@khtee khtee changed the title feat: Add Valkey Sentinel support in High Availability setup. feat: Add Valkey Sentinel & HAProxy support in High Availability setup. Feb 19, 2026
@khtee khtee marked this pull request as ready for review February 19, 2026 01:19
@khtee khtee marked this pull request as draft February 19, 2026 01:30
@khtee khtee marked this pull request as ready for review February 19, 2026 06:11
@khtee
Copy link
Copy Markdown
Author

khtee commented Feb 19, 2026

Also, would that make sense to run sentinel as a side container in the replica STS pods ?

You'll want to disable the default statefulset when sentinel is enabled

I got confused, thinking valkey-sentinel also runs the server process, it's been a while since I used redis/valkey. But yes, @khtee you'll want to run sentinel and the server side-to-side in the same pod, don't split them out in different statefulsets.

Added following enhancements.

  • Sentinel now runs as a sidecar in valkey pod.
  • HAProxy with sidecar watcher to track failover.

@dmaes
Copy link
Copy Markdown

dmaes commented Feb 19, 2026

Do you plan on adding a HAProxy deployment, so clients don't have to be aware of sentinel?

@dmaes I guess when you suggest to implement HAproxy, it's for sentinel incompatible clients workloads ? So this would be optional and the default implementation would be a valkey/sentinel HA setup ?

That's correct.

The truly kubernetes-native way would probably be to have a sentinel-master Service, using a sentinel.valkey.io/master: "true" label selector, and then some watcher that updates that label on the correct Pod,
but a HAproxy deployment is the easier option to implement, and is how most other Redis charts do it.

Comment thread valkey/values.yaml Outdated
Comment thread valkey/templates/haproxy-deployment.yaml Outdated
Comment thread valkey/templates/haproxy-deployment.yaml
Comment thread valkey/values.yaml Outdated
Comment thread valkey/values.yaml
khtee and others added 5 commits February 20, 2026 09:53
Signed-off-by: KHTee <teekahhui@hotmail.com>
Co-authored-by: Dieter Maes <dieter.maes@dmaes.be>
Signed-off-by: khtee <75174583+khtee@users.noreply.github.com>
Signed-off-by: KHTee <teekahhui@hotmail.com>
Allow HAProxy to retry DNS resolution during startup when pending for Valkey node to start.

Essentially it does the following

- Try to use the last known IP.
- If none, query the libc resolver (DNS).
- If that fails, resolve to none (meaning the server has no IP address yet, but HAProxy won't crash) and wait for the runtime resolver health-checks to pick up the DNS correctly.

Signed-off-by: KHTee <teekahhui@hotmail.com>
@lazariv
Copy link
Copy Markdown

lazariv commented Mar 2, 2026

Waiting for this to be merged. Thanks for your work!

Comment thread valkey/templates/haproxy-configmap.yaml Outdated
Comment thread valkey/templates/haproxy-deployment.yaml
Comment thread valkey/values.yaml Outdated
Comment thread valkey/values.schema.json
Comment thread valkey/templates/haproxy-deployment.yaml
Comment thread valkey/templates/haproxy-deployment.yaml
Comment thread valkey/templates/haproxy-configmap.yaml Outdated
Comment thread valkey/templates/haproxy-configmap.yaml Outdated
Comment thread valkey/templates/statefulset.yaml
Comment thread valkey/templates/haproxy-configmap.yaml Outdated
Comment on lines +57 to +62
command:
- /bin/sh
- -c
- |
apk add --no-cache socat
exec sh /scripts/sentinel-watcher.sh
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax triggers a "Potential reverse shell detected" incident with Microsoft Defender for Cloud.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be mitigated by removing the pipe ?

Copy link
Copy Markdown

@angeloxx angeloxx Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that in an air-gapped environment, the apk command cannot function. Generally, installing software at runtime is not a best practice.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert with Valkey and/or Redis and especially not Sentinel.
But maybe the approach from DandyDeveloper/redis-ha can help here, which seems to do this entirely within the haproxy config:
https://github.com/DandyDeveloper/charts/blob/f4aadb7523966fd3852589627c18db937415a439/charts/redis-ha/templates/_configs.tpl#L542-L671

Copy link
Copy Markdown

@dmaes dmaes Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that in an air-gapped environment, the apk command cannot function. Generally, installing software at runtime is not a best practice.

Neither will it work with readOnlyRootFilesystem deployments

@khtee in the values.yaml you mention not needing any additional tools, because valkey alpine image comes with nc -U, but then you seem to not be using nc -U, and instead you install socat as an additional tool.

Copy link
Copy Markdown

@jose-10000 jose-10000 May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve been testing a different HAProxy approach
Instead of Sentinel watcher + socat + runtime socket updates, I think HAProxy can detect the writable master directly using native TCP health checks against the Valkey pods (INFO replication -> role:master).

So this ConfigMap is almost completely different, it replaces most of the current HAProxy integration.

This removes runtime package installs and socat

I’m pasting the full replacement ConfigMap below so the complete approach can be reviewed

HAProxy ConfigMap proposal
{{- if .Values.haproxy.enabled }}
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ include "valkey.fullname" . }}-haproxy
  labels:
    {{- include "valkey.labels" . | nindent 4 }}
data:
  haproxy.cfg.tpl: |
    global
      log stdout format raw local0
      maxconn 1024
      ssl-server-verify none

    # Resolve Kubernetes DNS dynamically
    resolvers k8s
      parse-resolv-conf
      hold valid 1s

    defaults
      log global
      timeout connect {{ .Values.haproxy.config.timeout.connect }}
      timeout client  {{ .Values.haproxy.config.timeout.client }}
      timeout server  {{ .Values.haproxy.config.timeout.server }}
      timeout tunnel  {{ .Values.haproxy.config.timeout.tunnel | default "0s" }}
      retries 3

    frontend valkey_frontend_write
      bind *:{{ .Values.haproxy.service.port | default 6379 }}
      mode tcp
      option tcplog
      default_backend valkey_backend_master

    frontend valkey_frontend_read
      bind *:{{ .Values.haproxy.service.readPort | default 6380 }}
      mode tcp
      option tcplog
      default_backend valkey_backend_read

    backend valkey_backend_master
      mode tcp
      option tcp-check
      timeout check 5s

      default-server inter {{ .Values.haproxy.config.checkInterval }} fall 1 rise 2

      {{- if .Values.tls.enabled }}
      tcp-check connect port {{ .Values.service.port }} ssl
      {{- else }}
      tcp-check connect port {{ .Values.service.port }}
      {{- end }}

      {{- if .Values.auth.enabled }}
      tcp-check send "AUTH HAPROXY_WATCHER_USER HAPROXY_WATCHER_PASS\r\n"
      tcp-check expect string +OK
      {{- end }}
      tcp-check send "INFO replication\r\n"
      tcp-check expect rstring role:master
      tcp-check send "QUIT\r\n"
      tcp-check expect string +OK

      {{- range $i := until (add (int .Values.replica.replicas) 1 | int) }}
      server valkey-{{ $i }} {{ include "valkey.fullname" $ }}-{{ $i }}.{{ include "valkey.headlessServiceName" $ }}.{{ $.Release.Namespace }}.svc.{{ $.Values.clusterDomain }}:{{ $.Values.service.port }} check init-addr last,libc,none resolvers k8s on-marked-down shutdown-sessions
      {{- end }}

    backend valkey_backend_read
      mode tcp
      option tcp-check
      timeout check 5s

      {{- if .Values.tls.enabled }}
      tcp-check connect port {{ .Values.service.port }} ssl
      {{- else }}
      tcp-check connect port {{ .Values.service.port }}
      {{- end }}

      {{- if .Values.auth.enabled }}
      tcp-check send "AUTH HAPROXY_WATCHER_USER HAPROXY_WATCHER_PASS\r\n"
      tcp-check expect string +OK
      {{- end }}

      tcp-check send "PING\r\n"
      tcp-check expect string +PONG
      tcp-check send "QUIT\r\n"
      tcp-check expect string +OK

      {{- range $i := until (add (int .Values.replica.replicas) 1 | int) }}
      server valkey-{{ $i }} {{ include "valkey.fullname" $ }}-{{ $i }}.{{ include "valkey.headlessServiceName" $ }}.{{ $.Release.Namespace }}.svc.{{ $.Values.clusterDomain }}:{{ $.Values.service.port }} check inter 15s fall 3 rise 2 init-addr last,libc,none resolvers k8s
      {{- end }}

  entrypoint.sh: |
    #!/bin/sh
    set -eu

    CFG_TPL="/config/haproxy.cfg.tpl"
    CFG_OUT="/tmp/haproxy.cfg"

    cp "${CFG_TPL}" "${CFG_OUT}"

    {{- if .Values.auth.enabled }}
    {{- $watcherUser := .Values.haproxy.watcherUser | default "default" }}
    {{- $watcherUserObj := index .Values.auth.aclUsers $watcherUser | default (dict "passwordKey" "") }}
    {{- $passKey := $watcherUserObj.passwordKey | default $watcherUser }}

    WATCHER_USER="{{ $watcherUser }}"

    WATCHER_PASS=""
    if [ -f "/valkey-users-secret/{{ $passKey }}" ]; then
      WATCHER_PASS=$(cat "/valkey-users-secret/{{ $passKey }}")
    elif [ -f "/valkey-auth-secret/{{ $passKey }}-password" ]; then
      WATCHER_PASS=$(cat "/valkey-auth-secret/{{ $passKey }}-password")
    elif [ -n "${HAPROXY_WATCHER_PASS:-}" ]; then
      WATCHER_PASS="${HAPROXY_WATCHER_PASS}"
    else
      echo "ERROR: Could not resolve password for watcher user '${WATCHER_USER}'." >&2
      echo "       Mount the secret as a volume or set the HAPROXY_WATCHER_PASS env var." >&2
      exit 1
    fi

    sed -i "s|HAPROXY_WATCHER_USER|${WATCHER_USER}|g" "${CFG_OUT}"
    sed -i "s|HAPROXY_WATCHER_PASS|${WATCHER_PASS}|g" "${CFG_OUT}"
    {{- end }}

    haproxy -c -V -f "${CFG_OUT}" || { echo "ERROR: haproxy config validation failed" >&2; exit 1; }
    exec haproxy -W -f "${CFG_OUT}"
{{- end }}

khtee and others added 3 commits March 28, 2026 13:59
Co-authored-by: Tim Karger <49390121+tkarger@users.noreply.github.com>
Signed-off-by: khtee <75174583+khtee@users.noreply.github.com>
Co-authored-by: Tim Karger <49390121+tkarger@users.noreply.github.com>
Signed-off-by: khtee <75174583+khtee@users.noreply.github.com>
Co-authored-by: Tim Karger <49390121+tkarger@users.noreply.github.com>
Signed-off-by: khtee <75174583+khtee@users.noreply.github.com>
Copy link
Copy Markdown
Author

@khtee khtee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix code review comments

khtee and others added 2 commits March 29, 2026 22:20
Co-authored-by: lazariv <lazariv.taras@gmail.com>
Signed-off-by: khtee <75174583+khtee@users.noreply.github.com>
Signed-off-by: khtee <75174583+khtee@users.noreply.github.com>
Comment thread valkey/values.yaml
# labels:
# severity: error

haproxy:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
haproxy:
# Enable haproxy in front of sentinel
# This is allows running redis in sentinel mode even when the client is not compatible
# https://arystech.com/blog/setting-up-haproxy-with-redis-sentinel-for-high-availability-on-microk8s-kubernetes
haproxy:

@jose-10000
Copy link
Copy Markdown

Hi @khtee

I’ve been testing this with ACLs enabled + TLS active and found some discovery gaps when the default user is restricted. I've sent a PR to @khtee's fork with the following hardening fixes:

  • Inter-Sentinel Auth: Added sentinel-user/pass to fix quorum discovery ("empty array" issue).
  • Role Segregation: Support for a dedicated monitorUser (separate from replication).
  • Security: Fixed credential leaks in logs (redirected to stderr) and added native SHA256 hashing.
  • Schema: Fixed values.schema.json and HAProxy tag types to pass helm lint.

I've verified that ACL + TLS now work seamlessly. I'm currently finishing validation for HAProxy with credentials and will update soon.

Thanks for your contribution!

jose and others added 3 commits April 7, 2026 14:29
- Updating HAProxy watcher for near-instant IP-based failover.
- Refactoring init scripts to support dynamic topology and universal auth/TLS injection.
- Adding smart L7 health checks in HAProxy to handle ACL-protected nodes.
- Fully parameterizing Service and ConfigMap ports for end-to-end flexibility.
…schema

Valkey Sentinel: Fixing Auth/ACL gaps in TLS environments and security hardening (Enhancements for PR valkey-io#137)
@@ -188,3 +188,34 @@ Validate replica authentication configuration
{{- end }}
{{- end -}}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{{/*
Validate haproxy is used in replica mode
*/}}
{{- define "valkey.validateHaproxyRequirements" -}}
{{- if and .Values.haproxy.enabled (not .Values.replica.enabled) }}
{{- fail "Haproxy is only relevant in replica mode with clients incompatible with Sentinel." }}
{{- end }}
{{- end -}}
  • update of deploy_valkey.yaml to fail if haproxy is enabled in standalone mode.
Suggested change
{{- include "valkey.validateHaproxyRequirements" . }}

Copy link
Copy Markdown

@lazariv lazariv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redis/Valkey pub/sub (SUBSCRIBE) connections are long-lived and idle by design — they block waiting for messages. HAProxy's timeout client/timeout server treats them as stale and drops them, causing clients to see ConnectionError: Connection closed by server and forcing reconnect loops.

Adding timeout tunnel to the HAProxy defaults section solves this cleanly. In HAProxy, timeout tunnel governs bidirectional connections after the initial handshake — exactly the pattern pub/sub uses. Setting it to 0 keeps SUBSCRIBE connections alive indefinitely while preserving normal client/server timeouts as a safety net for regular command connections.

Without this, users running any pub/sub workload (Socket.IO, Celery, Sidekiq, etc.) through HAProxy must either set client/server timeouts to 0 (triggering HAProxy warnings) or accept periodic disconnects.

log global
timeout connect {{ .Values.haproxy.config.timeout.connect }}
timeout client {{ .Values.haproxy.config.timeout.client }}
timeout server {{ .Values.haproxy.config.timeout.server }}
Copy link
Copy Markdown

@lazariv lazariv Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timeout server {{ .Values.haproxy.config.timeout.server }}
timeout server {{ .Values.haproxy.config.timeout.server }}
timeout tunnel {{ .Values.haproxy.config.timeout.tunnel }}
option clitcpka
option srvtcpka

Comment thread valkey/values.schema.json
},
"server": {
"type": "string"
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
},
"tunnel": {
"type": "string"
}

Comment thread valkey/values.yaml
timeout:
connect: 5s
client: 1m
server: 1m
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
server: 1m
server: 1m
# Timeout for long-lived bidirectional connections (e.g. pub/sub).
# Set to 0 to keep pub/sub SUBSCRIBE connections alive indefinitely.
tunnel: 0s

@jose-10000
Copy link
Copy Markdown

I've implemented a preStop hook. I found that improper shutdowns during updates were tripping up Sentinel and risked data loss, so this change ensures a more graceful exit. I made a new pull request @khtee

@jose-10000
Copy link
Copy Markdown

jose-10000 commented Apr 30, 2026

I've added the workloadAnnotations block to haproxy-deployment.yaml. I noticed that the upstream main branch recently introduced workloadAnnotations for other workloads, so adding it here ensures HAProxy is aligned with main once this PR is eventually merged.

Do you have a rough ETA for this merge? Please let me know if there is anything else I can help review or test to get this PR over the finish line. I'm planning to adopt these changes soon and would like to have an idea of when it might be merged. Thanks!

@jose-10000
Copy link
Copy Markdown

I have some reservations about the approach of running Sentinel as a sidecar within the same pod as the Valkey instance.

The main concern is related to Helm upgrade workflows: when a pod is taken down during a rolling update or a Helm upgrade, the entire pod is terminated — including both the Valkey container and the Sentinel sidecar. This means we lose a Sentinel node during the upgrade process.

Losing a Sentinel during an upgrade can temporarily break the quorum. For example, in a typical 3-Sentinel setup, losing one during an upgrade leaves only 2 Sentinels, which may still meet quorum — but in smaller or less redundant setups, this can prevent failover from working correctly during the exact moment it may be needed most.

A more resilient architecture would deploy Sentinels as independent pods (e.g., via a separate StatefulSet), decoupled from the Valkey data pods. This ensures that Sentinel availability is not tied to the lifecycle of the data nodes, preserving quorum integrity throughout rolling updates and upgrades.

@dmaes
Copy link
Copy Markdown

dmaes commented May 18, 2026

@jose-10000 across all other best known redis sentinel helm charts, running in the same pod is the most common setup, and has been working just fine for most people. So I (and I think that goes for others too), never really considered using different statefulsets?

@jose-10000
Copy link
Copy Markdown

@jose-10000 across all other best known redis sentinel helm charts, running in the same pod is the most common setup, and has been working just fine for most people. So I (and I think that goes for others too), never really considered using different statefulsets?

Thanks for the context, @dmaes! Totally makes sense, I know most charts (like Bitnami) do it this way to keep things simplier. I just wanted to point out that coupling their lifecycles can be tricky for quorum during upgrades, but I'm fine keeping the current approach. Just wanted to share my two cents

@trabelsieyal
Copy link
Copy Markdown

Any idea when will this be released? can be amazing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.