Skip to content

[deploy/infra] Publish container image and k8s deploy manifests#2100

Draft
Gregory-Pereira wants to merge 2 commits into
kvcache-ai:mainfrom
Gregory-Pereira:publish-container-image-and-k8s-deploy-manifests
Draft

[deploy/infra] Publish container image and k8s deploy manifests#2100
Gregory-Pereira wants to merge 2 commits into
kvcache-ai:mainfrom
Gregory-Pereira:publish-container-image-and-k8s-deploy-manifests

Conversation

@Gregory-Pereira
Copy link
Copy Markdown

Description

  • Publish container images for the CPU master store for K8s deployment
  • Build deployment manifests for a mooncake master store deployment

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other - Deployment

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

Deployed on my k8s cluster:

k get all -n mooncake
NAME                                  READY   STATUS    RESTARTS   AGE
pod/mooncake-master-895794746-8wwds   1/1     Running   0          3h15m

NAME                      TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                       AGE
service/mooncake-master   ClusterIP   10.16.1.84   <none>        50051/TCP,8080/TCP,9003/TCP   3h18m

NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/mooncake-master   1/1     1            1           3h18m

NAME                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/mooncake-master-895794746    1         1         1       3h15m

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Signed-off-by: Gregory Pereira <grpereir@redhat.com>
@Gregory-Pereira Gregory-Pereira changed the title Publish container image and k8s deploy manifests [deploy/infra] Publish container image and k8s deploy manifests May 15, 2026
Signed-off-by: Gregory Pereira <grpereir@redhat.com>
@Gregory-Pereira Gregory-Pereira force-pushed the publish-container-image-and-k8s-deploy-manifests branch from 50f6658 to fc4378d Compare May 15, 2026 01:43
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Kubernetes deployment manifests and a multi-stage Dockerfile for the mooncake-master component. The deployment includes Kustomize configurations, Prometheus monitoring support, and persistent storage for snapshots. Feedback focuses on improving security by running the container as a non-root user, removing redundant generated manifests and commented-out code, and optimizing the Docker build process by refining file copying and updating image registry references.

Comment thread docker/Dockerfile.master

EXPOSE 50051 9003 8080

ENTRYPOINT ["mooncake_master"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The container currently runs as the root user. It is a security best practice to run applications as a non-privileged user to minimize the potential impact of a container breakout. Consider creating a dedicated user in the runtime stage and using the USER instruction.

Comment on lines +1 to +168
apiVersion: v1
kind: Namespace
metadata:
labels:
app.kubernetes.io/component: master
app.kubernetes.io/name: mooncake-master
app.kubernetes.io/part-of: mooncake
name: mooncake
---
apiVersion: v1
data:
master.yaml: |
rpc_port: 50051
rpc_thread_num: 4
rpc_address: "0.0.0.0"
rpc_conn_timeout_seconds: 0
rpc_enable_tcp_no_delay: true

enable_metric_reporting: true
metrics_port: 9003

enable_http_metadata_server: true
http_metadata_server_host: "0.0.0.0"
http_metadata_server_port: 8080

default_kv_lease_ttl: 5000
default_kv_soft_pin_ttl: 1800000
allow_evict_soft_pinned_objects: true
eviction_ratio: 0.05
eviction_high_watermark_ratio: 0.95

memory_allocator: "offset"
allocation_strategy: "random"

enable_snapshot: true
enable_snapshot_restore: true
snapshot_interval_seconds: 60
snapshot_retention_count: 3
snapshot_object_store_type: "local"
kind: ConfigMap
metadata:
labels:
app.kubernetes.io/component: master
app.kubernetes.io/name: mooncake-master
app.kubernetes.io/part-of: mooncake
name: mooncake-master-config
namespace: mooncake
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/component: master
app.kubernetes.io/name: mooncake-master
app.kubernetes.io/part-of: mooncake
name: mooncake-master
namespace: mooncake
spec:
ports:
- name: rpc
port: 50051
protocol: TCP
targetPort: rpc
- name: metadata
port: 8080
protocol: TCP
targetPort: metadata
- name: metrics
port: 9003
protocol: TCP
targetPort: metrics
selector:
app.kubernetes.io/component: master
app.kubernetes.io/name: mooncake-master
app.kubernetes.io/part-of: mooncake
type: ClusterIP
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
labels:
app.kubernetes.io/component: master
app.kubernetes.io/name: mooncake-master
app.kubernetes.io/part-of: mooncake
name: mooncake-master-snapshots
namespace: mooncake
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/component: master
app.kubernetes.io/name: mooncake-master
app.kubernetes.io/part-of: mooncake
name: mooncake-master
namespace: mooncake
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/component: master
app.kubernetes.io/name: mooncake-master
app.kubernetes.io/part-of: mooncake
strategy:
type: Recreate
template:
metadata:
labels:
app.kubernetes.io/component: master
app.kubernetes.io/name: mooncake-master
app.kubernetes.io/part-of: mooncake
spec:
containers:
- args:
- --config_path=/etc/mooncake/master.yaml
command:
- mooncake_master
env:
- name: MOONCAKE_SNAPSHOT_LOCAL_PATH
value: /data/snapshots
image: quay.io/grpereir/mooncake-master:test
livenessProbe:
initialDelaySeconds: 10
periodSeconds: 15
tcpSocket:
port: rpc
name: mooncake-master
ports:
- containerPort: 50051
name: rpc
protocol: TCP
- containerPort: 9003
name: metrics
protocol: TCP
- containerPort: 8080
name: metadata
protocol: TCP
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 10
tcpSocket:
port: rpc
resources:
limits:
cpu: "4"
memory: 4Gi
requests:
cpu: "1"
memory: 1Gi
volumeMounts:
- mountPath: /etc/mooncake
name: config
readOnly: true
- mountPath: /data/snapshots
name: snapshots
volumes:
- configMap:
name: mooncake-master-config
name: config
- name: snapshots
persistentVolumeClaim:
claimName: mooncake-master-snapshots
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file appears to be a generated manifest (e.g., from kustomize build). Committing generated files to the repository is generally discouraged as it creates redundancy and can lead to configuration drift between the source YAMLs and this combined output. This file should be removed.

Comment on lines +10 to +18
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9003"
prometheus.io/path: "/metrics"
spec:
containers:
- name: mooncake-master
image: mooncake-master
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These commented-out lines are redundant because the Kustomization configuration is already set to automatically inject these labels and annotations into the pod template. Removing them will improve the clarity and maintainability of the manifest.

Comment on lines +21 to +22
newName: quay.io/grpereir/mooncake-master
newTag: test
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The image configuration uses a personal registry (quay.io/grpereir/...) and a test tag. For a shared project repository, it is recommended to use an organization-level registry or a placeholder, and a more stable tag like latest or a specific version variable.

Comment thread docker/Dockerfile.master
rm -rf /var/lib/apt/lists/*

WORKDIR /workspace
COPY . /workspace
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using COPY . /workspace without a .dockerignore file can include unnecessary files (such as local build/ directories, .git history, or temporary files) in the build context. This increases image size and build time. Ensure a .dockerignore is present or be more specific with the files being copied.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants