modelservice servicemonitor#22
Conversation
|
@tumido ptal again, ty
|
|
The CI should be adjusted here: https://github.com/neuralmagic/llm-d-deployer/blob/main/charts/llm-d/ci/default-values.yaml |
|
@nerdalert please review the quickstart changes |
| name: {{ include "modelservice.fullname" . }}-monitor | ||
| namespace: {{ .Release.Namespace }} | ||
| labels: {{ include "common.labels.standard" . | nindent 4 }} | ||
| control-plane: controller-manager |
| labels: {{ include "common.labels.standard" . | nindent 4 }} | ||
| control-plane: controller-manager | ||
| app.kubernetes.io/component: modelservice | ||
| release: prometheus |
There was a problem hiding this comment.
Same as above, is this label necessary?
There was a problem hiding this comment.
this is for Kubernetes, as the default is for prometheus to look for this label, but I think I can remove that with an added arg to the prom-stack in quickstart - I will remove this
There was a problem hiding this comment.
removed, added arg in installer to prevent needing it
| kind: ServiceMonitor | ||
| metadata: | ||
| name: {{ include "modelservice.fullname" . }}-monitor | ||
| namespace: {{ .Release.Namespace }} |
There was a problem hiding this comment.
Please don't encode the namespace, leave it up to the chart to decide.
tumido
left a comment
There was a problem hiding this comment.
Sorry, I've merged another PR, can you bump to 0.3.0 🙂 ?
tumido
left a comment
There was a problem hiding this comment.
/lgtm
/approve
@nerdalert @cfchase PTAL for the quickstart part
nerdalert
left a comment
There was a problem hiding this comment.
Tested in minikube and /LGTM tyvm!
Signed-off-by: sallyom <somalley@redhat.com>
Signed-off-by: sallyom <somalley@redhat.com>
Signed-off-by: sallyom <somalley@redhat.com>
Signed-off-by: sallyom <somalley@redhat.com>
Signed-off-by: sallyom <somalley@redhat.com>
Co-authored-by: greg pereira <grpereir@redhat.com> Co-authored-by: Brent Salisbury <bsalisbu@redhat.com> Co-authored-by: Chris Chase <cchase@redhat.com> Co-authored-by: Ryan Cook <rcook@redhat.com> Co-authored-by: sallyom <somalley@redhat.com> Co-authored-by: Anil Kumar Vishnoi <vishnoianil@gmail.com> Co-authored-by: Andrew Anderson <andy@clubanderson.com> Signed-off-by: Tomas Coufal <tcoufal@redhat.com> chore: trigger release Signed-off-by: Tomas Coufal <tcoufal@redhat.com> Update installer permissions Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> remove minikube flags from base installer Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> chore: renaming things Signed-off-by: Tomas Coufal <tcoufal@redhat.com> updating modelservice CRD for cmd support + automation Signed-off-by: greg pereira <grpereir@redhat.com> chore: replace proselint with vale Signed-off-by: Tomas Coufal <tcoufal@redhat.com> swap image back to ms official Signed-off-by: greg pereira <grpereir@redhat.com> chore: run test on release tags and add badge to releases in GH Signed-off-by: Tomas Coufal <tcoufal@redhat.com> labels + using gotpl decode service Signed-off-by: greg pereira <grpereir@redhat.com> temporarily remove vale lint Signed-off-by: greg pereira <grpereir@redhat.com> linting and fixing helpers Signed-off-by: greg pereira <grpereir@redhat.com> revert removing vale Signed-off-by: greg pereira <grpereir@redhat.com> msvc rbac v2 hack updates Signed-off-by: greg pereira <grpereir@redhat.com> msvc rbac v2 hack updates v3 Signed-off-by: greg pereira <grpereir@redhat.com> quickstart README updates bumping epp image to new amd target Signed-off-by: greg pereira <grpereir@redhat.com> chore: fix pre-commit-cache Signed-off-by: Tomas Coufal <tcoufal@redhat.com> Replaced broken glusterfs for single node hostPath Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> Fix minikube readme Signed-off-by: Anil Vishnoi <vishnoianil@gmail.com> chore: run chart releases as bumper bot, so we can trigger workflows from it Signed-off-by: Tomas Coufal <tcoufal@redhat.com> using imagepullsecrts for epp/pd secrets Signed-off-by: greg pereira <grpereir@redhat.com> Update charts/llm-d/templates/modelservice/_helpers.tpl non global pull-secrets as well Co-authored-by: Tom Coufal <7453394+tumido@users.noreply.github.com> calling the IPS template properly Signed-off-by: greg pereira <grpereir@redhat.com> chart bump + linting Signed-off-by: greg pereira <grpereir@redhat.com> chore: fix test workflow on tag push (#35) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> fixing gateway (#33) * fixing MS upgrade no pd role flag Signed-off-by: greg pereira <grpereir@redhat.com> * fixing gateway Signed-off-by: greg pereira <grpereir@redhat.com> --------- Signed-off-by: greg pereira <grpereir@redhat.com> Make the storage PV and PVCs variable in the minikube installer (#49) * Make the storage PV and PVCs variable Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> * Default to hostPath storage type - No need to have conditionals for storage type in the minikube installer. Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> * minikube readme updates Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> --------- Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> Fix quickstart validation model names (#48) Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> feat: add Istio backend for Gateway (#45) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> removal of dev flag (#52) Signed-off-by: Ryan Cook <rcook@redhat.com> feat: migrate to community redis image (#53) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> removal of -dev from sidecar and use latest image tag (#54) Signed-off-by: Ryan Cook <rcook@redhat.com> fix: modelservice CR was wrong (#56) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> stub OCP ingress controller into opinionated install (#47) * stub OCP ingress controller into opinionated install Signed-off-by: greg pereira <grpereir@redhat.com> * remove backstage references and respect passing host Signed-off-by: greg pereira <grpereir@redhat.com> --------- Signed-off-by: greg pereira <grpereir@redhat.com> feat: update model service rbac (#58) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> modelservice servicemonitor (#22) * add modelservice servicemonitor Signed-off-by: sallyom <somalley@redhat.com> * disable modelservice metrics in CI Signed-off-by: sallyom <somalley@redhat.com> * update installers for metrics collection Signed-off-by: sallyom <somalley@redhat.com> * update quickstart READMEs for metrics Signed-off-by: sallyom <somalley@redhat.com> * bump chart version Signed-off-by: sallyom <somalley@redhat.com> --------- Signed-off-by: sallyom <somalley@redhat.com> fix: ensure model service controller can always grant epp role (#60) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> ensure metrics stack reinstalls in quickstarts (#61) Signed-off-by: sallyom <somalley@redhat.com> fix: kgateway proxyUID fixes - compatibility with multiple gateway types (#64) Signed-off-by: greg pereira <grpereir@redhat.com> chore: fix ci values file (#63) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> Add tolerations to values.yaml and baseconfigs (#50) BYO model (#21) * BYO model Signed-off-by: greg pereira <grpereir@redhat.com> * charts should not have sample app perspective, but treat MSVC as first class citizen Signed-off-by: greg pereira <grpereir@redhat.com> * defy remove sample app from MSVC base config - hack Signed-off-by: greg pereira <grpereir@redhat.com> * refactor base everything off modelartifactURI Signed-off-by: greg pereira <grpereir@redhat.com> * fix: use model service controller templating instead of helm Signed-off-by: Tomas Coufal <tcoufal@redhat.com> * more refactors Signed-off-by: greg pereira <grpereir@redhat.com> * minikube script compatability Signed-off-by: greg pereira <grpereir@redhat.com> * linting Signed-off-by: greg pereira <grpereir@redhat.com> --------- Signed-off-by: greg pereira <grpereir@redhat.com> Signed-off-by: Tomas Coufal <tcoufal@redhat.com> Co-authored-by: Tomas Coufal <tcoufal@redhat.com> feat: upgrade to model service 0.0.8 (#62) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> docs: sanitize chart README, update main repo README and CONTRIBUTING with tips, faq and others (#72) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> quick helm fix (#73) Signed-off-by: greg pereira <grpereir@redhat.com> Update the minikube readme with byo model (#70) Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> Fixup the minikube model pvc to be dynamic (#69) - Makes the PVC dynamic based on the PVC URI in the chart values - Adds some debugging, will wrap those into a --debug flag in a seperate patch. Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> Fix helm set pull secrets array in minikube installer (#75) Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> upgrading EPP image to tag reflecting inference-router repo migration (#74) Signed-off-by: greg pereira <grpereir@redhat.com> Add github workflow to run e2e test on the AWS instance (#77) Signed-off-by: Anil Vishnoi <vishnoianil@gmail.com> Fix install-deps.sh execution permission (#80) Signed-off-by: Anil Vishnoi <vishnoianil@gmail.com> Update quickstart for byo (#79) Fix the llmd-deployer repo url (#81) Signed-off-by: Anil Vishnoi <vishnoianil@gmail.com> Use fine grain token to clone deployer repo (#82) Signed-off-by: Anil Vishnoi <vishnoianil@gmail.com> feat: add knobs for EPP env variables (#67) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> docs: fix Quickstart link in main README (#87) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> quiet prometheus install stdout in quickstarts (#85) Signed-off-by: sallyom <somalley@redhat.com> evaluate if the defined storage class exists (#84) Signed-off-by: Ryan Cook <rcook@redhat.com> Fixup disable metric (#71) When DISABLE_METRICS is true, inject Helm args to set modelservice.metrics.enabled=false and modelservice.serviceMonitor.enabled=false, to stop the chart from rendering ServiceMonitor resources. Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> Remove the minikube runtime memory limit (#92) - Makes room for cpu mem offload from vLLM. Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> using new NIXL only connector (#32) * using new NIXL only connector Signed-off-by: greg pereira <grpereir@redhat.com> * runs but no cache hit Signed-off-by: greg pereira <grpereir@redhat.com> * no p/d services in prod example Signed-off-by: greg pereira <grpereir@redhat.com> * restore pd services deemed non-invasive Signed-off-by: greg pereira <grpereir@redhat.com> * keeping confimaps around but not using them in lmcache for dual connectors later Signed-off-by: greg pereira <grpereir@redhat.com> * downgrade to working image Signed-off-by: greg pereira <grpereir@redhat.com> * removing dead code placeholder sections Signed-off-by: greg pereira <grpereir@redhat.com> * linting Signed-off-by: greg pereira <grpereir@redhat.com> --------- Signed-off-by: greg pereira <grpereir@redhat.com> Invert `--download-model` to `--skip-download-model` (#83) Just flips the logic. Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> Add pods describe info to the logs (#96) Signed-off-by: Anil Vishnoi <vishnoianil@gmail.com> Update the quickstart ingress validations (#88) Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> docs: Neater preset table for NOTES.txt (#95) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> Add --all-containers=true to artifact logs (#97) Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> Remove runasroot to fix decode breakage (#98) - Longer term, see if decode can be run as a regular user. Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> Enable metrics scraping from EPP (#93) * add epp-service metrics collection Signed-off-by: sallyom <somalley@redhat.com> * bump chart version Signed-off-by: sallyom <somalley@redhat.com> --------- Signed-off-by: sallyom <somalley@redhat.com> llm-d scheduler scorers configuration (#99) Signed-off-by: Ricardo Noriega De Soto <rnoriega@redhat.com> chore: fix test prereqs (#103) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> fix: update model service controller to 0.0.9 (#101) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> update model_id in validation endpoints (#105) Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> test: get proper tests going (#102) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> All image support w/ concurrent connectors (#100) * All image support w/ concurrent connectorst Signed-off-by: greg pereira <grpereir@redhat.com> * dist url via pod IP + no config, fallback full env Signed-off-by: greg pereira <grpereir@redhat.com> * linting Signed-off-by: greg pereira <grpereir@redhat.com> --------- Signed-off-by: greg pereira <grpereir@redhat.com> use prod EPP image post rename (#108) Signed-off-by: greg pereira <grpereir@redhat.com> remove legacy gpu-basic preset (#110) Signed-off-by: greg pereira <grpereir@redhat.com> Update the quickstart validate script to v1/completion (#111) - Temporary until the vllm nixl patch lands with v1/chat support. Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> feat: add helm json schema (#114) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> fix: remove from the chart (#113) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> chore: update issue templates and add autolabel (#120) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> template p/d replicas in ms from sample app (#119) * template p/d replicas in ms from sample app Signed-off-by: greg pereira <grpereir@redhat.com> * removing helpers feedback Signed-off-by: greg pereira <grpereir@redhat.com> --------- Signed-off-by: greg pereira <grpereir@redhat.com> docs: add section on documenting variables to the CONTRIBUTING guide (#122) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> chore(ci): add ghcr creds (#124) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> Fix escape log message in minikube installer (#125) Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> Fix a string escape in llmd-installer.sh (#126) Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> Fix HF_MODEL_ID validation in quickstart verify_env() (#127) Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> include clusterRouterBase in the schema (#129) Signed-off-by: greg pereira <grpereir@redhat.com> feat: migrate images registry to ghcr.io (#121) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> feat: update to model-service:0.0.10 (#130) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> Add debug logging wrapper and model value file and dir (#112) - Fills out the --debug mode to include logging - Add a sample `quickstart/models` directory for pre-canned validated models. Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> chore: rename to llm-d org (#135) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> chore: remove openshift reference (#136) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> chore: try fixing test workflow after repo move Signed-off-by: Tomas Coufal <tcoufal@redhat.com> chore: downgrade min kube version to 1.30 (#139) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> chore: rename vllm-sim to llm-d-inference-sim (#140) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> docs: add prereqs (#142) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> bugfix: swap floats to ints for epp vars (#133) * swap floats to ints for epp vars Signed-off-by: greg pereira <grpereir@redhat.com> * fix: update remaining variables Signed-off-by: Tomas Coufal <tcoufal@redhat.com> --------- Signed-off-by: greg pereira <grpereir@redhat.com> Signed-off-by: Tomas Coufal <tcoufal@redhat.com> Co-authored-by: Tomas Coufal <tcoufal@redhat.com> Wire in the bitnami redis sc (#143) - bug reported in #141 Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> fix: make epp env variables merging possible and also extend sample app with epp env vars (#145) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> feat: add various selectors, constrains, tolerations etc (#147) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> Fix --values-file documentation in README.md (#148) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> refactor: remove hfToken.create from the chart values (#149) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> chore: update badge for release decorator (#153) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> chore: refac prerequisites for quickstart (#150) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> Update the test-request.sh validation script (#156) * Update the test-request.sh validation script - Docs letting the user know if the hit podsec issues how to work workaround them. - replace shuf with RANDOM since its default on osx and linux - Make namespace and model arguments Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> * fix: lint errors Signed-off-by: Tomas Coufal <tcoufal@redhat.com> --------- Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> Signed-off-by: Tomas Coufal <tcoufal@redhat.com> Co-authored-by: Tomas Coufal <tcoufal@redhat.com> feat: upgrade to `ghcr.io/llm-d/llm-d:0.0.8` (#161) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> docs: docs docs and more docs (#164) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> feat: upgrade to inference scheduler 0.0.3 (#163) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> remove create token from sample override file (#166) Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> swap to HF by default to avoid RWX storage issue trajectory (#165) Signed-off-by: greg pereira <grpereir@redhat.com> Signed-off-by: Tomas Coufal <tcoufal@redhat.com> Set mikefarah yq as a quickstart requirement (#167) Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> upping model download time, including resrouce stuff in json schema, remove ee by default (#169) Signed-off-by: greg pereira <grpereir@redhat.com> s/quay/ghcr/ updates to quickstart readmes (#172) Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> fix: repeated of quickstart need CRD cleanups (#173) Signed-off-by: Tomas Coufal <tcoufal@redhat.com> pvc log cleanup in uninstall (#178) - user doesnt care if that was skipped if not PVC - pvc gets deleted with the ns Signed-off-by: Brent Salisbury <bsalisbu@redhat.com> keep values schema for resources but not actual values (#174) Signed-off-by: greg pereira <grpereir@redhat.com> Skip setting BASE_OCP_DOMAIN when not on OpenShift (#155) * Skip setting BASE_OCP_DOMAIN when not on openshift Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * pre-commit Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> --------- Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Remove redis persistence (#175) * Remove redis persistence * chore: bump chart version Signed-off-by: Tomas Coufal <tcoufal@redhat.com> --------- Signed-off-by: Tomas Coufal <tcoufal@redhat.com> Co-authored-by: Tomas Coufal <tcoufal@redhat.com> safer failures on uninstall full stack (#179) Signed-off-by: greg pereira <grpereir@redhat.com> safe uninstall (#180) Signed-off-by: greg pereira <grpereir@redhat.com> feat: populate CRB for metrics collection from epp (#171) Signed-off-by: Tomas Coufal <tcoufal@redhat.com>
This PR adds ServiceMonitor values and template for modelservice.
The installer scripts need to be updated to handle the case where the ServiceMonitor CRD does not exist.
Updates llm-d installer with the following:
New --disable-metrics-collection flag to optionally disable metrics collection (metrics enabled by default)
Automatic detection of existing ServiceMonitor CRD (e.g., in OpenShift) to avoid duplicate installations
Smart metrics configuration:
In the simplest case, the user does not provide the flag and:
The Prometheus stack installation is intentionally minimal, only essential configurations:
This change allows llm-d to work seamlessly in various environments: