11# ExaMon v0.5.1 -- Technical Plan
22
33This document captures the planned improvements for v0.5.1, building on the
4- v0.5.0 Kubernetes migration. It covers two main areas:
4+ v0.5.0 Kubernetes migration. It covers three main areas:
55
661 . ** Data persistence, backup, and migration strategy** -- filling the gaps
77 identified in the v0.5.0 deployment.
882 . ** KairosDB HOCON ConfigMap overlay** -- replacing the current ` sed ` -patching
99 entrypoint with a cleaner, Kubernetes-native configuration model.
10+ 3 . ** Documentation gap analysis (README parity)** -- ensuring every operation
11+ documented in the Docker Compose ` README.md ` has a complete K8s equivalent.
1012
11- Both items are documented here as a technical plan; implementation follows in a
13+ All items are documented here as a technical plan; implementation follows in a
1214subsequent phase.
1315
1416---
@@ -657,17 +659,10 @@ ingress → grafana (3000), examon-server (5000)
657659
658660---
659661
660- # # 4. Implementation Phases
662+ # # 4. Implementation Phases (Original)
661663
662- | Phase | Items | Depends On | Effort |
663- |-------|-------|-----------|--------|
664- | **Phase 1** | Cassandra Medusa backups, PV reclaim policy docs, Mosquitto persistence fix | -- | Medium |
665- | **Phase 2** | KairosDB HOCON ConfigMap overlay, updated Dockerfile | -- | Medium |
666- | **Phase 3** | Grafana dashboard-as-code + API backup script | -- | Medium |
667- | **Phase 4** | Security contexts on all workloads | Phase 2 (KairosDB readOnlyRootFilesystem) | Medium |
668- | **Phase 5** | PDBs, image versioning, expanded migration docs | -- | Low-Medium |
669- | **Phase 6** | Observability (ServiceMonitors, alert rules, dashboards) | Phase 3 (Grafana provisioning) | Medium |
670- | **Phase 7** | HPAs, NetworkPolicies | Phase 6 (metrics for HPA decisions) | Medium |
664+ See [section 7](#7-implementation-phases-updated) for the updated phase plan
665+ that includes documentation gap items from section 6.
671666
672667---
673668
@@ -684,3 +679,326 @@ ingress → grafana (3000), examon-server (5000)
684679 backup/restore operator for Cassandra
685680- [Grafana sidecar provisioning](https://github.com/grafana/helm-charts/tree/main/charts/grafana#sidecar-for-dashboards) --
686681 dashboard-as-code via ConfigMap sidecar
682+
683+ ---
684+
685+ # # 6. Documentation Gap Analysis (README vs. K8s Docs)
686+
687+ A systematic comparison of the Docker Compose `README.md` against the full
688+ Kubernetes documentation identified six areas where the K8s docs do not cover
689+ functionality that users expect based on the README. These must be addressed
690+ so that every operation documented for Docker Compose has a clear, complete
691+ K8s equivalent.
692+
693+ # ## 6.1 Gap Overview
694+
695+ | # | README Feature | K8s Coverage | Severity | Environments |
696+ |---|----------------|-------------|----------|--------------|
697+ | D1 | Grafana dashboard import/provisioning | Not covered | **HIGH** | All |
698+ | D2 | Plugin management (start/stop/restart/status/logs) | Partially covered | **MEDIUM** | All |
699+ | D3 | Grafana first login & datasource setup walkthrough | Auto-provisioned but undocumented | **MEDIUM** | Mostly local |
700+ | D4 | Plugin `.conf` files replaced by Helm values | Not explained | **LOW** | All |
701+ | D5 | Custom volume paths / data persistence model | Partially covered | **LOW** | Production |
702+ | D6 | Log rotation and retention configuration | Not covered | **LOW** | Production |
703+
704+ # ## 6.2 D1 — Grafana Dashboard Import and Provisioning
705+
706+ **README reference:**
707+
708+ > "To import the dashboards stored in the `dashboards/` folder:
709+ > [Import dashboard](https://grafana.com/docs/grafana/latest/dashboards/export-import/#import-dashboard)
710+ > To test the installation, you can import the `Examon Test - Random Sensor.json` dashboard."
711+
712+ **Current K8s state:**
713+
714+ - The `values.yaml` auto-provisions the KairosDB **datasource** via
715+ ` grafana.datasources.datasources.yaml` , which is correctly handled.
716+ - The `dashboards/Examon Test - Random Sensor.json` file exists in the repo
717+ but is never referenced by any K8s template or documentation.
718+ - The upgrading guide (`upgrading.md` Step 4) only says "Re-import Grafana
719+ dashboards through the Grafana UI or API" with no further instructions.
720+ - The v0.5.1 plan (section 1.2 GAP 2) already describes the Grafana sidecar
721+ dashboard-as-code strategy but only as a future plan, not documentation.
722+
723+ **TODO — Documentation additions:**
724+
725+ - [ ] **D1a:** Add a "Grafana Dashboards" section to `kubernetes.md` covering :
726+ - Manual dashboard import via Grafana UI (same procedure as README, with
727+ K8s-specific URL — `http://localhost:3000` for local, or the Ingress URL
728+ for production).
729+ - API-based import (useful for scripting) :
730+ ` ` ` bash
731+ # Port-forward to Grafana (if not using NodePort/Ingress)
732+ kubectl port-forward svc/examon-grafana 3000:80 -n examon &
733+
734+ # Import dashboard JSON
735+ curl -X POST -H "Content-Type: application/json" \
736+ -H "Authorization: Basic $(echo -n admin:<password> | base64)" \
737+ -d @dashboards/Examon\ Test\ -\ Random\ Sensor.json \
738+ http://localhost:3000/api/dashboards/db
739+ ` ` `
740+ - A note that the bundled `dashboards/Examon Test - Random Sensor.json`
741+ can be used to verify the full data pipeline after installation.
742+
743+ - [ ] **D1b:** (Implementation item, already tracked in section 1.2 GAP 2)
744+ Enable Grafana sidecar dashboard provisioning so that bundled dashboards
745+ are automatically loaded from ConfigMaps on deploy. This eliminates the
746+ manual import step for core dashboards. Required changes :
747+ - Set `grafana.sidecar.dashboards.enabled : true` and
748+ `grafana.sidecar.dashboards.label : grafana_dashboard` in `values.yaml`.
749+ - Create a ConfigMap template in the umbrella chart's `templates/` that
750+ loads `dashboards/*.json` files with the `grafana_dashboard` label.
751+ - Document : " Core dashboards are auto-provisioned on deploy. User-created
752+ dashboards must be imported manually or backed up via the API."
753+
754+ - [ ] **D1c:** Add environment-specific guidance :
755+ - **Local/Staging:** Manual UI import is acceptable for testing. The
756+ auto-provisioned test dashboard (after D1b) provides an out-of-the-box
757+ verification experience.
758+ - **Production:** Dashboard-as-code is recommended. User-created dashboards
759+ should be exported via the Grafana API and stored in version control.
760+
761+ # ## 6.3 D2 — Plugin Management (Start/Stop/Restart/Status/Logs)
762+
763+ **README reference:**
764+
765+ The README has a large section on `supervisorctl` commands :
766+ - ` docker exec -it <container> supervisorctl start/stop/restart/status/tail <plugin>`
767+ - Opening the supervisor shell for interactive management.
768+ - Editing `supervisor.conf` for `autostart=True` to persist enable/disable.
769+
770+ **Current K8s state:**
771+
772+ - ` kubernetes.md` has a brief "Managing Plugins" section showing
773+ ` --set random-pub.enabled=false` and `kubectl scale`.
774+ - The `kubernetes.md` "Logs" section shows `kubectl logs` commands.
775+ - No mapping table exists to help Docker Compose users translate their
776+ existing workflows.
777+
778+ **TODO — Documentation additions:**
779+
780+ - [ ] **D2a:** Expand the "Managing Plugins" section in `kubernetes.md` with a
781+ complete mapping table between Docker Compose `supervisorctl` operations and
782+ their Kubernetes equivalents :
783+
784+ | Operation | Docker Compose | Kubernetes |
785+ |-----------|---------------|------------|
786+ | **Start** a plugin | `docker exec -it examon supervisorctl start plugins:random_pub` | `kubectl scale deployment examon-random-pub --replicas=1 -n examon` |
787+ | **Stop** a plugin | `docker exec -it examon supervisorctl stop plugins:random_pub` | `kubectl scale deployment examon-random-pub --replicas=0 -n examon` |
788+ | **Restart** a plugin | `docker exec -it examon supervisorctl restart plugins:random_pub` | `kubectl rollout restart deployment/examon-random-pub -n examon` |
789+ | **Status** of all plugins | `docker exec -it examon supervisorctl status` | `kubectl get pods -n examon` |
790+ | **Tail logs** of a plugin | `docker exec -it examon supervisorctl tail -f random_pub` | `kubectl logs -f deployment/examon-random-pub -n examon` |
791+ | **Permanent enable** | Edit `supervisor.conf`, set `autostart=True` | Set `random-pub.enabled : true` in `values-<env>.yaml` + `helm upgrade` |
792+ | **Permanent disable** | Edit `supervisor.conf`, set `autostart=False` | Set `random-pub.enabled : false` in `values-<env>.yaml` + `helm upgrade` |
793+ | **Scale** a plugin | Not supported (single container) | `kubectl scale deployment examon-mqtt2kairosdb --replicas=3 -n examon` |
794+
795+ - [ ] **D2b:** Add a note explaining the architectural difference : in Docker
796+ Compose, all plugins run inside a single container managed by `supervisord`;
797+ in Kubernetes, each plugin is an independent Deployment with its own pod(s),
798+ enabling independent scaling, restarts, and resource limits.
799+
800+ - [ ] **D2c:** Document how to get a shell inside a plugin pod for debugging :
801+ ` ` ` bash
802+ kubectl exec -it deployment/examon-examon-server -n examon -- /bin/bash
803+ ` ` `
804+ This is the K8s equivalent of `docker exec -it <container> bash`.
805+
806+ # ## 6.4 D3 — Grafana First Login and Datasource Setup Walkthrough
807+
808+ **README reference:**
809+
810+ > "Log in to the Grafana server using your browser and the default credentials...
811+ > http://localhost:3000 ... add a new data source and select `KairosDB`.
812+ > Fill out the form with: Name: kairosdb, Url: http://kairosdb:8083, Access: Server"
813+
814+ **Current K8s state:**
815+
816+ - The KairosDB datasource is **auto-provisioned** via
817+ ` grafana.datasources.datasources.yaml` in `values.yaml`. Users do NOT need
818+ to add it manually — but this is never stated.
819+ - Grafana plugins (KairosDB, plotly, piechart, etc.) are auto-installed via
820+ ` grafana.plugins` in `values.yaml` — also not documented as a feature.
821+ - The `GF_PANELS_DISABLE_SANITIZE_HTML` setting is auto-configured via
822+ ` grafana.env` — not documented.
823+ - No "first login" walkthrough exists for K8s users.
824+ - The production guide briefly mentions auto-provisioning and has a manual
825+ fallback, but local/staging docs skip this entirely.
826+
827+ **TODO — Documentation additions:**
828+
829+ - [ ] **D3a:** Add a "Grafana First Login" subsection to `kubernetes-local.md`
830+ (after the data pipeline verification step), covering :
831+ 1. Open `http://localhost:3000` (for K3d) or the Ingress URL (production).
832+ 2. Log in as `admin` with the password set via `--set grafana.adminPassword`
833+ (default : ` Password` if not overridden).
834+ 3. Note : the KairosDB datasource is **already configured** — no manual setup
835+ needed (unlike Docker Compose).
836+ 4. Note : the KairosDB Grafana plugin and other required plugins are
837+ **auto-installed** during pod startup via the `grafana.plugins` list.
838+ 5. Link to the dashboard import section (D1a) for importing test dashboards.
839+
840+ - [ ] **D3b:** Add a "What is auto-configured" callout in `kubernetes.md` listing
841+ all the things that are automatically set up (datasources, plugins, HTML
842+ sanitization) so users understand they don't need to replicate the README's
843+ manual configuration steps.
844+
845+ # ## 6.5 D4 — Plugin Configuration Files Replaced by Helm Values
846+
847+ **README reference:**
848+
849+ > "It is necessary to define all the properties of the `.conf` configuration file
850+ > of the plugins with the appropriate values related to the server hosting the
851+ > framework. In particular, it is necessary to define the IP addresses and ports
852+ > of the server where the KairosDB and/or MQTT broker services run..."
853+
854+ Points users to `publishers/random_pub/random_pub.conf` and per-plugin READMEs.
855+
856+ **Current K8s state:**
857+
858+ - The `change-propagation.md` explains that `.conf` files are generated from
859+ Helm values via ConfigMap templates, and `configuration.md` lists all
860+ parameters. However, nowhere does the documentation explicitly say : " the
861+ old `.conf` files in `publishers/` and `web/examon-server/` are **not used**
862+ in K8s. All configuration is driven by Helm `values.yaml`."
863+ - A user coming from Docker Compose would naturally look for `.conf` files
864+ to edit and would be confused.
865+
866+ **TODO — Documentation additions:**
867+
868+ - [ ] **D4a:** Add a paragraph to `kubernetes.md` (in the "Configuration" or
869+ a new "Configuration Model" section) explicitly stating :
870+
871+ > In the Kubernetes deployment, plugin configuration files (`.conf`) are
872+ > **generated automatically** from Helm values and mounted as ConfigMaps.
873+ > Do not edit the `.conf` files in `publishers/` or `web/examon-server/`
874+ > directly — those files are only used by the Docker Compose deployment.
875+ > All configuration is managed via `values.yaml` and environment-specific
876+ > override files.
877+
878+ - [ ] **D4b:** Add a mapping reference of key old `.conf` fields to new Helm
879+ values, either in `configuration.md` or `upgrading.md` :
880+
881+ | Old field (`random_pub.conf`) | New Helm value |
882+ |-------------------------------|---------------|
883+ | `MQTT_BROKER` | `random-pub.config.mqttBroker` |
884+ | `MQTT_PORT` | `random-pub.config.mqttPort` |
885+ | `MQTT_TOPIC` | `random-pub.config.mqttTopic` |
886+ | `MQTT_USER` | `random-pub.config.mqttUser` |
887+ | `MQTT_PASSWORD` | `random-pub.config.mqttPassword` |
888+ | `NUM_SENSORS` | `random-pub.config.numSensors` |
889+ | `TS` (sample interval) | `random-pub.config.sampleInterval` |
890+
891+ Similar tables for `mqtt2kairosdb.conf` and `server.conf`.
892+
893+ # ## 6.6 D5 — Data Persistence Model and Custom Volume Paths
894+
895+ **README reference:**
896+
897+ > "Two Docker volumes are created... `examon_cassandra_volume`, `examon_grafana_volume`.
898+ > To set a custom volume path, use `driver_opts` with `type: none`, `device: /path/...`"
899+
900+ **Current K8s state:**
901+
902+ - ` configuration.md` documents `cassandra.datacenters.dc1.storageClass`,
903+ ` grafana.persistence.enabled/size` , and `mosquitto.persistence.enabled/size`.
904+ - The production guide has a "Storage" section for Cassandra StorageClass.
905+ - Section 1 of this plan document covers backup/snapshot gaps.
906+ - However, there is no unified "Data Persistence" section explaining how K8s
907+ PVCs replace Docker volumes, what happens to data on pod restart vs.
908+ ` helm uninstall` vs. cluster deletion, or how to specify custom storage
909+ paths.
910+
911+ **TODO — Documentation additions:**
912+
913+ - [ ] **D5a:** Add a "Data Persistence" section to `kubernetes.md` covering :
914+ - How PVCs replace Docker volumes (the K8s equivalent).
915+ - What data survives what operations :
916+
917+ | Event | Cassandra data | Grafana data |
918+ |-------|:---:|:---:|
919+ | Pod restart | Survives | Survives |
920+ | `helm upgrade` | Survives | Survives |
921+ | `helm uninstall` | Depends on reclaim policy | Depends on reclaim policy |
922+ | K3d cluster delete | **Lost** (local only) | **Lost** (local only) |
923+ | Node failure (production) | Survives (replicas) | Survives (if PVC on shared storage) |
924+
925+ - Environment-specific behavior :
926+ - **Local (K3d):** Uses `local-path` provisioner. Data persists across
927+ pod restarts but is lost when the K3d cluster is deleted.
928+ - **Staging (K3d):** Same as local; data is ephemeral to the cluster.
929+ - **Production:** Set `storageClass` to match infrastructure
930+ (e.g., `cinder-ssd`, `gp3`, `longhorn`). Use `reclaimPolicy : Retain`
931+ for critical data (see section 1.2 GAP 4).
932+
933+ - [ ] **D5b:** Add the K8s equivalent of "custom volume path" : explain how to
934+ use a `hostPath` PersistentVolume or a `local` StorageClass for bare-metal
935+ deployments where data must reside on a specific disk/partition.
936+
937+ # ## 6.7 D6 — Log Rotation and Retention
938+
939+ **README reference (via docker-compose.yml):**
940+
941+ Each service in `docker-compose.yml` specifies logging configuration :
942+ ` ` ` yaml
943+ logging:
944+ driver: json-file
945+ options:
946+ max-size: "10m"
947+ max-file: "1"
948+ ` ` `
949+
950+ **Current K8s state:**
951+
952+ - The `kubernetes.md` "Logs" section shows `kubectl logs` commands but says
953+ nothing about log rotation, retention, or aggregation.
954+
955+ **TODO — Documentation additions:**
956+
957+ - [ ] **D6a:** Add a note to the "Logs" section of `kubernetes.md` explaining :
958+ - In Kubernetes, container log rotation is handled by the **container
959+ runtime** (containerd/CRI-O), not by individual services. The Docker
960+ Compose `json-file` settings have no direct K8s equivalent — containerd
961+ handles rotation automatically.
962+ - K3d default : containerd keeps ~10MB per container before rotation.
963+ - Production recommendation : the default containerd rotation is sufficient
964+ for cluster-level operations, but for persistent, searchable logs,
965+ deploy a **log aggregation stack** :
966+ - **Loki + Promtail + Grafana** (lightweight, integrates with existing
967+ Grafana — recommended for ExaMon)
968+ - **EFK** (Elasticsearch + Fluentd + Kibana) — heavier but more mature
969+ - **Cloud-native** (CloudWatch, GCP Logging, Azure Monitor) — if on
970+ managed K8s
971+
972+ - [ ] **D6b:** For production, add a brief note in `kubernetes-production.md`
973+ recommending a log retention strategy and linking to the Logs section.
974+
975+ # ## 6.8 Implementation Priority
976+
977+ | Priority | TODO | Effort | Depends On |
978+ |----------|------|--------|-----------|
979+ | **P0** | D1a (dashboard import docs) | Low | — |
980+ | **P0** | D2a (plugin management mapping table) | Low | — |
981+ | **P0** | D3a (Grafana first login) | Low | — |
982+ | **P1** | D1b (dashboard-as-code provisioning) | Medium | Grafana sidecar config |
983+ | **P1** | D4a, D4b (conf file replacement docs) | Low | — |
984+ | **P1** | D5a (persistence model docs) | Low | — |
985+ | **P2** | D1c (per-environment dashboard guidance) | Low | D1a |
986+ | **P2** | D2b, D2c (architecture note, exec shell) | Low | — |
987+ | **P2** | D3b (auto-configured callout) | Low | — |
988+ | **P2** | D5b (custom volume paths for bare metal) | Low | — |
989+ | **P2** | D6a, D6b (logging docs) | Low | — |
990+
991+ ---
992+
993+ # # 7. Implementation Phases (Updated)
994+
995+ | Phase | Items | Depends On | Effort |
996+ |-------|-------|-----------|--------|
997+ | **Phase 1** | Cassandra Medusa backups, PV reclaim policy docs, Mosquitto persistence fix | — | Medium |
998+ | **Phase 2** | KairosDB HOCON ConfigMap overlay, updated Dockerfile | — | Medium |
999+ | **Phase 3** | Grafana dashboard-as-code + API backup script | — | Medium |
1000+ | **Phase 4** | Documentation gaps D1–D6 (README parity) | D1b depends on Phase 3 | Low-Medium |
1001+ | **Phase 5** | Security contexts on all workloads | Phase 2 (readOnlyRootFilesystem) | Medium |
1002+ | **Phase 6** | PDBs, image versioning, expanded migration docs | — | Low-Medium |
1003+ | **Phase 7** | Observability (ServiceMonitors, alert rules, dashboards) | Phase 3 (Grafana provisioning) | Medium |
1004+ | **Phase 8** | HPAs, NetworkPolicies | Phase 7 (metrics for HPA decisions) | Medium |
0 commit comments