Operator not scaling up cluster

Thanks for opening an issue for the M3DB Operator! We'd love to help you, but we need the following information included
with any issue:

* What version of the operator are you running? Please include the docker tag. If using `master`, please include the git
  SHA logged when the operator first starts.

v0.10.0

* What version of Kubernetes are you running? Please include the output of `kubectl version`.

```
❯ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.5", GitCommit:"e6503f8d8f769ace2f338794c914a96fc335df0f", GitTreeState:"clean", BuildDate:"2020-06-27T00:38:11Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.12", GitCommit:"17c50ce2d686f4346924935063e3a431360e0db7", GitTreeState:"clean", BuildDate:"2020-06-26T03:33:27Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
```

* What are you trying to do?

Increase the instances per isolation group of our m3db cluster by 1, i.e. adding 3 nodes to the cluster, one for each replica.

* What did you expect to happen?

Operator to detect that it needs to begin adding the node.

* What happened?

The operator doesn't scale up the cluster. We see logs that look like this:

```
{"level":"info","ts":"2021-01-29T19:05:39.751Z","msg":"statefulset already exists","controller":"m3db-cluster-controller","name":"m3db-rep0"}
{"level":"info","ts":"2021-01-29T19:05:39.751Z","msg":"successfully synced item","controller":"m3db-cluster-controller","key":"m3/m3db"}
{"level":"info","ts":"2021-01-29T19:05:40.254Z","msg":"processing pod","controller":"m3db-cluster-controller","pod.namespace":"m3","pod.name":"m3db-rep2-1"}
{"level":"info","ts":"2021-01-29T19:05:40.254Z","msg":"processing pod","controller":"m3db-cluster-controller","pod.namespace":"m3","pod.name":"m3db-rep0-2"}
{"level":"info","ts":"2021-01-29T19:05:40.254Z","msg":"processing pod","controller":"m3db-cluster-controller","pod.namespace":"m3","pod.name":"m3db-rep2-7"}
{"level":"info","ts":"2021-01-29T19:05:40.254Z","msg":"processing pod","controller":"m3db-cluster-controller","pod.namespace":"m3","pod.name":"m3db-rep0-16"}
{"level":"info","ts":"2021-01-29T19:05:40.254Z","msg":"processing pod","controller":"m3db-cluster-controller","pod.namespace":"m3","pod.name":"m3db-rep2-4"}
```

We previously saw the same issue when using v0.7.0 of the operator with these logs.
```
{"level":"error","ts":"2021-01-28T21:18:58.717Z","msg":"statefulsets.apps \"m3db-rep0\" already exists","controller":"m3db-cluster-controller"}
E0128 21:18:58.717342       1 controller.go:319] error syncing cluster 'm3/m3db': statefulsets.apps "m3db-rep0" already exists
```
At that time, we chatted with @robskillington who suggested we upgrade to 0.8.0 or newer where there would be better state syncing in large k8s clusters that might reduce issues where there are stale view of objects, such as statefulsets not being seen as existing.

We thought it might be resolved by upgrading to v0.10.0 but we think the same issue persists. Though it seems like the "statefulset already exists" log is info level rather than error.

We're trying to understand more about how "statefulset already exists" might relate to the operator is not beginning to scale up the cluster. Still unsure if this is an issue on our k8s cluster side or a bug in the operator.

Other things we've tried:
- [didn't work] edit the m3dbcluster back to the original number of instances, then restart the operator, then edit the m3dbcluster back up to the desired number of instances
- [worked] delete the m3db-rep0 statefulset (operator doesn't recreate sts yet), then restart the operator, then we saw the operator started creating the new statefulset with the desired number of instances + started scaling up the cluster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator not scaling up cluster #267

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Operator not scaling up cluster #267

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions