Skip to content

CAPI can incorrectly report Diego Actual LRP state #4220

@Samze

Description

@Samze

In some circumstances CAPI will report an app instance is running when it is down.

Reproduction

Steps to reproduce:

  1. Have a cf app with 4 running instances
  2. On the Diego brain I see with cfdot I have 4 actual_lrps
  3. I kill the diego cell VM bosh delete-vm
  4. On the Diego brain with cfdot that now have 8 actual_lrp instances (4 running, 4 unclaimed). Each app instance has two entries, one running one down.
  5. There continue to be duplicate entries until Diego is restored.

CAPI iterates over all actual_lrps returned from Diego (in this case 8) and uses the app index as the key, so in the case CAPI will override each app instance information once and the state shown will be determined by the order of the actual lrp instances. See https://github.com/cloudfoundry/cloud_controller_ng/blob/main/lib/cloud_controller/diego/reporters/instances_stats_reporter.rb#L48-L56

Example of a duplicate entry from cfdot actual-lrps. Note the process_guid and index are the same.

{
  "process_guid": "57a8e43b-81f9-46e9-9f78-81e15bbfd231-de7f7844-156e-4fc7-9f21-db5d072fb0b7",
  "index": 3,
  "domain": "cf-apps",
  "instance_guid": "",
  "cell_id": "",
  "address": "",
  "ports": null,
  "preferred_address": "UNKNOWN",
  "crash_count": 0,
  "state": "UNCLAIMED",
  "placement_error": "unable to communicate to compatible cells",
  "since": 1739568280529021112,
  "modification_tag": {
    "epoch": "780635af-9208-4d5e-5a08-ea49ebcb3f95",
    "index": 5758
  },
  "presence": "ORDINARY",
  "OptionalRoutable": {
    "routable": false
  },
  "availability_zone": ""
}
{
  "process_guid": "57a8e43b-81f9-46e9-9f78-81e15bbfd231-de7f7844-156e-4fc7-9f21-db5d072fb0b7",
  "index": 3,
  "domain": "cf-apps",
  "instance_guid": "1f3ffac3-be77-45e0-5075-7357",
  "cell_id": "23b06662-20e7-42dd-9377-6d8f10190ec4",
  "address": "10.0.4.17",
  "ports": [
    {
      "container_port": 8080,
      "host_port": 61012,
      "container_tls_proxy_port": 61001,
      "host_tls_proxy_port": 61014
    },
    {
      "container_port": 8080,
      "host_port": 61012,
      "container_tls_proxy_port": 61443,
      "host_tls_proxy_port": 0
    },
    {
      "container_port": 2222,
      "host_port": 61013,
      "container_tls_proxy_port": 61002,
      "host_tls_proxy_port": 61015
    }
  ],
  "instance_address": "10.255.233.24",
  "preferred_address": "HOST",
  "crash_count": 0,
  "state": "RUNNING",
  "since": 1739222044495241579,
  "modification_tag": {
    "epoch": "4a424a13-b5ba-47b7-771a-1a61d99c2524",
    "index": 2
  },
  "presence": "SUSPECT",
  "metric_tags": {
    "app_id": "57a8e43b-81f9-46e9-9f78-81e15bbfd231",
    "app_name": "static",
    "instance_id": "3",
    "organization_id": "c877a084-d65b-4758-9908-90201c6df339",
    "organization_name": "org-1",
    "process_id": "57a8e43b-81f9-46e9-9f78-81e15bbfd231",
    "process_instance_id": "1f3ffac3-be77-45e0-5075-7357",
    "process_type": "web",
    "source_id": "57a8e43b-81f9-46e9-9f78-81e15bbfd231",
    "space_id": "b248d5ab-2948-468b-ad0f-7b1b90e923d1",
    "space_name": "space-1"
  },
  "OptionalRoutable": {
    "routable": true
  },
  "availability_zone": "us-central1-f"
}

Fix

In the case of duplicates, CAPI should look at since of the actual_lrp information and take the latest definition.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions