Skip to content

K8S API error resilience #7122

@BioWilko

Description

@BioWilko

New feature

Could Nextflow be a little bit more resilient to transient K8S API issues, e.g.

Caused by:
  Request POST /apis/batch/v1/namespaces/ns-<NAMESPACE>/jobs returned an error code=500
  
    {
        "kind": "Status",
        "apiVersion": "v1",
        "metadata": {
            
        },
        "status": "Failure",
        "message": "Internal error occurred: resource quota evaluation timed out",
        "reason": "InternalError",
        "details": {
            "causes": [
                {
                    "message": "resource quota evaluation timed out"
                }
            ]
        },
        "code": 500
    }

 -- Check '.nextflow.log' file for details

Perhaps some sort of configurable retry number for 4xx/5xx error codes?

The k8s API can be a little flakey under high load situations, especially when there are a lot of jobs sitting around spamming the API with requests to make pods!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions