Skip to content

Case study: Source code management problems after adding many more cloud workers to a testing platform #4

@tmbgreaves

Description

@tmbgreaves

To solve problems with long build queue delays, we added Jenkins capacity to burst up to fifty additional cloud workers. We then saw very unreliable connections to our source code management (SCM) platform - in this case, github - which appeared to be problems with failed access tokens.

We initially assumed Jenkins-end problems but later realised that our access tokens were being disabled for on the order of 48h at the github end, presumably in response to a large number of concurrent SCM requests which looked like a distributed denial-of-service (DDOS) attack.

This was probably more severe in our research software case as each job could be running in five or six configurations, and each configuration could be doing multiple clone actions. In the case of a PR'ed branch the total would double with builds of the branch and the merge. Before adding cloud capacity, the load would have been spread over a longer time as our smaller worker pool processed the backlog.

Flagging this as a potential issue could be helpful for future research groups, both in terms of preventing the problem (don't allow so many on-demand instances to start at once), stopping it affecting multiple projects (giving each its own access token), and awareness of the problem to save a long debugging process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Case StudyDocument a case study

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions