Skip to content

[Feature request]: Start faster!! #139

@MirandaStreeter

Description

@MirandaStreeter

Use Case

On a VM with 8GB and 4 cores I'm getting container startup times measured at >45s. It's sometimes almost down to 35s if I've got:

  • a warm cache,
  • an existing CA,
  • and I'm not doing anything interesting with my mounts.

In either case that is wayyyyy tooooo slowwww.

Why is that too slow? Well, ideally, in an orchestrated environment (kubernetes/docker swarm/nomad/ECS/etc) you want your stateless services to scale up and down quickly and responsively as load demands it.

Yes, I know openvox server isn't stateless... yet. That's a very related goal I plan on discussing with you all.

(It's related because separating the stateful bits (CA/code/config) eliminates a lot of preprocessing for container startup, which makes it go faster. I'll be opening a separate FR in the future)

In either case, the grand vision is I can have an HPA react to workload which informs a deployment to scale the replicaset. But if a service takes too long this leads to what's called the Bullwhip Effect. A user shouldn't have to worry whether there's enough services at a given time, only that it isn't overloading their cluster.

Describe the solution you would like

My first goal is getting startup under 10s (relative to my previously mentioned benchmarks). Faster would be ideal but that may take some java expertise (which I don't have in abundance).

What's making it slow? When checking podman logs -t and running through timestamps I can see that the entrypoint shell phase takes up about 2/3 that time. Once the entrypoint scripts are done then the JVM takes over, which in my experience takes up the remaining third.

Thankfully logback.xml has some pretty sensible defaults and we can see what happens at which times. Here's a quick and dirty breakdown from my tests. No warm cache, no mounts, fresh image.

                           Phase                               Time        
  |-----------------------------------------------------|-----------------|
  | Shell (entrypoint, incl. ca setup on cold start)    | ~36s            |
  |-----------------------------------------------------|-----------------|
  | JVM + Clojure + Trapperkeeper class loading         | ~4s             |
  |-----------------------------------------------------|-----------------|
  | JRuby instance creation (loading Puppet into JRuby) | ~10s            |
  |-----------------------------------------------------|-----------------|
  | Jetty bind (ready)                                  | ~1s             |
  |-----------------------------------------------------|-----------------|
  |           TOTAL                                     | ~51s            |

That first bit is what I'd like to tackle first. When I start adding set -x options to the scripts and prepending commands with time I can see various culprits.

  • puppet config commands take up ~1s each, each booting a separate ruby interpreter. A few dozen of those are littered about. This is by far and away the most time consuming part.
    • I'm planning on replacing it with a grab-all puppet config get <XYZ> and then supply those settings using a batching helper script.
  • The chown -R from 87-ca-permissions.sh is a temporary workaround I'm not happy with.
    • Until a future date when we can get rid of it, I can probably ensure it only runs on files with a different uid (as opposed to its current shotgun approach).
  • 90-ca.sh has a few hocon set commands, each needing to spend half a second booting the gem.
    • We can probably load the library in its own ruby script and consolidate those.
  • The puppetserver ca generate command takes a long while, but since that's only on first boot it shouldn't matter. I can ignore that.

As for the JVM, I've tinkered with some of the startup parameters to no avail. Messing with TieredStopAtLevel was disastrous, Jruby relies heavily on C2 JIT.

Buuuut I think we can pre-seed classes by creating a CDS (Class Data Sharing) archive. This memory maps classes into a reference file instead of loading them class-by-class. This is already provided for the default JDK classes, but not the application JARs (clojure, jruby, puppetserver). I'm still investigating.

I think that's all I've got so far? If anyone else has any ideas or knows of any additional bottlenecks we might be able to clear, please let me know!

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions