article

Death to Cold Starts

Cold starts are a real pain. KraftCloud provides cold starts in the order of a few milliseconds with hardware-level isolation; in this post we explain how.

· 4 min read

First things first: long cold start times are a real pain. On hyperscalers, starting a new service can take seconds or even minutes, and even newer platforms can take hundreds of milliseconds or seconds to have a new service started. Such long start times can be the difference between a client purchasing a product or leaving a store. Even FaaS offerings, for all their purported nimbleness, also suffer from noticeable cold starts.

Worse, autoscale, supposedly a seamless mechanism for coping with traffic peaks and varying demand, can only be effective if it can react in the same time scales as those peaks. Because autoscale in major cloud providers can take seconds or even minutes to bring new instances up and have them ready to process requests, engineers are having to come up with less-than-optimal workarounds such as having to have hot instances ready to run for scaling purposes (costly) or trying to come up with complex algorithms to predict demans/peaks (hard or near impossible).

Hardware Isolation and Millisecond Cold Starts?

The world would be a better place if cold starts and autoscale were completely transparent to users, in the order of a few milliseconds. KraftCloud provides exactly such millisecond semantics, all of the while providing full hardware isolation. How fast is it? Because I like numbers, let’s start with a graph:

KraftCloud cold starts times when running 5,000 harwdare isolated NGINX instances.

To generate this graph, we create an instance (read: unikernel) of NGINX and measure how long it takes for it to be ready to serve requests. We then start a second instance (all the while leaving the first one running), do the same measurement, and so on all the way up to 5,000 such instances — all on a single, relatively standard server consisting of an AMD EPYC 7402P, 24 cores @ 2.8 GHz CPU and 64GiB of memory.

On the lowest end, for the first instances, we see cold start times of about 4 milliseconds; and on the upper range, for the 5,000th VM, that number goes up to a still quite low 14ms. Note that the slight sub-linear increase is due to system/host overheads (e.g., scheduling) which can be optimized.

To make it all a bit more tangible, here’s the output of a kraft cloud deploy command which creates an instance on the platform:

Terminal window
$ kraft cloud deploy -p 443:8080 nginx:latest
[] Deployed successfully!
────────── name: nginx-6cfc4
────────── uuid: 62d1d6e9-0d45-4ced-ad2a-619718ba0344
───────── state: running
─────────── url: https://long-violet-92ka3gk7.fra0.kraft.host
───────── image: nginx@sha256:fb3e5fb1609ab4fd40d38ae12605d56fc0dc48aaa0ad4890ed7ba0b637af69f6
───── boot time: 16.65 ms
──────── memory: 128 MiB
service group: long-violet-92ka3gk7
── private fqdn: nginx-6cfc4.internal
──── private ip: 172.16.6.4
────────── args: /usr/bin/nginx -c /etc/nginx/nginx.conf

It is worth nothing that even though we are running a Unikraft unikernel, the application is standard, unmodified NGINX. What about other applications or programming languages? Here’s a table with a sample of them, including a comparison with those same apps running on Linux:

ApplicationUnikraftLinux
NGINX4.2 ms715 ms
Redis7.8 ms761 ms
SQLite5.2 ms698 ms
Node.js47 ms820 ms
Go HTTP server8.8 ms688 ms

How Does it Work?

In one word, specialization (well, with many performance optimizations as well): if you know the application you want to deploy, and presumably for cloud deployments you always do, then you can, at build time, fully customize the image, all the way down to the OS, to only contain the functionality that the app needs, and nothing more.

The concept is illustrated in the diagram above. If a line of code is in a Unikraft image it’s because the application needs it to run — otherwise it’s out. To further help with specialization, Unikraft is a modular OS, making it easy to and/remove functionality from builds. Finally, we optimize the start process and code itsself, and we leverage a fast VMM (Virtual Machine Monitor, in our case FireCracker) to ensure the quickest possible start times.

As a final note, while these start times are small, we have a number of ideas as to how to reduce them further, so watch this space!

Get early access to the KraftCloud Beta

If you want to find out more about the tech behind KraftCloud read our other blog posts, join our Discord server and check out the Unikraft’s Linux Foundation OSS website . We would be extremely grateful for any feedback provided!

Sign-up now