Why We Replaced Firecracker with QEMU

Firecracker vs QEMU

Firecracker, the microVM hypervisor, is renowned for being lightweight, fast, and secure. It's excellent for running short-lived workloads, which is why it's the backbone of AWS Lambda. Our initial prototype for Hocus, a self-hosted alternative to Gitpod and GitHub Codespaces, utilized Firecracker. However, after weeks of testing, we decided to entirely replace it with QEMU. A little-known fact about Firecracker is its lack of support for many modern hypervisor features, such as dynamic RAM management, which is vital for long-lived workloads. In this post, I will explain why Firecracker might not be the best hypervisor choice and when you should avoid it.

Firecracker Optimizes for Short-Lived Workloads

The creators of Firecracker state that:

"Firecracker has a minimalist design. It excludes unnecessary devices and guest-facing functionality to reduce the memory footprint and attack surface area of each microVM."

The term "unnecessary" is intriguing - if this functionality is unnecessary, why was it incorporated into other hypervisors? The definition of "unnecessary" must be understood in the context of what Firecracker was built for. These excluded features are unnecessary for AWS Lambda, which spins up VMs to run short function calls and then shuts them down. If you're running a different kind of workload, like a VM that contains your development environment or a self-hosted GitHub Actions agent, these features cease to be unnecessary. Your VM will run for hours, days, or even months without stopping, unlike the typical Firecracker VM, which runs for seconds or minutes.

Firecracker, Not So Lightweight After All

Here are the two most significant features Firecracker lacks:

Dynamic memory management - Firecracker's RAM footprint starts low, but once a workload inside allocates RAM, Firecracker will never return it to the host system. After running several workloads inside, you end up with an idling VM that consumes 32 GB of RAM on the host, even though it doesn't need any of it.¹
Discard operations on storage - if you create a 10 GB file inside a VM and then delete it, the backing space won't be reclaimed on the host. The VM will occupy that disk space until you delete the entire VM drive.²

These deficiencies make Firecracker a memory and disk space hog. The plot below shows the memory usage of the same memory-intensive workload running in QEMU and Firecracker virtual machines.

QEMU vs Firecracker VM Memory Usage

The workload in Firecracker finishes running around the 200-second mark, and in QEMU around the 250-second mark. It's not a performance difference; it's just when I manually stopped them.

Other Features Firecracker Is Missing

GPU support - if you need a GPU inside the VM, you have to pick a different hypervisor.
High-performance disk IO - when you connect multiple drives to the VM and run intensive IO operations, you will likely run into a bottleneck. Firecracker uses a virtio-blk implementation that isn’t as memory-hungry as alternatives, but has a smaller throughput.³

QEMU is Not Perfect Though

The main issue we've had with QEMU is that it has too many options you need to configure. For instance, enabling your VM to return unused RAM to the host requires at least three challenging tasks:

Knowing that the feature even exists (it's called free page reporting and you have to specifically enable it in QEMU)
Understanding that an obscure feature of Linux called DAMON exists, knowing what it's for⁴, knowing how to configure it, and compiling a guest Linux kernel that supports it
Knowing that you need to disable transparent huge pages on the guest, otherwise the VM will never return large amounts of memory

It took us two months of experimentation, reading through the source code of Firecracker, QEMU, and other hypervisors to develop a reliable QEMU proof of concept. To comprehend DAMON configuration, my co-founder spent days running benchmarks and conversing with one of its authors. It's great that we could talk and we are grateful that the author spent the time to help us, but it shows that the technology is not easily accessible yet.

Conclusion

QEMU has the features you need to run general-purpose workloads, but configuring it requires a lot of time and patience. If you want to run short-lived, untrusted workloads, Firecracker is a great choice. However, if you just want to run your development environment in a VM, you can use Hocus. We've done all the hard work for you already. It's still in alpha, but you can already check it out on GitHub.

If you squint hard enough, you'll find that Firecracker does support dynamic memory management with a technique called ballooning. However, in practice, it's not usable. To reclaim memory, you need to make sure that the guest OS isn't using it, which, for a general-purpose workload, is nearly impossible.↩
There is one other option for reclaiming disk space. After you shut down the VM, you may use a tool called virt-sparsify to make the disk sparse again. However, this is a manual operation that you need to run on the host system while the VM is offline.↩
Firecracker uses a MMIO transport for its virtio devices. QEMU also supports a PCI transport.↩
Linux stores memory in various caches, and DAMON is a mechanism for analyzing memory usage. You can use it to inform the kernel to release unused memory pages from caches.↩

Firecracker Optimizes for Short-Lived Workloads​

Firecracker, Not So Lightweight After All​

Other Features Firecracker Is Missing​

QEMU is Not Perfect Though​

Conclusion​

Firecracker Optimizes for Short-Lived Workloads

Firecracker, Not So Lightweight After All

Other Features Firecracker Is Missing

QEMU is Not Perfect Though

Conclusion