Skip to main content

· 8 min read
Hugo Dutka

Virtualization Technologies

Hocus uses virtualization to let you define your development environment, build it in CI, and then run it on a devbox in a virtual machine. When we were designing the product, we followed a principles-first approach. We didn't choose a ready-to-go virtualization tool like Docker, but evaluated the pros and cons of different runtimes, and then created a custom-made system that precisely solved our problems.

In this post, I'll tell you about the limits of two modern virtualization methods: containers and virtual machines. You'll gain an overview of how they isolate processes, what tradeoffs they make to balance security and efficiency, and when you should use one over the other. Also, you'll learn how you can use them to boost your development environments.

The Limits of Docker

When we first considered how to virtualize a development environment, we wanted to use Docker. But once we looked closer, we started seeing problems.

We wanted to build a solution that would allow multiple software engineers to work on a single machine, each one isolated in their own development environment. They should be able to run any software they want inside their own workspaces without impacting each other.

However, Docker was not designed with highly-privileged isolation in mind. For example, most developers would like to use Docker itself in their dev environment, and, by default, the only way to run Docker in Docker is with a privileged container. This gives the developer root access to the machine where their dev environment is hosted, completely compromising any isolation.

But, in fact, you can use containers to achieve a higher degree of isolation. Container engines1 like Docker and Containerd include a subsystem called a runtime which spawns containerized processes. By default, they use runc, but they let you replace it with, for example, Sysbox. By using Linux kernel features such as user namespaces and emulating certain syscalls, it lets containerized processes run with a root user that has no privileges on the host, but can do almost anything within the container. If you're not confined to the OCI ecosystem, there is also LXC, which is more mature.

But these solutions come with limits. Sysbox does not support GPUs and cannot run nested Sysbox containers. As a developer, if you are working on a web app, you will not run into any issues. But some Linux subsystems, such as block devices, currently can't be isolated within a container. If your development environment depends on them, you're out of luck.

Virtual Machines Galore

To solve this, you can use a virtual machine. VMs can run full, separate Linux kernels so you can virtualize any single Linux feature. The main caveat is memory efficiency.

The Linux kernel loves RAM. Whenever it sees free memory, it stuffs it into various caches, so software can run faster. That's a great design when the kernel is running on bare metal, since unused RAM is wasted RAM. But when it's running in a VM, the VM gobbles up as much memory as it can and, by itself, shows no inclination to give it back.

There are techniques you can use to reclaim it, such as memory ballooning or free page reporting in combination with DAMON, but it's not straightforward2 to make them work. And, even if you implement them, you will not gain the efficiency that containers boast.

If you've got multiple VMs running on a host, you can also use Kernel Samepage Merging, which deduplicates identical memory pages between processes. Last we tested it, we had multiple VMs with similar workloads running that together consumed 15 GB of RAM, and after enabling KSM, this dropped down to only 6 GB. However, even though the memory savings are colossal, you should be cautious about using it. Enabling KSM introduces a side channel vulnerability that potentially lets a process running in one VM read files from a different VM.3

VMs are also not as portable as containers. Many cloud providers, like AWS inside VM-based EC2 instances, won't let you run a VM since they don't support nested virtualization.

Our Solution

There was no silver bullet we could use to virtualize development environments. VMs are not very memory-efficient, and containers can't isolate all workloads. Many of our users would not need the virtualization capabilities of VMs, and could save costs by putting more containers on a single machine. However, Hocus itself depends on low-level kernel features, and can't be fully developed inside a container. We wanted to use Hocus to develop Hocus as soon as we could, so the first version of Hocus uses VMs. However, we designed the system in a way that allows us to add container4 support later.

Hocus is a work in progress, a proof of concept, and we want to finish it in collaboration with people who need it. We are looking for individuals who can't stand their huge, slow dev environments at work and want to do something about it. We recently figured out how to start 100 GB+ dev environments in seconds even when you haven't downloaded them onto your host yet,5 and we'd love to find someone to implement it for. If that's you, you can sign up for the closed beta of Hocus Enterprise. We will work with you to introduce Hocus at your company, and adapt it to your needs. But, if you're just interested in what we've built so far, you can check out the alpha version on GitHub.


  1. Container terminology is baffling. Containerd calls itself a container runtime instead of an engine, but then it uses runc, Sysbox, or others under the hood, which are also called container runtimes, even though they are much lower level tools. In this blog post a Docker developer laments: "I think the container ecosystem can be confusing at times. Especially with the terminology that we use. What's this? A runtime. And this? A runtime..." I decided to use the term "container engine" because Podman describes itself as "an open source container, pod, and container image management engine" and it doesn't clash with the other thing.
  2. Last time I wrote that memory ballooning in Firecracker is nearly impossible to set up, Hacker News called me defensive and disingenuous and someone tweeted that it's in fact pretty simple - you just have to write a custom memory driver for Firecracker in Rust. So this time I'm just going to say it's not straightforward.
  3. Actually, the vulnerability is a bit more nuanced and I can't quite wrap my head around what its full real-world implications are. It allows a process to detect that a different process is using the same memory page. An attacker would do it by first loading a memory page, and then dropping it from CPU cache. They would then wait a bit, and load it again. If loading was instant, then a different process must have loaded it in the meantime, and therefore it was using it. So to actually read a file from a different VM, you have 2 options: already know the exact contents of the file and then you can verify that the other VM is reading the same file, or figure out how to use the side channel to indirectly infer the contents of unknown files. I guess you could maybe do it by dropping memory pages that contained functions from standard libraries used to process IO and then figure out in what order they are loaded again to infer the data that's being processed? I don't know, and if you have any resources that show how the vulnerability is exploited in the wild, I would love a link. I also found this paper that shows an attack which can extract encryption keys from a different VM.
  4. In fact, we would like to use LXC rather than Sysbox. Earlier I mentioned that LXC is more mature - it supports GPUs and can run nested LXC. As a side note, I don't quite understand why Sysbox exists. Why not create a containerd runtime that's a wrapper over LXC? LXC already existed when Sysbox started development. And why doesn't this wrapper exist now? It seems like a relatively low-effort way to introduce mature system containers into the OCI ecosystem. If anyone knows, I would love to understand this.
  5. If your dev environment's image size exceeds 100 GB, and you have 10 Gbps of network bandwidth, it's going to take at least 80 seconds to download. We integrated with a storage system called Overlaybd that supports a technique called lazy pulling. It allows you to download parts of the image on demand and start with only the data that you need to boot. You then pull the rest in the background while the developer is already inside the dev environment. However, there are still some issues we need to iron out before we fully integrate it into Hocus. That's going to be another blog post.

· 5 min read
Hugo Dutka

Firecracker vs QEMU

Firecracker, the microVM hypervisor, is renowned for being lightweight, fast, and secure. It's excellent for running short-lived workloads, which is why it's the backbone of AWS Lambda. Our initial prototype for Hocus, a self-hosted alternative to Gitpod and GitHub Codespaces, utilized Firecracker. However, after weeks of testing, we decided to entirely replace it with QEMU. A little-known fact about Firecracker is its lack of support for many modern hypervisor features, such as dynamic RAM management, which is vital for long-lived workloads. In this post, I will explain why Firecracker might not be the best hypervisor choice and when you should avoid it.

Firecracker Optimizes for Short-Lived Workloads

The creators of Firecracker state that:

"Firecracker has a minimalist design. It excludes unnecessary devices and guest-facing functionality to reduce the memory footprint and attack surface area of each microVM."

The term "unnecessary" is intriguing - if this functionality is unnecessary, why was it incorporated into other hypervisors? The definition of "unnecessary" must be understood in the context of what Firecracker was built for. These excluded features are unnecessary for AWS Lambda, which spins up VMs to run short function calls and then shuts them down. If you're running a different kind of workload, like a VM that contains your development environment or a self-hosted GitHub Actions agent, these features cease to be unnecessary. Your VM will run for hours, days, or even months without stopping, unlike the typical Firecracker VM, which runs for seconds or minutes.

Firecracker, Not So Lightweight After All

Here are the two most significant features Firecracker lacks:

  • Dynamic memory management - Firecracker's RAM footprint starts low, but once a workload inside allocates RAM, Firecracker will never return it to the host system. After running several workloads inside, you end up with an idling VM that consumes 32 GB of RAM on the host, even though it doesn't need any of it.1
  • Discard operations on storage - if you create a 10 GB file inside a VM and then delete it, the backing space won't be reclaimed on the host. The VM will occupy that disk space until you delete the entire VM drive.2

These deficiencies make Firecracker a memory and disk space hog. The plot below shows the memory usage of the same memory-intensive workload running in QEMU and Firecracker virtual machines.

QEMU vs Firecracker VM Memory Usage

The workload in Firecracker finishes running around the 200-second mark, and in QEMU around the 250-second mark. It's not a performance difference; it's just when I manually stopped them.

Other Features Firecracker Is Missing

  • GPU support - if you need a GPU inside the VM, you have to pick a different hypervisor.
  • High-performance disk IO - when you connect multiple drives to the VM and run intensive IO operations, you will likely run into a bottleneck. Firecracker uses a virtio-blk implementation that isn’t as memory-hungry as alternatives, but has a smaller throughput.3

QEMU is Not Perfect Though

The main issue we've had with QEMU is that it has too many options you need to configure. For instance, enabling your VM to return unused RAM to the host requires at least three challenging tasks:

  • Knowing that the feature even exists (it's called free page reporting and you have to specifically enable it in QEMU)
  • Understanding that an obscure feature of Linux called DAMON exists, knowing what it's for4, knowing how to configure it, and compiling a guest Linux kernel that supports it
  • Knowing that you need to disable transparent huge pages on the guest, otherwise the VM will never return large amounts of memory

It took us two months of experimentation, reading through the source code of Firecracker, QEMU, and other hypervisors to develop a reliable QEMU proof of concept. To comprehend DAMON configuration, my co-founder spent days running benchmarks and conversing with one of its authors. It's great that we could talk and we are grateful that the author spent the time to help us, but it shows that the technology is not easily accessible yet.

Conclusion

QEMU has the features you need to run general-purpose workloads, but configuring it requires a lot of time and patience. If you want to run short-lived, untrusted workloads, Firecracker is a great choice. However, if you just want to run your development environment in a VM, you can use Hocus. We've done all the hard work for you already. It's still in alpha, but you can already check it out on GitHub.


  1. If you squint hard enough, you'll find that Firecracker does support dynamic memory management with a technique called ballooning. However, in practice, it's not usable. To reclaim memory, you need to make sure that the guest OS isn't using it, which, for a general-purpose workload, is nearly impossible.
  2. There is one other option for reclaiming disk space. After you shut down the VM, you may use a tool called virt-sparsify to make the disk sparse again. However, this is a manual operation that you need to run on the host system while the VM is offline.
  3. Firecracker uses a MMIO transport for its virtio devices. QEMU also supports a PCI transport.
  4. Linux stores memory in various caches, and DAMON is a mechanism for analyzing memory usage. You can use it to inform the kernel to release unused memory pages from caches.