Skip to main content

One post tagged with "podman"

View All Tags

· 8 min read
Hugo Dutka

Virtualization Technologies

Hocus uses virtualization to let you define your development environment, build it in CI, and then run it on a devbox in a virtual machine. When we were designing the product, we followed a principles-first approach. We didn't choose a ready-to-go virtualization tool like Docker, but evaluated the pros and cons of different runtimes, and then created a custom-made system that precisely solved our problems.

In this post, I'll tell you about the limits of two modern virtualization methods: containers and virtual machines. You'll gain an overview of how they isolate processes, what tradeoffs they make to balance security and efficiency, and when you should use one over the other. Also, you'll learn how you can use them to boost your development environments.

The Limits of Docker

When we first considered how to virtualize a development environment, we wanted to use Docker. But once we looked closer, we started seeing problems.

We wanted to build a solution that would allow multiple software engineers to work on a single machine, each one isolated in their own development environment. They should be able to run any software they want inside their own workspaces without impacting each other.

However, Docker was not designed with highly-privileged isolation in mind. For example, most developers would like to use Docker itself in their dev environment, and, by default, the only way to run Docker in Docker is with a privileged container. This gives the developer root access to the machine where their dev environment is hosted, completely compromising any isolation.

But, in fact, you can use containers to achieve a higher degree of isolation. Container engines1 like Docker and Containerd include a subsystem called a runtime which spawns containerized processes. By default, they use runc, but they let you replace it with, for example, Sysbox. By using Linux kernel features such as user namespaces and emulating certain syscalls, it lets containerized processes run with a root user that has no privileges on the host, but can do almost anything within the container. If you're not confined to the OCI ecosystem, there is also LXC, which is more mature.

But these solutions come with limits. Sysbox does not support GPUs and cannot run nested Sysbox containers. As a developer, if you are working on a web app, you will not run into any issues. But some Linux subsystems, such as block devices, currently can't be isolated within a container. If your development environment depends on them, you're out of luck.

Virtual Machines Galore

To solve this, you can use a virtual machine. VMs can run full, separate Linux kernels so you can virtualize any single Linux feature. The main caveat is memory efficiency.

The Linux kernel loves RAM. Whenever it sees free memory, it stuffs it into various caches, so software can run faster. That's a great design when the kernel is running on bare metal, since unused RAM is wasted RAM. But when it's running in a VM, the VM gobbles up as much memory as it can and, by itself, shows no inclination to give it back.

There are techniques you can use to reclaim it, such as memory ballooning or free page reporting in combination with DAMON, but it's not straightforward2 to make them work. And, even if you implement them, you will not gain the efficiency that containers boast.

If you've got multiple VMs running on a host, you can also use Kernel Samepage Merging, which deduplicates identical memory pages between processes. Last we tested it, we had multiple VMs with similar workloads running that together consumed 15 GB of RAM, and after enabling KSM, this dropped down to only 6 GB. However, even though the memory savings are colossal, you should be cautious about using it. Enabling KSM introduces a side channel vulnerability that potentially lets a process running in one VM read files from a different VM.3

VMs are also not as portable as containers. Many cloud providers, like AWS inside VM-based EC2 instances, won't let you run a VM since they don't support nested virtualization.

Our Solution

There was no silver bullet we could use to virtualize development environments. VMs are not very memory-efficient, and containers can't isolate all workloads. Many of our users would not need the virtualization capabilities of VMs, and could save costs by putting more containers on a single machine. However, Hocus itself depends on low-level kernel features, and can't be fully developed inside a container. We wanted to use Hocus to develop Hocus as soon as we could, so the first version of Hocus uses VMs. However, we designed the system in a way that allows us to add container4 support later.

Hocus is a work in progress, a proof of concept, and we want to finish it in collaboration with people who need it. We are looking for individuals who can't stand their huge, slow dev environments at work and want to do something about it. We recently figured out how to start 100 GB+ dev environments in seconds even when you haven't downloaded them onto your host yet,5 and we'd love to find someone to implement it for. If that's you, you can sign up for the closed beta of Hocus Enterprise. We will work with you to introduce Hocus at your company, and adapt it to your needs. But, if you're just interested in what we've built so far, you can check out the alpha version on GitHub.

  1. Container terminology is baffling. Containerd calls itself a container runtime instead of an engine, but then it uses runc, Sysbox, or others under the hood, which are also called container runtimes, even though they are much lower level tools. In this blog post a Docker developer laments: "I think the container ecosystem can be confusing at times. Especially with the terminology that we use. What's this? A runtime. And this? A runtime..." I decided to use the term "container engine" because Podman describes itself as "an open source container, pod, and container image management engine" and it doesn't clash with the other thing.
  2. Last time I wrote that memory ballooning in Firecracker is nearly impossible to set up, Hacker News called me defensive and disingenuous and someone tweeted that it's in fact pretty simple - you just have to write a custom memory driver for Firecracker in Rust. So this time I'm just going to say it's not straightforward.
  3. Actually, the vulnerability is a bit more nuanced and I can't quite wrap my head around what its full real-world implications are. It allows a process to detect that a different process is using the same memory page. An attacker would do it by first loading a memory page, and then dropping it from CPU cache. They would then wait a bit, and load it again. If loading was instant, then a different process must have loaded it in the meantime, and therefore it was using it. So to actually read a file from a different VM, you have 2 options: already know the exact contents of the file and then you can verify that the other VM is reading the same file, or figure out how to use the side channel to indirectly infer the contents of unknown files. I guess you could maybe do it by dropping memory pages that contained functions from standard libraries used to process IO and then figure out in what order they are loaded again to infer the data that's being processed? I don't know, and if you have any resources that show how the vulnerability is exploited in the wild, I would love a link. I also found this paper that shows an attack which can extract encryption keys from a different VM.
  4. In fact, we would like to use LXC rather than Sysbox. Earlier I mentioned that LXC is more mature - it supports GPUs and can run nested LXC. As a side note, I don't quite understand why Sysbox exists. Why not create a containerd runtime that's a wrapper over LXC? LXC already existed when Sysbox started development. And why doesn't this wrapper exist now? It seems like a relatively low-effort way to introduce mature system containers into the OCI ecosystem. If anyone knows, I would love to understand this.
  5. If your dev environment's image size exceeds 100 GB, and you have 10 Gbps of network bandwidth, it's going to take at least 80 seconds to download. We integrated with a storage system called Overlaybd that supports a technique called lazy pulling. It allows you to download parts of the image on demand and start with only the data that you need to boot. You then pull the rest in the background while the developer is already inside the dev environment. However, there are still some issues we need to iron out before we fully integrate it into Hocus. That's going to be another blog post.