Skip to main content

Command Palette

Search for a command to run...

Why cgroups v2 Replaced cgroups v1

v1 was assembled, not designed. Here's what broke, and why the kernel team had to start over.

Updated
10 min read
Why cgroups v2 Replaced cgroups v1
P
Software and infrastructure engineer. Open source contributor. The person behind root.cause.dev. I build things, break them on purpose, and write about what I find underneath. Two-time GSoC alum. Speaker at OpenSearch Ahmedabad. BazelCon & OpenSearchCon Korea & Japan attendee.

TL;DR cgroups v1 was built one subsystem at a time, by different teams, with no unified design. Each resource controller - CPU, memory, I/O, lived in its own independent hierarchy. A process could be in three different cgroup trees simultaneously with no consistent view across them. This caused real accounting bugs: memory limits you could silently escape via swap, I/O costs charged to the wrong cgroup, and resource delegation that required root. cgroups v2 threw out the fragmented model and replaced it with a single unified hierarchy. Linux 5.10+ defaults to it. So do all modern container runtimes. This post is about what broke and why the fix required a full rewrite.

In 2008, the Linux kernel shipped a feature called cgroups - control groups. The idea was straightforward: let the kernel limit and account for the resources a group of processes can use. CPU time. Memory. Disk I/O. Give a group a budget, enforce it at the scheduler level, no cooperation required from userspace.

It worked. It shipped in 2.6.24. Google was already using an internal version of it. And for the next several years, the Linux ecosystem built on top of it, systemd adopted it, LXC used it, and eventually Docker was built with it as one of the two pillars of container isolation.

There was just one problem. The cgroups that shipped in 2008 were not designed. They were assembled.

Each resource controller - cpu, memory, blkio, devices, pids was written by a different team, merged independently, and mounted as its own separate hierarchy. They didn't share a data model. They didn't coordinate. They just coexisted under /sys/fs/cgroup/ like tenants in the same building who have never met each other.

For a while, this was fine. And then people started running serious workloads in containers. And the cracks showed.

What cgroups Actually Do

Before the history, the mechanism, because you need it to understand what broke.

A cgroup is a group of processes with shared resource accounting and hard limits. The kernel enforces those limits at the scheduler, memory manager, and I/O stack level. There's no userspace enforcement, no polling, no cooperation. The kernel simply will not give a cgroup more than its allocation.

The interface to all of this is a pseudo-filesystem mounted at /sys/fs/cgroup/. You create a cgroup by making a directory. You put processes into it by writing their PIDs to a file. You set limits by writing numbers to other files. The kernel watches all of it.

mount | grep cgroup

On a system running both v1 and v2, common during the transition period, the output looks like this:

tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/unified    type cgroup2 (rw,...)
cgroup on /sys/fs/cgroup/memory     type cgroup  (rw,...,memory)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,...,cpu,cpuacct)
cgroup on /sys/fs/cgroup/blkio      type cgroup  (rw,...,blkio)
cgroup on /sys/fs/cgroup/pids       type cgroup  (rw,...,pids)

Look at what this is showing you. Not one hierarchy. Five separate ones. Each mounted independently at its own path under /sys/fs/cgroup/. A process can be a member of all five simultaneously, at different paths, under different trees, organized differently in each.

This is where the problem lived.

The Problem: A Process With "N" Different Addresses

In v1, every process had a different position in every controller hierarchy. Not one address in the cgroup world, five separate ones, with no guaranteed relationship between them.

Check your own shell:

cat /proc/$$/cgroup
11:cpuset:/
10:rdma:/
9:blkio:/system.slice/session-3.scope
8:devices:/system.slice/session-3.scope
7:pids:/user.slice/user-1000.slice/session-3.scope
6:memory:/user.slice/user-1000.slice/session-3.scope
5:cpu,cpuacct:/user.slice/user-1000.slice/session-3.scope
4:net_cls,net_prio:/
3:freezer:/
2:perf_event:/
1:name=systemd:/user.slice/user-1000.slice/session-3.scope

Eleven lines(for cgroup v1 cases). One process. Eleven separate hierarchies, each with its own path, its own organizational structure, its own rules. The process is in /user.slice/.../session-3.scope for memory limits, but sitting at the root / of the cpuset hierarchy. There is no unified answer to "where is this process in the cgroup tree?" - because there isn't one tree.

This fragmentation wasn't just aesthetically ugly. It caused actual bugs that affected real workloads.

Bug 1: The Memory Limit You Could Silently Escape

Here's the one that hit people hardest in production.

In v1, the memory controller and the swap controller were separate. You could set a memory limit on a container without setting a swap limit. If the container hit its memory limit, the kernel's memory reclaim would kick in, and if there was swap available, processes would silently spill to swap and keep running.

From the outside, this looked like the memory limit was being respected. The container's memory RSS stayed under the limit. But its actual memory footprint, anonymous memory + swap, was quietly growing beyond it. You'd set a 512MB limit, the container would hit it, start swapping, and churn along at effectively unlimited memory with just a performance penalty.

The limit wasn't broken in the sense that the kernel ignored it. The limit was broken in the sense that it didn't mean what you thought it meant.

In v2, memory.max is a hard ceiling on memory + swap combined. When a process crosses it, the OOM killer fires. Not silently escaped, actually enforced.

Bug 2: The Writeback Problem

This one is more subtle but arguably more fundamental.

When a process writes data, it doesn't go directly to disk. It goes into the page cache, memory-resident dirty pages that get flushed to disk asynchronously in the background. That flushing is called writeback.

In v1, writeback I/O was attributed to the blkio controller. That makes some sense; writeback is an I/O operation. But writeback is triggered by memory pressure. When a cgroup runs low on memory, the kernel starts reclaiming pages, and some of those are dirty, so they get written back to disk.

The result: the memory controller caused the I/O, but the I/O was charged to the blkio controller. The memory accounting was structurally wrong. You could have a container that was generating enormous disk I/O, not from its own explicit writes, but from its memory pressure causing writeback, and the memory controller would never see it.

In v2, because all controllers share one hierarchy and one accounting infrastructure, writeback is correctly attributed. The cgroup that owns the dirty pages owns the writeback cost. The accounting is actually correct.

Bug 3: You Needed Root for Everything

This one mattered specifically for containers. In v1, any manipulation of cgroup hierarchies required root. There was no safe way to delegate cgroup management to an unprivileged user.

What this meant in practice: rootless containers couldn't enforce resource limits on themselves. You could run a container as a non-root user. But if that container needed to set its own memory limit, it couldn't, which required writing to cgroup files owned by root.

This wasn't a policy decision. It was an architectural limitation. The v1 model had no concept of "this subtree is yours to manage."

The Fix: One Tree

cgroups v2 starts from a different premise. There is one hierarchy. All controllers are domains within that one tree. A process has exactly one position in that tree, and all resource accounting for that process - CPU, memory, I/O, all of it - flows from that single position.

Check whether your system is running v2:

stat -fc %T /sys/fs/cgroup/

cgroup2fs?

If you get cgroup2fs, you're on a v2-only system. Modern Ubuntu (22.04+), Fedora (31+), Debian (11+) are all v2 by default now. Look at the difference in how a process sees itself:

cat /proc/$$/cgroup

One line. One hierarchy, index 0, which always means cgroup2. One path. Every controller sees this same position. That's the entire structural change. But the implications cascade through everything.

ls /sys/fs/cgroup/

No subsystem subdirectories. No /memory/, no /cpu/, no /blkio/. Just a single tree with controller files living alongside cgroup metadata. The controllers available at any node are listed in cgroup.controllers. The ones you've enabled for children are in cgroup.subtree_control. One tree, one interface.

The move from v1 to v2 wasn't just a cleanup. It was the kernel team admitting that building each subsystem independently was a mistake, and that a correct resource model requires a unified view. You cannot account for memory-induced I/O if memory and I/O live in separate hierarchies.

The Delegation Fix

In v2, the kernel introduced a formal delegation model. A cgroup directory can be owned by a non-root user. That user can create subdirectories, write to controller files, and manage resources within their subtree, without any root privileges. The kernel enforces the boundary: you can only affect what's inside your subtree.

ls -la /sys/fs/cgroup/user.slice/user-0.slice/

That directory is owned by user1000. No sudo required to read it. No sudo required to create child cgroups inside it. No sudo required to set memory.max. This is why rootless Podman and rootless Docker can enforce resource limits, not because they escalated privileges, but because v2 made resource management safe to delegate.

Rootless container resource limits are a cgroups v2 feature. In v1, there was no delegation model. The path to enforcement always went through root. v2 fixed the ownership model at the kernel level, and every rootless runtime benefited immediately.

What This Looks Like for a Running Container

Run a container with memory and CPU limits and look at exactly what the kernel set:

docker run -d --name v2test --memory 64m --cpus 0.5 ubuntu sleep 300

Find its cgroup path. On a v2 system with containerd, Docker places containers under system.slice:

PID=$(docker inspect v2test --format '{{.State.Pid}}')
cat /proc/$PID/cgroup

One line. Now read the actual limits from the cgroup files:

CID=$(docker inspect v2test --format '{{.Id}}')
cat /sys/fs/cgroup/system.slice/docker-${CID}.scope/memory.max

67108864 bytes, exactly 64MB. That's the hard ceiling. When resident memory crosses this, the OOM killer fires. No swap escape. And the CPU limit:

cat /sys/fs/cgroup/system.slice/docker-${CID}.scope/cpu.max

50ms out of every 100ms period, 0.5 of a CPU. The kernel's completely fair scheduler enforces this at the task level. That container will not get more CPU time than this, regardless of host load.

Two numbers. Two files. The entire resource contract for that container, readable directly from the filesystem.

docker rm -f v2test

🟩 Try It Yourself

# Confirm your system is on v2
stat -fc %T /sys/fs/cgroup/
# Expected: cgroup2fs

# Run a container and read its live limits
docker run -d --name cgtest --memory 128m --cpus 1.0 ubuntu sleep 300
CID=$(docker inspect cgtest --format '{{.Id}}')

cat /sys/fs/cgroup/system.slice/docker-${CID}.scope/memory.max
# Expected: 134217728 (128MB in bytes)

cat /sys/fs/cgroup/system.slice/docker-${CID}.scope/cpu.max
# Expected: 100000 100000 (1 full CPU)

docker rm -f cgtest
What You Can Now Explain
  • Why v1 used multiple separate hierarchies - each controller was written by a different team and merged independently, with no unified design

  • The memory limit escape bug in v1 - limits applied to RSS only, not RSS + swap, so containers could silently exceed them

  • The writeback attribution problem - memory pressure caused I/O, but I/O costs were charged to the wrong controller because memory and blkio were separate hierarchies

  • Why rootless container resource limits weren't possible in v1 - no delegation model; cgroup manipulation always required root

  • What v2's single unified hierarchy actually means - one path per process, all controllers sharing one view

  • How to check whether your system is on v1, v2, or hybrid - stat -fc %T /sys/fs/cgroup/ and /proc/PID/cgroup line count

  • How to read a container's exact memory and CPU limits directly from the v2 cgroup filesystem

Next Post

→ The OCI spec: what actually defines a container? : namespaces handle isolation. cgroups handle resource limits. But a container is also an image, a filesystem layout, a config, a runtime contract. The OCI spec is the document that formally defines all of that, and understanding it is why you'll never be confused about why Podman images run on Kubernetes, or what runc actually is.

root/cause - Not tutorials. Just the real picture.