richiejp logo

Supporting both Linux CGroup APIs

Having spent considerable time trying to create a “CGroup” compatability layer. In addition to figuring out why a counter in the memory CGroup was underflowing. I now know for sure that I do not know what Linux Control Groups are. Before I thought maybe I did, but after a much deeper investigation, it’s clear I do not.

With a little bit of luck this article will help tip you over the edge. Perhaps it is time for a career change? Have you ever considered making things out of wood?

Joking aside, Linux Control Groups are trees of groups. A group can have processes or other groups inside it. Normally it can’t have both unless it is the root group. Internally the Kernel provides an interface to the CGroup hierarchy (or tree(s)).

So we have a generic hierarchy of groups and processes. This is taken advantage of by Controllers. Usually these encapsulate control of various resources. In particular the memory and CPU controllers allow one to restrict the amount of memory and CPU time groups can use.

Internally the kernel provides a standard interface for developing controllers. So for each controller we get a roughly similar interface. Both internally and in user land. In user land we get a file based interface, usually located at /sys/fs/cgroup.

Each controller has wildly different knobs, represented by files. Each file can produce and consume arbitrary data. Although all the files I have seen are text based.

Something to keep in mind; there are many controller specific details. Some resources require mutual exclusion while others can be “over committed”. Details such as these can break abstractions. Linux does not hold back features because they have “leaky” abstractions. This is a reason for its success and a source of confusion.

Linux also has the following maxim: do not break user land. This creates an interesting scenario when a regrettable interface is introduced. Which appears to be what happened with CGroups V1.

The first interface allowed each controller to have its own hierarchy. In V2 this was simplified to a single hierarchy. Many other things were changed as well. Including many details of the controller interfaces.

Because the kernel can’t break user land. It now must support both. It also supports hybrid configurations. Where both V1 and V2 are active at once. This is possibly because controllers are being migrated to V2 piecemeal. So some controllers are missing altogether from V2. Meanwhile others are missing features in V2.

Each controller must be exclusively mounted as V1 or V2. However we may have a mixture of different V1 and V2 controllers. It’s not clear to me whether anyone needs a hybrid setup at this point. However it is being used by various distributions.

The Linux Test Project has many tests which rely on CGroups. A lot of these were, and many still are, limited to CGroups V1. Perhaps foolishly, we decided to create a compatability layer. Below is what the author wrote in tst_cgroup.h.

The LTP CGroups API tries to present a consistent interface to the many possible CGroup configurations a system could have.

You may ask; “Why don’t you just mount a simple CGroup hierarchy, instead of scanning the current setup?”. The short answer is that it is not possible unless no CGroups are currently active and almost all of our users will have CGroups active. Even if unmounting the current CGroup hierarchy is a reasonable thing to do to the sytem manager, it is highly unlikely the CGroup hierarchy will be destroyed. So users would be forced to remove their CGroup configuration and reboot the system.

This perhaps deserves some emphasis. We need to test with specific CGroup controls, but we also have to play nice with init. There is unshare(CLONE_CGROUP_NAMESPACE) which may help, but it requires root.

The core library tries to ensure an LTP CGroup exists on each hierarchy root. Inside the LTP group it ensures a ‘drain’ group exists and creats a test group for the current test. In the worst case we end up with a set of hierarchies like the follwoing. Where existing system-manager-created CGroups have been omitted.

      (V2 Root)       (V1 Root 1)     ...     (V1 Root N)
          |                |                      |
        (ltp)            (ltp)        ...        (ltp)
       /     \          /     \                  /    \
  (drain) (test-n) (drain)  (test-n)  ...     (drain)  (test-n)

V2 CGroup controllers use a single unified hierarchy on a single root. Two or more V1 controllers may share a root or have their own root. However there may exist only one instance of a controller. So you can not have the same V1 controller on multiple roots.

It is possible to have both a V2 hierarchy and V1 hierarchies active at the same time. Which is what is shown above. Any controllers attached to V1 hierarchies will not be available in the V2 hierarchy. The reverse is also true.

Note that a single hierarchy may be mounted multiple times. Allowing it to be accessed at different locations. However subsequent mount operations will fail if the mount options are different from the first.

The user may pre-create the CGroup hierarchies and the ltp CGroup, otherwise the library will try to create them. If the ltp group already exists and has appropriate permissions, then admin privileges will not be required to run the tests.

Because the test may not have access to the CGroup root(s), the drain CGroup is created. This can be used to store processes which would otherwise block the destruction of the individual test CGroup or one of its descendants.

The test author may create child CGroups within the test CGroup using the CGroup Item API. The library will create the new CGroup in all the relevant hierarchies.

There are many differences between the V1 and V2 CGroup APIs. If a controller is on both V1 and V2, it may have different parameters and control files. Some of these control files have a different name, but similar functionality. In this case the Item API uses the V2 names and aliases them to the V1 name when appropriate.

Some control files only exist on one of the versions or they can be missing due to other reasons. The Item API allows the user to check if the file exists before trying to use it.

Often a control file has almost the same functionality between V1 and V2. Which means it can be used in the same way most of the time, but not all. For now this is handled by exposing the API version a controller is using to allow the test author to handle edge cases. (e.g. V2 memory.swap.max accepts “max”, but V1 memory.memsw.limit_in_bytes does not).

So what does this API look like? Below is an example taken from the docs.

#include "tst_test.h"
#include "tst_cgroup.h"

static const struct tst_cgroup_group *cg;

static void run(void)
{
    ...
    // do test under cgroup
    ...
}

static void setup(void)
{
    tst_cgroup_require("memory", NULL);
    cg = tst_cgroup_get_test_group();
    SAFE_CGROUP_PRINTF(cg, "cgroup.procs", "%d", getpid());
    SAFE_CGROUP_PRINTF(cg, "memory.max", "%lu", MEMSIZE);
    if (SAFE_CGROUP_HAS(cg, "memory.swap.max"))
        SAFE_CGROUP_PRINTF(cg, "memory.swap.max", "%zu", memsw);
}

static void cleanup(void)
{
    tst_cgroup_cleanup();
}

struct tst_test test = {
    .setup = setup,
    .test_all = run,
    .cleanup = cleanup,
    ...
};

This works quite nicely for the memory CGroup. Most of the time we can just translate V2 names to V1 names. There are things V2 accepts when V1 does not. For example V2 allows “max” to be written to memory.max, but V1 does not allow it to be written to memory.limit_in_bytes.

For other CGroups we are looking at some bigger issues. The following function sets the “bandwidth” of a CPU CGroup in cfs_bandwidth01. The bandwidth is the amount of CPU time used in a given period. So it is a two dimensional value.

static void set_cpu_quota(const struct tst_cgroup_group *const cg,
              const float quota_percent)
{
    const unsigned int period_us = 10000;
    const unsigned int quota_us = (quota_percent / 100) * (float)period_us;

    if (TST_CGROUP_VER(cg, "cpu") != TST_CGROUP_V1) {
        SAFE_CGROUP_PRINTF(cg, "cpu.max",
                   "%u %u", quota_us, period_us);
    } else {
        SAFE_CGROUP_PRINTF(cg, "cpu.cfs_period_us",
                  "%u", period_us);
        /* Actually cpu.cfs_quota_us, but we translate it */
        SAFE_CGROUP_PRINTF(cg, "cpu.max",
                   "%u", quota_us);
    }

    tst_res(TINFO, "Set '%s/cpu.max' = '%d %d'",
        tst_cgroup_group_name(cg), quota_us, period_us);
}

Note that we must branch on the CGroup Controller version. In V1 two files were used to represent the two values representing the bandwidth. In V2 these were combined. Presently our translation layer can’t handle something like this. It’s not entirely clear if it is needed. Often the extra complication to the library code is not worth saving some lines in the tests.

Likely there are many other corner cases. On the plus side, we are now able to run some tests on way more setups. Practically speaking the change from V1 to V2 did break user land. At least it broke LTP. Although to be fair LTP is not a real user. Still some tests stopped working because of the introduction of V2. Well technically the adoption of V2 configs by init systems like Systemd broke LTP…