Table of contents
Preamble
I already tested a Operating System resource scheduler with HP Process Resource Manager (PRM). When we have migrated from HPUX to Linux, Oracle was still not certified on RedHat 6 and we had to use RHEL 5 and nothing was existing on this release to control resource usage…
We can see around that cgroups ancestor is supposed to be /etc/security/limits.conf but for me it has nothing to see as limits.conf (with associated pam_limits module) is putting soft and hard limits to processes. While cgroups aim to prioritize server resources (cpu, memory, I/O, network, …)…
Control groups (Cgroups) is a kernel feature that has been introduced with kernel 2.6.24 and so is availbale on all Linux distribution using this kernel or above…
I have tested this functionality on Oracle Linux Server release 6.4.
Installation
Start by installing libcgroup package:
[root@server1 ~]# yum install libcgroup Loaded plugins: security Setting up Install Process Resolving Dependencies --> Running transaction check ---> Package libcgroup.x86_64 0:0.37-7.2.el6_4 will be installed --> Finished Dependency Resolution Dependencies Resolved =============================================================================================================================================================================================================== Package Arch Version Repository Size =============================================================================================================================================================================================================== Installing: libcgroup x86_64 0.37-7.2.el6_4 public_ol6_latest 110 k Transaction Summary =============================================================================================================================================================================================================== Install 1 Package(s) Total download size: 110 k Installed size: 272 k Is this ok [y/N]: y Downloading Packages: libcgroup-0.37-7.2.el6_4.x86_64.rpm | 110 kB 00:00 Running rpm_check_debug Running Transaction Test Transaction Test Succeeded Running Transaction Installing : libcgroup-0.37-7.2.el6_4.x86_64 1/1 Verifying : libcgroup-0.37-7.2.el6_4.x86_64 1/1 Installed: libcgroup.x86_64 0:0.37-7.2.el6_4 Complete! |
It creates default configuration files:
[root@server1 ~]# ll /etc/cg* -rw-r--r-- 1 root root 812 Jun 25 17:04 /etc/cgconfig.conf -rw-r--r-- 1 root root 1705 Jun 25 17:04 /etc/cgrules.conf -rw-r--r-- 1 root root 161 Jun 25 17:04 /etc/cgsnapshot_blacklist.conf |
But no service started by default (cgconfig and cgred):
[root@server1 ~]# chkconfig --list | grep cg cgconfig 0:off 1:off 2:off 3:off 4:off 5:off 6:off cgred 0:off 1:off 2:off 3:off 4:off 5:off 6:off . . |
From official documentation:
The cgconfig service installed with the libcgroup package provides a convenient way to create hierarchies, attach subsystems to hierarchies, and manage cgroups within those hierarchies
Cgred is a service (which starts the cgrulesengd daemon) that moves tasks into cgroups according to parameters set in the /etc/cgrules.conf file
Configuration
To list the available subsystems (it differs from one kernel to another):
[root@server1 ~]# uname -a Linux server1.domain.com 2.6.39-400.210.2.el6uek.x86_64 #1 SMP Thu Oct 17 16:28:13 PDT 2013 x86_64 x86_64 x86_64 GNU/Linux [root@server1 ~]# lssubsys -am cpuset cpu cpuacct memory devices freezer net_cls blkio |
The RedHat 6 official documentation is providing a little bit more subsystems than my Oracle Unbreakble Enterprise Kernel:
Subsystems | Description |
---|---|
blkio | The Block I/O (blkio) subsystem controls and monitors access to I/O on block devices by tasks in cgroups. Writing values to some of these pseudofiles limits access or bandwidth, and reading values from some of these pseudofiles provides information on I/O operations |
cpu | The cpu subsystem schedules CPU access to cgroups. Access to CPU resources can be scheduled using two schedulers:
|
cpuacct | The CPU Accounting (cpuacct) subsystem generates automatic reports on CPU resources used by the tasks in a cgroup, including tasks in child groups |
cpuset | The cpuset subsystem assigns individual CPUs and memory nodes to cgroups |
devices | The devices subsystem allows or denies access to devices by tasks in a cgroup |
freezer | The freezer subsystem suspends or resumes tasks in a cgroup |
memory | The memory subsystem generates automatic reports on memory resources used by the tasks in a cgroup, and sets limits on memory use by those tasks |
net_cls | The net_cls subsystem tags network packets with a class identifier (classid) that allows the Linux traffic controller (tc) to identify packets originating from a particular cgroup. The traffic controller can be configured to assign different priorities to packets from different cgroups |
net_prio | The Network Priority (net_prio) subsystem provides a way to dynamically set the priority of network traffic per each network interface for applications within various cgroups |
ns | The ns subsystem provides a way to group processes into separate namespaces. Within a particular namespace, processes can interact with each other but are isolated from processes running in other namespaces. These separate namespaces are sometimes referred to as containers when used for operating-system-level virtualization |
perf_event | When the perf_event subsystem is attached to a hierarchy, all cgroups in that hierarchy can be used to group processes and threads which can then be monitored with the perf tool, as opposed to monitoring each process or thread separately or per-CPU |
Default configuration, only subsystems are mount in a virtual filesystem and obviously no rules have been created:
[root@server1 etc]# grep -v ^# cgconfig.conf mount { cpuset = /cgroup/cpuset; cpu = /cgroup/cpu; cpuacct = /cgroup/cpuacct; memory = /cgroup/memory; devices = /cgroup/devices; freezer = /cgroup/freezer; net_cls = /cgroup/net_cls; blkio = /cgroup/blkio; } [root@server1 etc]# grep -v ^# cgrules.conf |
I create two groups, one low profile with 25% of CPU and an high_profile with 75% of CPU (I NUMA emulated my configuration to mix cpu and cpuset subsystems):
group low_profile { perm { task { uid = yjaquier; gid = users; } admin { uid = root; gid = root; } } cpuset { cpuset.cpus = "0"; cpuset.mems = "0"; } cpu { cpu.shares = "25"; } } group high_profile { perm { task { uid = oracle; gid = dba; } admin { uid = root; gid = root; } } cpuset { cpuset.cpus = "0"; cpuset.mems = "0"; } cpu { cpu.shares = "75"; } } |
On my small virtual machine I have only one CPU and memory is interleaved (one memory band):
[root@server1 ~]# numactl --show policy: default preferred node: current physcpubind: 0 cpubind: 0 nodebind: 0 membind: 0 |
Remark:
If you want to give NUMA affinity to your applications then cpuset is the subsystem to use but as already said it is complex and tuning it you loose its added value.
I associate my personal account (yjaquier) to low_profile and oracle account to high_profile:
[root@server1 ~]# grep -v ^# /etc/cgrules.conf yjaquier * low_profile oracle * high_profile |
Finally I start the two cgroups services:
[root@server1 etc]# service cgconfig start Starting cgconfig service: [ OK ] [root@server1 etc]# service cgred start Starting CGroup Rules Engine Daemon: [ OK ] |
Testing
To test it I’m using the C program (eat_cpu.c) I have developed while testing HP Process Resource Manager (PRM). I have improved to remove all compilation warnings (again the idea is to create a CPU intensive executable):
#include <unistd.h> #include <stdlib.h> #include <stdio.h> #include <math.h> #define FALSE 0 #define TRUE 1 /* To compile: gcc eat_cpu.c -o eat_cpu -lm */ int main(int argc, char *argv[]) { pid_t pid; int i; double x,y; if (argc <= 1) { printf("Please provide number of CPU\n"); exit(1); } else printf("%s running for %d CPU(s)\n",argv[0],atoi(argv[1])); for (i=1; i<=atoi(argv[1]); i++) { pid=fork(); if (pid == 0) { printf("Creating child %d\n",i); x = 0.000001; while (TRUE) { y = tan(x); x = x + 0.000001; } exit(0); } } return 0; } |
Before giving few result if I execute eat_cpu program one time per account (yjaquier and oracle) without the cgroups in loop I obviously get (each process is using half the only available CPU):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5143 oracle 20 0 6544 80 0 R 49.9 0.0 1:23.51 eat_cpu 5145 yjaquier 20 0 6544 76 0 R 49.6 0.0 0:28.28 eat_cpu |
Before going further, as many persons, I have asked myself why a difference in CPU usage between ps and top commands. From ps manual:
CPU usage is currently expressed as the percentage of time spent running during the entire lifetime of a process. This is not ideal, and it does not conform to the standards that ps otherwise conforms to. CPU usage is unlikely to add up to exactly 100%.
In other word the CPU percentage of ps command is giving an average CPU usage of the process over its complete life while top is giving current (real time) CPU usage:
[root@server1 ~]# top -b -n 1 | head -n 9 top - 14:32:00 up 4:11, 4 users, load average: 2.00, 1.55, 0.79 Tasks: 117 total, 3 running, 114 sleeping, 0 stopped, 0 zombie Cpu(s): 4.6%us, 0.3%sy, 0.0%ni, 94.5%id, 0.6%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 1020820k total, 638148k used, 382672k free, 46312k buffers Swap: 4194300k total, 3740k used, 4190560k free, 460788k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6882 oracle 20 0 6544 76 0 R 50.0 0.0 0:09.25 eat_cpu 6884 yjaquier 20 0 6544 80 0 R 50.0 0.0 0:06.13 eat_cpu [root@server1 ~]# ps -o %cpu,cgroup,euser,cmd 6882 6884 %CPU CGROUP EUSER CMD 50.4 - oracle /tmp/eat_cpu 1 49.5 - yjaquier /tmp/eat_cpu 1 |
In above example groups are not in action and ps/top commands are giving similar results. This is not the case when you put Cgroups in loop with non balances allocation…
When the two Cgroups services are running I get (as expected). Either you stop and restart the 2 processes to have them associated with their respective Cgroups or you use cgclassify command::
[root@server1 ~]# cgclassify -g cpu,cpuset:low_profile 6884 [root@server1 ~]# cgclassify -g cpu,cpuset:high_profile 6882 [root@server1 ~]# top -b -n 1 | head -n 9 top - 14:36:57 up 4:16, 4 users, load average: 2.07, 1.88, 1.14 Tasks: 118 total, 3 running, 115 sleeping, 0 stopped, 0 zombie Cpu(s): 6.2%us, 0.3%sy, 0.0%ni, 92.9%id, 0.5%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 1020820k total, 638724k used, 382096k free, 46456k buffers Swap: 4194300k total, 3740k used, 4190560k free, 460840k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6882 oracle 20 0 6544 76 0 R 75.2 0.0 6:28.53 eat_cpu 6884 yjaquier 20 0 6544 80 0 R 25.7 0.0 5:42.64 eat_cpu [root@server1 ~]# ps -o %cpu,cgroup,euser,cmd 6882 6884 %CPU CGROUP EUSER CMD 53.7 blkio:/;net_cls:/;freezer:/;devices:/;memory:/;cpuacct:/;cpu:/high_profile;cpuset:/high_profile oracle /tmp/eat_cpu 1 46.0 blkio:/;net_cls:/;freezer:/;devices:/;memory:/;cpuacct:/;cpu:/low_profile;cpuset:/low_profile yjaquier /tmp/eat_cpu 1 |
Remark:
We see here that monitoring using ps command is no more possible…
If I want to dynamically change the CPU allocation to 80%/20% I have two alternatives:
[root@server1 ~]# cgset -r cpu.shares=80 high_profile [root@server1 ~]# cgset -r cpu.shares=20 low_profile |
Or:
[root@server1 ~]# echo 80 > /cgroup/cpu/high_profile/cpu.shares [root@server1 ~]# echo 20 > /cgroup/cpu/low_profile/cpu.shares |
And it is dynamically refected:
[root@server1 ~]# top -b -n 1 | head -n 9 top - 16:55:04 up 6:34, 4 users, load average: 2.00, 1.53, 0.77 Tasks: 119 total, 3 running, 116 sleeping, 0 stopped, 0 zombie Cpu(s): 13.7%us, 0.3%sy, 0.0%ni, 85.5%id, 0.4%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 1020820k total, 640996k used, 379824k free, 48196k buffers Swap: 4194300k total, 3732k used, 4190568k free, 460952k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9974 oracle 20 0 6544 76 0 R 79.8 0.0 5:30.02 eat_cpu 9972 yjaquier 20 0 6544 80 0 R 20.0 0.0 1:31.46 eat_cpu |
One concrete example I could see in Oracle database world is when I have, as a simple example, two databases running on my single CPU box. Without Cgroups and with threee CPU consuming process you get the below balanced distribution:
[root@server1 ~]# top -b -n 1 | head -n 10 top - 17:02:46 up 6:42, 4 users, load average: 2.15, 1.93, 1.26 Tasks: 120 total, 4 running, 116 sleeping, 0 stopped, 0 zombie Cpu(s): 15.1%us, 0.3%sy, 0.0%ni, 84.0%id, 0.4%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 1020820k total, 641444k used, 379376k free, 48352k buffers Swap: 4194300k total, 3732k used, 4190568k free, 461088k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9974 oracle 20 0 6544 76 0 R 32.9 0.0 11:34.30 eat_cpu 10382 yjaquier 20 0 6544 80 0 R 32.9 0.0 0:04.88 eat_cpu 9972 yjaquier 20 0 6544 80 0 R 31.0 0.0 3:08.90 eat_cpu |
If I configure Cgroups to allocate to each database runnning on the server 100/(number of database runnning on the server) % of CPU resource (so 50% in my simple example). I get the below distribution and no database can kill whole server performance:
[root@server1 ~]# top -b -n 1 | head -n 10 top - 17:04:44 up 6:44, 4 users, load average: 2.89, 2.29, 1.47 Tasks: 120 total, 4 running, 116 sleeping, 0 stopped, 0 zombie Cpu(s): 15.5%us, 0.3%sy, 0.0%ni, 83.7%id, 0.4%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 1020820k total, 641764k used, 379056k free, 48408k buffers Swap: 4194300k total, 3732k used, 4190568k free, 461088k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9974 oracle 20 0 6544 76 0 R 49.1 0.0 12:21.79 eat_cpu 10382 yjaquier 20 0 6544 80 0 R 27.5 0.0 0:40.05 eat_cpu 9972 yjaquier 20 0 6544 80 0 R 25.5 0.0 3:44.07 eat_cpu |
Fabiano says:
Very interesting indeed! Thank you,
Is there any way to create a per-user cgroup, in a way that a set of user, can ..each one use max N GB or M CPUs. I have understood that adding users to the same cgroup they share the same oveall limit, but I cannot define easily a per-user behaviour. Perhaps, do I miss the way? Regards
Yannick Jaquier says:
Welcome !
I think nothing forbid to create one cgroup per user if you want fine tuning of resources but I believe, on top of being cumbersome, this is not the philosophy of the product…