Linux容器 » JasonLe's TechBlog

Archive for the ‘Linux容器’ category

Linux namespace分析（1）

March 9th, 2015

在linux中实现kernel virtualization 的条件就是资源的限制与隔离，而namespace完成的系统资源的隔离。

从Linux 2.6.23开始对于namespace的框架实现基本完成，之后的patch大多是修修补补，比较重要的一个patch是linux 3.8中一个非root用户可以创建user namespace，在这个user namespace中，该用户拥有所有的权限，包括在这个namespace中创建其他类型的namespace。

Linux namespace主要包含：

UTS namespace
Mount namespace
IPC namespace
PID namespace
Network namespace
User namespace

UTS namespace主要可以使得host拥有两套nodename与domainname：最大的区别就是如果使用lxc或者docker启动一个镜像，使用uname可以获得不同的主机名。

Mount namespace使得不同namespace中的相同进程都对文件结构有不同的视角，比起chroot有更好的安全性，使用namespace特性，可以使得系统的mount namespace产生主从的结构。

IPC namespace 使得每个IPC内的进程相互通信。

PID namespace最为神奇，他使得进程在不同的PID namespace中可以拥有相同的PID，也就是说如果系统拥有root namespace、namespace A、namespace B 。其中A、B是root的子命名空间，也就是说在A、B命名空间内部可以存在两个相同的PID，比如init 。当然这两个init的PID号从root命名空间来看是不同的，这里就有了映射的概念。

Network namespace 可以使得一个物理主机的网卡，模拟出两个虚拟网卡，每个虚拟网卡都可以绑定相同的端口，访问却不受影响。

User namespace 是在linux 3.8 才完成的部分，他可以使得一个进程的User ID和group ID相比于命名空间内外不同。举例：在root namespace中的一个非特权进程在namespace A中可以是init 0 ，可以是一个特权进程！

以上这些空间可以使得linux建立起一个轻量级的虚拟机系统。比如lxc、docker。

参考：

http://lwn.net/Articles/531114/

1 comment »

Posted in Kernel内核分析, Linux, Linux容器

Tags: namespace

cgroup 介绍（2）

March 2nd, 2015

之前在第一篇介绍cgroup的文章中，我初步使用cgroup对资源进行限制隔离http://www.lizhaozhong.info/archives/1211

但是基于层级的cgroup存在一个弊端：就是不灵活，树的深度可能是无限的，这就导致实际操作中管理非常繁琐。

基于这个原因，在kernel 3.16中正式加入了unified hierarchy特性，这个特性目前仍然在开发，所以如果想显式开启该特性需要
mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT

__DEVEL__sane_behavior通过看名字，我们也能发现这个特性仍然在开发。

在之前的cgroup hierarchy中，我们知道一个hierarchy可以绑定一个子系统，也可以同时绑定12个子系统。

举例层级A绑定cpuset，层级B绑定memory，如果有一个task同时需要这两个子系统，则很多时候task在这两个层级中存在正交，非常不便。

hierarchy may be collapsed from leaf towards root when viewed from specific
controllers.  For example, a given configuration might not care about
how memory is distributed beyond a certain level while still wanting
to control how CPU cycles are distributed.

如果我们开启__DEVEL__sane_behavior特性，我们看到cgroup.controllers 存在的子系统，在unified hierarchy中，系统会把所有子系统都挂载到根层级下，只有leaf节点可以存在tasks，非叶子节点只进行资源控制。

# mount -t cgroup -o __DEVEL__sane_behavior cgroup /sys/fs/cgroup
# cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu cpuacct memory devices freezer net_cls blkio perf_event net_prio hugetlb

现在我们在root cgroup下面创建parent与child，根层级的cgroup.subtree_control 控制parents的cgroup.controllers

如此往复，上级的cgroup.subtree_control控制下级的cgroup.controllers，也就是说subsystem不会有传递性！

如下面的例子，如果我指定根层级的cgroup.subtree_control 可以使能memory与cpu两个子系统，也就是说parents中可以控制memory、cpu两个子系统。而child如果没有指定子系统，是不会控制memory与cpu的。

# mkdir /sys/fs/cgroup/parent
# mkdir /sys/fs/cgroup/parent/child</pre>
# echo "+memory +cpu" > /sys/fs/cgroup/cgroup.subtree_control
# cat /sys/fs/cgroup/parent/cgroup.controllers
cpu memory

举个例子：

A(b,m) - B(b,m) - C (b)
              \ - D (b) - E

其中b代表blkio，m代表memory，A是根，在这个结构中ACD都拥有进程，比如C对blkio受限，那么memory则不受限，共享B，E比较特殊，如果没有指定子系统，那么blkio受D控制，memory受B控制。具体操作方式在上面parents、child已声明。

如果该cgroup中已有进程，那么只有在关联的组没有包含进程的时候，cgroup.subtree_control文件能被用来改变控制器的设置。

中间层级必须拥有子系统，如果指定E受限于blkio，那么系统不承认该操作！

Unified hierarchy implements an interface file “cgroup.populated”which can be used to monitor whether the cgroup’s subhierarchy has tasks in it or not. Its value is 0 if there is no task in the cgroup and its descendants; otherwise, 1. poll and [id]notify events are triggered when the value changes.

其他unified hierarchy 改动在document中说的很清楚，这里不再赘述。

包括tasks，cgroup.procs，cgroup.clone_children会被移除等。一旦这种层级开发明确，旧有的cgroup机制会被这种unified hierarchy代替。

参考：

http://lwn.net/Articles/601840/

http://events.linuxfoundation.org/sites/events/files/slides/2014-KLF.pdf

https://www.kernel.org/doc/Documentation/cgroups/unified-hierarchy.txt

http://d.hatena.ne.jp/defiant/mobile?date=20140826

No comments »

Posted in Linux, Linux容器

Tags: Linux容器

cgroups (abbreviated from control groups) is a Linux kernel feature that limits, accounts for and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes.

cgroup 在内核的发展是从从2007年 2.6.24内核引入的机制。经过这些年的发展，已经发展成为docker /LXC 中间力量

cgroup定义，很多中文资料已经说明非常详细了，我这里就过多赘述。

值得注意的是，cgroup最初是依靠sysfs作为与用户的控制接口，因为会存在很多hierarchy。

vfs在cgroup这里受到一些滥用，因此cgroup也积累的很多的问题，在vfs中可能会出现重名死锁。
在3.14的版本后，也就是2013之后，kernel对cgroup进行了巨大的修改。很多的数据结构都有了巨大的变化。其中提出来一种 Unified hierarchy 。

缺点[1][2]：

1)虽然这种树型结构可以带来一定的灵活性，但是在实际应用中可能存在一些问题,比如每个子系统在一个hierarchy只能有一个实例，显然freezer也只能有一个实例，freezer不得不移动去控制其他进程。
2）所有子系统绑定在hierarchy中，一旦hierarchy与具体pid绑定，控制granularity不好把握。
3）因为是树型结构，所以树的深度是无限的，这就使得在一个有限的资源中，子hierarchy可能会分配出有限资源外的资源。也使得管理更加复杂。
4）对多hierarchy支持限制了cgroup的使用，因为我们可以知道我们可以有12个hierarchy绑定12个子系统，也可以只有一个hierarhy绑定12个子统统。

在kernel 3.16中开发了Unified hierarchy。它的目的是通过使用其他的结构，同时解决了上述缺点，保持足够的灵活性，对于大多数用例。

这个特性目前没有在本文中讨论。

简单提一点就是：这种新的unified hierarchy不允许remount/rename

——————-

由于从3.14后cgroup大改，放弃使用sysfs，创建一个kernfs （http://en.wikipedia.org/wiki/Kernfs_%28Linux%29）

这种方式说白了就是抽出了sysfs的核心，创建了一种新的文件系统，可以进行复用。但是这个kernfs目前只是由cgroup在使用

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2bd59d48ebfb3df41ee56938946ca0dd30887312

这个里面对于3.14之后的更改说的比较清晰。因为使用kernfs，vfs被移除，也就是说super_block dentry, inode 和 bdi 被移除，使用cgroup_mount()直接挂载根由vfs管理的注册结构体转移到 kernfs_ops and kernfs_syscall_ops。

具体的子系统，我这里简单的提一下：

blkio — 这个子系统为块设备设定输入/输出限制,比如物理设备(磁盘,固态硬盘,USB 等
等)。
cpu — 这个子系统使用调度程序提供对 CPU 的 cgroup 任务访问。说白了就是时间片大小！
cpuacct — 这个子系统自动生成 cgroup 中任务所使用的 CPU 报告。
cpuset — 这个子系统为 cgroup 中的任务分配独立 CPU(在多核系统)和内存节点。
devices — 这个子系统可允许或者拒绝 cgroup 中的任务访问设备。
freezer — 这个子系统挂起或者恢复 cgroup 中的任务。
memory — 这个子系统设定 cgroup 中任务使用的内存限制,并自动生成由那些任务使用的
内存资源报告。
net_cls — 这个子系统使用等级识别符(classid)标记网络数据包,可允许 Linux 流量控制程
序(tc)识别从具体 cgroup 中生成的数据包。
ns — 名称空间子系统。

比如我们要创建一个cpu核数和memory大小受限的cgroup：

1）创建一个目录mkdir

2）挂载新的文件系统，如果我们在/sys/fs/cgroup挂载，我们要先挂载一个tmpfs，因为sysfs下不允许创建文件夹，其他允许创建文件夹的文件系统，不需要挂载tmpfs

3）$mount -t cgroup -o cpuset,memory cgroup_cpu_mem /mnt/cgroup_cpu_mem

4)）我们挂载到root_cgroup的文件系统成功

5）

[lzz@localhost cgroup_cpu_mem]$ ls
cgroup.clone_children   cpuset.memory_pressure_enabled   memory.kmem.slabinfo                memory.pressure_level
cgroup.event_control    cpuset.memory_spread_page        memory.kmem.tcp.failcnt             memory.soft_limit_in_bytes
cgroup.procs            cpuset.memory_spread_slab        memory.kmem.tcp.limit_in_bytes      memory.stat
cgroup.sane_behavior    cpuset.mems                      memory.kmem.tcp.max_usage_in_bytes  memory.swappiness
cgroup_test             cpuset.sched_load_balance        memory.kmem.tcp.usage_in_bytes      memory.usage_in_bytes
cpuset.cpu_exclusive    cpuset.sched_relax_domain_level  memory.kmem.usage_in_bytes          memory.use_hierarchy
cpuset.cpus             memory.failcnt                   memory.limit_in_bytes               notify_on_release
cpuset.mem_exclusive    memory.force_empty               memory.max_usage_in_bytes           release_agent
cpuset.mem_hardwall     memory.kmem.failcnt              memory.move_charge_at_immigrate     tasks
cpuset.memory_migrate   memory.kmem.limit_in_bytes       memory.numa_stat
cpuset.memory_pressure  memory.kmem.max_usage_in_bytes   memory.oom_control

我们发现在这个root cgroup里面，包括两个系统的相关设置，分别以cpuset/memory开头,上面就是建立一个hierarchy.是一个cpusset与memory的资源集合。

6）在这个hierarchy下面，建立文件夹，就是建立新的cgroup，进入cgroup_test

7）然后这个cgroup_test中的设置都为0

8）具体每个字段含义，可以查看document https://www.kernel.org/doc/Documentation/cgroups/

9）我们简单的限定一下：

$echo 16M  > memory.memory.limit_in_bytes
$echo 1  > cpuset.cpus
$echo 0  > cpus.mems

10）这个cgroup就可以往里面添加进程了，我们可以先添加一个shell终端 echo $$ > tasks ,然后通过这个终端启动的都会添加到该cgroup！

[1] http://lwn.net/Articles/606699/
[2] https://www.kernel.org/doc/Documentation/cgroups/unified-hierarchy.txt