Linux容器 » JasonLe's TechBlog

Posts Tagged ‘Linux容器’

由lxc-checkpoint想到的 …

May 12th, 2015

在 LXC 1.1版本以后，lxc整合了criu的功能，使得可以checkpoint一个正在运行的容器。但是有时候我们会出现lxc.tty must be 0的文字，这个就意味着我们必须在lxc 的config中加入特定的选项。

cat | sudo tee -a /var/lib/lxc/u1/config << EOF
# hax for criu
lxc.console = none
lxc.tty = 0
lxc.cgroup.devices.deny = c 5:1 rwm
EOF

但是我发现了一个问题：到底什么是tty，他是由什么管理的？

我们都知道在unix下有非常多的终端：bash、zsh、sh、ssh等，这些终端程序就是我们输入命令的窗口，至于配置用户的终端，当然是在/etc/passwd下面。

但是是哪个程序调用了/bin/bash呢？使用strace跟踪这个/bin/login（strace -f -o /tmp/strace.log /bin/login），可以发现里面存在execve系统调用，这个调用执行了 /bin/login程序。而login又是谁调用的呢？经过查看是getty。getty是在自己的主进程里头直接执行了/bin/login，这样/bin/login将把getty的进程空间替换掉。

而init需要读取/etc/inittab来做，inittab，目前不再被systemd使用，这里就有/etc/rc.d/一系列脚本完成。
根据系统启动原理，我们可以发现调用过程：

init –> init –> /sbin/getty –> /bin/login –> /bin/login –> /bin/bash

这里的execve调用以后，后者将直接替换前者，我们要知道一点：因为终端程序之间有父子关系的存在，当子进程exit之后，父进程要进行处理，否则就是zombie进程。因此当我们键入exit退出/bin/bash以后，也就相当于/sbin/getty都已经结束了，因此最前面的init程序判断/sbin/getty退出了，又会创建一个子进程把/sbin/getty启动，进而又启动了/bin/login，又看到了那个”XXX login:”

一般情况下，系统内置程序会比自己编写的更加优先被执行，按照系统内置规则，一般首先是程序别名，然后是shell function，之后是系统内置函数（builtin ），最后才是自己编写的函数（program ）！

总的来说：先 alias –> shell function –> builtin –> program 后

参考：

[1] man boot-scripts Linux启动过程
[2] man bootparam Linux内核启动参数
[3] man 5 passwd
[4] man shadow

No comments »

Posted in Linux

Tags: CRIU Linux容器

物理内存管理:请求PFN函数主体实现(1)

March 24th, 2015

在物理内存管理:请求PFN函数层次结构分析这篇文章中，我分析了分配页框的函数结构，其中是上层页框分配的核心，这个函数比起alloc_pages()多一个参数nid,如果传入的nid < 0 ,那么在当前内存节点上分配physical frame。

这里需要阐述的是Linux的内存管理针对的就是NUMA结构，如果当前系统只有一个节点，那么默认调用numa_node_id()返回这个唯一节点。

309 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
310                                                 unsigned int order)
311 {
312         /* Unknown node is current node */
313         if (nid < 0)
314                 nid = numa_node_id();
315 
316         return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
317 }

而在__alloc_pages()函数中，根据nid与gfp_mask可以得到一个适当的zonelist链表，我们知道每个内存节点下面都会默认存在三个zonelist区域：ZONE_DMA/ZONE_NORMAL/ZONE_HIGHMEM ，而node_zonelist(nid, gfp_mask)就是选择合适的内存链表区域zonelist。

因为存在三个zonelist区域，联系之前的struct pglist_data结构成员struct zonelist node_zonelists[MAX_ZONELISTS]，MAX_ZONELISTS最大值就是2，可以看出分配只能分配当前节点和备用节点。

581 /*
582  * The NUMA zonelists are doubled because we need zonelists that restrict the
583  * allocations to a single node for __GFP_THISNODE.
584  *
585  * [0]  : Zonelist with fallback
586  * [1]  : No fallback (__GFP_THISNODE)
587  */
588 #define MAX_ZONELISTS 2

而__alloc_pages()函数内部又封装了__alloc_pages_nodemask()函数，这个函数是页框分配的主体[2]，

struct page *
2857 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
2858                         struct zonelist *zonelist, nodemask_t *nodemask)
2859 {
2860         enum zone_type high_zoneidx = gfp_zone(gfp_mask);
2861         struct zone *preferred_zone;
2862         struct zoneref *preferred_zoneref;
2863         struct page *page = NULL;
2864         int migratetype = gfpflags_to_migratetype(gfp_mask);
2865         unsigned int cpuset_mems_cookie;
2866         int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
2867         int classzone_idx;
2868 
2869         gfp_mask &= gfp_allowed_mask;
2870 
2871         lockdep_trace_alloc(gfp_mask);
2872 
2873         might_sleep_if(gfp_mask & __GFP_WAIT);
2874 
2875         if (should_fail_alloc_page(gfp_mask, order))
2876                 return NULL;
...
2883         if (unlikely(!zonelist->_zonerefs->zone))
2884                 return NULL;
....
2889 retry_cpuset:
2890         cpuset_mems_cookie = read_mems_allowed_begin();
2891 
2892         /* The preferred zone is used for statistics later */
2893         preferred_zoneref = first_zones_zonelist(zonelist, high_zoneidx,
2894                                 nodemask ? : &cpuset_current_mems_allowed,
2895                                 &preferred_zone);
2896         if (!preferred_zone)
2897                 goto out;
2898         classzone_idx = zonelist_zone_idx(preferred_zoneref);
2899 
2900         /* First allocation attempt */
2901         page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
2902                         zonelist, high_zoneidx, alloc_flags,
2903                         preferred_zone, classzone_idx, migratetype);
2904         if (unlikely(!page)) {
....
2910                 gfp_mask = memalloc_noio_flags(gfp_mask);
2911                 page = __alloc_pages_slowpath(gfp_mask, order,
2912                                 zonelist, high_zoneidx, nodemask,
2913                                 preferred_zone, classzone_idx, migratetype);
2914         }
2915 
2916         trace_mm_page_alloc(page, order, gfp_mask, migratetype);
2917 
2918 out:
....
2925         if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
2926                 goto retry_cpuset;
2927 
2928         return page;
2929 }

分析代码，我们可以看到，gfp_zone()根据gfp_mask选取适当类型的zone index。然后经过几项检查，通过zonelist->_zonerefs->zone判断zonelist是否为空，在这里至少要存在一个可用的zone，然后使用上述的zone index，通过first_zones_zonelist()来分配一个内存管理区。

如果前面分配成功，则进入get_page_from_freelist()函数，这个函数可以看成伙伴算法的前置函数，如果伙伴系统存在空位，那么利用伙伴系统进行分配内存，如果分配不成功就进入__alloc_pages_slowpath()慢速分配，这个时候内核要放宽分配的条件，回收系统内存，然后总会分配出一块page。

这里我们要说明下likely()与unlikely()的用法，这两个宏只是提高代码执行概率，是的gcc在编译时，将哪个代码段提前，哪个代码段推后，从而提高效率，不会对值有修改，例如if (unlikely(!zonelist->_zonerefs->zone))表示的就是当zonelist->_zonerefs->zone为空时，执行return NULL操作[1],虽然这个return不太可能发生。

在代码中我们还发现了cpuset_mems_cookie = read_mems_allowed_begin();语句，看到名字，我们就知道这个与cgroup有关，也就是说与cpuset子系统相关，cpuset子系统负责cpu节点与内存节点的分配，如果没有指定nodemask，则使用cpuset_current_mems_allowed允许的节点。我们看到在out域下，有一个if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
发现目前kernel对于cgroup机制中出现page分配失败，就会怀疑是否cpuset_mems_cookie出现修改，如果出现修改，则重试。

[1] http://blog.csdn.net/npy_lp/article/details/7175517

[2]http://lxr.free-electrons.com/source/mm/page_alloc.c#L2857

2 comments »

Posted in Kernel内核分析, Linux, 内存管理

Tags: Linux容器 Management Memory

Checkpoint/Restore in user space:CRIU

March 20th, 2015

Update 2015-3-23

CRIU 是一款目前流行的应用程序级的检查点恢复程序，这个基于OpenVZ 项目，但是OpenVZ项目最大的弊端是需要修改原有kernel。而CRIU则尽可能将程序主体放在用户空间，内核空间只保留必要的system call。

目前OpenVZ的开发，只停留在kernel 2.6.32上面，主要开发人员已经把他们的开发重点放在CRIU上面。

用户态下的CRIU程序我们不会细说，我们主要关注kernel中CRIU。包括两部分：1）需要一种mechanism去dump kernel关于该进程的某个特定信息。2）将状态信息传递给内核进行恢复。

CRIU的目标是允许整个application的运行状态可以被dump，这里就要去dump非常多的与这个application相关的信息，主要包括[1]:

virtual memory map
open files
credential
timer
PID
parent PID
share resources

dump 一个特定application的途径就是:

Parasite code[2] 这个代码可以hack进一个特定进程，对进程透明的进行监控，获取文件描述符。dump memory content。实际原理就是在正常程序执行前，先执行Parasite code，实际的例子就是getitimer()和sigaction()。
Ptrace 可以迅速freeze processes，注入parasite code。
Netlink 获取 sockets，netns信息。
获取procfs 中特定PID的内容，/proc/PID/maps /proc/PID/map_files/ /proc/PID/status /proc/PID/mountinfo ，其中/proc/PID/map_files。这个map_files包括文件，网络等

Parasite code不是专门为CRIU设计，而是kernel的加入的特性，而CRIU使用了Parasite code去调用某些只能是application自己调用的system call，比如getitimer()。

除了一些特殊的system call，另外一些call可以由任意形式的程序进行调用，比如sched_getscheduler()获取调度器，使用sche_getparam()获取进程调度参数。

Ptrace 是一个system call，使用这个ptrace，可以做到控制目标进程，包括目标状态的内部信息，常用于debug和其他的代码分析工具。在kernel 3.4之前，ptrace非常依赖signal与目标进程交互，这就意味会打断进程执行，非常类似于gdb等工具，而加入PTRACE_SEIZE并不会停止进程。

ptrace新特性的引入，使得CRIU可以用来对于某个特定application进行checkpoint。

Restore一个application：

Collect shared object
Restore namespace
创建进程树，包括SID，PGID，恢复继承
files，socket，pipes
Restore per-task properties
Restore memory
Call sigreturn

特定kernel的feature:

Parasite code[2]
如果一个程序打开了一系列的各种形式的文件，kernel在内核中会保存一个文件描述符表来记录该application打开哪些文件，在恢复时，CRIU要重新打开该这些文件，以相同的fd号。在恢复某些特定的pid 的application，发现pid被占用，如果我们想要恢复这个进程，而且继续使用这个pid值，CRIU在内核中加入一个API来控制下几个fork即将分配的pid值，主要是/proc/sys/kernel/ns_last_pid 。主要是向具体参见：http://lwn.net/Articles/525723/
kernel还添加了kcmp()的system call，用来比较两个进程是否共享一个kernel资源。这个就用在父进程打开一系列的share resource，然后fork()。子进程继承父进程的resource，这时kcmp()派上用场。
/proc/PID/map_files
prctl拓展来设置匿名的，私有的对象。eg: task/mm object
通过netlink dump socket信息。在scoket恢复中，相比于/proc file，通过这个可以获取更多的socket信息，通过这些信息，CRIU使用getsockopt(),setsockopt()恢复socket链接。
TCP repair mode
virtual net device indexes，在一个命名空间中恢复网络设备
socket peeking offset
Task memory tracking，用于增量快照与线上迁移。

总的来说CRIU与OpenVZ有几分相似，二者最大的区别就是OpenVZ需要修改内核，非常不便，而CRIU依赖kernel加入的systemcall完成，对于内核没有要求，非常轻便。

而BLCR也是根据某个特定kernel 版本开发，它由两个kernel module，用户态lib工具组成。使用BLCR恢复进程，进程必须依赖libcr库，或者编译时将libcr加入。这个显然对于老旧代码非常不便。BLCR最新版本发布的时候2013.1

而CRIU 截止目前最新版本发布在2015.3.2 ，可以看出CRIU开发非常活跃。

CRIU.pdf

参考:

[1] http://lwn.net/Articles/525675/

[2] http://lwn.net/Articles/454304/

BCLR:

http://blog.csdn.net/myxmu/article/details/8948258

http://blog.csdn.net/myxmu/article/details/8948265

No comments »

Posted in Linux, Linux容器

Tags: BLCR CRIU Linux容器

cgroup 介绍（2）

March 2nd, 2015

之前在第一篇介绍cgroup的文章中，我初步使用cgroup对资源进行限制隔离http://www.lizhaozhong.info/archives/1211

但是基于层级的cgroup存在一个弊端：就是不灵活，树的深度可能是无限的，这就导致实际操作中管理非常繁琐。

基于这个原因，在kernel 3.16中正式加入了unified hierarchy特性，这个特性目前仍然在开发，所以如果想显式开启该特性需要
mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT

__DEVEL__sane_behavior通过看名字，我们也能发现这个特性仍然在开发。

在之前的cgroup hierarchy中，我们知道一个hierarchy可以绑定一个子系统，也可以同时绑定12个子系统。

举例层级A绑定cpuset，层级B绑定memory，如果有一个task同时需要这两个子系统，则很多时候task在这两个层级中存在正交，非常不便。

hierarchy may be collapsed from leaf towards root when viewed from specific
controllers.  For example, a given configuration might not care about
how memory is distributed beyond a certain level while still wanting
to control how CPU cycles are distributed.

如果我们开启__DEVEL__sane_behavior特性，我们看到cgroup.controllers 存在的子系统，在unified hierarchy中，系统会把所有子系统都挂载到根层级下，只有leaf节点可以存在tasks，非叶子节点只进行资源控制。

# mount -t cgroup -o __DEVEL__sane_behavior cgroup /sys/fs/cgroup
# cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu cpuacct memory devices freezer net_cls blkio perf_event net_prio hugetlb

现在我们在root cgroup下面创建parent与child，根层级的cgroup.subtree_control 控制parents的cgroup.controllers

如此往复，上级的cgroup.subtree_control控制下级的cgroup.controllers，也就是说subsystem不会有传递性！

如下面的例子，如果我指定根层级的cgroup.subtree_control 可以使能memory与cpu两个子系统，也就是说parents中可以控制memory、cpu两个子系统。而child如果没有指定子系统，是不会控制memory与cpu的。

# mkdir /sys/fs/cgroup/parent
# mkdir /sys/fs/cgroup/parent/child</pre>
# echo "+memory +cpu" > /sys/fs/cgroup/cgroup.subtree_control
# cat /sys/fs/cgroup/parent/cgroup.controllers
cpu memory

举个例子：

A(b,m) - B(b,m) - C (b)
              \ - D (b) - E

其中b代表blkio，m代表memory，A是根，在这个结构中ACD都拥有进程，比如C对blkio受限，那么memory则不受限，共享B，E比较特殊，如果没有指定子系统，那么blkio受D控制，memory受B控制。具体操作方式在上面parents、child已声明。

如果该cgroup中已有进程，那么只有在关联的组没有包含进程的时候，cgroup.subtree_control文件能被用来改变控制器的设置。

中间层级必须拥有子系统，如果指定E受限于blkio，那么系统不承认该操作！

Unified hierarchy implements an interface file “cgroup.populated”which can be used to monitor whether the cgroup’s subhierarchy has tasks in it or not. Its value is 0 if there is no task in the cgroup and its descendants; otherwise, 1. poll and [id]notify events are triggered when the value changes.

其他unified hierarchy 改动在document中说的很清楚，这里不再赘述。

包括tasks，cgroup.procs，cgroup.clone_children会被移除等。一旦这种层级开发明确，旧有的cgroup机制会被这种unified hierarchy代替。

参考：

http://lwn.net/Articles/601840/

http://events.linuxfoundation.org/sites/events/files/slides/2014-KLF.pdf

https://www.kernel.org/doc/Documentation/cgroups/unified-hierarchy.txt

http://d.hatena.ne.jp/defiant/mobile?date=20140826

No comments »

Posted in Linux, Linux容器

Tags: Linux容器

cgroup 介绍（1）

January 14th, 2015

cgroups (abbreviated from control groups) is a Linux kernel feature that limits, accounts for and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes.

cgroup 在内核的发展是从从2007年 2.6.24内核引入的机制。经过这些年的发展，已经发展成为docker /LXC 中间力量

cgroup定义，很多中文资料已经说明非常详细了，我这里就过多赘述。

值得注意的是，cgroup最初是依靠sysfs作为与用户的控制接口，因为会存在很多hierarchy。

vfs在cgroup这里受到一些滥用，因此cgroup也积累的很多的问题，在vfs中可能会出现重名死锁。
在3.14的版本后，也就是2013之后，kernel对cgroup进行了巨大的修改。很多的数据结构都有了巨大的变化。其中提出来一种 Unified hierarchy 。

缺点[1][2]：

1)虽然这种树型结构可以带来一定的灵活性，但是在实际应用中可能存在一些问题,比如每个子系统在一个hierarchy只能有一个实例，显然freezer也只能有一个实例，freezer不得不移动去控制其他进程。
2）所有子系统绑定在hierarchy中，一旦hierarchy与具体pid绑定，控制granularity不好把握。
3）因为是树型结构，所以树的深度是无限的，这就使得在一个有限的资源中，子hierarchy可能会分配出有限资源外的资源。也使得管理更加复杂。
4）对多hierarchy支持限制了cgroup的使用，因为我们可以知道我们可以有12个hierarchy绑定12个子系统，也可以只有一个hierarhy绑定12个子统统。

在kernel 3.16中开发了Unified hierarchy。它的目的是通过使用其他的结构，同时解决了上述缺点，保持足够的灵活性，对于大多数用例。

这个特性目前没有在本文中讨论。

简单提一点就是：这种新的unified hierarchy不允许remount/rename

——————-

由于从3.14后cgroup大改，放弃使用sysfs，创建一个kernfs （http://en.wikipedia.org/wiki/Kernfs_%28Linux%29）

这种方式说白了就是抽出了sysfs的核心，创建了一种新的文件系统，可以进行复用。但是这个kernfs目前只是由cgroup在使用

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2bd59d48ebfb3df41ee56938946ca0dd30887312

这个里面对于3.14之后的更改说的比较清晰。因为使用kernfs，vfs被移除，也就是说super_block dentry, inode 和 bdi 被移除，使用cgroup_mount()直接挂载根由vfs管理的注册结构体转移到 kernfs_ops and kernfs_syscall_ops。

具体的子系统，我这里简单的提一下：

blkio — 这个子系统为块设备设定输入/输出限制,比如物理设备(磁盘,固态硬盘,USB 等
等)。
cpu — 这个子系统使用调度程序提供对 CPU 的 cgroup 任务访问。说白了就是时间片大小！
cpuacct — 这个子系统自动生成 cgroup 中任务所使用的 CPU 报告。
cpuset — 这个子系统为 cgroup 中的任务分配独立 CPU(在多核系统)和内存节点。
devices — 这个子系统可允许或者拒绝 cgroup 中的任务访问设备。
freezer — 这个子系统挂起或者恢复 cgroup 中的任务。
memory — 这个子系统设定 cgroup 中任务使用的内存限制,并自动生成由那些任务使用的
内存资源报告。
net_cls — 这个子系统使用等级识别符(classid)标记网络数据包,可允许 Linux 流量控制程
序(tc)识别从具体 cgroup 中生成的数据包。
ns — 名称空间子系统。

比如我们要创建一个cpu核数和memory大小受限的cgroup：

1）创建一个目录mkdir

2）挂载新的文件系统，如果我们在/sys/fs/cgroup挂载，我们要先挂载一个tmpfs，因为sysfs下不允许创建文件夹，其他允许创建文件夹的文件系统，不需要挂载tmpfs

3）$mount -t cgroup -o cpuset,memory cgroup_cpu_mem /mnt/cgroup_cpu_mem

4)）我们挂载到root_cgroup的文件系统成功

5）

[lzz@localhost cgroup_cpu_mem]$ ls
cgroup.clone_children   cpuset.memory_pressure_enabled   memory.kmem.slabinfo                memory.pressure_level
cgroup.event_control    cpuset.memory_spread_page        memory.kmem.tcp.failcnt             memory.soft_limit_in_bytes
cgroup.procs            cpuset.memory_spread_slab        memory.kmem.tcp.limit_in_bytes      memory.stat
cgroup.sane_behavior    cpuset.mems                      memory.kmem.tcp.max_usage_in_bytes  memory.swappiness
cgroup_test             cpuset.sched_load_balance        memory.kmem.tcp.usage_in_bytes      memory.usage_in_bytes
cpuset.cpu_exclusive    cpuset.sched_relax_domain_level  memory.kmem.usage_in_bytes          memory.use_hierarchy
cpuset.cpus             memory.failcnt                   memory.limit_in_bytes               notify_on_release
cpuset.mem_exclusive    memory.force_empty               memory.max_usage_in_bytes           release_agent
cpuset.mem_hardwall     memory.kmem.failcnt              memory.move_charge_at_immigrate     tasks
cpuset.memory_migrate   memory.kmem.limit_in_bytes       memory.numa_stat
cpuset.memory_pressure  memory.kmem.max_usage_in_bytes   memory.oom_control

我们发现在这个root cgroup里面，包括两个系统的相关设置，分别以cpuset/memory开头,上面就是建立一个hierarchy.是一个cpusset与memory的资源集合。

6）在这个hierarchy下面，建立文件夹，就是建立新的cgroup，进入cgroup_test

7）然后这个cgroup_test中的设置都为0

8）具体每个字段含义，可以查看document https://www.kernel.org/doc/Documentation/cgroups/

9）我们简单的限定一下：

$echo 16M  > memory.memory.limit_in_bytes
$echo 1  > cpuset.cpus
$echo 0  > cpus.mems

10）这个cgroup就可以往里面添加进程了，我们可以先添加一个shell终端 echo $$ > tasks ,然后通过这个终端启动的都会添加到该cgroup！

[1] http://lwn.net/Articles/606699/
[2] https://www.kernel.org/doc/Documentation/cgroups/unified-hierarchy.txt

1 comment »

Posted in Linux, Linux容器

Tags: Linux容器

Posts Tagged ‘Linux容器’

由lxc-checkpoint想到的 …

物理内存管理:请求PFN函数主体实现(1)

Checkpoint/Restore in user space:CRIU

Update 2015-3-23

dump 一个特定application的途径就是:

Restore一个application：

特定kernel的feature:

参考:

cgroup 介绍（2）

cgroup 介绍（1）

比如我们要创建一个cpu核数和memory大小受限的cgroup：

Recent Posts

热门文章