Linux容器 » JasonLe's TechBlog

Archive for the ‘Linux容器’ category

docker 初体验

February 21st, 2017

Docker话说已经活了好几年了，业界开始逐步从热议到逐步落地使用，之前硕士搞过一年的LXC，虽然Docker这个技术新瓶装旧酒，但是由于加入了特有的hub机制，使其易于传播，易用性比LXC更具优势。

今天抽空使用了一下docker，准备将自己的 VPS 容器化，方便以后在不同主机中迁移。

Docker由三部分组成：一个运行docker命令的client，一个包含images并以容器(container)形式运行image的主机，一个docker的images仓库。client与docker host上面的docker daemon通信。当然docker client和host可以运行于一台机器（我们做实验的时候是一台），默认的docker仓库是Docker Hub。

docker使用流程就是client pull 从hub上把image拉到docker host，然后通过run命令指挥image到host上面弄一个container来跑这个image。

或者就是client 通过build命令在host上面创建一个自己的image，然后通过push命令把image推到仓库。然后别人就可以使用自己构建的镜像。

刚开始使用对image容易搞混乱，按照我的理解：image就是一个文件系统镜像，用户无法直接修改这个镜像，类似于一种二进制文件形式（其实是一种特殊的文件系统）里面运行时所需的程序、库、资源、配置等文件外，还包含了一些为运行时准备的一些配置参数（如匿名卷、环境变量、用户等）。

我们使用docker run就可以将这个image运行起来，运行起来的image就是container。运行起来的container我们可以对其进行各种修改，每产生一个修改，就产生一个commition ID。后一个构建依赖前一个构建，因此和git版本管理很像！

image和container之间的关系类似程序与进程之间的关系!

Docker命令实例

docker pull的格式是：

docker pull[选项] [Docker Registry地址] &lt;仓库名&gt;:&lt;标签名&gt;

2.docker images命令下载的images：

$docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
ubuntu ml 14.04 b969ab9f929b 1 weeksago 210 MB

3.最最重要的就是docker run

$docker run -p 8080:80 -p 111:22 -i -t -name [docker-name] -d COMMAND

其中的-p指的是端口映射，-d 设置该容器以daemon模式运行，具体实例：

$ sudo docker run -i -t ubuntu:14.04 /bin/bash

docker run – 运行一个容器
-t – 分配一个（伪）tty (link is external)
-i – 交互模式 (so we can interact with it)
ubuntu:14.04 – 使用 ubuntu 基础镜像 14.04
/bin/bash – 运行命令 bash shell

4. docker start 其实就是docker run的缩写，创建好container之后就不用再输入很长的命令，而是由一个docker start代替。

$docker start [docker-name]

5. docker ps 可以显示出运行的所有容器

$ docker ps
CONTAINER ID       IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
e3a913872698       ubuntu:14.04       "bash"              11seconds ago      Up 10 seconds                           wizardly_elion
db1c25753e97       ubuntu:14.04       "bash"              21seconds ago      Up 21 seconds                           adoring_shannon

这几个命令在使用中使用频率最高。

Dockerfile是构建image的文件，类似于脚本命令。

# ubuntu 14.04 with vim and gcc
FROM ubuntu:14.04
MAINTAINER lzz<[email protected]>
RUN apt-get update && apt-getinstall –y vim gcc

其中FROM指定构建镜像的基础源镜像，如果本地没有指定的镜像，则会自动从 Docker 的公共库 pull 镜像下来。
FROM必须是 Dockerfile 中非注释行的第一个指令，即一个 Dockerfile 从FROM语句开始。FROM可以在一个 Dockerfile 中出现多次，如果有需求在一个 Dockerfile 中创建多个镜像。如果FROM语句没有指定镜像标签，则默认使用latest标签。
MAINTAINER <name> 指定创建镜像的用户

RUN 有两种使用方式
RUN “executable”, “param1”, “param2”
每条RUN指令将在当前镜像基础上执行指定命令，并提交为新的镜像，后续的RUN都在之前RUN提交后的镜像为基础，镜像是分层的，可以通过一个镜像的任何一个历史提交点来创建，类似源码的版本控制。

exec 方式会被解析为一个 JSON 数组，所以必须使用双引号而不是单引号。exec 方式不会调用一个命令 shell，所以也就不会继承相应的变量，如：

RUN [ “echo”, “$HOME” ]
这种方式是不会达到输出 HOME 变量的，正确的方式应该是这样的

RUN [ “sh”, “-c”, “echo”, “$HOME” ]
RUN产生的缓存在下一次构建的时候是不会失效的，会被重用，可以使用–no-cache选项，即docker build –no-cache，如此便不会缓存。

http://www.cnblogs.com/hustcat/p/3980244.html

http://blog.csdn.net/u011537073/article/details/52719363

No comments »

Posted in Linux, Linux容器

Tags: Docker

手动创建类docker环境

March 10th, 2016

最近docker特别的火，但是细究docker的原理机制。其实就是使用了cgroups+namespace完成资源的限制与隔离。现在我们手动创建一个namespace的容器，做到资源的隔离。

之前我们已经讨论了namespace的组成，现在我们通过手动的方式创建每种不同namespace的环境。创建不同的namespace主要使用clone()+特定关键字的方式进行。我们可以把clone返回的pid，所以container也是一种特殊的进程！

Mount namespaces	CLONE_NEWNS	Linux 2.4.19
UTS namespaces	CLONE_NEWUTS	Linux 2.6.19
IPC namespaces	CLONE_NEWIPC	Linux 2.6.19
PID namespaces	CLONE_NEWPID	Linux 2.6.24
Network namespaces	CLONE_NEWNET	始于Linux 2.6.24 完成于 Linux 2.6.29
User namespaces	CLONE_NEWUSER	始于 Linux 2.6.23 完成于 Linux 3.8)

通过这个表格我们看到每个namespace完成时间不同，但是基于目前kernel版本已经为4.0.我们可以理解namespace部分基本完成。首先我们先定义一个模板：

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)

static char child_stack[STACK_SIZE];
char* const child_args[] = {
  "/bin/bash",
  NULL
};

int child_main(void* arg)
{
  printf(" - World !\n");
  execv(child_args[0], child_args);
  printf("Ooops\n");
  return 1;
}

int main()
{
  printf(" - Hello ?\n");
  int child_pid = clone(child_main, child_stack+STACK_SIZE, SIGCHLD, NULL);
  waitpid(child_pid, NULL, 0);
  return 0;
}

这里我们发现clone()的第二个参数是child_stack+STACK_SIZE，这里要说明下栈是从高地址往低地址走的，所以给出最后一个地址，也就是给出了栈的首地址。

1.UTS namespace[1]

使用这个UTS namespace可以使得容器拥有自己的主机名，程序是要是使用clone+sethostname()配合，这个与fork一个进程特别相似。

// (needs root privileges (or appropriate capabilities))
//[...]
int child_main(void* arg)
{
  printf(" - World !\n");
  sethostname("In Namespace", 12);
  execv(child_args[0], child_args);
  printf("Ooops\n");
  return 1;
}

int main()
{
  printf(" - Hello ?\n");
  int child_pid = clone(child_main, child_stack+STACK_SIZE,
      CLONE_NEWUTS | SIGCHLD, NULL);
  waitpid(child_pid, NULL, 0);
  return 0;
}

与fork()函数类似，fork()函数创建一个新的child_pid时，进程中的内容与父进程完全相同，当后续执行子进程功能时，使用execv()族函数覆盖子进程各个程序段，包括代码段数据段等。这里我们注意到clone()函数中的CLONE_NEWUTS关键字，然后在子函数中调用execv(child_args[0], child_args);

注:child_args省略在运行程序添加参数

运行结果：

lzz@localhost:~/container$ gcc -Wall main.c -o ns && sudo ./ns
 - Hello ?
 - World !
root@In Namespace:~/container$ # inside the container
root@In Namespace:~/container$ exit
lzz@localhost:~/container$ # outside the container

上面的这个namespace只做到主机名的隔离，其他子系统都没有还没有隔离，我们在proc下还是可以看到全局的信息。

2.IPC namespace[2]

这里我们使用pipe进行同步，当创建child_pid时， checkpoint[0]为管道里的读取端，checkpoint[1]则为管道的写入端。当管道没有数据时，read()调用将默认的被阻塞，等待某些数据写入，从而达到同步的目的。

...
int child_main(void* arg)
{
  char c;

  // init sync primitive
  close(checkpoint[1]);
  // wait...
  read(checkpoint[0], &c, 1);

  printf(" - World !\n");
  sethostname("In Namespace", 12);
  execv(child_args[0], child_args);
  printf("Ooops\n");
  return 1;
}

int main()
{
  // init sync primitive
  pipe(checkpoint);

  printf(" - Hello ?\n");

  int child_pid = clone(child_main, child_stack+STACK_SIZE,
      CLONE_NEWUTS | CLONE_NEWIPC | SIGCHLD, NULL);

  // some damn long init job
  sleep(4);
  // signal "done"
  close(checkpoint[1]);

  waitpid(child_pid, NULL, 0);
  return 0;
}

这里在父进程下关闭close(checkpoint[1]);意味着父进程结束，子进程才能继续。

3. PID namespace[3]

PID namespace可以做到容器内部的pid与容器外的隔离，也就是说都可以有pid 1的进程，当然容器内pid 1 的进程映射到容器外，拥有其他的pid 号。

...
int child_main(void* arg)
{
  char c;

  // init sync primitive
  close(checkpoint[1]);
  // wait...
  read(checkpoint[0], &c, 1);

  printf(" - [%5d] World !\n", getpid());
  sethostname("In Namespace", 12);
  execv(child_args[0], child_args);
  printf("Ooops\n");
  return 1;
}

int main()
{
  // init sync primitive
  pipe(checkpoint);

  printf(" - [%5d] Hello ?\n", getpid());

  int child_pid = clone(child_main, child_stack+STACK_SIZE,
      CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | SIGCHLD, NULL);

  // further init here (nothing yet)

  // signal "done"
  close(checkpoint[1]);

  waitpid(child_pid, NULL, 0);
  return 0;
}

这里我们看到在clone的标志里又加入了CLONE_NEWPID，然后在child_main中加入getpid()，我们可以发现容器的pid号。运行结果：

lzz@localhost:~/container$ gcc -Wall main-3-pid.c -o ns && sudo ./ns
 - [ 7823] Hello ?
 - [    1] World !
root@In Namespace:~/blog# echo "=> My PID: $$"
=> My PID: 1
root@In Namespace:~/blog# exit

这里我们发现在容器中，我们没有挂载proc文件系统，这里有一个问题，如果我们在容器里面挂载一个proc，在容器外使用top、ps -aux会提示出错。需要重新挂载proc在根目录下，因为这里我们并没有隔离文件系统。

4.CLONE_NEWNS[4]

这个clone选项，可以保证在容器内的文件挂载操作，不影响父容器的使用,也就解决了上面proc挂载损坏父容器空间的问题。

...
// sync primitive
int checkpoint[2];
....
int child_main(void* arg)
{
  char c;

  // init sync primitive
  close(checkpoint[1]);

  // setup hostname
  printf(" - [%5d] World !\n", getpid());
  sethostname("In Namespace", 12);

  // remount "/proc" to get accurate "top" && "ps" output
  mount("proc", "/proc", "proc", 0, NULL);

  // wait...
  read(checkpoint[0], &c, 1);

  execv(child_args[0], child_args);
  printf("Ooops\n");
  return 1;
}

int main()
{
  // init sync primitive
  pipe(checkpoint);

  printf(" - [%5d] Hello ?\n", getpid());

  int child_pid = clone(child_main, child_stack+STACK_SIZE,
      CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL);

  // further init here (nothing yet)

  // signal "done"
  close(checkpoint[1]);

  waitpid(child_pid, NULL, 0);
  return 0;
}

这个时候我们运行这个程序：

lzz@localhost:~/container$ gcc -Wall ns.c -o ns && sudo ./ns
 - [14472] Hello ?
 - [    1] World !
root@In Namespace:~/blog# mount -t proc proc /proc
root@In Namespace:~/blog# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  1.0  0.0  23620  4680 pts/4    S    00:07   0:00 /bin/bash
root        79  0.0  0.0  18492  1328 pts/4    R+   00:07   0:00 ps aux
root@In Namespace:~/blog# exit

可以发现容器内的ps读取的是容器内部的proc文件系统！

这里我们就要考虑docker中，比如我们从docker hub下pull下一个镜像，在这个镜像中必然存在一个文件系统rootfs，包括各种配置信息，基本的链接库。各种特殊的文件系统：/dev，/sysfs等。这个就需要我们进行裁剪！
这里主要存在的目录有：

bin etc lib lib64 mnt proc run sbin sys tmp usr/bin

我们需要根据自己的architecture配置这些目录。最后使用mount，将容器中的文件系统，都挂载到这些目录下。挂载完毕后使用chroot与chdir隔离目录：

    if ( chdir("./rootfs") != 0 || chroot("./") != 0 ){
        perror("chdir/chroot");
    }

5.User namespace[5]

这个namespace主要是管理用户的UID，GID，主要原理通过读写/proc//uid_map 和 /proc//gid_map 这两个文件。这两个文件的格式为：ID-inside-ns ID-outside-ns length 。
核心函数：

void set_map(char* file, int inside_id, int outside_id, int len) {
    FILE* mapfd = fopen(file, "w");
    if (NULL == mapfd) {
        perror("open file error");
        return;
    }
    fprintf(mapfd, "%d %d %d", inside_id, outside_id, len);
    fclose(mapfd);
}

void set_uid_map(pid_t pid, int inside_id, int outside_id, int len) {
    char file[256];
    sprintf(file, "/proc/%d/uid_map", pid);
    set_map(file, inside_id, outside_id, len);
}

void set_gid_map(pid_t pid, int inside_id, int outside_id, int len) {
    char file[256];
    sprintf(file, "/proc/%d/gid_map", pid);
    set_map(file, inside_id, outside_id, len);
}

6.Network namespace[6]

这个namespace主要完成的是将一块物理网卡虚拟出多快虚拟网卡，主要命令：

# Create a "demo" namespace
ip netns add demo

# create a "veth" pair
ip link add veth0 type veth peer name veth1

# and move one to the namespace
ip link set veth1 netns demo

# configure the interfaces (up + IP)
ip netns exec demo ip link set lo up
ip netns exec demo ip link set veth1 up
ip netns exec demo ip addr add xxx.xxx.xxx.xxx/30 dev veth1
ip link set veth0 up
ip addr add xxx.xxx.xxx.xxx/30 dev veth0

运用在代码中就是:

...
int child_main(void* arg)
{
  char c;

  // init sync primitive
  close(checkpoint[1]);

  // setup hostname
  printf(" - [%5d] World !\n", getpid());
  sethostname("In Namespace", 12);

  // remount "/proc" to get accurate "top" && "ps" output
  mount("proc", "/proc", "proc", 0, NULL);

  // wait for network setup in parent
  read(checkpoint[0], &c, 1);

  // setup network
  system("ip link set lo up");
  system("ip link set veth1 up");
  system("ip addr add 169.254.1.2/30 dev veth1");

  execv(child_args[0], child_args);
  printf("Ooops\n");
  return 1;
}

int main()
{
  // init sync primitive
  pipe(checkpoint);

  printf(" - [%5d] Hello ?\n", getpid());

  int child_pid = clone(child_main, child_stack+STACK_SIZE,
      CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | SIGCHLD, NULL);

  // further init: create a veth pair
  char* cmd;
  asprintf(&cmd, "ip link set veth1 netns %d", child_pid);
  system("ip link add veth0 type veth peer name veth1");
  system(cmd);
  system("ip link set veth0 up");
  system("ip addr add 169.254.1.1/30 dev veth0");
  free(cmd);

  // signal "done"
  close(checkpoint[1]);

  waitpid(child_pid, NULL, 0);
  return 0;
}

对于网络这一块，要想使得容器内的进程与容器外的进程通信，需要架设网桥，具体可以查看相关文档。

Final

我们这里看一下父子容器的namespace：
父：

lzz@localhost:~$ sudo ls -l /proc/4599/ns
total 0
lrwxrwxrwx 1 root root 0  4月  7 22:01 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0  4月  7 22:01 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0  4月  7 22:01 net -> net:[4026531956]
lrwxrwxrwx 1 root root 0  4月  7 22:01 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0  4月  7 22:01 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0  4月  7 22:01 uts -> uts:[4026531838]

子：

lzz@localhost:~$ sudo ls -l /proc/4600/ns
total 0
lrwxrwxrwx 1 root root 0  4月  7 22:01 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0  4月  7 22:01 mnt -> mnt:[4026532520]
lrwxrwxrwx 1 root root 0  4月  7 22:01 net -> net:[4026531956]
lrwxrwxrwx 1 root root 0  4月  7 22:01 pid -> pid:[4026532522]
lrwxrwxrwx 1 root root 0  4月  7 22:01 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0  4月  7 22:01 uts -> uts:[4026532521]

我们可以发现，其中的ipc，net，user是相同的ID，而mnt,pid,uts都是不同的。如果两个进程指向的namespace编号相同，就说明他们在同一个namespace下，否则则在不同namespace里面。

参考：

[1] http://lwn.net/Articles/531245/

[2] http://coolshell.cn/articles/17010.html

[3] http://lwn.net/Articles/532741/

[4] http://coolshell.cn/articles/17029.html

[5] http://lwn.net/Articles/539941/

[6] http://lwn.net/Articles/580893/

[7]

https://blog.jtlebi.fr/2014/01/12/introduction-to-linux-namespaces-part-4-ns-fs/
http://www.cnblogs.com/nufangrensheng/p/3579378.html

No comments »

Posted in Linux, Linux容器

Tags: namespace

为 LXC 配置网络

October 16th, 2015

LXC是一个基于cgroup 与 namespace 机制的轻量级虚拟机，在Ubuntu平台下有专门的源，可以直接通过apt-get安装，但是在debian平台下，软件仓库中lxc版本太低，导致很多新特性无法使用，推荐源码安装。截止到我写这篇博客，lxc版本已经更新至1.1.4 。

首先我们首先要编译安装最新版的LXC，根据教程INSTALL，我们需要运行autogen.sh ./configure 生成Makefile，这里必须将LXC 中的Security feature 全部安装，否则无法通过lxc-start 启动容器。

为容器配置网络有两种形式：1) 使用网桥 2) 直接使用物理网卡

1) 使用网桥

假设我们主机只有eth0的物理网卡，在主机/etc/network/interfaces中，直接加入下面的字段：

auto br0
iface br0 inet dhcp
        bridge_ports eth0
        bridge_fd 0
        bridge_maxwait 0

然后重启网络 /etc/init.d/networking restart 之后可以发现主机网络出现br0的网桥。

如果LXC在编译时没有配置路径，容器的config默认路径在/usr/local/var/lib/lxc/xxx/config ，我们需要在这个文件中加入网络选项

lxc.network.type = veth
lxc.network.flags = up

# that's the interface defined above in host's interfaces file
lxc.network.link = br0

# name of network device inside the container,
# defaults to eth0, you could choose a name freely
# lxc.network.name = lxcnet0 

lxc.network.hwaddr = 00:FF:AA:00:00:01

然后我们在容器的/etc/network/interfaces中，添加

auto eth0
iface eth0 inet dhcp

如果容器中没有开启dhclient服务，最好将其加到 /etc/rc.local中即可。

2) 直接使用物理网卡

比如物理宿主主机拥有两张网卡：eth0 与 eth1，我把eth0作为主机使用，eth1作为LXC使用。那么我们在config中添加

xc.network.type=phys
lxc.network.link=eth1
lxc.network.flags=up
#lxc.network.hwaddr = 00:16:3e:f9:ad:be #注释掉#

lxc.network.flags 用于指定网络的状态，up 表示网络处于可用状态。
lxc.network.link 用于指定用于和容器接口通信的真实接口，比如一个网桥 br0 ，eth0等。

在主机/etc/network/interfaces中加入

auto eth1
iface eth1 inet dhcp

然后重新启动网络服务 #/etc/init.d/networking restart
重新启动 LXC 容器 # lxc-start -n xxx

一旦 LXC 虚拟计算机启动成功，在宿主计算机上使用〝ifconfig -a〞查看主机网络接口，用户会发现此时网络接口 eth1 消失了，只有 eth0 。这是因为 eth1 已经让 LXC 虚拟计算机给使用了。然后我们使用如下命令“ lxc-attach -n xxx”登录 LXC 虚拟计算机发现此时 LXC 虚拟计算机的网络接口是 eth1。然后我们可以使用 ping 命令测试一下 LXC 虚拟计算机和互联网是否联通。

3) 容器配置静态IP

如果我们使用静态IP的话，宿主机可以使用静态IP或者是DHCP，我们假定宿主机是DHCP，容器是静态IP，注意最后两个字段：

lxc.network.type = veth
lxc.network.flags = up

# that's the interface defined above in host's interfaces file
lxc.network.link = br0

# name of network device inside the container,
# defaults to eth0, you could choose a name freely
# lxc.network.name = lxcnet0 

lxc.network.hwaddr = 00:FF:AA:00:00:01
lxc.network.ipv4 = 192.168.1.110/24#注意设置为宿主机的网段
lxc.network.ipv4.gateway = 192.168.1.1#注意设置为宿主机的网段

在容器内的/etc/network/interfaces中加入，记住不加auto eth0！

iface eth0 inet static
       address <container IP here, e.g. 192.168.1.110>
       netmask 255.255.255.0
       network <network IP here, e.g. 192.168.1.0>
       broadcast <broadcast IP here, e.g. 192.168.1.255>
       gateway <gateway IP address here, e.g. 192.168.1.1>
       # dns-* options are implemented by the resolvconf package, if installed
       dns-nameservers <name server IP address here, e.g. 192.168.1.1>
       dns-search your.search.domain.here

结束：

根据我与CRIU团队的交流，目前CRIU不支持对于LXC独占物理网卡的c/r ，对于某些application使用 SOCK_PACKET 的套接字目前也不支持！这个特性已被加到criu新特性中，https://github.com/xemul/criu/issues/73 。预计在之后的版本中支持！

https://www.ibm.com/developerworks/cn/linux/1312_caojh_linuxlxc/

https://wiki.debian.org/LXC/SimpleBridge

No comments »

Posted in Linux, Linux容器

Tags: CRIU LXC Network

Docker 使用

May 19th, 2015

Update 2015-6-3

最近项目需要使用到docker进行live migration。之前接触过lxc，所以这时两款原理上相同的容器虚拟化产品。我是在fedora 20下使用这个产品，在fedora20的官方仓库中有一个重名的软件包docker，我们需要使用docker-io安装。

安装完成以后就可以通过systemctl 启动docker daemon

$ sudo systemctl start docker

如果想开机启动那么使用enable命令，当我们安装好docker以后，可以使用docker ps查看,在目前我使用的1.5版本docker。首先我先pull一个镜像：

[root@localhost contrib]# docker pull ubuntu:latest
[root@localhost contrib]# docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
ubuntu              14.04.2             07f8e8c5e660        2 weeks ago         188.3 MB
ubuntu              latest              07f8e8c5e660        2 weeks ago         188.3 MB
ubuntu              trusty              07f8e8c5e660        2 weeks ago         188.3 MB
ubuntu              trusty-20150427     07f8e8c5e660        2 weeks ago         188.3 MB
ubuntu              14.04               07f8e8c5e660        2 weeks ago         188.3 MB
busybox             buildroot-2014.02   8c2e06607696        4 weeks ago         2.43 MB
busybox             latest              8c2e06607696        4 weeks ago         2.43 MB

pull完成以后，就可以通过docker images查看本机拥有的镜像，这时候我们使用docker run -d -t -i ubuntu /bin/sh可以启动镜像。

we’ve also passed in two flags: -t and -i. The -t flag assigns a pseudo-tty or terminal inside our new container and the -i flag allows us to make an interactive connection by grabbing the standard in (STDIN) of the container.

这时我们可以使用docker ps来查看本机启动的容器。

[root@localhost contrib]# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
f05df7249fea        ubuntu:14.04        "/bin/bash"         9 seconds ago       Up 6 seconds                            silly_hypatia

docker不同于lxc，主要使用一个64bit的id唯一标示一个容器，也是通过这个CONTAINER ID来标示，至于容器的停止，使用docker stop [CONTAINER ID] 或者是docker stop [NAMES]。删除一个镜像是docker rmi REPOSITORY：TAG即可,我们也可以使用别名机制来代替具体的CONTAINER ID。

[root@localhost contrib]# docker run -d -i -t --name busybox busybox:latest
3772693b1a82f996b296addf2f2ec00636535c81aa61730f66717d1311ae1b20
[root@localhost contrib]# docker stop busybox
busybox
[root@localhost contrib]# docker run -d -i -t --name busybox busybox:latest
FATA[0000] Error response from daemon: Conflict. The name "busybox" is already in use by container 3772693b1a82. You have to delete (or rename) that container to be able to reuse that name.
[root@localhost contrib]# docker run -d -i -t busybox:latest
5c331e6cddb00e134f39831ce1a6c5a9763c5ba60e1a41232503fce49fa17b09
[root@localhost contrib]# docker ps
CONTAINER ID        IMAGE                       COMMAND             CREATED             STATUS              PORTS               NAMES
5c331e6cddb0        busybox:buildroot-2014.02   "/bin/sh"           8 seconds ago       Up 6 seconds                            elated_meitner
f05df7249fea        ubuntu:14.04                "/bin/bash"         3 hours ago         Up 3 hours                              silly_hypatia

我们知道对于容器的动态迁移，利用pid进行C/R操作是十分有必要的，要想知道容器的pid，那么我们使用docker inspect busybox 查询当前容器的信息。我们看到这个容器的pid是768。

# docker inspect busybox
.....
"Path": "/bin/sh",
    "ProcessLabel": "",
    "ResolvConfPath": "/var/lib/docker/containers/f89a11d1cfe0037993dc4862ee218d59ccdcc4c7377fa4ed8946ed744226d8f6/resolv.conf",
    "RestartCount": 0,
    "State": {
        "Error": "",
        "ExitCode": 0,
        "FinishedAt": "0001-01-01T00:00:00Z",
        "OOMKilled": false,
        "Paused": false,
        "Pid": 768,
        "Restarting": false,
        "Running": true,
        "StartedAt": "2015-06-02T22:48:09.975364683Z"
    },
    "Volumes": {},
    "VolumesRW": {}
....

参考：

http://criu.org/Docker

http://docs.docker.com/userguide/dockerizing/

No comments »

Posted in Linux, Linux容器

Tags: cgroups Docker namespace

Checkpoint/Restore in user space:CRIU

March 20th, 2015

Update 2015-3-23

CRIU 是一款目前流行的应用程序级的检查点恢复程序，这个基于OpenVZ 项目，但是OpenVZ项目最大的弊端是需要修改原有kernel。而CRIU则尽可能将程序主体放在用户空间，内核空间只保留必要的system call。

目前OpenVZ的开发，只停留在kernel 2.6.32上面，主要开发人员已经把他们的开发重点放在CRIU上面。

用户态下的CRIU程序我们不会细说，我们主要关注kernel中CRIU。包括两部分：1）需要一种mechanism去dump kernel关于该进程的某个特定信息。2）将状态信息传递给内核进行恢复。

CRIU的目标是允许整个application的运行状态可以被dump，这里就要去dump非常多的与这个application相关的信息，主要包括[1]:

virtual memory map
open files
credential
timer
PID
parent PID
share resources

dump 一个特定application的途径就是:

Parasite code[2] 这个代码可以hack进一个特定进程，对进程透明的进行监控，获取文件描述符。dump memory content。实际原理就是在正常程序执行前，先执行Parasite code，实际的例子就是getitimer()和sigaction()。
Ptrace 可以迅速freeze processes，注入parasite code。
Netlink 获取 sockets，netns信息。
获取procfs 中特定PID的内容，/proc/PID/maps /proc/PID/map_files/ /proc/PID/status /proc/PID/mountinfo ，其中/proc/PID/map_files。这个map_files包括文件，网络等

Parasite code不是专门为CRIU设计，而是kernel的加入的特性，而CRIU使用了Parasite code去调用某些只能是application自己调用的system call，比如getitimer()。

除了一些特殊的system call，另外一些call可以由任意形式的程序进行调用，比如sched_getscheduler()获取调度器，使用sche_getparam()获取进程调度参数。

Ptrace 是一个system call，使用这个ptrace，可以做到控制目标进程，包括目标状态的内部信息，常用于debug和其他的代码分析工具。在kernel 3.4之前，ptrace非常依赖signal与目标进程交互，这就意味会打断进程执行，非常类似于gdb等工具，而加入PTRACE_SEIZE并不会停止进程。

ptrace新特性的引入，使得CRIU可以用来对于某个特定application进行checkpoint。

Restore一个application：

Collect shared object
Restore namespace
创建进程树，包括SID，PGID，恢复继承
files，socket，pipes
Restore per-task properties
Restore memory
Call sigreturn

特定kernel的feature:

Parasite code[2]
如果一个程序打开了一系列的各种形式的文件，kernel在内核中会保存一个文件描述符表来记录该application打开哪些文件，在恢复时，CRIU要重新打开该这些文件，以相同的fd号。在恢复某些特定的pid 的application，发现pid被占用，如果我们想要恢复这个进程，而且继续使用这个pid值，CRIU在内核中加入一个API来控制下几个fork即将分配的pid值，主要是/proc/sys/kernel/ns_last_pid 。主要是向具体参见：http://lwn.net/Articles/525723/
kernel还添加了kcmp()的system call，用来比较两个进程是否共享一个kernel资源。这个就用在父进程打开一系列的share resource，然后fork()。子进程继承父进程的resource，这时kcmp()派上用场。
/proc/PID/map_files
prctl拓展来设置匿名的，私有的对象。eg: task/mm object
通过netlink dump socket信息。在scoket恢复中，相比于/proc file，通过这个可以获取更多的socket信息，通过这些信息，CRIU使用getsockopt(),setsockopt()恢复socket链接。
TCP repair mode
virtual net device indexes，在一个命名空间中恢复网络设备
socket peeking offset
Task memory tracking，用于增量快照与线上迁移。

总的来说CRIU与OpenVZ有几分相似，二者最大的区别就是OpenVZ需要修改内核，非常不便，而CRIU依赖kernel加入的systemcall完成，对于内核没有要求，非常轻便。

而BLCR也是根据某个特定kernel 版本开发，它由两个kernel module，用户态lib工具组成。使用BLCR恢复进程，进程必须依赖libcr库，或者编译时将libcr加入。这个显然对于老旧代码非常不便。BLCR最新版本发布的时候2013.1

而CRIU 截止目前最新版本发布在2015.3.2 ，可以看出CRIU开发非常活跃。

CRIU.pdf

参考:

[1] http://lwn.net/Articles/525675/

[2] http://lwn.net/Articles/454304/

BCLR:

http://blog.csdn.net/myxmu/article/details/8948258

http://blog.csdn.net/myxmu/article/details/8948265

No comments »

Posted in Linux, Linux容器

Tags: BLCR CRIU Linux容器

Archive for the ‘Linux容器’ category

docker 初体验

Docker命令实例

手动创建类docker环境

1.UTS namespace[1]

2.IPC namespace[2]

3. PID namespace[3]

4.CLONE_NEWNS[4]

5.User namespace[5]

6.Network namespace[6]

Final

参考：

为 LXC 配置网络

1) 使用网桥

2) 直接使用物理网卡

3) 容器配置静态IP

结束：

Docker 使用

Update 2015-6-3

Checkpoint/Restore in user space:CRIU

Update 2015-3-23

dump 一个特定application的途径就是:

Restore一个application：

特定kernel的feature:

参考:

Recent Posts

热门文章