Archive for the ‘Kernel内核分析’ category

tools/vm/page-types.c 解析使用

December 9th, 2014

page-types.c这个代码位于kernel源码下的tool/vm/page-type.c

我们可以通过使用makefile编译这个程序，通过这个程序我们可以查找每个特定进程的page frame number 记住这个pfn只是一个index而已，我们如果想取得真实的物理地址需要使用pfn*page_size，当然这个是page frame 的物理起始地址。

这个程序是位于用户态，通过查找/proc/pid/pagemap 和 /proc/pid/maps来找到pfn的。

这个程序的功能比较强大，我们使用的仅仅是指定一个$./page-type -p pid

首先mian（）中先要解析参数，然后通过parse_number()来获取用户输入的pid，然后我们知道当指定了一个pid，也就意味着我们打开一个正在运行的pid的maps

[root@localhost 13352]# ls
attr       clear_refs       cpuset   fd       limits    mountinfo   ns         oom_score_adj  root       stack   syscall
autogroup  cmdline          cwd      fdinfo   loginuid  mounts      numa_maps  pagemap        sched      stat    task
auxv       comm             environ  gid_map  maps      mountstats  oom_adj    personality    sessionid  statm   uid_map
cgroup     coredump_filter  exe      io       mem       net         oom_score  projid_map     smaps      status  wchan

00400000-00401000 r-xp 00000000 fd:02 4330673                            /home/lzz/mca-ras/src/tools/simple_process/simple_process
00600000-00601000 r--p 00000000 fd:02 4330673                            /home/lzz/mca-ras/src/tools/simple_process/simple_process
00601000-00602000 rw-p 00001000 fd:02 4330673                            /home/lzz/mca-ras/src/tools/simple_process/simple_process
019b8000-019d9000 rw-p 00000000 00:00 0                                  [heap]
33c1a00000-33c1a20000 r-xp 00000000 fd:01 2622088                        /usr/lib64/ld-2.18.so
33c1c1f000-33c1c20000 r--p 0001f000 fd:01 2622088                        /usr/lib64/ld-2.18.so
33c1c20000-33c1c21000 rw-p 00020000 fd:01 2622088                        /usr/lib64/ld-2.18.so
33c1c21000-33c1c22000 rw-p 00000000 00:00 0
33c1e00000-33c1fb4000 r-xp 00000000 fd:01 2622743                        /usr/lib64/libc-2.18.so
33c1fb4000-33c21b3000 ---p 001b4000 fd:01 2622743                        /usr/lib64/libc-2.18.so
33c21b3000-33c21b7000 r--p 001b3000 fd:01 2622743                        /usr/lib64/libc-2.18.so
33c21b7000-33c21b9000 rw-p 001b7000 fd:01 2622743                        /usr/lib64/libc-2.18.so
33c21b9000-33c21be000 rw-p 00000000 00:00 0
7f3fe081a000-7f3fe081d000 rw-p 00000000 00:00 0
7f3fe083c000-7f3fe083e000 rw-p 00000000 00:00 0
7fff87571000-7fff87592000 rw-p 00000000 00:00 0                          [stack]
7fff875fe000-7fff87600000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

我们可以看到这个进程每个段，包括BSS CODE HEAP STACK等。然后哦page-types.c中的parse_pid读取这个表里面的信息。
关于这个表每个列的含义看这个http://stackoverflow.com/questions/1401359/understanding-linux-proc-id-maps

我们主要关注头两行，vm_start vm_end。这时我们要获取虚拟地址的page number:

pg_start[nr_vmas] = vm_start / page_size;
pg_end[nr_vmas] = vm_end / page_size;

至于这个page_size可以通过page_size = getpagesize();获取。这里我们要注意unsigned long的取值范围是0-2^64-1，-1也就是2^64-1

然后我们进入walk_task()函数,因为process可能会存在匿名映射，所以我们需要使用filebacked_path[i]判断文件地址。

进入walk_vma（）后。
这时我们需要读取/proc/pid/pagemap，这个我们用常规打不开，所以我们只好通过vm index读取的方式获得pages和pfn的相关信息。

详细我们查看

http://stackoverflow.com/questions/17021214/decode-proc-pid-pagemap-entry
https://www.kernel.org/doc/Documentation/vm/pagemap.txt

关注

* Bits 0-54 page frame number (PFN) if present
* Bit 63 page present
下面就是通过编程的方式获取这个的0-54位。总的来所就是使用位操作：

#define PFN_MASK (((1LL)<<55)-1)  //意味着0-54位都为1
*phy=(buf&PFN_MASK)*page_size+vir%page_size

核心函数：

        for (i = 0; i < pages; i++) {
//          printf("%lx\n",buf[i]);
            pfn = pagemap_pfn(buf[i]);
            if (pfn) {
                printf("%lx", pfn);
                printf("\t0x%lx",pfn*page_size);
                walk_pfn(index + i, pfn, 1, buf[i]);
            }
        }

我们可以看到pfn只是一个index而已，真正的物理地址需要pfn*page_size的方式，后面的各种属性是通过读取/proc/kpageflags获得。结果如下：

/home/lzz/mca-ras/src/tools/simple_process/simple_process
93a12	0x93a12000	referenced,uptodate,lru,active,mmap,private

/home/lzz/mca-ras/src/tools/simple_process/simple_process
54d07	0x54d07000	uptodate,lru,active,mmap,anonymous,swapbacked

/home/lzz/mca-ras/src/tools/simple_process/simple_process
9aa43	0x9aa43000	uptodate,lru,active,mmap,anonymous,swapbacked

[heap]
91f74	0x91f74000	uptodate,lru,active,mmap,anonymous,swapbacked
60fb3	0x60fb3000	uptodate,lru,active,mmap,anonymous,swapbacked
...............

我们看到物理地址只是在后面添加了3个0，其实也就是12位，正好是一个4k page的页内offset。

No comments »

Posted in C/C++, Kernel内核分析, Kernel内核编程, Linux, 进程管理

Tags: Memory

Linux中的虚拟地址转换物理地址实现

November 26th, 2014

Linux使用一种4级页表机制，即页全局目录(Page Global Directory)、页上级目录(Page Upper Directory)、页中间目录(Page Middle Directory)和页表(Page Table)。在这种分页机制下.

一个完整的线性地址被分为五部分：页全局目录、页上级目录、页中间目录、页表和偏移量，但是对于每个部分所占的位数则是不定的，这跟系统所在的体系架构有关。

Linux分别采用pgd_t、pud_t 、pmd_t和pte_t四种数据结构来表示页全局目录项、页上级目录项、页中间目录项和页表项。这四种数据结构本质上都是无符号长整型，Linux为了更严格数据类型检查，将无符号长整型分别封装成四种不同的页表项。

static void get_pgtable_macro(void)
{
	printk("PAGE_OFFSET = 0x%lx\n", PAGE_OFFSET);
	printk("PGDIR_SHIFT = %d\n", PGDIR_SHIFT);
	printk("PUD_SHIFT = %d\n", PUD_SHIFT);
	printk("PMD_SHIFT = %d\n", PMD_SHIFT);
	printk("PAGE_SHIFT = %d\n", PAGE_SHIFT);

	printk("PTRS_PER_PGD = %d\n", PTRS_PER_PGD);
	printk("PTRS_PER_PUD = %d\n", PTRS_PER_PUD);
	printk("PTRS_PER_PMD = %d\n", PTRS_PER_PMD);
	printk("PTRS_PER_PTE = %d\n", PTRS_PER_PTE);

	printk("PAGE_MASK = 0x%lx\n", PAGE_MASK);
}

static unsigned long vaddr2paddr(unsigned long vaddr)
{
	pgd_t *pgd;
	pud_t *pud;
	pmd_t *pmd;
	pte_t *pte;
	unsigned long paddr = 0;
        unsigned long page_addr = 0;
	unsigned long page_offset = 0;

	pgd = pgd_offset(current->mm, vaddr);
	printk("pgd_val = 0x%lx\n", pgd_val(*pgd));
	printk("pgd_index = %lu\n", pgd_index(vaddr));
	if (pgd_none(*pgd)) {
		printk("not mapped in pgd\n");
		return -1;
	}

	pud = pud_offset(pgd, vaddr);
	printk("pud_val = 0x%lx\n", pud_val(*pud));
	if (pud_none(*pud)) {
		printk("not mapped in pud\n");
		return -1;
	}

	pmd = pmd_offset(pud, vaddr);
	printk("pmd_val = 0x%lx\n", pmd_val(*pmd));
	printk("pmd_index = %lu\n", pmd_index(vaddr));
	if (pmd_none(*pmd)) {
		printk("not mapped in pmd\n");
		return -1;
	}

	pte = pte_offset_kernel(pmd, vaddr);
	printk("pte_val = 0x%lx\n", pte_val(*pte));
	printk("pte_index = %lu\n", pte_index(vaddr));
	if (pte_none(*pte)) {
		printk("not mapped in pte\n");
		return -1;
	}

	//页框物理地址机制 | 偏移量
	page_addr = pte_val(*pte) & PAGE_MASK;
	page_offset = vaddr & ~PAGE_MASK;
	paddr = page_addr | page_offset;
	printk("page_addr = %lx, page_offset = %lx\n", page_addr, page_offset);
        printk("vaddr = %lx, paddr = %lx\n", vaddr, paddr);

	return paddr;
}

通过代码我们发现线性地址寻值过程：pgd_t -> pud_t ->pmd_t -> pte_t

具体Dmesg打印信息:

dmesg.log

目前在X86_64环境下，虽然支持64位，但是系统实际上目前远远使用不了如此大的内存，真正使用的只有48位，即

0xffff000000000000

然后使用上面的宏函数进行移位操作，拼接出下一级的页表物理基地址。然后再和后面的pud，pmd，pte部分地址相加，得到下一级的物理基地址。直到最后拿到页框的物理地址与pte或，找到真实的物理地址。

参考：

arch/x86/include/asm/pgtable_64_types.h

No comments »

Posted in C/C++, Kernel内核编程, Linux, 内存管理

Tags: Management Memory

VFS Data Structure关系(3) – 跨文件系统的文件操作分析

October 28th, 2014

通过对VFS 以及文件系统的分析，我们现在来分析一下跨文件系统的文件传输

http://www.lizhaozhong.info/archives/1080

http://www.lizhaozhong.info/archives/1110

比如从一个floppy传输到harddisk（ext2——>ext4）

首先我们明确一个概念：一切皆文件！

在linux中，无论是普通的文件，还是特殊的目录、设备等，VFS都将它们同等看待成文件，通过统一的文件操作界面来对它们进行操作。操作文件时需先打开；打开文件时，VFS会知道该文件对应的文件系统格式；当VFS把控制权传给实际的文件系统时，实际的文件系统再做出具体区分，对不同的文件类型执行不同的操作。

原理：将ext2格式的磁盘上的一个文件a.txt拷贝到ext4格式的磁盘上，命名为b.txt。这包含两个过程，对a.txt进行读操作，对b.txt进行写操作。读写操作前，需要先打开文件。由前面的分析可知，打开文件时，VFS会知道该文件对应的文件系统格式，以后操作该文件时，VFS会调用其对应的实际文件系统的操作方法。所以，VFS调用vfat的读文件方法将a.txt的数据读入内存；在将a.txt在内存中的数据映射mmap()到b.txt对应的内存空间后，VFS调用ext4的写文件方法将b.txt写入磁盘；从而实现了最终的跨文件系统的复制操作。

关于mmap()使用

fs/open.c

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
    struct open_flags op;
    int fd = build_open_flags(flags, mode, &op);
    struct filename *tmp;

    if (fd)
        return fd;

    tmp = getname(filename);
    if (IS_ERR(tmp))
        return PTR_ERR(tmp);

    fd = get_unused_fd_flags(flags);
    if (fd >= 0) {
        struct file *f = do_filp_open(dfd, tmp, &op);
        if (IS_ERR(f)) {
            put_unused_fd(fd);
            fd = PTR_ERR(f);
        } else {
            fsnotify_open(f);
            fd_install(fd, f);
        }
    }
    putname(tmp);
    return fd;
}

下面我们来分析下linux-3.14.8的内核(Not Finished!)

sys_open()
    |
    |------do_sys_open()
    |           |
    |           |------get_unused_fd_flags()
                |
                |------do_filp_open()
                            |
                            |----path_openat()
                            |       |
                            |       |----link_path_walk()

No comments »

Posted in Kernel内核分析, Linux, VFS

Tags: filesystem Structure Systemcall VFS

VFS Data Structure关系(2)

October 27th, 2014

之前分析了文件系统主要的数据结构inode，dentry，super_block，file.

为了加快文件的一些操作，还引入了中间的数据结构。

struct file_system_type（include/linux/fs.h）

file_system_type结构用来描述具体的文件系统的类型信息。被Linux支持的文件系统，都有且仅有一个file_system_type结构而不管它有零个或多个实例被安装到系统中。

struct file_system_type {
    const char *name;
    int fs_flags;
#define FS_REQUIRES_DEV     1
#define FS_BINARY_MOUNTDATA 2
#define FS_HAS_SUBTYPE      4
#define FS_USERNS_MOUNT     8   /* Can be mounted by userns root */
#define FS_USERNS_DEV_MOUNT 16 /* A userns mount does not imply MNT_NODEV */
#define FS_RENAME_DOES_D_MOVE   32768   /* FS will handle d_move() during rename() internally. */
    struct dentry *(*mount) (struct file_system_type *, int,
               const char *, void *);
    void (*kill_sb) (struct super_block *);
    struct module *owner;
    struct file_system_type * next;
    struct hlist_head fs_supers;

    struct lock_class_key s_lock_key;
    struct lock_class_key s_umount_key;
    struct lock_class_key s_vfs_rename_key;
    struct lock_class_key s_writers_key[SB_FREEZE_LEVELS];

    struct lock_class_key i_lock_key;
    struct lock_class_key i_mutex_key;
    struct lock_class_key i_mutex_dir_key;
};

和路径查找相关:

struct nameidata(include/linux/namei.h)

struct nameidata {
    struct path path;
    struct qstr last;
    struct path root;
    struct inode    *inode; /* path.dentry.d_inode */
    unsigned int    flags;
    unsigned    seq, m_seq;
    int     last_type;
    unsigned    depth;
    char *saved_names[MAX_NESTED_LINKS + 1];
};

被Linux支持的文件系统，都有且仅有一个file_system_type结构而不管它有零个或多个实例被安装到系统中。每安装一个文件系统，就对应有一个超级块和安装点。也就是说如果有多个ext4 文件系统，也就有多个ext4的super_block 与 vfsmount 。

超级块通过它的一个域s_type指向其对应的具体的文件系统类型。具体的文件系统通过file_system_type中的一个域fs_supers链接具有同一种文件类型的超级块。同一种文件系统类型的超级块通过域s_instances链接。

通过下图，我们可以看到第一个与第三个super_block都指向同一个file_system_type

从下图可知，进程通过task_struct中的一个域files_struct files来找到它当前所打开的文件对象；而我们通常所说的文件描述符其实是进程打开的文件对象数组的索引值。文件对象通过域f_dentry找到它对应的dentry对象，再由dentry对象的域d_inode找到它对应的索引结点，这样就建立了文件对象与实际的物理文件的关联。

这里有一个非常重要的数据结构

struct files_struct(include/linux/fdtable.h)

/*
 * Open file table structure
 */
struct files_struct {
  /*
   * read mostly part
   */
    atomic_t count;
    struct fdtable __rcu *fdt;
    struct fdtable fdtab;
  /*
   * written part on a separate cache line in SMP
   */
    spinlock_t file_lock ____cacheline_aligned_in_smp;
    int next_fd;
    unsigned long close_on_exec_init[1];
    unsigned long open_fds_init[1];
    struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};

我们可以看到fd_array 这个二维fd连接一个struct file类型的数据结构，这个struct file就是连接进程与文件的媒介，比dentry，inode更加高级！

通过学习VFS文件结构，我们就可以理解跨文件系统传输的基本原理。

下面的图我是按照kernel 3.14.8来画的，可以清晰的看出task_struct文件结构：

1 comment »

Posted in Kernel内核分析, Linux, VFS

Tags: filesystem Structure VFS

Kdump 与 grub2 操作

October 24th, 2014

我们在调试kernel的时候，经常会出现kernel panic。这时我们system全部按键失灵，无法查看dmesg信息。

这时我们就要使用kdump工具。他会在system 重启前，将dmesg和vmcore信息保存到默认路径/var/crash/下

[root@localhost lzz]# vim /var/crash/127.0.0.1-2014.10.24-20\:22\:03/vmcore
vmcore            vmcore-dmesg.txt
[root@localhost lzz]# vim /var/crash/127.0.0.1-2014.10.24-20\:22\:03/vmcore
vmcore            vmcore-dmesg.txt

kdump工具主要是面向redhat产品，由于我使用的是fedora，比较好的兼容了redhat5.

最主要是安装 yum install –enablerepo=fedora-debuginfo –enablerepo=updates-debuginfo kexec-tools crash kernel-debuginfo kdump

然后在使用systemctl start kdump 启动服务

[root@localhost 127.0.0.1-2014.10.24-20:22:03]# systemctl status kdump
kdump.service - Crash recovery kernel arming
   Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled)
   Active: active (exited) since 五 2014-10-24 20:23:01 CST; 2min 5s ago
  Process: 641 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
 Main PID: 641 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/kdump.service

10月 24 20:23:01 localhost.localdomain kdumpctl[641]: kexec: loaded kdump kernel
10月 24 20:23:01 localhost.localdomain kdumpctl[641]: Starting kdump: [OK]
10月 24 20:23:01 localhost.localdomain systemd[1]: Started Crash recovery kernel arming.

PS：由于在linux发行版中，各大发行版逐渐开始使用systemctl ，所以chkconfig/service 命令都会在不久以后移除。

学习system systemd-vs-sysVinit-cheatsheet

在配置linux kernel for mce-inject中

我们要在grub2中进行一些配置：

在grub2中，/boot/grub2/grub.cfg是脚本自动生成的（使用grub2-mkconfig命令）

我们需要在[root@localhost 127.0.0.1-2014.10.24-20:22:03]# vim /etc/default/grub

GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
RUB_CMDLINE_LINUX="rd.lvm.lv=fedora/swap rd.md=0 rd.dm=0 $([ -x /usr/sbin/rhcrashkernel-param ] && /usr/sbin/rhcrashkernel-p    aram || :) rd.luks=0 vconsole.keymap=us rd.lvm.lv=fedora/root rhgb quiet crashkernel=240M"
GRUB_DISABLE_RECOVERY="true"
GRUB_THEME="/boot/grub2/themes/system/theme.txt"

在RUB_CMDLINE_LINUX中，我们在最后家一个crashkernel=240M ,这个240M是根据1GB-80M来计算的，太小的内存会造成kdump service 启动失败！

然后使用

# grub2-mkconfig -o /boot/grub2/grub.cfg

会自动更新grub2启动脚本，我们在/boot/grub2/grub.cfg看到生成的grub信息。

然后启动之后我们通过free -m /top命令看memory都会变小！也就意味着内存有一部分被分出去作crashkernel内存区域！

其实这个kdump实质就是在kernel crash掉后，system启动了crashkernel 来收集信息。

参考：

http://blog.csdn.net/sabalol/article/details/7043313

http://www.ibm.com/developerworks/cn/linux/l-kexec/

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Kernel_Crash_Dump_Guide/sect-kdump-config-cli.html

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/pdf/Kernel_Crash_Dump_Guide/Red_Hat_Enterprise_Linux-7-Kernel_Crash_Dump_Guide-en-US.pdf

No comments »

Posted in Code杂谈, Kernel内核分析, Linux

Tags: grub2 kdump

Archive for the ‘Kernel内核分析’ category

tools/vm/page-types.c 解析使用

我们主要关注头两行，vm_start vm_end。这时我们要获取虚拟地址的page number:

至于这个page_size可以通过page_size = getpagesize();获取。这里我们要注意unsigned long的取值范围是0-2^64-1，-1也就是2^64-1

我们看到物理地址只是在后面添加了3个0，其实也就是12位，正好是一个4k page的页内offset。

Linux中的虚拟地址转换物理地址实现

dmesg.log

VFS Data Structure关系(3) – 跨文件系统的文件操作分析

关于mmap()使用

fs/open.c

VFS Data Structure关系(2)

struct file_system_type（include/linux/fs.h）

file_system_type结构用来描述具体的文件系统的类型信息。被Linux支持的文件系统，都有且仅有一个file_system_type结构而不管它有零个或多个实例被安装到系统中。

和路径查找相关:

struct nameidata(include/linux/namei.h)

struct files_struct(include/linux/fdtable.h)

我们可以看到fd_array 这个二维fd连接一个struct file类型的数据结构，这个struct file就是连接进程与文件的媒介，比dentry，inode更加高级！

Kdump 与 grub2 操作

PS：由于在linux发行版中，各大发行版逐渐开始使用systemctl ，所以chkconfig/service 命令都会在不久以后移除。

学习system systemd-vs-sysVinit-cheatsheet

Recent Posts

热门文章

Archive for the ‘Kernel内核分析’ category

tools/vm/page-types.c 解析使用

我们主要关注头两行，vm_start vm_end。这时我们要获取虚拟地址的page number:

至于这个page_size可以通过page_size = getpagesize();获取。这里我们要注意unsigned long的取值范围是0-2^64-1，-1也就是2^64-1

我们看到物理地址只是在后面添加了3个0，其实也就是12位，正好是一个4k page的页内offset。

Linux中的虚拟地址转换物理地址实现

dmesg.log

VFS Data Structure关系(3) – 跨文件系统的文件操作分析

关于mmap()使用

fs/open.c

VFS Data Structure关系(2)

struct file_system_type（include/linux/fs.h）

file_system_type结构用来描述具体的文件系统的类型信息。被Linux支持的文件系统，都有且仅有一 个file_system_type结构而不管它有零个或多个实例被安装到系统中。

和路径查找相关:

struct nameidata(include/linux/namei.h)

struct files_struct(include/linux/fdtable.h)

我们可以看到fd_array 这个二维fd连接一个struct file类型的数据结构，这个struct file就是连接进程与文件的媒介，比dentry，inode更加高级！

Kdump 与 grub2 操作

PS：由于在linux发行版中，各大发行版逐渐开始使用systemctl ，所以chkconfig/service 命令都会在不久以后移除。

学习system systemd-vs-sysVinit-cheatsheet

Tags

Recent Posts

热门文章

file_system_type结构用来描述具体的文件系统的类型信息。被Linux支持的文件系统，都有且仅有一个file_system_type结构而不管它有零个或多个实例被安装到系统中。