schedual » JasonLe's TechBlog

Posts Tagged ‘schedual’

CFS 调度算法

February 2nd, 2015

之前说过CFS是Kernel中的一种调度policy，这个调度算法的核心，所有task都应该公平分配处理器，为了达成这个目标，CFS调度使用vruntime来衡量某一个进程是否值得调度。

上篇博文初步对CFS的实现有了一个说明，但是没有阐述vruntime的计算。

上篇 http://www.lizhaozhong.info/archives/1206

vruntime 是CFS算法模拟出来的一个变量，他淡化了优先级在调度中的作用，而是以vruntime的值使用struct sched_entity组织成为一棵red-black tree。

根据red-black tree的特点，值小的在tree的左边，值大的在右边，随着进程的运行，系统在timer 中断发生时会调用policy中的task_tick（）方法，这个函数可以更新vruntime的值。以供CFS调度时使用。

为了维护这个red-black tree最左边的节点vruntime值最小，我们必须使得这个值单调递增，所以要比较delta_exec 与 curr->statistics.exec_max值的，并取最大值。schedstat_set(curr->statistics.exec_max,max(delta_exec, curr->statistics.exec_max));update_min_vruntime(cfs_rq);

通过这两个函数，只有最靠左的节点超过min_vruntime才会更新。

有一种情况，如果进程睡眠，则他的vruntime不变，而min_vruntime变大，则，这个进程会更加靠左！

调用路径是：

void scheduler_tick(void)
{... curr->sched_class->task_tick(rq, curr, 0); ...}
->通过函数指针，调用具体policy的函数，在CFS中是task_tick_fair,这个函数可以调用entity_tick（）更新
当前调度实体sched_entity所在的cfs_rq中当前运行task的sche_entity中vruntime的值
->
3097 static void
3098 entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
3099 {
3100         /*
3101          * Update run-time statistics of the 'current'.
3102          */
3103         update_curr(cfs_rq);
3104
......
3131 }
->static void update_curr(struct cfs_rq *cfs_rq)

从entity_tick()中的update_curr()调cfs中真正更新vruntime值的函数：

694 static void update_curr(struct cfs_rq *cfs_rq)
695 {
....
697         u64 now = rq_clock_task(rq_of(cfs_rq));
....
703         delta_exec = now - curr->exec_start;
704         if (unlikely((s64)delta_exec <= 0))
705                 return;
706
707         curr->exec_start = now;
708
709         schedstat_set(curr->statistics.exec_max,
710                       max(delta_exec, curr->statistics.exec_max));
711
712         curr->sum_exec_runtime += delta_exec;
713         schedstat_add(cfs_rq, exec_clock, delta_exec);
714
715         curr->vruntime += calc_delta_fair(delta_exec, curr);
716         update_min_vruntime(cfs_rq);
717
718         if (entity_is_task(curr)) {
719                 struct task_struct *curtask = task_of(curr);
720
721                 trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
722                 cpuacct_charge(curtask, delta_exec);
723                 account_group_exec_runtime(curtask, delta_exec);
724         }
725
726         account_cfs_rq_runtime(cfs_rq, delta_exec);
727 }

首先获取当前rq的时间，使用delta_exec获取当前进程运行的实际时间，然后将exec_start再次更新为now以便下一次使用。

并将该值加到sum_exec_runtime中时间中，对于vruntime 时间则需要calc_delta_fair(delta_exec, curr);进行处理。

通过下表我们可以看出当nice值为0，weight值为1024。另外我们需要明确nice值【-20，+19】映射到整个系统中是100~139，也就是说nice值每增加一个nice值，获得cpu时间减少10%，反之增加10%！而0~99则属于实时进程专用！nice值越高权值越小！

1046 static const int prio_to_weight[40] = {
1047  /* -20 */     88761,     71755,     56483,     46273,     36291,
1048  /* -15 */     29154,     23254,     18705,     14949,     11916,
1049  /* -10 */      9548,      7620,      6100,      4904,      3906,
1050  /*  -5 */      3121,      2501,      1991,      1586,      1277,
1051  /*   0 */      1024,       820,       655,       526,       423,
1052  /*   5 */       335,       272,       215,       172,       137,
1053  /*  10 */       110,        87,        70,        56,        45,
1054  /*  15 */        36,        29,        23,        18,        15,
1055 };

在calc_delta_fair()函数中会比较当前权重与nice值为0的权重（NICE_0_LOAD），如果等于则直接返回加权后的vruntime，如果不同则需要对该权值加权。
struct sched_entity *se 存在着当前进程的权重，就是上面那个array里面数字！

601 static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
602 {
603         if (unlikely(se->load.weight != NICE_0_LOAD))
604                 delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
605
606         return delta;
607 }

如果当前进程的nice值不等于nice 0 ，进入下面的函数：

214 static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
215 {
216         u64 fact = scale_load_down(weight);
217         int shift = WMULT_SHIFT;
218
219         __update_inv_weight(lw);
220
221         if (unlikely(fact >> 32)) {
222                 while (fact >> 32) {
223                         fact >>= 1;
224                         shift--;
225                 }
226         }
227
228         /* hint to use a 32x32->64 mul */
229         fact = (u64)(u32)fact * lw->inv_weight;
230
231         while (fact >> 32) {
232                 fact >>= 1;
233                 shift--;
234         }
235
236         return mul_u64_u32_shr(delta_exec, fact, shift);
237 }

这个函数有些复杂，我现在理解这个加权公式就是**delta_exec = delta_exec * （weight / lw.weight）**

我们可以绘制出不同nice下，加权后vruntime与真实的delta_exec值的关系。我们可以对照上面那个数组发现nice值越高，权值越小，在这里我们比较的是1024/lw.weight的值，权值越小的，商越大，vruntime越大！在CFS中，vruntime值越小，越容易调度！

mul_u64_u32_shr（）函数应该是32位与64位转换的，具体没研究清楚，改天再来。
具体这个的解释：

/*
203  * delta_exec * weight / lw.weight
204  *   OR
205  * (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT
206  *
207  * Either weight := NICE_0_LOAD and lw \e prio_to_wmult[], in which case
208  * we're guaranteed shift stays positive because inv_weight is guaranteed to
209  * fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22.
210  *
211  * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
212  * weight/lw.weight <= 1, and therefore our shift will also be positive.
213  */

CFS总结：

1）不再区分进程类型，不使用nice值判断优先级，而是使用vruntime衡量一个进程的重要性。

2）对于IO类型的进程，随着睡眠时间正常，仍然可以得到公平的时间片

3）对于优先级高的进程，可以获得更多的CPU时间。

参考：
http://lxr.free-electrons.com/source/kernel/sched/fair.c#L214
http://lxr.free-electrons.com/source/kernel/sched/sched.h#L1046

No comments »

Posted in Kernel内核分析, Linux

Tags: Algorithm CFS schedual

__schedule()调度分析

January 22nd, 2015

主实现代码：http://lxr.free-electrons.com/source/kernel/sched/core.c#L2765

调度这一块，因为存在很多的调度policy，kernel为了分离mechanism与具体policy，在__schedule()中实现task的切换，具体policy在pick_next_task() 中实现。

内核中对进程调度的方法有两种，其一为周期性调度器（generic scheduler），它对进行进行周期性的调度，以固定的频率运行；其二为主调度器（main scheduler），如果进程要进行睡眠或因为其他原因主动放弃CPU，那么就直接调用主调度器。

其中，主调度器是__schedule() ,而周期性调度器是void scheduler_tick(void)。这个函数负责每个rq的平衡，保持每个cpu都有task可以运行，这个程序由timer调度。http://lxr.free-electrons.com/source/kernel/sched/core.c#L2524

__schedule（）是调度的核心函数，在这个函数里面是主要是从rq队列中，选择进程。除了切换上下文状态，还要使用 pick_next_task() 使用这个选择下一个进程,具体到使用哪种调度policy都在这个struct sched_class结构体里保存着。

目前kernel在SMP环境下使用的调度算法是CFS算法。具体我们先来看pick_next_task()函数。
我们发现具体的policy在fair_sched_class 定义，GNU C的语法就是用C 的strut来模拟C++的class方式，然后在fair.c中定义了众多的函数，这种方式就是一种钩子函数。具体CFS策略这里不再细讲，之后我会专门来分析CFS调度算法。

2692 static inline struct task_struct *
2693 pick_next_task(struct rq *rq, struct task_struct *prev)
2694 {
2695         const struct sched_class *class = &fair_sched_class;
2696         struct task_struct *p;
2697 
2698         /*
2699          * Optimization: we know that if all tasks are in
2700          * the fair class we can call that function directly:
2701          */
2702         if (likely(prev->sched_class == class &&
2703                    rq->nr_running == rq->cfs.h_nr_running)) {
2704                 p = fair_sched_class.pick_next_task(rq, prev);
2705                 if (unlikely(p == RETRY_TASK))
2706                         goto again;
2707 
2708                 /* assumes fair_sched_class->next == idle_sched_class */
2709                 if (unlikely(!p))
2710                         p = idle_sched_class.pick_next_task(rq, prev);
2711 
2712                 return p;
2713         }
2714 
2715 again:
2716         for_each_class(class) {
2717                 p = class->pick_next_task(rq, prev);
2718                 if (p) {
2719                         if (unlikely(p == RETRY_TASK))
2720                                 goto again;
2721                         return p;
2722                 }
2723         }
2724 
2725         BUG(); /* the idle class will always have a runnable task */
2726 }

const struct sched_class fair_sched_class（kernel/sched/fair.c）

在CFS算法中，我们看下面有两个比较特殊：

7944 #ifdef CONFIG_SMP
7945 .select_task_rq = select_task_rq_fair,
7946 .migrate_task_rq = migrate_task_rq_fair,

多CPU必然存在进程并行运行的情况，7945行是公平的选择特定的task，7956行是进行rq中task的迁移，我们知道每个cpu都对应着一个rq队列，这个不一定是quenu，而是red-black tree。对于rq中task的迁移，在

select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)

这个函数正是真正的完全公平调度算法!

__schedule()函数是进程的主调度器，下面我们来分析这个的实现

2765 static void __sched __schedule(void)
2766 {
2767         struct task_struct *prev, *next;
2768         unsigned long *switch_count;
2769         struct rq *rq;
2770         int cpu;
2771 
2772 need_resched:
2773         preempt_disable();
2774         cpu = smp_processor_id();
2775         rq = cpu_rq(cpu);
2776         rcu_note_context_switch(cpu);
2777         prev = rq->curr;
2778 
2779         schedule_debug(prev);
2780 
2781         if (sched_feat(HRTICK))
2782                 hrtick_clear(rq);
2783 
2784         /*
2785          * Make sure that signal_pending_state()->signal_pending() below
2786          * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
2787          * done by the caller to avoid the race with signal_wake_up().
2788          */
2789         smp_mb__before_spinlock();
2790         raw_spin_lock_irq(&rq->lock);
2791 
2792         switch_count = &prev->nivcsw;
2793         if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
2794                 if (unlikely(signal_pending_state(prev->state, prev))) {
2795                         prev->state = TASK_RUNNING;
2796                 } else {
2797                         deactivate_task(rq, prev, DEQUEUE_SLEEP);
2798                         prev->on_rq = 0;
2799 
2800                         /*
2801                          * If a worker went to sleep, notify and ask workqueue
2802                          * whether it wants to wake up a task to maintain
2803                          * concurrency.
2804                          */
2805                         if (prev->flags & PF_WQ_WORKER) {
2806                                 struct task_struct *to_wakeup;
2807 
2808                                 to_wakeup = wq_worker_sleeping(prev, cpu);
2809                                 if (to_wakeup)
2810                                         try_to_wake_up_local(to_wakeup);
2811                         }
2812                 }
2813                 switch_count = &prev->nvcsw;
2814         }
2815 
2816         if (task_on_rq_queued(prev) || rq->skip_clock_update < 0)
2817                 update_rq_clock(rq);
2818 
2819         next = pick_next_task(rq, prev);
2820         clear_tsk_need_resched(prev);
2821         clear_preempt_need_resched();
2822         rq->skip_clock_update = 0;
2823 
2824         if (likely(prev != next)) {
2825                 rq->nr_switches++;
2826                 rq->curr = next;
2827                 ++*switch_count;
2828 
2829                 context_switch(rq, prev, next); /* unlocks the rq */
2830                 /*
2831                  * The context switch have flipped the stack from under us
2832                  * and restored the local variables which were saved when
2833                  * this task called schedule() in the past. prev == current
2834                  * is still correct, but it can be moved to another cpu/rq.
2835                  */
2836                 cpu = smp_processor_id();
2837                 rq = cpu_rq(cpu);
2838         } else
2839                 raw_spin_unlock_irq(&rq->lock);
2840 
2841         post_schedule(rq);
2842 
2843         sched_preempt_enable_no_resched();
2844         if (need_resched())
2845                 goto need_resched;
2846 }

在2773 禁止进程抢占调度器，在2774 ~ 2777 获取当前cpu的id，并获取当前cpu的rq，切换RCU，获取当前rq运行的task，并赋值为prev。

203 #define TASK_RUNNING            0
204 #define TASK_INTERRUPTIBLE      1
205 #define TASK_UNINTERRUPTIBLE    2

我们发现TASK_RUNNING 值为0，这就使得2793行，如果判断当前的进程在运行，就不会进行调度，只会更新rq的clock。
反之如果当前占用cpu的task处于TASK_INTERRUPTIBLE态，却收到了某个唤醒它的信号，那么当前进程的标志被更新为TASK_RUNNING,等待再次被调度。否则，通过deactivate_task()将当前进程prev从就绪队列中删除。

之后在2819行使用pick_next_task()函数，去的当前rq的新的进程，然后清除之前prev进程的标志位。
获取要调度的新的进程，之后就是各种调度了。从2824~2839 这段代码会判断当前的选择的进程与之前的进程是否相同，相同就不用再切换上下文了。

一切调度完成，放开preempt_enable ，系统可以开始抢占。
参考：
http://www.makelinux.net/books/lkd2/ch09lev1sec9

1 comment »

Posted in Kernel内核分析, Linux, 进程管理

Tags: Process schedual

在JOS上实现基于Priority的RR_sched

April 8th, 2014

JOS在一开始实现的是简单的RR算法，没有优先级调度。

下面我实现了具有Priority的RR调度算法。首先我们需要增加一个sys_env_set_priority（）的系统调用。
» Read more: 在JOS上实现基于Priority的RR_sched

No comments »

Posted in JOS

Tags: JOS schedual

Linux进程调度

December 29th, 2013

进程分为I/O消耗性和处理器消耗性进程。Linux为了保证交互式应用，对进程响应做了优化（缩短响应时间），倾向于优先调度i/o消耗性进程。
Linux实现了一种基于动态优先级的调度方法，一开始，先设置基本的优先级，然而它允许调度程序根据需要加减优先级，如果进程在IO等待上小号的时间杜宇其运行时间，那么该进程属于IO消耗性进程。动态优先级会动态提高，如果一个进程的全部时间片一下子被耗尽，那么该进程属于处理器消耗型进程，动态优先级会被动态降低。 » Read more: Linux进程调度

No comments »

Posted in Kernel内核分析, Linux, 进程管理

Tags: schedual

Posts Tagged ‘schedual’

CFS 调度算法

有一种情况，如果进程睡眠，则他的vruntime不变，而min_vruntime变大，则，这个进程会更加靠左！

这个函数有些复杂，我现在理解这个加权公式就是**delta_exec = delta_exec * （weight / lw.weight）**

CFS总结：

1）不再区分进程类型，不使用nice值判断优先级，而是使用vruntime衡量一个进程的重要性。

2）对于IO类型的进程，随着睡眠时间正常，仍然可以得到公平的时间片

3）对于优先级高的进程，可以获得更多的CPU时间。

__schedule()调度分析

在JOS上实现基于Priority的RR_sched

Linux进程调度

Recent Posts

热门文章

Posts Tagged ‘schedual’

CFS 调度算法

有一种情况，如果进程睡眠，则他的vruntime不变，而min_vruntime变大，则，这个进程会更加靠左！

这个函数有些复杂，我现在理解这个加权公式就是delta_exec = delta_exec * （weight / lw.weight）

CFS总结：

1）不再区分进程类型，不使用nice值判断优先级，而是使用vruntime衡量一个进程的重要性。

2）对于IO类型的进程，随着睡眠时间正常，仍然可以得到公平的时间片

3）对于优先级高的进程，可以获得更多的CPU时间。

__schedule()调度分析

在JOS上实现基于Priority的RR_sched

Linux进程调度

Tags

Recent Posts

热门文章

这个函数有些复杂，我现在理解这个加权公式就是**delta_exec = delta_exec * （weight / lw.weight）**