Scheduling is all about selecting a proper task to run from the list of available tasks. But before the scheduler will be able to do its job we need to somehow fill this list. The way in which new tasks can be created is the main topic of this chapter.
For now, we want to focus only on kernel threads and postpone the discussion of user-mode functionality till the next lesson. However, not everywhere it will be possible, so be prepared to learn a little bit about executing tasks in user mode as well.
When the kernel is started there is a single task running: init task. The corresponding task_struct
is defined here and is initialized by INIT_TASK macro. This task is critical for the system because all other tasks in the system are derived from it.
In Linux it is not possible to create a new task from scratch - instead, all tasks are forked from a currently running task. Now, as we've seen from were the initial task came from, we can try to explore how new tasks can be created from it.
There are 4 ways in which a new task can be created.
- fork system call creates a full copy of the current process, including its virtual memory and it is used to create new processes (not threads). This syscall is defined here.
- vfork system call is similar to
fork
but it differs in that the child reuses parent virtual memory as well as stack, and the parent is blocked until the child finished execution. The definition of this syscall can be found here. - clone system call is the most flexible one - it also copies the current task but it allows to customize the process using
flags
parameter and allows to configure the entry point for the child task. In the next lesson, we will see howglibc
clone wrapper function is implemented - this wrapper allows to useclone
syscall to create new threads. - Finally, kernel_thread function can be used to create new kernel threads.
All of the above functions calls _do_fork, which accept the following arguments.
clone_flags
Flags are used to configure fork behavior. The complete list of the flags can be found here.stack_start
In case ofclone
syscall this parameter indicates the location of the user stack for the new task. If 'kernel_thread' calls_do_fork
this parameter points to the function that needs to be executed in a kernel thread.stack_size
Inarm64
architecture this parameter is only used in the case when_do_fork
is called by `kernel_thread. It is a pointer to the argument that needs to be passed to the kernel thread function. (And yes, I also find the naming of the last two parameters misleading)parent_tidptr
child_tidptr
Those 2 parameters are used only inclone
syscall. Fork will store the child thread ID at the locationparent_tidptr
in the parent's memory, or it can store parent's ID atchild_tidptr
location.tls
Thread Local Storage
Next, I want to highlight the most important events that take place during _do_fork
execution, preserving their order.
- _do_fork calls copy_process
copy_process
is responsible for configuring newtask_struct
. copy_process
calls dup_task_struct, which allocates newtask_struct
and copies all fields from the original one. Actual copying takes place in the architecture-specific arch_dup_task_struct- New kernel stack is allocated. If
CONFIG_VMAP_STACK
is enabled the kernel uses virtually mapped stacks to protect against kernel stack overflow. link - Task's credentials are copied. link
- The scheduler is notified that a new task is forked. link
- task_fork_fair method of the CFS scheduler class is called. This method updates
vruntime
value for the currently running task (this is done inside update_curr function) and updatesmin_vruntime
value for the current runqueue (inside update_min_vruntime). Thenmin_vruntime
value is assigned to the forked task - this ensures that this task will be picked up next. Note, that at this point of time new task still hasn't been added to thetask_timeline
. - A lot of different properties, such as information about filesystems, open files, virtual memory, signals, namespaces, are either reused or copied from the current task. The decision whether to copy something or reuse current property is usually made based on the
clone_flags
parameter. link - copy_thread_tls is called which in turn calls architecture specific copy_thread function. This function deserves a special attention because it works as a prototype for the copy_process function in the RPi OS, and I want to investigate it deeper.
The whole function is listed below.
int copy_thread(unsigned long clone_flags, unsigned long stack_start,
unsigned long stk_sz, struct task_struct *p)
{
struct pt_regs *childregs = task_pt_regs(p);
memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));
if (likely(!(p->flags & PF_KTHREAD))) {
*childregs = *current_pt_regs();
childregs->regs[0] = 0;
/*
* Read the current TLS pointer from tpidr_el0 as it may be
* out-of-sync with the saved value.
*/
*task_user_tls(p) = read_sysreg(tpidr_el0);
if (stack_start) {
if (is_compat_thread(task_thread_info(p)))
childregs->compat_sp = stack_start;
else
childregs->sp = stack_start;
}
/*
* If a TLS pointer was passed to clone (4th argument), use it
* for the new thread.
*/
if (clone_flags & CLONE_SETTLS)
p->thread.tp_value = childregs->regs[3];
} else {
memset(childregs, 0, sizeof(struct pt_regs));
childregs->pstate = PSR_MODE_EL1h;
if (IS_ENABLED(CONFIG_ARM64_UAO) &&
cpus_have_const_cap(ARM64_HAS_UAO))
childregs->pstate |= PSR_UAO_BIT;
p->thread.cpu_context.x19 = stack_start;
p->thread.cpu_context.x20 = stk_sz;
}
p->thread.cpu_context.pc = (unsigned long)ret_from_fork;
p->thread.cpu_context.sp = (unsigned long)childregs;
ptrace_hw_copy_thread(p);
return 0;
}
Some of this code can be already a little bit familiar to you. Let's dig dipper into it.
struct pt_regs *childregs = task_pt_regs(p);
The function starts with allocating new pt_regs struct. This struct is used to provide access to the registers, saved during kernel_entry
. childregs
variable then can be used to prepare whatever state we need for the newly created task. If the task then decides to move to user mode the state will be restored by the kernel_exit
macro. An important thing to understand here is that task_pt_regs macro doesn't allocate anything - it just calculate the position on the kernel stack, were kernel_entry
stores registers, and for the newly created task, this position will always be at the top of the kernel stack.
memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));
Next, forked task cpu_context
is cleared.
if (likely(!(p->flags & PF_KTHREAD))) {
Then a check is made to determine whether we are creating a kernel or a user thread. For now, we are interested only in kernel thread case and we will discuss the second option in the next lesson.
memset(childregs, 0, sizeof(struct pt_regs));
childregs->pstate = PSR_MODE_EL1h;
if (IS_ENABLED(CONFIG_ARM64_UAO) &&
cpus_have_const_cap(ARM64_HAS_UAO))
childregs->pstate |= PSR_UAO_BIT;
p->thread.cpu_context.x19 = stack_start;
p->thread.cpu_context.x20 = stk_sz;
If we are creating a kernel thread x19
and x20
registers of the cpu_context
are set to point to the function that needs to be executed (stack_start
) and its argument (stk_sz
). After CPU will be switched to the forked task, ret_from_fork will use those registers to jump to the needed function. (I don't quite understand why do we also need to set childregs->pstate
here. ret_from_fork
will not call kernel_exit
before jumping to the function stored in x19
, and even if the kernel thread decides to move to the user mode childregs
will be overwritten anyway. Any ideas?)
p->thread.cpu_context.pc = (unsigned long)ret_from_fork;
p->thread.cpu_context.sp = (unsigned long)childregs;
Next cpu_context.pc
is set to ret_from_fork
pointer - this ensures that we return to the ret_from_fork
after the first context switch. cpu_context.sp
is set to the location just below the childregs
. We still need childregs
at the top of the stack because after the kernel thread finishes its execution the task will be moved to user mode and childregs
structure will be used. In the next lesson, we will discuss in details how this happens.
That's it about copy_thread
function. Now let's return to the place in the fork procedure from where we left.
-
After
copy_process
succsesfully preparestask_struct
for the forked task_do_fork
can now run it by calling wake_up_new_task. This is done here. Then task state is changed toTASK_RUNNING
and enqueue_task_fair CFS method is called, wich triggers execution of the __enqueue_entity that actually adds task to thetask_timeline
red-black tree. -
At this line, check_preempt_curr is called, which in turn calls check_preempt_wakeup CFS method. This method is responsible for checking whether the current task should be preempted by some other task. That is exactly what is going to happen because we have just put a new task on the timeline that has minimal possible
vruntime
. So resched_curr function is triggered, which setsTIF_NEED_RESCHED
flag for the current task. -
TIF_NEED_RESCHED
is checked just before the current task exit from an exception handler (fork
,vfork
andclone
are all system call, and each system call is a special type of exception.). The check is made here. Note that _TIF_WORK_MASK includes_TIF_NEED_RESCHED
. It is also important to understand that in case of a kernel thread creation, the new thread will not be started until the next timer tick or until the parent task volatirely callsschedule()
. -
If the current task needs to be rescheduled, do_notify_resume is triggered, which in turn calls schedule. Finally we reached the point where task scheduling is triggered, and we are going to stop at this point.
Now that you understand how new tasks are created and added to the scheduler, it is time to take a look on how the scheduler itself works and how context switch is implemented. That is something we are going to explore in the next chapter.