Solaris Process Scheduling
In Solaris, highest priorities are scheduled first. Kernel thread
scheduling information can be revealed with ps -elcL.
A process can exist in one of the following states: running, sleeping or
ready.
Kernel Threads Model
The Solaris 10 kernel threads model consists of the following major objects:
- kernel threads: This is what is scheduled/executed on a processor
- user threads: The user-level thread state within a process.
- process: The object that tracks the execution environment of a program.
- lightweight process (lwp): Execution context for a user thread.
Associates a user thread with a kernel thread.
In the Solaris 10 kernel, kernel services and tasks are executed as kernel threads.
When a user thread is created, the associated lwp and kernel threads are also created
and linked to the user thread.
(This single-level model was first introduced in Solaris 8's alternative threads library,
which was made the default in Solaris 9. Prior to that, user threads had to bind to
an available lwp before becoming eligible to run on the processor.)
Priority Model
The Solaris kernel is fully preemptible. This means that all threads, including the
threads that support the kernel's own activities, can be deferred to allow a higher-
priority thread to run.
Solaris recognizes 170 different priorities, 0-169. Within these priorities fall a
number of different scheduling classes:
- TS (timeshare): This is the default class for processes and their associated
kernel threads. Priorities within this class range 0-59, and are dynamically adjusted
in an attempt to allocate processor resources evenly.
- IA (interactive): This is an enhanced version of the TS class that applies
to the in-focus window in the GUI. Its intent is to give extra resources to processes
associated with that specific window. Like TS, IA's range is 0-59.
- FSS (fair-share scheduler): This class is share-based rather than priority-
based. Threads managed by FSS are scheduled based on their associated shares and the
processor's utilization. FSS also has a range 0-59.
- FX (fixed-priority): The priorities for threads associated with this class
are fixed. (In other words, they do not vary dynamically over the lifetime of the
thread.) FX also has a range 0-59.
- SYS (system): The SYS class is used to schedule kernel threads. Threads
in this class are "bound" threads, which means that they run until they block or complete.
Priorities for SYS threads are in the 60-99 range.
- RT (real-time): Threads in the RT class are fixed-priority, with a fixed
time quantum. Their priorities range 100-159, so an RT thread will preempt a system
thread.
Of these, FSS and FX were implemented in Solaris 9. (An extra-cost option for Solaris 8
included the SHR (share-based) class, but this has been subsumed into FSS.)
Fair Share Scheduler
The default Timesharing (TS) scheduling class in Solaris attempts to
allow each process on the system to have relatively equal CPU access.
The nice command allows some management of process
priority, but the new Fair Share Scheduler (FSS) allows more flexible
process priority management that integrates with the
project framework.
Each project is allocated a certain number of CPU shares via the
project.cpu-shares
resource control. Each project is allocated CPU time based on
its cpu-shares value divided by the sum of the
cpu-shares values for all active projects.
Anything with a zero cpu-shares value will not be granted CPU
time until all projects with non-zero
cpu-shares are done with the CPU.
The maximum number of shares that can be assigned to any one project is 65535.
FSS can be assigned to processor sets, resulting in more
sensitive control of priorities on a server than raw processor sets.
The dispadmin command command controls the
assignment of schedulers to processor sets, using a form like:
dispadmin -d FSS
To enable this change now, rather than after the next
reboot, run a command like the following:
priocntl -s -C FSS
priocntl can control cpu-shares for a project:
priocntl -r -n project.cpu-shares -v number-shares
-i project project-name
The Fair Share Scheduler should not be combined with the TS, FX (fixed-priority)
or IA (interactive) scheduling classes on the same CPU or processor set. All of
these scheduling classes use priorities in the same range, so unexpected behavior
can result from combining FSS with any of these. (There is no problem, however,
with running TS and IA on the same processor set.)
To move a specific project's processes into FSS, run something like:
priocntl -s -c FSS -i projid project-ID
All processes can be moved into FSS by first converting init, then the
rest of the processes:
priocntl -s -c FSS -i pid 1
priocntl -s -c FSS -i all
Implementation Details
Time Slicing for TS and IA
TS and IA scheduling classes implement an adaptive time slicing scheme
that increases the priority of I/O-bound processes at the expense of
compute-bound processes. The exact values that are used to implement
this can be found in the dispatch table. To examine the TS dispatch
table, run the command dispadmin -c TS -g. (If units are
not specified, dispadmin reports time values in ms.)
The following values are reported in the dispatch table:
- ts_quantum: This is the default length of time assigned to a
process with the specified priority.
- ts_tqexp: This is the new priority that is assigned to a
process that uses its entire time quantum.
- ts_slpret: The new priority assigned to a process that
blocks before using its entire time quantum.
- ts_maxwait: If a thread does not receive CPU time during
a time interval of
ts_maxwait, its priority is raised
to ts_lwait.
- ts_lwait:
The man page for ts_dptbl contains additional information
about these parameters.
dispadmin can be used to edit the dispatch table to affect
the decay of priority for compute-bound processes or the growth in
priority for I/O-bound processes. Obviously, the importance of the
different types of processing on different systems will make a
difference in how these parameters are tweaked. In particular,
ts_maxwait and ts_lwait can prevent
CPU starvation, and raising ts_tqexp slightly can
slow the decline in priority of CPU-bound processes.
In any case, the dispatch tables should only be altered slightly
at each step in the tuning process, and should only be altered
at all if you have a specific goal in mind.
The following are some of the sorts of changes that can be made:
- Decreasing
ts_quantum favors IA class objects.
- Increasing
ts_quantum favors compute-bound objects.
ts_maxwait and ts_lwait control CPU
starvation.
ts_tqexp can cause compute-bound objects' priorities to
decay more or less rapidly.
ts_slpret can cause I/O-bound objects' priorities to
rise more or less rapidly.
RT objects time slice differently in that ts_tqexp and
ts_slpret do not increase or decrease the priority of the
IA objects add 10 to the regular TS priority of the process in the
active window. This priority shifts with the focus on the active
window.
object. Each RT thread will execute until its time slice is up or it is
blocked while waiting for a resource.
Time Slicing for FSS
In FSS, the time quantum is the length of time that a thread is allowed
to run before it has to release the processor. This can be checked using
dispadmin -c FSS -g
The QUANTUM is reported in ms. (The output of the above command
displays the resolution in the RES parameter. The default is
1000 slices per second.) It can be adjusted using
dispadmin as well. First, run the above command and capture the
output to a text file (filename.txt). Then run the command:
dispadmin -c FSS -s filename.txt
Callouts
Solaris handles callouts with a callout thread that runs at maximum
system priority, which is still lower than any RT thread. RT callouts
are handled separately and are invoked at the lowest interrupt level,
which ensures prompt processing.
Priority Inheritance
Each thread has two priorities: global priority and inherited
priority. The inherited priority is normally zero unless the thread
is sitting on a resource that is required by a higher priority thread.
When a thread blocks on a resource, it attempts to "will" or pass on
its priority to all threads that are directly or indirectly blocking
it. The pi_willto() function checks each thread that is
blocking the resource or that is blocking a thread in the syncronization
chain. When it sees threads that are a lower priority, those threads
inherit the priority of the blocked thread. It stops traversing the
syncronization chain when it hits an object that is not blocked or
is higher priority than the willing thread.
This mechanism is of limited use when considering
condition variable, semaphore or
read/write locks. In the latter case, an owner-of-record
is defined, and the inheritance works as above. If there are
several threads sharing a read lock, however, the inheritance only
works on one thread at a time.
Thundering Herd
When a resource is freed, all threads awaiting that resource are
woken. This results in a footrace to obtain access to that object;
one succeeds and the others return to sleep. This can lead to
wasted overhead for context switches, as well as a problem with
lower priority threads obtaining access to an object before a
higher-priority thread. This is called a "thundering herd"
problem.
Priority inheritance is an attempt to
deal with this problem, but some types of syncronization do not
use inheritance.
Turnstiles
Each syncronization object (lock) contains a pointer to a structure
known as a turnstile. These contain the data needed to
manipulate the syncronization object, such as a queue of blocked
threads and a pointer to the thread that is currently using the
resource. Turnstiles are dynamically allocated based on the number
of allocated threads on the system. A turnstile is allocated by
the first thread that blocks on a resource and is freed when no
more threads are blocked on the resource.
Turnstiles queue the blocked threads according to their priority.
Turnstiles may issue a signal to wake up the highest-priority
thread, or they may issue a broadcast to wake up all sleeping
threads.
Adjusting Priorities
The priority of a process can be adjusted with priocntl
or nice, and the priority of an LWP can be controlled
with priocntl().
Real Time Issues
STREAMS processing is moved into its own kernel threads, which run
at a lower priority than RT threads. If an RT thread places a
STREAMS request, it may be serviced at a lower priority level than
is merited.
Real time processes also lock all their pages in memory. This
can cause problems on a system that is underconfigured for the
amount of memory that is required.
Since real time processes run at such a high priority, system daemons
may suffer if the real time process does not permit them to run.
When a real time process forks, the new process also inherits
real time privileges. The programmer must take care to prevent
unintended consequences. Loops can also be hard to stop, so
the programmer also needs to make sure that the program does
not get caught in an infinite loop.
Interrupts
Interrupt levels run between 0 and 15. Some typical interrupts include:
- soft interrupts
- SCSI/FC disks (3)
- Tape, Ethernet
- Video/graphics
- clock() (10)
- serial communications
- real-time CPU clock
- Nonmaskable interrupts (15)