DTrace Introduction
DTrace is Solaris 10's new Dynamic Tracing facility. It allows
us to peer into the innards of running processes and customize
our view to exclude extraneous information and
close in on the source of a problem.
DTrace also has capabilities that allow us to examine a crash
dump or trace the boot process.
A number of freely available scripts have been made available
as the
DTrace Toolkit. The toolkit provides both programming
examples and also extremely useful tools for different types
of system monitoring.
The DTrace facility provides data to a number of consumers,
including commands such as dtrace and
lockstat, as well as programs calling libraries
that access DTrace through the dtrace kernel driver.
Probes
DTrace is built on a foundation of objects called probes.
Probes are event handlers that fire
when their particular event occurs. DTrace can bind a particular
action to the probe to make use of the information.
Probes report on a variety of information about
their event. For example, a probe for a kernel function may
report on arguments, global variables, timestamps, stack traces,
currently running processes or the thread that called the function.
Kernel modules that enable probes are packaged into sets
known as providers.
In a DTrace context, a module is a kernel module (for
kernel probes) or a library name (for applications). A
function in DTrace refers to the function associated with
a probe, if it belongs to a program location.
Probes may be uniquely addressed by a combination of the provider,
module, function and name. These are frequently organized into a
4-tuple when invoked by the
dtrace command.
Alternatively, each probe has a unique integer identifier,
which can vary depending on Solaris patch level.
These numbers, as well as the provider, module, function and
name, can be listed out through the dtrace -l
command. The list will vary from system to system, depending
on what is installed. Probes can be listed by function, module
or name by specifying it with the -f, -m
or -n options, respectively.
Running a dtrace without a -l, but
with a -f, -m or -n option,
enables all matching probes. All the probes in a provider can be
enabled by using the -P option. An individual probe
can be enabled by using its 4-tuple with the -n
option.
(Note: Do not enable more probes than necessary.
If too many probes are enabled, it may adversely impact performance.
This is particularly true of sched probes.)
Some probes do not list a module or function. These are called
"unanchored" probes. Their 4-tuple just omits the nonexistent
information.
Providers
Providers are kernel modules that create related groups of probes.
The most commonly referenced providers are:
fbt: (Function Boundary Tracing)
Implements probes at the entry
and return points of almost all kernel functions.
io: Implements probes for I/O-related events.
pid: Implements probes for user-level processes
at entry, return and instruction.
proc: Implements probes for process creation
and life-cycle events.
profile: Implements timer-driven probes.
sched: Implements probes for scheduling-related
events.
sdt: (Statistically Defined Tracing) Implements
programmer-defined probes at arbitrary locations and names within
code. Obviously, the programmer should define names whose meaning
is intuitively clear.
syscall: Implements entry and return probes
for all system calls.
sysinfo: Probes for updates to the sys
kstat.
vminfo: Probes for updates to the vm
kstat.
Command Components
The dtrace command has several components:
- A 4-tuple identifier:
provider:module
:function:name
Leaving any of these blank is equivalent to using a wildcard
match. (If left blank, the left-most members of the 4-tuple
are optional.)
- A predicate determines whether the action should be taken.
They are enclosed in slashes: /predicate/. The predicate
is a C-style relational expression which must evaluate to an
integer or pointer. If omitted, the action is executed when the
probe fires. Some predicate examples are:
- executable name matches csh:
/execname == "csh"/
- process ID does not match 1234:
/pid != 1234/
arg0 is 1 and arg1 is not 0:
/arg0 == 1 && arg1 !=0/
- An action (in the D scripting language) to be taken when the
probe fires and the predicate is satisfied. Typically, this is
listed in curly brackets: {}
Several command
examples are provided at the bottom of the page.
D Scripting Language
In order to deal with operations that can become confusing on a single
command line, a D script can be saved to a file and run as desired.
A D script will have one or more
probe clauses, which consist of one or more
probe-descriptions,
along with the associated predicates
and actions:
#!/usr/sbin/dtrace -s
probe-description[, probe-description...]
/predicate/
{
action; [action; ...]
}
The probe-description
section consists of one or more
4-tuple identifiers.
If the predicate line is not present, it is the same
as a predicate that is always true.
The action(s) specified are to be run if the probe
fires and the predicate is true.
Each recording action dumps data to a trace buffer. By
default, this is the principal buffer.
Several programming
examples are provided at the bottom of the page.
D Variables
D specifies both associative arrays and scalar variables.
Storage for these variables is not pre-allocated. It is allocated
when a non-zero value is assigned and deallocated when a zero value
is assigned.
D defines several built-in variables, which are frequently
used in creating predicates and actions. The most commonly
used built-in variables for D are the following:
args[]: The args[] array
contains the arguments, specified from 0 to the number
of arguments less one. These can also be specified by
argn, where this is the n+1th
argument.
curpsinfo: psinfo structure
of current process.
curthread: pointer to the current thread's
kthread_t
execname: Current executable name
pid: Current process ID
ppid: Parent process ID
probefunc: function name of the current probe
probemod: module name of the current probe
probename: name of the current probe
timestamp: Time since boot in ns
Variable scope can be global, thread-local or clause-local.
Thread-local variables allow separate storage for each thread's
copy of that variable. They are referenced with names of the form:
self->variable-name
Associative arrays can be indexed by an arbitrary name.
There is no pre-defined limit on the number of elements.
Scalar variables hold a single data value.
We can also access Solaris kernel symbols by specifying them
in backquotes.
Providers define arguments based on their own requirements.
Some of the more useful such arguments are listed below:
| Provider-Specific Variables |
| io |
arg[0] |
Pointer to a bufinfo structure. |
arg[0]->b_bcount |
Byte count. |
arg[0]->b_resid |
Bytes not transferred. |
arg[0]->b_iodone |
I/O completion routine. |
arg[0]->b_edev |
Extended device. |
arg[1] |
Pointer to a devinfo structure. |
arg[1]->dev_major |
Major number. |
arg[1]->dev_minor |
Minor number. |
arg[1]->dev_instance |
Instance number. |
arg[1]->dev_name |
Device name. |
arg[1]->dev_pathname |
Device pathname. |
arg[2] |
Pointer to a fileinfo structure. |
arg[2]->fi_name |
File name. |
arg[2]->fi_dirname |
File directory location. |
arg[2]->fi_pathname |
Full path to file. |
arg[2]->fi_offset |
Offset within a file. |
arg[2]->fi_fs |
Filesystem. |
arg[2]->fi_mount |
Filesystem mount point. |
| pid |
arg0-argn (entry) |
For entry probes, arg0-argn represent the
arguments. |
arg0-arg1 (return) |
For return probles, arg0-arg1 represent
the return codes. |
| syscall |
arg0-argn (entry) |
For entry probes, arg0-argn represent the
arguments. |
arg0-arg1 (return) |
For return probles, arg0-arg1 represent
the return codes. |
| sysinfo |
arg0 |
Value of statistic increment |
arg1 |
Pointer to the current value of the statistic before increment. |
arg2 |
Pointer to the cpu_t structure incrementing
the statistic. Defined in sys/cpuvar.h |
| vminfo |
arg0 |
Value of statistic increment |
arg1 |
Pointer to the current value of the statistic before increment. |
The full list for each provider can be found in
the provider's chapter of the
Solaris Dynamic Tracing Guide.
Actions
The most commonly used built-in actions are:
breakpoint():
System stops and transfers control to kernel debugger.
chill(number-nanoseconds):
DTrace spins for the specified number of nanoseconds.
copyinstr(pointer):
Returns null terminated string from address space referenced
by pointer.
copyout(buffer, address, number-bytes):
Copies number-bytes from the buffer to a memory address.
copyoutstr(string, address, max-length):
Copies a string to a memory address.
normalize(aggregation, normalization-factor):
Divides aggregation values by the normalization-factor.
panic():
Panics the kernel; may be used to generate a core dump.
printf(format, arguments):
Dumps the arguments to the buffer in the specified format.
printa(aggregation) does the same thing
for aggregation data.
raise(signal):
Sends the signal to the current process.
stack(number-frames):
Copies the specified number of frames of the kernel thread's
stack to the buffer.
stop():
Stops the process that fired the probe.
stringof(): Converts values to
DTrace string values.
system(command):
Runs a program as if from the shell.
trace(D-expression):
Dumps the output of D-expression to the trace buffer.
tracemem(address, size_t
number-bytes): Dumps the contents from
the memory address to the buffer.
trunc(aggregation):
Truncates or removes the contents of the specified
aggregation.
ustack(number-frames):
Copies the specified number of frames of the user stack to the buffer.
Multiple actions in a probe clause
can be combined using a semicolon between them inside the curly brackets.
Aggregations
Aggregating functions allow multiple data points to be combined
and reported. Aggregations take the form:
@name[ keys ] = aggregating-function(
arguments );
Here, the name is a name assigned to the aggregation, the keys
are a comma-separated list of D expressions which index the output,
the arguments are a comma-separated list and the aggregating
functions may be one of the following:
avg: Average of the expressions in the arguments.
count: Number of times that the function is called.
lquantize: The arguments are a scalar expression, a
lower bound, an upper bound and a step value. This function increments
the value in the highest linearly-sized bucket that is less than the
expression.
max: The largest value among the arguments.
min: The smallest value among the arguments.
quantize: Increments the value in the highest power
of two bucket less than the expression in the argument.
sum: Total value of the expressions in the arguments.
Directives
Directives provide options to a D script that can increase readability of
the output. Each option is enabled by including a line like the
following at the beginning of a D script:
#pragma D option option-name
The following are the most commonly-used directives.
flowindent: Indentation increased on function entry;
decreased on function return.
quiet: Don't print anything not explicitly specified.
Examples
To report on each active PID for the readch function:
dtrace -n 'readch {trace(pid)}'
To report on each system call's entry time:
dtrace -n 'syscall:::entry {trace(timestamp)}'
Tracing executable names (stored on a UFS filesystem):
dtrace -m 'ufs {trace(execname)}'
To print all functions from libc for process 123:
dtrace -n 'pid123:libc::entry'
To count the occurrences of each libc function for PID 123:
#!/usr/sbin/dtrace -s
pid123:libc::entry
{
@function_count[probefunc]=count();
}
(Once this script is stopped with control-c, the results will be printed
out.)
To identify how much time is spent in each function, we would
add a probe-clause as follows. In the below, the
timestmp[] array stores the timestamps
for each function as it is entered; the aggregation
function_duration[] stores the sum total
of durations for invocations of each function. Also
note that the scope of the timestmp[] array
is limited to a single thread to deal with the possibility
of multiple concurrent copies of the same function:
#!/usr/sbin/dtrace -s
pid123:libc::entry
{
self->timestmp[probefunc] = timestmp;
}
pid123:libc::return
/self->timestmp[probefunc] != 0/
{
@function_duration[probefunc] = sum(timestmp - self->timestmp[probefunc]);
timestmp[probefunc] = 0;
}
The following script prints out the name of every program
that calls the exec or exece
system calls, as well as the name of the program
being executed:
#!/usr/sbin/dtrace -s
syscall::exec*:entry
{
trace(execname);
trace(copyinstr(arg0));
}
While this script is straightforward, it may fail if
arg0 has been written to disk as part of
paging activity. (See Bennett, Part 2.) In that event,
it may make more sense to use a variable to hold the
pointer until return pulls any pointers
back into memory. The below also uses printf to format
the output and turn off default output with "quiet:"
#!/usr/sbin/dtrace -s
#pragma D option quiet
syscall::exec*:entry
{
self->prog = copyinstr(arg0);
self->exn = execname;
}
syscall::exec*:return
/ self->prog != NULL /
{
printf ("%-20s %s\n", self->exn, self->prog);
self->prog = 0;
self->exn = 0;
}
DTrace Toolkit
A number of freely available scripts have been made available
as the
DTrace Toolkit. The toolkit provides both programming
examples and also extremely useful tools for different types
of system monitoring.
| DTrace Toolkit |
| Script Name |
Description |
anonpgpid.d |
Attempts to identify which processes are suffering the
most from a system that is hard swapping |
bitesize.d |
Provides graphs of distributions of different
I/O sizes |
connections |
Displays server process that accepts each
inbound TCP connection. |
cputypes.d |
Reports on types of CPUs on the system. |
cpuwalk.d |
Reports on which CPUs a process runs on. |
cswstat.d |
Reports on context switches and time consumed. |
dapptrace |
Traces user and library function usage. Similar to
apptrace, but also gets elapsed and CPU
times |
dispqlen.d |
Measures dispatcher queue length (CPU saturation). |
dnlcps.d |
Measures DNLC hits and misses by process. |
dnlcsnoop.d |
Real time record of target, process and result
of DNLC lookups. |
dtruss |
truss replacement without the performance
hit. |
filebyproc.d |
Snoops files opened by process name. |
fsrw.d |
Traces I/O events at a system call level. |
hotspot.d |
Identifies disk "hot spots." |
inttimes.d |
Reports on time spent servicing interrupts
for each device. |
iofile.d |
I/O wait times for each file by process. |
iofileb.d |
I/O size for each file by process. |
iopattern |
System-wide disk I/O usage patterns. |
iosnoop |
Tracks system I/O activity. |
iotop |
Displays processes with highest I/O traffic. |
lockbydist.d |
Lock distribution by process. |
lockbyproc.d |
Lock times by process. |
nfswizard.d |
Identifies top NFS filename requests; reports on access
and performance statistics. |
opensnoop |
Snoops open files. |
pfilestat |
I/O statistics for each file descriptor in a
process. |
priclass.d |
Reports distribution of thread priorites by
class. |
pridist.d |
Reports distribution of thread priorities by class. |
procsystime |
Process system call details; elapsed time, CPU time,
counts, etc. |
rfileio.d |
Read size statistics from file systems and
physical disks, along with a total miss rate
for the file system cache. |
rfsio |
Read size statistics from file systems and
physical disks, along with a total miss rate for the
file system cache. |
rwsnoop |
Captures read/write activity, including
identifying the source processes. |
rwtop |
Displays processes with top read/write activity. |
sampleproc |
Reports which process is on which CPU how much
of the time. |
seeksize.d |
Directly measure seek lengths of I/Os. |
swapinfo.d |
Reports a summary of virtual memory use. |
tcpstat.d |
Reports TCP error and traffic statistics. |
tcpsnoop.d |
Snoops TCP packets and associates them with a port
and a process. |
tcptop |
Displays to TCP packet-generating processes. |
threaded.d |
Measures effectiveness of thread utilization. |
tcpsyscall |
Reports on busiest system calls. |
udpstat.d |
Reports UDP error and traffic statistics. |
vopstat |
Function-level timings of I/Os. |
xcallsbypid.d |
Provides by-process cross-call information. |
zvmstat |
Zone-specific vmstat. |
Additional Resources
Page © 2007 by the Trustees of Princeton University.
Content © 2007 by Scott Cromar. Used with permission.
Comments on the contents of this page may be logged to
The Solaris Troubleshooting Blog
|