Solaris Fault Management
The Solaris Fault Management Facility is designed to be
integrated into the Service Management
Facility to provide a self-healing capability to
Solaris 10 systems.
The fmd daemon is responsible for monitoring
several aspects of system health.
The fmadm config command shows the current
configuration for fmd.
The Fault Manager logs can be viewed with
fmdump -v and fmdump -e -v.
fmadm faulty will list any devices flagged
as faulty.
fmstat shows statistics gathered by
fmd.
Fault Management
With Solaris 10, Sun has implemented a daemon,
fmd, to track and react to fault management.
In addition to sending traditional syslog messages,
the system sends binary telemetry events to fmd for
correlation and analysis. Solaris 10 implements default
fault management operations for several pieces of hardware
in Sparc systems, including CPU, memory, and I/O bus events.
Similar capabilities are being implemented for x64 systems.
Once the problem is defined, failing components may be
offlined automatically without a system crash, or other
corrective action may be taken by fmd.
If a service dies as a result of the fault, the
Service Management Facility (SMF)
will attempt to restart it and any dependent processes.
The Fault Management Facility reports error messages in
a well-defined and explicit format. Each error code is
uniquely specified by a Universal Unique Identifier (UUID)
related to a document on the Sun web site at
http://www.sun.com/msg/ .
Resources are uniquely identified by a Fault Managed
Resource Identifier (FMRI). Each Field Replaceable Unit
(FRU) has its own FMRI. FMRIs are associated with one
of the following conditions:
ok: Present and available for use.
unknown: Not present or not usable, perhaps because
it has been offlined or unconfigured.
degraded: Present and usable, but one or more problems
have been identified.
faulted: Present but not usable; unrecoverable
problems have been diagnosed and the resource has been
disabled to prevent damage to the system.
The fmdump -V -u eventid command can be used to pull
information on the type and location of the event.
(The eventid is included in the text of the error message
provided to syslog.) The -e option can be used to pull
error log information rather than fault log information.
Statistical information on the performance of fmd can be
viewed via the fmstat command. In particular,
fmstat -m modulename provides information
for a given module.
The fmadm command provides administrative
support for the Fault Management Facility. It allows us
to load and upload modules and view and update the resource cache.
The most useful capabilities of fmadm are provided
through the following subcommands:
config: Display the configuration of component modules.
faulty: Display faulted resources. With the -a option,
list cached resource information. With the -i option, list
persistent cache identifier information, instead of most recent
state and UUID.
load /path/module: Load the module.
unload module: Unload module; the module name is the
same as reported by fmadm config.
rotate logfile: Schedule rotation for the specified
log file. Used with the logadm configuration file.
Additional Resources
Amy Rich's Predictive Self-Healing Article
Mike Shapiro's magazine article and
presentation contain a good discussion of the
architectural underpinnings of the Fault Manager.
Gavin Maltby
reports on AMD fault management in this blog entry.
matty's blog has a short introduction to Fault Management on
Solaris.