Watchdog Reset Diagnostics
A watchdog reset occurs when a fault condition occurs that the system deems
as potentially dangerous. When such a fault occurs, the system immediately
drops to the PROM monitor without taking a core dump. If the
watchdog-reboot? parameter is set to
true, the system
will reboot. No further diagnostics will be possible, unless an error
message appears either in the system logs (from immediately before the
watchdog reset was executed) or on the console (during hardware diagnostics
during the reboot).
watchdog-reboot? parameter is set to
some limited diagnostics are available that may point to a culprit in the
Further complicating the issue, watchdog resets may be caused by hardware
or software problems. A software-triggered watchdog reset occurs when
two trap errors take place so close together that the first one does not
have time to complete before the second one is received by the system.
This type of watchdog reset is sometimes called a "CPU" watchdog reset,
since it occurs when the CPU receives a trap while the register bit to
receive traps is not set.
Since hardware faults may cause traps, a CPU watchdog reset may be caused
by either hardware or software failures.
A second type of watchdog reset is a "system" watchdog reset. These
are almost always caused by a hardware fault.
If the system is still at the PROM monitor prompt following the watchdog
reset, it is possible to execute the following commands to attempt to
gather some information about the system state prior to the reset.
If at all possible, the system should be observed through some sort
of console or tip session that can be used to preserve the output of
the PROM monitor session.
.registers: Displays kernel internal registers.
.locals: Displays the registers in the current register
.psr: Displays the Processor Status Register.
f8002010 wector p: (Note: That word is not vector.)
This displays messages similar to those in
dmesg . They
represent any final messages that may have occurred before the reset.
See the Sun web site for more information on Watchdog Reset . Note that we have not had much success with this command,
but it is recommended by Sun, and hope does spring eternal...
ctrace: Displays the trace of the current thread.
Additional debugging information can be made available to the
ctrace command via a module called
obpsym. This can be loaded in one of two ways:
modload /platform/sun4x/kernel/misc/obpsym (where
x is m, u or d, depending on the system architecture) from the root
command line. This method loads the module for this boot only.
forceload: misc/obpsym in the
file. This method loads the module during future reboots.
Sun recommends using both methods so that the obpsym module is reloaded
on each reboot until the problem is diagnosed and resolved.
Once the PROM monitor diagnostics have been run, use
ok> prompt to generate a core dump. This can be
analyzed using the suggestions from the Crash
Dump Analysis page. If a core is not saved, check the Savecore Troubleshooting page.
Watchdog resets are
often caused by a hardware failure, usually requiring a system board
or CPU replacement. Less frequently, memory replacements have cleared
up the problem. Shortening the SCSI bus sometimes will eliminate the
watchdog resets. Any hardware that can send a trap is potentially
responsible for a watchdog reset.
Hardware faults may leave traces in log or console error messages.
In particular, check for the following:
Asynchronous memory error: Indicates a memory problem.
Asynchronous memory fault: May be a bus problem between
memory and CPU. Try replacing the system board first, then the CPU, then
Ecache parity error: Indicates a problem with the CPU's
onboard cache. Replace the CPU.