This document attempts to provide a high-level introduction to handling basic crash dump analysis on Sun servers. A sample procedure is included which can be adopted to any organization for uniform handling of Sun server crashes. The term 'Crash Dump Analysis' may be a bit misleading in the context of this document. Coverage of actual analysis of the system crash dump using a debugger is not covered--Sun has an excellent instructor-led training class on this topic. Most System Administrators at most organizations will never have to use a debugger on a crash dump--this is typically a service provided by Sun with a service contract. In light of this, this document covers introductory materials regarding server crashes and preparing the necessary information to present to Sun when a service call is opened.
When a panic occurs on a Solaris system, a message describing the error is usually echoed to the system console. The system will then attempt to write out the contents of the physical memory to a predetermined dump device, which is usually a dedicated disk partition, or the system swap partition. Once this is completed, the system is then rebooted.
Once the system begins rebooting, a startup script will call the
savecore
utility, if enabled. This command will perform a
few tasks on the memory dump. First it will check
to make sure that the crash dump corresponds to the running operating
system. If the dump passes this test, savecore
will then
begin to copy the crash dump from the dedicated dump device to the
directory /var/crash/`uname -n'
, or some other predetermined
device. The dump is written out to two files, unix.n
and
vmcore.n
, where n
is an sequential integer
identifying this particular crash. Finally, savecore
logs
a reboot using the LOG_AUTH syslog
facility.
A sample memory dump of a system named testbox
appears as
follows:
# ls -l /var/crash/testbox total 1544786 -rw-r--r-- 1 root root 2 Jun 15 16:02 bounds -rw-r--r-- 1 root root 670367 Jun 15 16:00 unix.0 -rw-r--r-- 1 root root 790110208 Jun 15 16:02 vmcore.0
Various options related to performing the actual crash dump and the
savecore
functions can be set using the dumpadm
command. This utility allows the administrator to determine the
dedicated dump device, the directory savecore
will write to,
and whether or not savecore
runs at all. In addition, the
/etc/init.d/savecore
initilization script is the actual
script run at bootup which executes savecore
.
Typical output from dumpadm
for the system
testbox
appears as follows:
# dumpadm Dump content: kernel pages Dump device: /dev/dsk/c0t0d0s3 (swap) Savecore directory: /var/crash/testbox Savecore enabled: yes
Fatal operating system errors can be caused by bugs in the operating system, its associated device drivers and loadable modules, or by faulty hardware. Whatever the cause, the crash dump itself provides invaluable information to a Sun Support Engineer (if you are lucky enough to have a support contract) to aid in diagnosing the problem.
Any action taken when a Sun server crashes is obviously going to depend on the local policies and procedures in place at your organization. The presence of a Sun Service Agreement and its level will also affect your response to a crash.
What follows is an example of a typical procedure for dealing with a crash. This procedure was created based on real world experiences but does not reflect any particular real-world organization. For the purposes of illustration, assume that the organization in this example has a Platinum level contract with Sun.
The first step in analysing a crash is to determine if the necessary
evidence is present in order to find a root cause. To begin, scan
/var/adm/messages
for any warnings or errors. Many crashes
will leave evidence in the logs, such as which CPU caught the panic or
which memory DIMM had errors. Often Sun engineers can diagnose the cause
of a crash based on this information alone.
Next, check /var/crash/`uname -n`
for a crash dump. If one
is not present, confirm that savecore
is enabled. Try
running savecore -v
if it was not previously enabled. It
would also be a good idea to run prtdiag
at this time to
determine if there are any egregious hardware faults.
Armed with this information, open a call with Sun. Take note of the case
ID number. For purposes of this example the case ID will be 123456. The
Sun engineer may be able to diagnose the fault based on the panic strings
or error messages from /var/adm/messages
, or they may
require the actual crash dump for analysis. Luckily there are two tools,
CTEact
(ACT
), and explorer
, which
cull useful information from the crash dump and the system making it
unecessary to upload the actual crash dump (which could be gigabytes in
size).
Use the following steps to generate the ACT analysis of that core file to
send to Sun:
Create a temporary upload directory. This directory will hold the output
of these programs and will eventually be uploaded to Sun.
# mkdir /tmp/upload # cd /var/crash/`uname -n` # /opt/CTEact/bin/act -n unix.0 -d vmcore.0 > /tmp/upload/act_outInstall (if necessary) and run the
explorer
script as
follows:
# ./explorerThe explorer script will prompt you for some information. Do not select email output. The script will create both a subdirectory and a uuencoded file containing the system audit. Copy the uuencoded system audit output to the
/tmp/upload
directory. For example:
# cp explorer.80b0c1cc.uu /tmp/uploadTar and compress the output for upload to Sun:
# cd /tmp # mv upload 123456 # tar -cvf 123456.tar 123456 # gzip 123456.tarFinally, FTP the output to Sun:
# ftp sunsolve.sun.com ftp> username: ftp ftp> password:At this point you can remove the temporary upload directory:ftp> bin ftp> put 123456.tar.gz ftp> quit
# /bin/rm -rf /tmp/123456Retain the original core files in /var/crash/`uname -n` until the case is closed. Once the case is closed by Sun, remove these file to free up disk space.
Those who wish to do more than simply upload information to Sun and let them analyse the crash dump should strongly consider taking Sun's "Core Dump Analysis" course.
For more information, particularly on self-analysis of crash dumps, see Printceton University Solaris 2.x Core Dump Analysis