Formatted HTML doc to follow...
NMI Watchdogs and NMI Panics


nmi_watchdog

The NMI watchdog monitors system interrupts and initiates a reboot if the system 
appears to have hung.  On a normal system hundreds of device and timer interrupts 
are received per second. If there are no interrupts in a 5 second interval, the NMI 
watchdog assumes that the system has hung and initiates a system reboot.

In order to understand how the NMI Watchdog works, it is first necessary to 
understand the APIC.  The APIC, or Advanced Programmable Interrupt Controller, has 
been built into all x86 CPUs since the Pentium Pro.  This built-in APIC is known as 
the Local APIC. Primarily, the APIC is used to issue interrupts to other CPUs in a 
multi-processor system, but still has its uses in single processor systems--for 
example with the NMI watchdog function.

IO-APIC is another APIC included on certain motherboards.  The IO-APIC collects 
interrupts from various I/O devices and sends them to the Local APIC built into the 
processor.  The IO-APIC is a replacement for the legacy 8259 Programmable Interrupt 
Controllers (PIC) which have been in use since the original PC-AT architecture.  
Obviously, the IO-APIC is a major improvement in PC Architecture, but it is usually 
only included on higher-end motherboards.

In order to use the NMI Watchdog, APIC support must be enabled in the kernel.  For 
SMP kernels, this is 
automatically enabled.  For Uniprocessor kernels, CONFIG_X86_UP_APIC or 
CONFIG_X86_UP_IOAPIC must be enabled.  (The IO-APIC is more desirable than the 
local APIC).  [Note: certain kernel debugging options, such as Kernel Stack Meter 
or Kernel Tracer, may implicitly disable the NMI watchdog.]

The NMI watchdog is enabled by adding nmi_watchdog=n to the command line used to 
boot the kernel.  The "n" will either be 1 or 2.  For all SMP systems and UP 
systems with an IO-APIC, nmi_watchdog will be "1".  For UP systems without an 
IO-APIC, nmi_watchdog will be "2".  This is not guaranteed to work, however.  If 
there is doubt, test each setting as shown below.

Here is an example from /etc/grub.conf for systems which utilize the GRUB boot 
loader:

title Test Kernel (2.4.9-10smp)
        root (hd0,0)
        # This is the kernel's command line.
        kernel /vmlinuz-2.4.9-10smp ro root=/dev/hda2 nmi_watchdog=1

Here is an example from /etc/lilo.conf for systems which utilize the LILO boot 
loader:

image=/boot/vmlinuz-2.4.9-10smp
        label=linux
        read-only
        root=/dev/hda2
        append="nmi_watchdog=1"

To determine if the NMI watchdog was activated, check /proc/interrupts.  The NMI 
interrupt should display a non-zero value.  If NMI displays a zero, try 
nmi_watchdog=2.  If that still displays zero then the processor is not supported by 
the NMI watchdog code.  The output, when functioning correctly, should look similar 
to the following:

           CPU0       
  0:    5623100          XT-PIC  timer
  1:         13          XT-PIC  keyboard
  2:          0          XT-PIC  cascade
  7:          0          XT-PIC  usb-ohci
  8:          1          XT-PIC  rtc
  9:     794332          XT-PIC  aic7xxx, aic7xxx
 10:     569498          XT-PIC  eth0
 12:         24          XT-PIC  PS/2 Mouse
 14:          0          XT-PIC  ide0
NMI:    5620998       
LOC:    5623358 
ERR:          0
MIS:          0

unknown_nmi_panic

A new feature was introduced in kernel 2.6.9 which helps to make easier the process 
of diagnosing system hangs on certain hardware.  This feature, called 
Unknown_nmi_panic utilizes NMI (Non-Maskable Interrupt) switch capability to force 
a kernel panic on a hung system.  Unknown_nmi_panic was also backported to Red Hat 
Enterprise Linux 3 Update 3.  This feature makes use of the computer's NMI switch 
(if it is equipped with one).  Because the NMI Switch generates an undefined NMI 
interrupt, this feature cannot be utilized on systems that also use the NMI 
Watchdog or oprofile features as both of these make use of the undefined NMI 
interrupt.  If unknown_nmi_panic is activated with one of these features present, 
it will not work.

Note that this is a user-initiated interrupt which is really most useful for 
helping to diagnose a system that is experiencing system hangs for unknown reasons.

To enable this feature, set the following system control parameter as follows:

kernel.unknown_nmi_panic = 1

This can either be done via the command line using the "sysctl -w" command, or by 
adding the above line to the /etc/sysctl.conf file.        

Once this is done (and the system rebooted if not using the command line), a panic 
can be forced by pushing the system's NMI switch.  Systems that do not have a NMI 
switch should still use the NMI Watchdog feature which will automatically generate 
an NMI if the system hangs.