sysdumpdev -L
Note the primary dump device name, and use it where you see /dev/hd# in the following steps.
In AIX Version 3.2, the Primary dump device will probably be /dev/hd7.
In AIX Version 4.x, the default dump device is set to /dev/hd6--your paging device. DO NOT use this device name in the following steps.
In some cases the dump will be copied to the /var/adm/ras directory. In this case use /var/adm/ras/vmcore.x instead of /dev/hd#. If, when your system started, you were prompted to put the dump on tape, extract the dump from tape to some directory--like /tmp/dump--and then use /tmp/dump/dump_name instead of /dev/hd6.
If you have set up a dedicated dump device like /dev/dumplv, use /dev/dumplv in the following steps.
crash /dev/hd#To verify the usability of the dump information, look for the following output:
Using /unix as the default namelist file Reading in symbols.........
TRAP | An assert statement in the code caused the system to crash because it was not true. |
INVALID OPERATION | There is probably a wild branch, and the instruction to be executed is not valid. |
DSI (Data Storage Interrupt) | This means that there was an addressing exception on a data fetch. |
ISI (Instruction Storage Interrupt) | This means that there was an addressing exception on an instruction fetch. |
HANG | The system is hung and the user must force a system dump. |
The types of dumps can be differentiated by the following symptoms:
888 102 700 0c0 | This LED sequence indicates a trap or invalid operation (differentiated by errlog entry). |
888 102 300 0c0 | This LED sequence indicates a DSI. |
888 102 400 0c0 | This LED sequence indicates an ISI. |
No process can proceed and no external interrupts are accepted. | This indicates a hang. |
An example of the errlog entry:
ERROR LABEL: PROGRAM_INT ... Detail Data Segment Register, SEGREG 0000 0000 Machine Status Save/Restore Register 0 0009 AC50 Machine Status Save/Restore Register 1 0002 0000 Machine State Register, MSR 0002 90B0
Please note the Machine Status Save/Restore Register 0 (9AC50 in the above example). This is the address at which the trap occurred.
If you have a version of the crash command which accepts the -m option on the trace subcommand (trace -m), you can easily identify where the trap occurred.
Here's an example (input is shown with the ">" crash command prompt):
> trace -m MST STACK TRACE: 0x1f8154 (excpt=0:0:0:0:0) (intpri=0) IAR: 00007148 .i_enable + ac: bcr 14,0 *LR: 013d87a4 [tokdd:tokoflv] + 294c 00000000: 00000000 <invalid> ... 0x2ff98000 (excpt=0:40000000:618f:2ff95508:106) (intpri=b) IAR: 0009ac50 .freeiblk + 54: ti 4,r7,0x0 *LR: 0009ac38 .freeiblk + 3c *2ff97288: 00000000 <invalid> 2ff97368: 0009b084 .ifreeind + 3c ...
Find the IAR line that contains the Machine Status Save/Restore Register 0 that was found in the errlog. The ti on the IAR: line indicates that there was a trap instruction in .freeiblk + 54.
An example of the errlog entry:
ERROR LABEL: PROGRAM_INT ... Detail Data Segment Register, SEGREG 0000 0000 Machine Status Save/Restore Register 0 000B E4F8 Machine Status Save/Restore Register 1 0008 0000 Machine State Register, MSR 0002 90B0
Please note the Machine Status Save/Restore Register 0 (BE4F8 in the above example). This is the address at which the invalid operation occurred.
If you have a version of crash which accepts the -m option on the trace subcommand (trace -m), you can easily identify an invalid operation.
Here's an example (input is shown with the ">" crash command prompt):
> trace -m MST STACK TRACE: 0x1f8154 (excpt=0:0:0:0:0) (intpri=0) IAR: 00007148 .i_enable + ac: bcr 14,0 *LR: 013d87a4 [tokdd:tokoflv] + 294c 00000000: 00000000 <invalid> ... 0x2ff98000 (excpt=0:42000000:40001164:2ff7fffc:106) (intpri=b) IAR: 000be4f8 <invalid>: ??? (0x9872c) *LR: 000356ec .soo_ioctl + 634 *2ff979f8: 00000000 <invalid> 2ff97a58: 000a4d88 .fp_ioctl + 68 2ff97ab8: 01d00b90 sna_sysx:luxgosp + 9b90 2ff97d88: 01cfdde4 sna_sysx:luxioctl + 6de4 2ff97de8: 00082ba8 .rdevioctl + b0 2ff97e28: 00084054 .spec_ioctl + 20 2ff97ed8: 0009891c .vno_ioctl + 110 2ff97fa8: 000a4c84 .kioctl + e8
Find the IAR line that contains the Machine Status Save/Restore Register 0 that was found in the errlog.
The IAR is listed as "invalid" in the trace output; however, the rest of the stack is valid information.
An example of the errlog entry:
ERROR LABEL: DSI_PROC ... Detail Data Data Storage Interrupt Status Register 4000 0000 Data Storage Interrupt Address Register 007F FFFF Segment Register, SEGREG 632E 6108 EXVAL 0000 000E
If you have a version of the crash command which accepts the -m option on the trace subcommand (trace -m), you can easily identify where the DSI occurred.
Here's an example (input is shown with the ">" crash command prompt):
> trace -m MST STACK TRACE: 0x211db0 (excpt=0:0:0:0:0) (intpri=0) IAR: 00009e00 .v_copypage_pwr + 58: dclz r7,r6 *LR: 0005be64 .getvmpage + 128 *00211bb8: 00000000 <invalid> IAR not in kernel segment. 0x2ff98000 (excpt=632e6108:40000000:7fffff:632e6108:106) (intpri=b) IAR: 0147ded4 [smt_load:smconnect] + cd4: l r0,0x8(r5) *LR: 0147de10 [smt_load:smconnect] + c10 *2ff97fa8: 00000000 <invalid> 00000000: 000036dc <invalid>
You can identify the correct IAR stanza by the first three values following excpt= on the line above the first and third appearances of "IAR". In the third appearance, those three numbers are 632e6108:40000000:7fffff. Each of these three numbers can be found in the DSI_PROC error log entry: the first number is the Segment Register, SEGREG, the second number is the Data Storage Interrupt Status Register, and the third number is the Data Storage Interrupt Address Register.
In the case of an ISI_PROC error log entry (see the section on Instruction Storage Interrupts), the first number is the Segment Register, SEGREG, the second number is the ISISR, and the third number is the ISIR0.
A DSI (or ISI) also shows up in the vmmerrlog. The vmmerrlog information can be seen in the detail data section of the DSI_PROC (or ISI_PROC) error report entry:
EXVAL 0000 000E
You can also view the information with crash subcommands. Here's an example (input is shown with the ">" crash command prompt):
> od vmmerrlog 9 a 00056a20: 20faed7f 53595356 4d4d2000 00000000 | .."SYSVMM .....| 00056a30: 00000000 40000000 007fffff 632e6108 |....@...."..c.a.| 00056a40: 0000000e |....|
In this case the return code from vmm is 0000000e. The DSISR in the above example is 40000000.
Check the error log for disk or SCSI errors. If there are disk or SCSI errors for disks that are not part of rootvg and do not contain paging space, contact AIX defect support and send in the crash information.
Otherwise, contact your CE to run diagnostics.
An example of the errlog entry:
ERROR LABEL: ISI_PROC ... Detail Data ISISR 4000 0000 ISIR0 007F FFFF Segment Register, SEGREG 632E 6108 EXVAL 0000 000E
Refer to the section Data Storage Interrupts in this document. You can find the correct IAR stanza for an ISI in the same way as for a DSI.
Refer to the section Forcing a System Dump for details on how to force a dump.
NOTE: If you work with AIX support on your problem, it is useful to describe in detail the conditions and events that led to the hang. For example, "After running xyz application for two hours, there is no response to the keyboard input. I cannot get a response at the system console or at any terminals directly connected to the system unit, and I cannot rlogin or ping the system."
In a dump that was forced because of a hang, locks are one thing to look at. A few locks are:
proc_lock kernel_lock net_lock
You can view lock information with crash subcommands. The following example uses proc_lock; the same format can be used for kernel_lock or net_lock. (Input is shown with the ">" crash command prompt.)
For Version 3.2:
> od proc_lock 0002a728: ffffffff
For Version 4.x:
> lock
If what is returned consists of a series of f's, no process is holding that lock. If there is a process holding the lock, the process ID will be in the field occupied by ffffffff in the preceding example.
You can look at the locks to see if there is a dead-lock situation. If there is NOT a dead-lock situation, you can look at the kernel stack trace of any process or of the running process to attempt to determine what caused the system to hang.
NOTE: At AIX Version 4.x you can use the dlock subcommand to detect dead-lock situations.
Look at the NAME column to find a process in which you are interested. Use trace -k process_slot_number to display its kernel stack trace, if any exists.
For example:
> p SLT ST PID PPID PGRP UID EUID PRI CPU EVENT NAME 0 s 0 0 0 0 0 16 120 swapper FLAGS: swapped_in no_swap fixed_pri kproc wake/sig 1 s 1 0 0 0 0 60 0 init FLAGS: swapped_in no_swap wake/sig locks 2 r 202 0 0 0 0 127 120 wait FLAGS: swapped_in no_swap wake/sig locks ... 12 s cf8 1 662 0 0 60 0 014d6280 cdpg FLAGS: swapped_in kproc orphanpgrp ...
> trace -k 12 STACK TRACE: 3320 (excpt=04fa5654:40000000:00000000:04fa5654:00000106) (intpri=0) IAR: .e_wait+15c (0003f384): cror 15,cr15,cr15 LR: .e_wait+15c (0003f384) 2ff7fec0: .e_sleep+120 (0003f674) 2ff7ff20: .[cfs.ext:cdr_pager]+ac (014d5720) 2ff7ff70: .procentry+1c (00032374) 2ff7ffc0: INVALID (00000000)
NOTE: In AIX Version 4.x the pcb subcommand is replaced with the tcb subcommand.
> pcb USER AREA FOR X (ProcTable Address 0xe3003200) SAVED MACHINE STATE curid:0x00003236 m/q:0x00040000 iar:0x014973dc cr:0x48844084 msr:0x000090b0 lr:0x0149712c xer:0x00000004 ctr:0x0008bf8c *bus:0x04fc13c0 *prevmst:0x00000000 *stackfix:0x00000000 intpri:0x0000000b backtrace:0x00 tid:0x00000000 fpeu:0x01 ecr:0x00000087 .... rest of pcb output deleted ...
In the preceding example, note that the curid value (on the third line of output) is 0x00003236. Obtain the process slot number from the curid by shifting the curid eight bits to the right. In this case, the process slot number is 0x32. Convert that value from hexadecimal to decimal; in this case, 0x32 (hexadecimal) = 50 (decimal). The process slot number is 50 (decimal).
> trace -k 50
If the user cannot telnet, rlogin, or ping to the system, it indicates a hung system. Another indication is if the user can ping the system but the rest of the system is unavailable.
Chances are the system will hang again. The steps below will prepare the system for a forced dump when and if this event recurs.
Run the following command:
smit dumpChange the Always Allow System Dump attribute to TRUE.
When the system hangs again, proceed according to the type of system:
System with LED or Non LED display with NO KEY SWITCH key and LED machine and RESET button ------------------- with AIX 4.1.4 and beyond | ---------------------------- Turn the key to service. | | Hit the following key Hit reset. sequence if there is no | disk activity: | | The LED sequence will be Ctrl-Alt-1 (on the numpad). 0c9-0c4 or 0c9-0c0. Wait for disk activity | to stop. | | If a hang occurs, power off the system and proceed as shown below.Connect a tape drive to the system and Power the SYSTEM on.
Unless the default dump configuration has been modified with the sysdumpdev command, the dump will be copied to /var/adm/ras/vmcore.x when the system is powered on.
NOTE: If /var is too small to hold the dump, the system
will prompt the user to copy the dump to external media
such as tape or preformatted diskettes. If a tape
drive is not connected, the system will prompt for
diskettes. Using diskettes is NOT a recommended method
of collecting a dump. If you are unable to save the
dump, we will not be able to determine what caused your
system to crash.