11/01/96, 4FAX# 1770 System Crashes in AIX Special Notices . . . . . . . . . . . . . . . . . . . . 1 About This Document . . . . . . . . . . . . . . . . . . 1 Collecting The Dump . . . . . . . . . . . . . . . . . . 2 Locate The Dump . . . . . . . . . . . . . . . . . . . . 2 Packaging The Dump . . . . . . . . . . . . . . . . . . 3 Verify The Dump . . . . . . . . . . . . . . . . . . . . 4 Copy The Dump . . . . . . . . . . . . . . . . . . . . . 4 Gathering Preliminary Crash Information . . . . . . . . 6 Reader's Comments . . . . . . . . . . . . . . . . . . . 9 SPECIAL NOTICES Information in this document is correct to the best of our knowledge at the time of this writing. Please send feedback by fax to "AIXServ Information" at (512) 823-4009. Please use this information with care. IBM will not be responsible for damages of any kind resulting from its use. The use of this information is the sole responsibility of the customer and depends on the customer's ability to eval- uate and integrate this information into the customer's operational environment. ABOUT THIS DOCUMENT This document discusses how to take a user or system initi- ated dump from your machine and package it for distribution to your support organization for problem determination. This document assumes that an IBM software support problem record has already been opened and the reference number for that problem is known by the customer. This can be achieved in 2 ways. 1. Open a software problem with IBM AIX SupportLine. This can be achieved by calling 1-800-237-5511 option 3. 2. Submit this problem through IBM Program Services. Retrieve FAX document #1760 from 1-800-IBM-4FAX (1-800-426-4329) for details on how to use this service. For more in-depth coverage of this subject, the following IBM documents are recommended: o AIX Version 4.1 Software Problem Debugging and Reporting for the RISC System/6000. (GG24-2513) o Common Diagnostics and Service Guide (SA23-2687) o Diagnostic Information For Micro Channel Bus Systems (SA23-2765) o Diagnostic Information for Multiple Bus Systems (SA38-0509) o Problem Solving Guide and Reference (SA23-2204), (SA23-2606) System Crashes in AIX 1 11/01/96, 4FAX# 1770 | o Managing System Dump Devices (FAX# 6221 from 1-800- | IBM-4FAX) This document applies to AIX versions 3.2, 4.1, and 4.2. COLLECTING THE DUMP Our objective is to remove the dump from the dump device. It can then be moved to another machine or packaged and sent to IBM for analysis. The steps to do this are as follows: 1. Locate the dump 2. Package the dump 3. Verify the dump 4. Copy the dump LOCATE THE DUMP To locate the dump issue the command: sysdumpdev -L The following output shows a good dump: Device name: /dev/hd6 Major device number: 10 Minor device number: 1 Size: 16197120 bytes Date/Time: Fri Nov 18 02:58:14 CST 1994 Dump status: 0 Dump copy filename: /var/adm/ras/vmcore.0 In this case we have located a valid dump that was safely saved by the system into the /var/adm/ras directory. | The "Dump Status:" line indicates if the dump was suc- | cessful. The following table indicates the possible return | codes, and their meaning. | SUCCESSFUL DUMP = 0 | DUMP DISABLED = -1 | PARTIAL DUMP = -2 | DUMP FAILED = -3 | If the dump status indicates a "PARTIAL DUMP", then the | common problem is that the dump device is not large enough | to hold a complete dump for this machine. The system dump | device should be increased to a more appropriate value. | Please follow the steps in FAX document #6221 ("Managing | System Dump Devices") for a complete description of how to | do this. | If the dump status indicates a "DUMP FAILED", then this | indicates that a dump error occurred during the system dump | process. The data in the system dump device may still hold System Crashes in AIX 2 11/01/96, 4FAX# 1770 | information to help determine the cause of the system | problem. Continue with the steps in this document to deter- | mine if usable information is present. The following case shows the command output when the dump copy failed. Presumably the dump is available on the external media device, for example, tape. Device name: /dev/hd6 Major device number: 10 Minor device number: 1 Size: 15914496 bytes Date/Time: Thu Nov 17 19:59:01 CST 1994 Dump status: 0 0481-195 Failed to copy the dump from /dev/hd6 to /var/adm/ras. 0481-198 Allowed the customer to copy the dump to external media. PACKAGING THE DUMP The "snap" command will automatically collect the required files pertaining to the system dump. To gather all of the dump including kernel and general information, run this command : snap -Dfkg This will make a directory in /tmp called "ibmsupt" where the kernel, crash, and general system information about the machine will be stored. If you are running AIX 4.1.x or 4.2.x and your dump device is the primary paging space (hd6), then upon reboot from a crash, the system will attempt to copy the dump from the dump device to a filesystem (usually "/var/adm/ras"). If there is not enough space in this filesystem, it will prompt you to copy this to tape, unless you have set the machine to discard the dump if no space is available. ========================================================== IMPORTANT: If you have the system copy this to external media, the information on this media is generally NOT suffi- cient to determine the cause of a system crash. The dump needs to be transferred back to the system to be packaged with other system files to be complete. ========================================================== If the dump was copied to tape on reboot, then the "snap" command above will return an error about there being no good dump. The dump is no longer accessible to the system, since it is on external media. In this case, ignore the warning messages, and proceed to copying the dump from the external media back to the system. System Crashes in AIX 3 11/01/96, 4FAX# 1770 Space must be available in the /tmp filesystem to copy this dump. Refer to the information in "sysdumpdev -L" to deter- mine how much space will be needed for the dump. In the "sysdumpdev -L" command last run above, the required space for the dump was "15914496 bytes" or roughly 16meg. In this case, 16 meg of space needs to be available in /tmp. If this space is not available, then 16 meg of space needs to be cleared from /tmp, or /tmp can be increased in size to have that much space available. If this information is not available, proceed with the fol- lowing steps and take appropriate actions if enough space is not available. For example, a dump saved to the /dev/rmt0 device is restored by commands: cd /tmp/ibmsupt/dump tctl -b0 -Bnf /dev/rmt0 read | tar -xvf- mv dump_file dump If the dump was copied to diskette instead of tape, run these commands: cd /tmp/ibmsupt/dump tar -xvf /dev/fd0 mv dump_file dump VERIFY THE DUMP Quickly check the dump is valid and readable before submit- ting it to IBM for analysis. First, make sure that the following files are in the /tmp/ibmsupt directory with this command: # ls -og /tmp/ibmsupt/dump total 11352 -rw-r--r-- 1 4530888 Jul 02 16:12 dump.Z -rw-r--r-- 1 180 Jul 02 16:11 dump.snap -rw-r--r-- 1 1270849 Jul 02 16:11 unix.Z NOTE: the file "dump.Z" may be just "dump", and the file "unix.Z" may be just "unix". These files should also be greater than 0 in size. In this case, "dump.Z" = 453088 bytes and "unix.Z" = 1270849 bytes. Use the steps in GATHERING PRELIMINARY DUMP INFORMATION to verify that there is a good dump. COPY THE DUMP Use one of the following steps to transfer the dump informa- tion from this machine to another site. 1. Sending a Testcase via Tape System Crashes in AIX 4 11/01/96, 4FAX# 1770 a. With the following command, determine the block size used to back up the tape: lsattr -E -l rmt# | grep block_size Sample output: block_size 512 BLOCK size (0=variable length) True NOTE: The number where "512" is in this example is the block size. Block size 0 is not recommended. b. Place a blank tape in the tape drive. c. Copy the appropriate information to the tape with /usr/sbin/snap -o /dev/rmt# d. Label the tape. tape block_size = xxx VERY IMPORTANT: If the person sending in this testcase is not the person who reported the problem, be sure to include the name of the person who reported it. If the proper information is not on the package, then it takes valuable time to process and delays solving your problem. e. Use the following address to send the tape via standard mail or overnight delivery. NOTE: This address can change, so please verify this address by either having the latest version of this document or verifying with your IBM software support organization. IBM Corp. / Zip 2900 / Bldg 42 Attn: AIX Testcase Dept. J66S 11400 Burnet Road Austin, TX 78758-3493 NOTE: Only Federal Express makes morning deliveries (Monday - Friday) directly to our building (the address must show Bldg. 42). NO overnight delivery service delivers directly to our building on Saturday. If you specify Sat- urday delivery without making special arrange- ments with AIX Support, there will be a delay of several days. Using other carriers may also cause delays. f. Sending Testcases Electronically via Internet or IBM VM To create a compressed tar image on the testcase, run: snap -c System Crashes in AIX 5 11/01/96, 4FAX# 1770 This will create a file called "snap.tar.Z" in the /tmp/ibmsupt directory. The size limit for a single compressed testcase to be transferred by FTP is 50 MB. FTP it to our testcase repository: Sample file names for the tarred and compressed file: ad1000.tar.Z (item number.tar.Z) 1x234.001.tar.Z (problem_report_#.branch_office_#.tar.Z) 1x234.1234567.tar.Z (problem_report_#.customer_#.tar.Z) ftp 198.17.57.67 (or "ftp testcase.boulder.ibm.com") login: anonymous password: (e.g., "customer@wallyworld.com") bin (change to binary transfer mode) cd aix put 1x234.001.tar.Z (for example) ls -l quit For IBM VM, the testcase should be uploaded in binary format. Please name the file PPPPPBBB TARZBIN, where PPPPP is the PMR number and BBB is the branch office number (without the "b"). PMR number and BBB is the branch office number (without the "b"). node = "AUSVMR" userid = "V3DEFECT" NOTE: Because of the size of a dump, it may be faster to send it by overnight mail than to send it over VM. Reclaiming Space in /tmp Once the testcase has been sent, you can reclaim the space in /tmp with the "snap -r" command. GATHERING PRELIMINARY CRASH INFORMATION First, use the "script" command to generate a file that will contain all of the information to be returned by the fol- lowing commands. Example: script /tmp/crash.info Now, run the following command: sysdumpdev -L After that, gather information from the DUMP itself. If you have already moved the dump information into the System Crashes in AIX 6 11/01/96, 4FAX# 1770 /tmp/ibmsupt directory using the steps in COLLECTING THE DUMP, then run "cd /tmp/ibmsupt" and run "crash dump". This file may be compressed (ie. dump.Z). If it is and space is available in /tmp, uncompress this file (ie. "uncompress dump.Z"). Otherwise, use the information returned by the "sysdumpdev -L" command above to determine the location of the dump. If the dump is on the machine, then run the "crash" command on that location. Example: #sysdumpdev -L Device name: /dev/hd6 Major device number: 10 Minor device number: 1 Size: 16197120 bytes Date/Time: Fri Nov 18 02:58:14 CST 1994 Dump status: 0 Dump copy filename: /var/adm/ras/vmcore.0 #crash /var/adm/ras/vmcore.0 If this command comes back with an error like: WARNING: dumpfile does not appear to match namelist. then you likely have a bad dump. This usually occurs if the unix image in your boot logical volume does not match the unix image in /unix. If this is the case, use the "bosboot" command to rebuild all boot images on your machine, reboot, and wait for the condition to occur again. This situation can also occur if you are mirroring the "rootvg" volume group, and have more than one boot image (ie. hd5, hd51, hd52, hd5x, etc). Remember to rebuild the boot image on all boot logical volumes whenever any upgrades or fixes are installed or any system configuration changes are made. If you are mirroring your "rootvg" volume group, you should not mirror your dump devices. This can also cause system dumps to be incomplete or corrupt. To avoid this, separate non-mirrored dump devices should be created for both the primary and secondary dump devices. "crash" should eventually come back with a ">" prompt. After this prompt appears, run the following commands: stat trace -m od *vmmerrlog 9 a At this point, you can exit out of "crash" with "quit", and then exit out of the "script" command with "exit". The file | you created with the "script" command will have the contents | of what was displayed above. | You should examine this file to verify that the date of this | dump reported by both the "sysdumpdev -L" command AND the | "stat" command within "crash" correspond to the same time System Crashes in AIX 7 11/01/96, 4FAX# 1770 | frame. If they do not, then you may actually be looking at a | system crash from an earlier time. Contact you local soft- | ware support organization if there are problems regarding | this. | You can arrange to fax, mail, etc. this information to your | support organization for this problem, if applicable. If | this was a user initiated dump or you are just verifying the | dump, then the above information is only used to verify that | the dump is readable. The full dump would need to be sent | to correctly diagnose and problems. Example: # script /tmp/crash.out Script started, file is /tmp/crash.out # crash /dev/hd7 Using /unix as the default namelist file. Reading in Symbols ......................... > stat sysname: AIX nodename: unix release: 1 version: 4 machine: 000011653000 time of crash: Wed Mar 20 08:46:01 CST 1996 age of system: 8 hr., 53 min. > trace -m MST STACK TRACE: 0x2ff3b400 (excpt=1000b844:4000f0b0:40002f36:1000b844:107) (intpri=b) IAR: 00009064 .disable_lock + 60: mtsr 0xf,r8 *LR: 00044e9c .selnotify + 3c *2ff3a818: 00044f6c .selnotify + 10c 2ff3a878: 018c3ccc [pse:select_wakeup_on_events&rbk. + 11f24 2ff3a8c8: 018bb284 [pse:osrq_wakeup&rbk. + 94dc 2ff3a918: 018c3bdc [pse:sth_rput&rbk. + 11e34 2ff3a978: 018ba3d8 [pse:csq_lateral&rbk. + 8630 2ff3a9b8: 018b5ebc [pse:puthere&rbk. + 4114 > od *vmmerrlog 9 a 00000000: 00000000 00000000 00000000 00000000 |................| 00000010: 00000000 00000000 00000000 00000000 |................| 00000020: 00000000 |....| > quit # exit Script done, file is /tmp/crash.out System Crashes in AIX 8 11/01/96, 4FAX# 1770 READER'S COMMENTS Please fax this form to (512) 823-4009, attention "AIXServ Informa- tion". You may also e-mail comments to: elizabet@austin.ibm.com. These comments should include the same customer information requested below. Use this form to tell us what you think about this document. If you have found errors in it, or if you want to express your opinion about it (such as organization, subject matter, appearance) or make sug- gestions for improvement, this is the form to use. If you need technical assistance, contact your local branch office, point of sale, or 1-800-CALL-AIX (for information about support offer- ings). These services may be billable. Faxes on a variety of sub- jects may be ordered free of charge from 1-800-IBM-4FAX. Outside the U.S. call 415-855-4329 using a fax machine phone. When you send comments to IBM, you grant IBM a nonexclusive right to use or distribute your comments in any way it believes appropriate without incurring any obligation to you. NOTE: If you have a problem report or item number, supplying that number may help us determine why a procedure did or did not work in your specific situation. Problem Report or Item #: Branch Office or Customer #: Be sure to print your name and fax number below if you would like a reply: Name: Fax Number: ______________________________________________________________________ ______________________________________________________________________ ______________________________________________________________________ ______________________________________________________________________ ______________________________________________________________________ ______________________________________________________________________ ______________________________________________________________________ ______________________________________________________________________ ______________________________________________________________________ ______________________________________________________________________ ______________________________________________________________________ ______________________________________________________________________ END OF DOCUMENT (how.to.dump.krn, 4FAX# 1770) System Crashes in AIX 9