Date: July 24, 1999
I generally recommend changing the default setup for AIX's system dump facility. The default setting stops the system from rebooting after an unexpected halt.
A system dump copies selected areas of the kernel to disk (or tape) if the system halts unexpectedly. The default dump location is the page space. When a system attempts to reboot after an unexpected halt, it stops to warn the operator the page space contains a dump. The reboot stops until the operator tells the system what do to with the dump.
The dump facility can be reconfigured to reboot automatically by changing the dump device from the page space to a raw partition on the disk. The procedure for making this change is in the attached HTML file. See your AIX documentation for more information.
Managing System Dump Devices [manage.dump.32-42.cmd]
Managing System Dump Devices ------------------------------------------------------------------------------- Contents About this document Related documentation Managing system dump devices Determining proper size for dump device Setting a tape drive as a dump device Extended options in AIX 4.x Dumping a mirrored logical volume Remote dumps over to a network ------------------------------------------------------------------------------- About this document This document discusses how to manage storage devices used by AIX to store a system dump in the event of a catastrophic operating system software failure. Its intent is to help the system administrator ensure that a system dump will be complete and usable for troubleshooting purposes. This document applies to AIX versions 3.2 and 4.x. ------------------------------------------------------------------------------- Related documentation For more in-depth coverage of this subject, the following IBM documents are recommended: o AIX Version 4.1 Software Problem Debugging and Reporting for the RISC System/6000 (GG24-2513) o Common Diagnostics and Service Guide (SA23-2687) o Diagnostic Information For Micro Channel Bus Systems (SA23-2765) o Diagnostic Information for Multiple Bus Systems (SA38-0509) o Problem Solving Guide and Reference (SA23-2204) (SA23-2606) o System Management Guide, V3.2 (SC23-2457) o System Management Guide, V4 (SC23-2525) ------------------------------------------------------------------------------- Managing system dump devices When an unexpected system halt occurs, the system dump facility automatically copies selected areas of kernel data to the primary dump device. These areas include kernel segment 0 as well as other areas registered in the Master Dump Table by kernel modules or kernel extensions. There are two dumps devices (a primary and secondary). To view information about the current dump devices, enter: sysdumpdev -l Example: # sysdumpdev -l primary /dev/hd7 secondary /dev/sysdumpnull In this example, the primary dump device is the logical volume hd7. When the operating system is installed, the primary dump device is automatically configured. In AIX 3.2, the default primary dump device is /dev/hd7. This is a logical volume dedicated for system dumps. In AIX 4.x, the default dump device is /dev/hd6. This is the primary paging space logical volume. In both AIX 3.2 and 4.x the default secondary dump device is /dev/sysdumpnull. This is a null device and any dump written to this device is lost. ------------------------------------------------------------------------------- Determining proper size for dump device The default dump device created for system use may NOT be large enough for a complete dump. To determine how large the dump device is, first determine what the primary dump device is using the procedure mentioned in this section. If the dump device is not currently set to a tape drive, then this device should be a logical volume. To retrieve information about this logical volume enter: lslv <LOGICAL VOLUME NAME> Example: lslv hd7 This command will return a screen of information. Obtain the values for LPs and PP SIZE. Multiply these two values to get the size of the dump device in megabytes. Next, determine how large the dump device for your machine should be. To view an estimate of how large the dump device should be, enter: sysdumpdev -e Example: # sysdumpdev -e Estimated dump size in bytes: 4526080 NOTE: This value will be what the CURRENT running machine would require. This value can change based on the activity of the machine. It is best to run this command when the machine is under its heaviest work load. This will return a value in bytes. The primary dump device should be a size that is at or greater than the value returned. In this case, the dump space needs to be 4.5 megabytes. A normal system will have a physical partition size of 4 megabytes for rootvg. The dump device has to be increased in multiples of this size. A dump space of 4 megabytes would not be large enough to hold this dump, so the next size would have to be 8 megabytes. At AIX levels prior to 3.2.4, this command option may not be available. If this is the case, a general rule of thumb is to make the dump device 1/4 of the size of your total RAM. To obtain the size of your total RAM, enter: bootinfo -r If the dump device is a standard dump logical volume, such as hd7, then use the command extendlv to increase its size. If it is the primary paging space hd6, use the command chps. ------------------------------------------------------------------------------- Setting a tape drive as a dump device If you do not have sufficient space on the system to store a dump, use a tape drive as the dump device. To accomplish this, put a blank tape in the desired tape drive and enter: sysdumpdev -Pp /dev/rmt# In this case, rmt# refers to the specific tape drive you want to use for this (for example, rmt0, rmt1, rmt2, etc.) Be aware that the tape drive will not be usable by any other application until you re-assign the dump device to another location. ------------------------------------------------------------------------------- Extended options in AIX 4.x At AIX 4.x, there are three extra attributes that are not available in AIX 3.2. sysdumpdev -l will show these extra options. Example: # sysdumpdev -l primary /dev/hd6 secondary /dev/sysdumpnull copy directory /var/adm/ras forced copy flag TRUE always allow dump TRUE The copy directory entry specifies a filesystem in the rootvg volume group where the dump will be copied upon reboot after a system dump. This only applies if the primary dump is the primary paging space (hd6). The force copy flag entry specifies if the system will prompt you to copy this dump to external media if there is not enough space in the specified filesystem. If this is set to FALSE and the system cannot copy this dump to the filesystem, then it will discard the contents of the dump. The always allow dump flag is a security measure. If this is set to FALSE, then the only way to force a system dump would be to turn the service key to Service and then press Reset. It also prevents forcing a dump of any kind on machines with no service key, such as all PCI based machines. If the primary dump device is the primary paging device, the only way it can copy the dump to the filesystem save area is if there is enough free space in that filesystem. The free space in the filesystem can be determined with the df command. If the free space in that filesystem is not at least as large as the space required for the dump (sysdumpdev -e), then either increase the size of that filesystem to have enough free space, remove files in that filesystem until enough free space is available, or move the save area to another filesystem with the required space. The latter can be accomplished with the sysdumpdev command. This filesystem must be in the rootvg volume group. ------------------------------------------------------------------------------- Dumping to a mirrored logical volume AIX does not support dumping to a mirrored logical volume. This is because the dump only dumps to one copy of the logical volume. In other words, one of the mirrors will contain the dump. Since the logical volume is not being handled like a mirrored logical volume, the new data written, (for example, the dump) will not be synched with the other mirrors. Thus, when crash tries to read the dump, it can obtain data from both mirrors, only one of which actually contains the dump. That is, crash sees good dump data mixed with garbage data, and will not read the dump. By splitting up the logical volume, creating one logical volume per copy of the original, one of them would contain a good dump. This can be accomplished with the splitlvcopy command. The procedure for splitting the logical volume is: Run lslv to get the LV IDENTIFIER: # lslv hd7 LOGICAL VOLUME: hd7 VOLUME GROUP: rootvg LV IDENTIFIER: 0000335216021417.12 PERMISSION: read/write VG STATE: active/complete LV STATE: opened/syncd TYPE: dump WRITE VERIFY: off MAX LPs: 128 PP SIZE: 4 megabyte(s) COPIES: 2 SCHED POLICY: parallel LPs: 4 PPs: 8 STALE PPs: 0 BB POLICY: relocatable INTER-POLICY: minimum RELOCATABLE: yes INTRA-POLICY: middle UPPER BOUND: 32 MOUNT POINT: N/A LABEL: None MIRROR WRITE CONSISTENCY: on EACH LP COPY ON A SEPARATE PV ?: no Notice that there are two copies. This means that there is one mirror. Use the splitlvcopy command to split the logical volume, hd7 in this case, into two logical volumes. # splitlvcopy 0000335216021417.12 1 A message similar to the following may appear: splitlvcopy: WARNING! The logical volume being split, hd7, is open. Splitting an open logical volume may cause data loss or corruption and is not supported by IBM. IBM will not be held responsible for data loss or corruption caused by splitting an open logical volume. Do you wish to continue? y(es) n(o)? lv02 Enter y. The command will complete and show the name of the new logical volume it created, e.g., lv02. At this point, hd7 contains one copy of the original hd7, and lv02 contains the other. This is exactly what we need. If there had been three copies, shown by lslv, then lv02 would contain two copies of the original hd7. Run crash on /dev/hd7 first to see if that was the right copy. If crash does not give error messages, the correct one has been found. If the dump is unusable, run crash on /dev/lv02, if lv02 has only one copy, that is, if the original hd7 contained two copies. If lv02 has two copies now, because the original hd7 had 3, run lslv /dev/lv02 to get the LV IDENTIFIER. Then run splitlvcopy <LVID> 1 to split lv02 to obtain one copy of each of its mirrors. This may not work for dumps taken to mirrored paging space, because the pager may have already overwritten the dump. ------------------------------------------------------------------------------- Remote dumps over a network Currently, the system dump does not handle ARP requests received from the server, or the gateway used, during the dump. If an ARP request is received while taking a dump, this causes the dump to hang. If your system takes a system dump and hangs on 0c7, this is likely the problem. At this point, power the system off and reboot. To avoid this problem, create a permanent ARP entry for the client (the dumping machine) on the server or gateway. The machine that needs the permanent ARP entry is the machine on the same local network or ring as the client. This can be thought of as the logical server, since, if it is not the real server, the dump data must pass through it to get to the real server. NOTE: "Real server" refers to the machine designated in the remote dump specification on the client. Run the following steps on the real server to establish a permanent ARP entry on the server or gateway machine. 1. Ensure an ARP entry exists by pinging the client. Example: ping myclient.xyz.com 2. Use arp -a to see the ARP table. Example: # arp -a The following four lines of text should appear as two full lines. myclient.xyz.com (128.3.56.9) at 10:0:5a:9:e:7d [token ring] myserver.xyz.com(128.3.56.20) at 10:0:5a:8f:12:bf [token ring]| 3. Now use the arp command to make the dumping client's entry permanent. Example: # arp -s 802.5 myclient.xyz.com 10:0:5a:9:e:7d The 802.5 refers to a token-ring network. Valid network types are listed in the ARP documentation of the product documentation, and are currently ether(802.3), fddi, and 802.5. NOTE: If the dump hangs and the client must be rebooted, the partial dump on the server may still be useful. Techdocs Ref:90605210214768 4FAX Ref:6221