These Diagnostic Procedures test the installation, configuration, and operation of all servers.
Use these tests to check that the server has been installed properly.
This test verifies that the ssp.basic file set has been installed correctly. All server device drivers are included in the ssp.basic file set. Issue this lslpp command on the control workstation:
lslpp -l ssp.basic
The output is similar to the following:
Path: /usr/lib/objrepos ssp.basic 3.1.0.8 COMMITTED SP System Support Package Path: /etc/objrepos ssp.basic 3.1.0.8 COMMITTED SP System Support Package
Good results are indicated if entries for ssp.basic exist. Proceed to Installation test 2 - Check external hardware daemon.
Error results are indicated in all other cases. Try to determine why the file set was not installed, and either install it, or contact the IBM Support Center.
This test verifies that the appropriate external hardware daemon and it's directory have been created on the control workstation. Issue these commands on the control workstation:
ls -l /usr/lpp/ssp/install/bin/hmcd
The output is similar to the following:
-r-x------ 1 bin bin 47166 Sep 15 13:00 hmcd
ls -l /var/adm/SPlogs/spmon | grep hmcd
The output is similar to the following:
drwxr-xr-x 2 bin bin 512 Sep 15 13:01 hmcd
ls -l /usr/lpp/ssp/install/bin/HMCD.class
The output is similar to the following:
-rw-r--r-- 1 bin bin 26927 Sep 11 01:03 HMCD.class
ls -l /usr/lpp/ssp/lib/libHMCD.so
The output is similar to the following:
-rwxr-x--x 1 bin bin 17680 Sep 15 13:00 libHMCD.so
Good results are indicated if output for all the commands is similar to the examples provided. Proceed to Installation test 3 - Check external hardware daemon components.
Error results are indicated in all other cases. Record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.
This test verifies that the appropriate external hardware daemon components and their directories have been created on the control workstation. Issue these commands on the control workstation:
ls -l /usr/java130/jre/lib/ext/xerces.jar
The output is similar to the following:
-r--r--r-- 1 bin bin 1521373 Feb 12 13:20 xerces.jar
ls -l /opt/freeware/cimom/org/snia/wbem/client/CIMClient.class
The output is similar to the following:
-rwxr-xr-x 1 bin bin 8507 Jul 23 15:20 CIMClient.class
Good results are indicated if output for all the commands is similar to the examples provided. Proceed to Configuration test 1 - Check SDR Frame object.
Error results are indicated in all other cases. Record all relevant information, see "Information to collect before contacting the IBM Support Center and contact the IBM Support Center.
Use these tests to check that all servers have been configured properly.
This test verifies that the SDR Frame object was created. The configuration data for all servers must reside in the SDR Frame object. On the control workstation, issue the command:
splstdata -f
For each server you should see a line of output similar to the following. The numbers may be different.
List Frame Database Information frame# tty s1_tty frame_type hardware_protocol control_ipaddrs domain_name ---------- ------------ -------- ----------------- ----------------- --------------- ------------ 1 /dev/tty0 "" switch SP "" "" 2 "" "" "" HMC 9.114.62.123 huntley
Good results are indicated if all of the following are true:
If all of these conditions are true, proceed to Configuration test 2 - Check HMC password files.
Error results are indicated if one or more of these conditions are not true. Attempt to fix the SDR data by issuing the spframe command with the appropriate parameters, or contact the IBM Support Center. For a description of the spframe command, refer to PSSP: Command and Technical Reference.
Repeat this test after issuing the spframe command. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.
This test verifies that for each unique HMC IP address in your system, a corresponding password file has been created on the control workstation. On the control workstation, issue the command:
sphmcid
For each HMC in your system, you should see a line of output similar to the following. The numbers may be different:
9.114.58.22 hmcadmin 9.114.62.123 hmcadmin
Good results are indicated if for each HMC IP address displayed under the control_ipaddrs heading in Configuration test 1 - Check SDR Frame object, there is a corresponding HMC IP address displayed as a result of issuing the sphmcid command. Proceed to Configuration test 3- Check SDR Syspar_map object.
Error results are indicated if no entry exists for one or more HMC IP address. Attempt to create a password file on the control workstation for the missing HMC IP address. For a description of the sphmcid command , refer to PSSP: Command and Technical Reference.
Repeat the test after issuing the sphmcid command. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center, and contact the IBM Support Center.
This test verifies that the SDR Syspar_map object for all servers was created correctly. The switch port number for the server is stored in the Syspar_map object of the SDR. If the server is attached to the SP Switch or an SP system without a switch, the switch port number for the server is defined for the SP-attached server using the spframe command. If the server is attached to the SP Switch2 or a clustered enterprise server system without a switch, the switch port number can be optionally automatically assigned by the SDR configuration command. On the control workstation, issue the command once for each server:
SDRGetObjects Syspar_map node_number==N switch_node_number
where N is the node number of the server.
If you do not know the node number, issue the command: spmon -G -d to determine node numbers. For example, the command:
SDRGetObjects Syspar_map node_number==17 switch_node_number
produces output similar to the following:
switch_node_number 5
Good results are indicated if the switch port number that is returned matches the value requested when the server was originally defined. For systems with an SP Switch, this should be the switch port number associated with the port in which the server is cabled to the SP Switch. For systems without a switch, this should be any unused valid switch port number on the system. For systems with an SP Switch 2 or for clustered systems, this can be any unused value in the range 0 to 511. Proceed to Operational test 1 - Check hardmon status.
Error results are indicated if no entry exists for the node, or the returned value is incorrect. Attempt to fix the SDR data by issuing the spframe command with the appropriate parameters. If a Frame object already exists for this server, you must first delete that Frame object by issuing the spdelfram command. For a description of these commands, see PSSP: Command and Technical Reference. For information on assigning valid switch port numbers for all servers, see IBM RS/6000 SP: Planning, Volume 2, Control Workstation and Software Environment.
Repeat this test after issuing the spframe command. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.
Use these tests to check that all servers are operating properly.
This test verifies that the System Monitor (hardmon) is active and running correctly. External hardware daemons cannot run if hardmon is not running. Issue the commands:
lssrc -s hardmon
The output is similar to the following:
Subsystem Group PID Status hardmon 42532 active
ps -ef | grep hardmon
The output is similar to the following:
root 42532 5966 0 Sep 15 0 9:42 /usr/lpp/ssp/bin/hardmon -r 5
Good results are indicated if the following are both true:
If these conditions are met, proceed to Operational test 2 - Check external hardware daemons.
Error results are indicated if these conditions are not met. To determine why hardmon is not running, or why the argument to the -r flag is not 5, refer to Diagnosing System Monitor problems.
Repeat this test, after taking any action suggested in Diagnosing System Monitor problems. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.
This test verifies that the hmc daemons are running. If you have one or more IBM pSeries 690 servers, issue the command:
ps -ef | grep hmcd
For IBM pSeries 690 servers, a line of output for each server is similar to the following:
root 23384 42532 1 Sep 15 2 79:08 /usr/lpp/ssp/install/bin/hmcd -d 0 9.114.58.22 1 minnow 1 5
Good results are indicated if all of the following are true:
These debug bit flags are for IBM service use. They are enabled with the hmadm command
If you receive good results, proceed to Operational test 3 - Check frame responsiveness.
Error results are indicated if any of these conditions are not met. Record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.
This test verifies that the frames are responding. External hardware daemons cannot run properly if their frames are not responding. To verify that a particular frame is responding, issue the following command on the control workstation:
hmmon -GQv controllerResponds F:0
where F is the frame number of the server that you are checking. Repeat this test for each server in the system.
The output is similar to the following:
frame F, slot 00: TRUE frame responding to polls
Good results are indicated if the value is TRUE for each server. Proceed to Operational test 4 - Check HMC communications.
If the value is FALSE, or you do not get any output, the test may have encountered an error. During normal operation, this value may occasionally switch to FALSE, which may simply mean that the daemon happens to be busy and cannot respond to an individual System Monitor request in a timely manner. Therefore, if you get a FALSE value, repeat the hmmon command several more times, waiting at least five seconds between invocations. If the value is consistently FALSE after several attempts, assume this to be error results.
In the case of error results, perform these steps:
Repeat this test after performing these steps. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center. and contact the IBM Support Center.
This test verifies that the HMC communication is available. For each server, issue the following command twice on the control workstation, waiting at least five seconds between the two commands:
hmmon -GQs F:0
where F is the frame number of the server that you are checking. Repeat this test for each server in the system. This is an example output for an IBM pSeries 690 in frame 3:
3 0 nodefail1 FALSE 0x8802 node 01 I2C not responding 3 0 nodefail2 TRUE 0x8803 node 02 I2C not responding 3 0 nodefail3 TRUE 0x8804 node 03 I2C not responding 3 0 nodefail4 TRUE 0x8805 node 04 I2C not responding 3 0 nodefail5 TRUE 0x8806 node 05 I2C not responding 3 0 nodefail6 TRUE 0x8807 node 06 I2C not responding 3 0 nodefail7 TRUE 0x8808 node 07 I2C not responding 3 0 nodefail8 TRUE 0x8809 node 08 I2C not responding 3 0 nodefail9 TRUE 0x880a node 09 I2C not responding 3 0 nodefail10 TRUE 0x880b node 10 I2C not responding 3 0 nodefail11 TRUE 0x880c node 11 I2C not responding 3 0 nodefail12 TRUE 0x880d node 12 I2C not responding 3 0 nodefail13 TRUE 0x880e node 13 I2C not responding 3 0 nodefail14 TRUE 0x880f node 14 I2C not responding 3 0 nodefail15 TRUE 0x8810 node 15 I2C not responding 3 0 nodefail16 TRUE 0x8811 node 16 I2C not responding 3 0 nodeLinkOpen1 FALSE 0x8813 node 01 serial link open 3 0 nodeLinkOpen2 FALSE 0x8814 node 02 serial link open 3 0 nodeLinkOpen3 FALSE 0x8815 node 03 serial link open 3 0 nodeLinkOpen4 FALSE 0x8816 node 04 serial link open 3 0 nodeLinkOpen5 FALSE 0x8817 node 05 serial link open 3 0 nodeLinkOpen6 FALSE 0x8818 node 06 serial link open 3 0 nodeLinkOpen7 FALSE 0x8819 node 07 serial link open 3 0 nodeLinkOpen8 FALSE 0x881a node 08 serial link open 3 0 nodeLinkOpen9 FALSE 0x881b node 09 serial link open 3 0 nodeLinkOpen10 FALSE 0x881c node 10 serial link open 3 0 nodeLinkOpen11 FALSE 0x881d node 11 serial link open 3 0 nodeLinkOpen12 FALSE 0x881e node 12 serial link open 3 0 nodeLinkOpen13 FALSE 0x881f node 13 serial link open 3 0 nodeLinkOpen14 FALSE 0x8820 node 14 serial link open 3 0 nodeLinkOpen15 FALSE 0x8821 node 15 serial link open 3 0 nodeLinkOpen16 FALSE 0x8822 node 16 serial link open 3 0 CECUserName huntley 0x8993 user defined CEC name 3 0 CECMode 0 0x8984 0=smp 1=partition 3 0 PowerOffPolicy TRUE 0x8985 power off with last Lpar 3 0 CECCapability 1 0x8986 0=smp 1=lpar 2=numa 3 0 CECState 1 0x8987 0=off on init err inc con rec 3 0 diagByte 0 0x8823 diagnosis return code 3 0 timeTicks 38701 0x8830 supervisor timer ticks 3 0 type 5 0x883a supervisor type 3 0 codeVersion 772 0x883b supervisor code version 3 0 daemonPollRate 5 0x8867 hardware monitor poll rate 3 0 controllerResponds TRUE 0x88a8 frame responding to polls
Good results are indicated if all of the following are true:
In this case, proceed to Operational test 5 - Check S1 communications.
Error results are indicated in all other cases. Record all relevant information, see Information to collect before contacting the IBM Support Center, and contact the IBM Support Center.
This test verifies that the S1 communication is available. For each server, issue the following command on the control workstation for one node chosen arbitrarily:
s1term -G F N
where F is the frame number and N is the node number for a logical partition in the server that you are checking. Repeat this test for each server in the system.
Good results are indicated if the AIX login prompt is displayed. If you do not see the login prompt after issuing the s1term command, try typing the enter key a second time. If you still do not see the login prompt, consider this to be error results. Note that the login prompt may take up to 90 seconds to appear.
To correct the problem, perform these steps:
Repeat this test after performing these steps. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.