This test verifies the functionality of a specific switch chip or switch port. The switch_stress command starts this test. For detailed information about this command, its flags, arguments, and usage examples, see PSSP: Command and Technical Reference. The -g flag can be used to run the test with the SPD GUI, but you must specify the command operands and flags on the command line.
The -n flag is used to specify the switch plane number. -n 0 specifies plane 0 and -n 1 specifies plane 1. Both planes cannot be specified on one invocation of this command.
Run this test when you suspect that a switch chip is faulty because the system error log on the primary node contains error status reports from the switch chip. This test checks the functionality of this switch chip and the attached links, in order to decide whether the switch chip should be replaced.
First, decide which switch chip you want to test. Usually this is a switch chip that is reporting errors. Then, decide which nodes to use for the test. These nodes cannot run parallel applications during this test. By default, all nodes are allowed, so be careful to avoid disturbing applications that are running. The nodes that are not allowed will not be affected directly. However, since the test implies stress traffic in the switch network, the performance of applications running on all nodes may be affected.
Invoke the test, specifying the desired switch chip ID. Also specify the nodes that can participate in the test, or alternatively the nodes that are forbidden (because they are running critical applications).
The test does not require user intervention. It runs several iterations, each one using a different combination of switch chip ports. In each iteration, the test sends data through the ports under test. In the beginning of the iteration, the test notifies you as to which nodes are participating in the test. At the end of the iteration, it displays these statistics: number of packets that were sent and lost, and number of switch errors (reported from all switch chips used for sending data by the test iteration).
Each error reported by a switch chip is displayed. In a stable system, there should be few or no such reports. If the test succeeded in causing a critical fault on one of the switch chips, the test decides that its goal has been achieved. The faulty component is isolated. In this case, the test displays an appropriate message and terminates. Contact IBM hardware support to replace the faulty component.
Otherwise, the test just displays the statistics and continues to the next iteration. If the test did not cause critical faults, but did cause some failures (that were recovered), it does not necessarily mean that some hardware component should be replaced. This result gives an indication of a possible cause of problems. Contact IBM hardware support in this case also.