IBM Books

Diagnosis Guide


Error symptoms, responses, and recoveries

Use the following table to diagnose problems with the Job Switch Resource Table (JSRT) Services component of PSSP. Locate the symptom and perform the action described in the following table.

Table 65. Job Switch Resource Table (JSRT) Services symptoms

Symptom Recovery
Cannot load or unload a Job Switch Resource Table (JSRT) on a node.

"Action 1 - Verify JSRT Services installation"

"Action 2 - Check the JSRT services log file"

"Action 3 - Request more detailed log information"

"Action 5 - Check the switch_node_number file"

"Action 6 - Check the current status of JSRT Services for a node"

Cannot obtain the status of a JSRT window.

"Action 1 - Verify JSRT Services installation"

"Action 2 - Check the JSRT services log file"

"Action 4 - Check the JSRT Services data files"

Cannot run the switchtbld daemon.

"Action 1 - Verify JSRT Services installation"

"Action 2 - Check the JSRT services log file"

st_status command hangs. No messages are displayed.

"Action 7 - Check communication paths for affected nodes"

Cannot load program st_status.

"Action 8 - Check the LIBPATH environment variable"

st_status returns ST_SYSTEM_ERROR or ST_LOADED_BYOTHER for a window.

"Action 9 - Issue the st_clean_table command"

st_status returns ST_RESERVED status for a window.

"Action 10 - Issue the chgcss command"

"Action 3 - Request more detailed log information"

The st_status command runs in DCE and compatibility mode and issues the message: swtbl_status: 2511-508 Time out while waiting for a response from host hostname.

"Action 13 - Ensure that the client and master DCE servers are running."

API returned ST_NOT_AUTHEN, ST_NOT_AUTHOR or ST_SECURITY_ERROR.

Action 2 - Check the JSRT services log file

"Action 3 - Request more detailed log information"

"Action 12 - Ensure that the caller is a member of the appropriate DCE security group"

Actions

Action 1 - Verify JSRT Services installation

When issued on the control workstation, the st_verify script checks the installation of the JSRT Services on every node that is defined in the current system partition. When issued on a single node, it verifies the installation of the JSRT Services on only that node.

The user must be logged in as root in order to perform this action. Run installation verification tests using SMIT, or from the command line, to ensure that installation is complete.

Using SMIT:

TYPE    smit SP_verify
        (the Installation/Configuration Menu appears)
SELECT  Job Switch Resource Table Services Installation
PRESS   Enter

Using the command line, enter: /usr/lpp/ssp/bin/st_verify .

The st_verify script checks that the correct files and directories were installed and that the necessary entries exist in the files. The files and directories are /etc/services , /etc/inittab , and /etc/inetd.conf.

Installation verification test output

The st_verify script produces an output log, located in /var/adm/SPlogs/st/st_verify.log (by default) or in a location that you specify. After completion, a message is written to stdout stating whether the verification passed or failed. If a failure occurred, examine the log for a list of the errors that were found.

Good results are indicated when a message similar to the following is written to stdout:

Verifying installation of the Job Switch Resource Table Services on node 0.
JSRT Services installation verification SUCCESSFUL on node 0.
Check /var/adm/SPlogs/st/st_verify.log file for further details.
 

Error results are indicated when a message similar to the following is written to stdout:

Verifying installation of the Job Switch Resource Table Services on node 6.
JSRT Services installation verification FAILED on node 6.
	Total of 1 errors found on node 6.

If the test fails, check the /var/adm/SPlogs/st/st_verify.log file (or specified log) for further details about which files or directories are missing. Reinstall the ssp.st file set if necessary.

Action 2 - Check the JSRT services log file

The JSRT Services maintain a single log file, st_log, which is located in: /var/adm/SPlogs/st. This log is located on every node where the services are used. For example, if the swtbl_load_job API is used, entries are found on the local node where the API was invoked, and entries are also found on the nodes that were being loaded by the swtbl_load_job API.

Examine the logs and correct any obvious problems that have been identified.

The following table indicates return codes that may appear in the log. They are defined in /usr/lpp/ssp/include/st_client.h.

If a return code of 2 (ST_NOT_AUTHOR), 3 (ST_NOT_AUTHEN), or 20 (ST_SECURITY_ERROR) appears in the log, see Diagnosing SP Security Services problems and Diagnosing Per Node Key Management (PNKM) problems.

Table 66. JSRT Services return codes

Return code Name Explanation
0 ST_SUCCESS The service request was successful.
1 ST_INVALID_TASK_ID An invalid task ID is specified as input.
2 ST_NOT_AUTHOR The caller is not authorized to perform the service.
3 ST_NOT_AUTHEN The caller is not authenticated to perform the service.
4 ST_SWITCH_IN_USE The JSRT is already loaded or in use.
5 ST_SYSTEM_ERROR A system error occurred.
6 ST_SDR_ERROR An SDR error occurred.
7 ST_CANT_CONNECT The connect system call failed.
8 ST_NO_SWITCH No css device is installed.
9 ST_INVALID_PARAM An invalid parameter was specified as input.
10 ST_INVALID_ADDR The inet_ntoa command failed on the st_addr input value.
11 ST_SWITCH_NOT_LOADED No JSRT is currently loaded.
12 ST_UNLOADED A previously successful load was unloaded because of an error.
13 ST_NOT_UNLOADED No unload request was issued.
14 ST_NO_STATUS No status request was issued.
15 ST_DOWNON_SWITCH The node is down on the switch.
16 ST_ALREADY_CONNECTED The node has already been connected to, and a load request made by a previous st_node_info structure.
17 ST_LOADED_BYOTHER The JSRT was loaded outside of the API.
18 ST_SWNODENUM_ERROR An error occurred when processing the switch node number.
19 ST_SWITCH_DUMMY For testing purposes.
20 ST_SECURITY_ERROR DCE security error.
21 ST_TCP_ERROR Error using TCP/IP.
22 ST_CANT_ALLOC Cannot allocate storage.
23 ST_OLD_SECURITY Old security method used.
24 ST_NO_SECURITY No security methods used.
25 ST_RESERVED Window reserved outside of API.

Action 3 - Request more detailed log information

To facilitate debugging, you can set several environment variables before invoking a JSRT Service. The variables provide more detailed information in the st_log file.

The first environment variable is SWTBLAPIERRORMSGS, and it must be set to yes within the caller's environment.

For example, as a ksh user, enter:

export SWTBLAPIERRORMSGS=yes

This is an example of the more detailed log information for a call to swtbl_load_table:

Thu Jun 18 10:12:22 1998: swtbl_load_table: INPUT PARAMETERS: uid - 0
pid - 19118 job_key - 1 requestor_node - k10n11.ppd.pok.ibm.com
num_tasks - 1 job_desb - load client test
Thu Jun 18 10:12:22 1998: swtbl_load_table: INPUT PARAMETERS virtual
 task id=0 switch_node_number=10 window_id=1
 

The second environment variable is SWTBLSECMSGLEVEL and it can be set to a an integer value which represents a level of detail. See Table 67. If a value greater than 4 is used, the highest level of tracing (4) is assumed.

Table 67. SWTBLSECMSGLEVEL environment variable values

Value Level of detail
0 No tracing is performed.
1 Trace error conditions.
2 Trace significant flow or data.
3 Trace calls and returns from function.
4 Trace flows and data - maximum detail.

For example, as a ksh user, enter:

export SWTBLSECMSGLEVEL=4

This is an example of the more detailed log information for a call to swtbl_load_job:

Wed Aug 25 09:43:25 1999: Ssobj: spsec_authenticate_client failed - 
cl=c187n12.ppd.pok.ibm.com, gr=switchtbld-status, inf=2502-611
 An argument is missing or not valid.
Wed Aug 25 09:43:25 1999: Ssobj::addError(eC=7, s=3, r=0, e=0, Error 0,
 ex=0, er=0, sH = c187n12.ppd.pok.ibm.com
Wed Aug 25 09:43:25 1999: Ssobj::addError(g=switchtbld-status, sE=1)
Wed Aug 25 09:43:25 1999: Ssobj::addError(sA=2502-611 An argument is missing
 or not valid.)
Wed Aug 25 09:43:25 1999: Ssobj::addError() - Return 7, p=201322e8
Wed Aug 25 09:43:25 1999: Ssobj::returnErrors(s=3, 
 host=c187n12.ppd.pok.ibm.com)
 

Action 4 - Check the JSRT Services data files

The JSRT Services maintains a set of data files that are located in the /spdata/sys1/st directory on every node. Verify that the directory exists and that the files have root access. Note that the files are not created until the load or unload services have been invoked. These files are also removed when a node is rebooted.

Action 5 - Check the switch_node_number file

The /spdata/sys1/st/switch_node_number file contains a single integer that represents the switch node number of the node. This file can be read using an AIX editor, or by issuing the cat command. The integer in the file should match the switch_node_number attribute in the SDR Node class for that node. Issue the /usr/lpp/ssp/bin/st_set_switch_number installation script on every node to create the switch_node_number file and set the correct value.

The following messages in the st_log indicate that there is a problem with the switch node number:

  1. Error reading the /spdata/sys1/st/switch_node_number file. Failed to get the switch node number for this node. Issue the command:
    /usr/lpp/ssp/bin/st_set_switch_number 
    
  2. The switch node number read from the /spdata/sys1/st/switch_node_number file is invalid, switch_number=number from file.

Action 6 - Check the current status of JSRT Services for a node

The st_status command shows you the current status of the JSRT windows on all nodes or on the node specified. This tells you whether the JSRT windows are loaded, unloaded, reserved by another subsystem, or in error.

To show the status of JSRT windows on all nodes within the current system partition, issue st_status.

To show the status of all JSRT windows on node k10n15, issue: st_status k10n15. Output similar to the following appears:

       *************************************************************
       Status from node: k10n15  User: root
       Load request from: k10n15 Pid: 12494 Uid: 0
       Job Description: No_job_description_given
       Time of request:Wed_Jan_24_13:38:21_EDT_1998
       Adapter: /dev/css0 Memory Allocated: 10000 
       Window id: 0
       *************************************************************
       Node k10n15 adapter /dev/css0 window 1 returned ST_RESERVED.
       Window 1 is RESERVED by VSD.
       *************************************************************
       Node k10n15 adapter /dev/css0 window 2 returned ST_SWITCH_NOT_LOADED
       *************************************************************
       Node k10n15 adapter /dev/css0 window 3 returned ST_SWITCH_NOT_LOADED
 

Action 7 - Check communication paths for affected nodes

If the st_status command hangs, it is trying to contact a node which has lost its TCP/IP communication path. If the command is issued without any arguments, it will try to contact every node in the current SP system partition. If one of the nodes is down on the Ethernet, the command will not return until the TCP/IP timeout value has been reached.

Issue the SDRGetObjects command to determine the nodes in the current SP system partition. For example,

SDRGetObjects Node reliable_hostname
Issue the AIX ping command using the names returned under reliable_hostname from the SDRGetObjects command to determine connectivity.

Now, reissue the st_status command with a specific list of node names.

Action 8 - Check the LIBPATH environment variable

The st_status command is compiled with a LIBPATH of /usr/lpp/ssp/lib:/usr/lib:/lib. If your LIBPATH conflicts with this, then issue the command as follows:

LIBPATH=/usr/lpp/ssp/lib st_status

Action 9 - Issue the st_clean_table command

The st_clean_table command forces the unload of the Job Switch Resource Table for a specified window on the specified node. See the st_clean_table command man page for more details.

Action 10 - Issue the chgcss command

The chgcss command can be used to view and manipulate reserved user space windows. See the chgcss command man page for more details.

Action 11 - Call the swtbl_adapter_resources API to obtain valid values

The swtbl_adapter_resources API returns the configured resources of the specified adapter on the node from which it is invoked. See the swtbl_adapter_resources API man page for more details.

Action 12 - Ensure that the caller is a member of the appropriate DCE security group

These are the DCE security groups defined for the JSRT Services:

Table 68. DCE security groups for JSRT Services

Function Affected services DCE group
clean window swtbl_clean_table API

st_clean_table command

ssp/switchtbld/clean
load or unload table swtbl_load_job

swtbl_unload_job

ssp/switchtbld/load
status swtbl_status

st_status

ssp/switchtbld/status

Action 13 - Ensure that the client and master DCE servers are running.

If the client DCE servers are running and the master DCE server is not, a timeout can occur even if DCE and compatibility mode is specified. Either stop the client DCE servers by issuing the stopsrc command, or start the master DCE server by issuing the startsrc command.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]