IBM Books

Diagnosis Guide


Contents

  • Figures

  • Tables

  • About this book
  • Who should use this book
  • Typographic conventions

  • Detecting and investigating PSSP problems

  • Diagnosing SP problems overview
  • How to use this book
  • SP systems and PSSP software supported by this book
  • When to use this book
  • Special troubleshooting considerations
  • Essential documentation - Other manuals to accompany this book
  • Preparing for your first problem before it happens
  • Knowing your SP structure and setup
  • Making effective use of the IBM Support Center
  • When to contact the IBM Support Center
  • Information to collect before contacting the IBM Support Center
  • How to contact the IBM Support Center
  • Detecting SP problems and keeping informed
  • Runtime notification methods
  • Graphical tools - SP Perspectives
  • Command line tools
  • Asynchronous (batch) notification methods
  • Graphics tools - SP Event Perspective
  • Command line tools - Problem Management
  • Automating your response to problems
  • Important - WHEN actions are performed
  • Important - WHERE actions are performed
  • Graphical tools - SP Event Perspective
  • Command line tools - Problem Management
  • Conditions to monitor on the SP system
  • Conditions to monitor using Perspectives or Problem Management
  • Monitor these hardware conditions
  • Monitor these software conditions
  • Descriptions of each condition
  • Preparing to examine and monitor this information
  • SP Event Perspective - conditions that you can monitor using the default event definition
  • SP Event Perspective - conditions that you can monitor that you must define to the SP Event Perspective
  • SP Event Perspective - creating the event definitions
  • SP Hardware Perspective
  • Problem Management
  • Error logging overview
  • Classifying Error Log events
  • Effect of not having a battery on error logging
  • Managing and monitoring the error log
  • Viewing error log information in parallel
  • Summary log for SP Switch, SP Switch2, and switch adapter errors
  • Viewing SP Switch error log reports
  • Using the AIX Error Notification Facility
  • Using the SP logs
  • Producing a system dump
  • Actions
  • Action 1. Produce a system dump
  • Action 2. Verify the system dump
  • Diagnosing hardware and software problems
  • High-Level SP symptoms

  • Diagnosing PSSP subsystems

  • Diagnosing NIM problems
  • Related documentation
  • Requisite function
  • Error information
  • Trace information
  • NIM debug SPOT
  • NIM SPOT logs
  • Information to collect before contacting the IBM Support Center
  • Error symptoms, responses, and recoveries
  • Actions
  • Network installation progress

  • Diagnosing node installation problems
  • Related documentation
  • Requisite function
  • Error information
  • Trace information
  • Post-installation customization trace
  • Nodecond log
  • Information to collect before contacting the IBM Support Center
  • Error symptoms, responses, and recoveries
  • Actions
  • Diagnosing boot problems
  • Related documentation
  • Error information
  • Node LED/LCD indicators
  • Trace information
  • Console log
  • Information to collect before contacting the IBM Support Center
  • Error symptoms, responses, and recoveries
  • Actions
  • Diagnosing Root Volume Group problems
  • Actions
  • Action 1 - Check disks
  • Action 2 - Check disk allocation
  • Action 3 - Force the Root Volume Group extension
  • Action 4 - Unlock the Root Volume Group
  • Action 5 - Add space to physical volumes
  • Action 6 - Add physical volumes to the Root Volume Group
  • Action 7 - Verify the number of copies of AIX on the node for mirroring
  • Action 8 - Verify the number of copies of AIX on the node for unmirroring
  • Action 9 - Check for user logical volumes on the physical volume
  • Action 10 - Verify mirroring or unmirroring
  • Root Volume Group terminology
  • Diagnosing system connectivity problems
  • Actions
  • Action 1 - Diagnose multiple nodes
  • Action 2 - Diagnose individual nodes
  • Action 3 - Diagnose a network problem
  • Action 4 - Diagnose a Topology-related problem
  • Diagnosing IP routing problems
  • IP source routing
  • Diagnosing SDR problems
  • Related documentation
  • Requisite function
  • Error information
  • Trace information
  • SDR daemon trace
  • Information to collect before contacting the IBM Support Center
  • Diagnostic procedures
  • Check current system partition
  • Query the state of the sdrd
  • Stop and start the sdrd
  • Check the sdrd processes using the ps command
  • Check for sdrd memory leaks and CPU utilization
  • Check for an sdrd hang
  • Check for an sdrd core dump
  • SDR verification test
  • Look for MBCS data in the SDR
  • Error symptoms, responses, and recoveries
  • Actions
  • Diagnosing SP Switch problems
  • Related documentation
  • Requisite function
  • Internal SP Switch subsystem components
  • Error information
  • AIX Error Log information
  • Adapter configuration error information
  • SP Switch device and link error information
  • Dump information
  • errpt.out
  • css_dump.out
  • fs_dump.out
  • cssadm.debug
  • cssadm.stderr
  • cssadm.stdout
  • regs.out
  • router.log
  • CSS_test.log
  • vdidl.out
  • spdata.out
  • netstat.out
  • scan_out.log and scan_save.log
  • Trace information
  • flt
  • rc.switch.<PID>.<date>.<time>
  • rc.switch.log
  • out.top
  • dist_topology.log
  • worm.trace
  • fs_daemon_print.file
  • daemon.stdout
  • logevnt.out
  • summlog.out
  • daemon.stderr
  • dtbx.trace
  • dtbx.failed.trace
  • cable_miswire
  • css.snap.log
  • Ecommands.log
  • spd.trace
  • Missing error data warning
  • Information to collect before contacting the IBM Support Center
  • css.snap package
  • Log files within css.snap package
  • Disk space handling policy
  • Diagnostic procedures
  • SP Switch diagnostics
  • Clock diagnostics
  • SP Switch adapter diagnostics
  • Cable diagnostics
  • SP Switch node diagnostics
  • SP Switch advanced diagnostics
  • Error symptoms, responses, and recoveries
  • SP Switch symptoms and recovery actions
  • Recover an SP Switch node
  • Worm error recovery
  • Ecommands error recovery
  • Estart error recovery
  • Unfence an SP Switch node
  • Eunfence error recovery
  • switch_responds is still on after node panic
  • Diagnosing SP Switch2 problems
  • Related documentation
  • Requisite function
  • Internal SP Switch2 subsystem components
  • Error information
  • SP Switch2 log and temporary file hierarchy
  • plane.info file
  • AIX Error Log information
  • Adapter configuration error information
  • SP Switch2 device and link error information
  • Dump information
  • errpt.out
  • cadd_dump.out
  • ifcl_dump.out
  • logevnt.out
  • summlog.out
  • col_dump.out
  • cssadm2.debug
  • cssadm2.stderr
  • cssadm2.stdout
  • regs.out
  • router.log
  • CSS_test.log
  • odm.out
  • spdata.out
  • netstat.out
  • scan_out.log and scan_save.log
  • DeviceDB.dump
  • Trace information
  • rc.switch.log
  • daemon.stdout
  • daemon.log
  • adapter.log
  • flt
  • fs_daemon_print.file
  • out.top
  • topology.data
  • cable_miswire
  • css.snap.log
  • colad.trace
  • spd.trace
  • Ecommands.log
  • emasterd.log
  • emasterd.stdout
  • chgcss.log
  • la_error.log
  • la_event_d.trace
  • Missing error data warning
  • Information to collect before contacting the IBM Support Center
  • css.snap package
  • Log files within css.snap package
  • Disk space handling policy
  • Diagnostic procedures
  • SP Switch2 diagnostics
  • SP Switch2 adapter diagnostics
  • Cable diagnostics
  • SP Switch2 node diagnostics
  • SP Switch2 Time Of Day (TOD) diagnostics
  • SP Switch2 advanced diagnostics
  • Error symptoms, responses, and recoveries
  • SP Switch2 symptoms and recovery actions
  • Recover an SP Switch2 node
  • Worm error recovery
  • Ecommands error recovery
  • Estart error recovery
  • Unfence an SP Switch2 node
  • Eunfence error recovery
  • switch_responds is still on after node panic
  • SP Switch and SP Switch2 advanced diagnostic tools
  • Adapter Error Log Analyzer (ELA)
  • When to run the adapter ELA
  • How to run the adapter ELA
  • Interpreting the results of the adapter ELA
  • SP Switch or SP Switch2 stress test
  • When to run the SP Switch or SP Switch2 stress test
  • How to run the SP Switch or SP Switch2 stress test
  • Interpreting the results of the SP Switch or SP Switch2 stress test
  • Multiple senders/single receiver test
  • When to run the Multiple senders/single receiver test
  • How to run the multiple senders/single receiver test
  • Interpreting the results of the multiple senders/single receiver test
  • SP Switch or SP Switch2 wrap test
  • When to run the SP Switch or SP Switch2 wrap test
  • How to run the SP Switch or SP Switch2 wrap test
  • Interpreting the results of the SP Switch or SP Switch2 wrap test
  • Diagnosing SP Security Services problems
  • Related documentation
  • Requisite function
  • Distributed Computing Environment (DCE) Version 3.1 for AIX
  • System Data Repository (SDR)
  • Error information
  • Log files
  • Dump information
  • Information to collect before contacting the IBM Support Center
  • SP Security Services configuration errors
  • SP trusted services authentication errors
  • SP trusted services authorization errors
  • Diagnostic procedures
  • Find out about your configuration
  • Verify DCE Security Services configuration
  • Verify configuration of Kerberos V4
  • Check enabled security
  • Check authentication
  • Check authorization
  • Error symptoms, responses, and recoveries
  • Actions
  • Diagnosing Per Node Key Management (PNKM) problems
  • Related documentation
  • Requisite function
  • Error information
  • AIX Error Log
  • Debug information
  • Information to collect before contacting the IBM Support Center
  • Diagnostic procedures
  • Has the spnkeyman daemon been started? Is it sleeping?
  • Are key files for the SP System Services on the host where spnkeyman is running?
  • Have the password expiration times been changed in the SP DCE organization, spsec-services, to indicate when keys should expire?
  • Have the SP server keys expired?
  • Was the password lifetime expiration time set to less than 24 hours?
  • Has the SP DCE organization, spsec-services, been created, and have the SP service principals been added to that organization?
  • Is the DCE registry populated with the principals and accounts for SP Services?
  • Is DCE installed on the host where spnkeyman is running?
  • Are the DCE daemons running on the host where spnkeyman is running?
  • Is There a route for the DCE client daemons to access the DCE server daemons?
  • Is the DCE secd daemon running on the DCE server host?
  • Error symptoms, responses, and recoveries
  • Diagnosing remote command problems on the SP System
  • Enhanced Security Option
  • Things to be aware of when using Restricted Root Access (RRA)
  • Using secure remote commands instead of AIX rsh and rcp commands
  • Related documentation
  • Requisite function
  • Using AIX rsh and rcp commands
  • Error information
  • Information to collect before contacting the IBM Support Center
  • AIX remote command diagnostics
  • Diagnostics specific to Kerberos V4
  • Diagnostics specific to Kerberos V5
  • Error symptoms, responses, and recoveries
  • Remote command (rsh/rcp) symptoms
  • Remote commands (rsh/rcp) recovery actions
  • Using secure remote commands - symptoms
  • Using secure remote commands - recovery actions
  • Diagnosing System Monitor problems
  • Related documentation
  • Requisite function
  • Error information
  • AIX Error Logs and templates
  • System Monitor daemon log file
  • Dump information
  • Trace information
  • Information to collect before contacting the IBM Support Center
  • Diagnostic procedures
  • Installation verification tests
  • Configuration verification tests
  • Operational verification tests
  • Error symptoms, responses, and recoveries
  • Actions
  • Diagnosing SP Logging daemon Problems
  • Related documentation
  • Requisite function
  • Error information
  • AIX Error Log
  • hwevents file
  • splogd.debug file
  • Dump information
  • Information to collect before contacting the IBM Support Center
  • Diagnostic procedures
  • Installation verification tests
  • Configuration verification tests
  • Operational verification tests
  • Error symptoms, responses, and recoveries
  • Actions
  • Diagnosing Topology Services problems
  • Related documentation
  • Requisite function
  • Error information
  • AIX Error Logs and templates
  • Dump information
  • Core dump
  • phoenix.snap dump
  • Trace information
  • Topology Services service log
  • Topology Services user log
  • hats or topsvcs script log
  • Network Interface Module (NIM) log
  • Information to collect before contacting the IBM Support Center
  • Diagnostic procedures
  • Installation verification test
  • Configuration verification tests
  • Operational verification tests
  • Error symptoms, responses, and recoveries
  • Actions
  • Diagnosing Group Services problems
  • Related documentation
  • Requisite function
  • Error information
  • AIX Error Logs and templates
  • Dump information
  • Core dump
  • phoenix.snap dump
  • Trace information
  • GS service log trace
  • GS service log trace - summary log
  • Group Services startup script log
  • Information to collect before contacting the IBM Support Center
  • How to find the GS nameserver (NS) node
  • How to find the Group Leader (GL) node for a specific group
  • Diagnostic procedures
  • Installation verification test
  • Configuration verification test
  • Operational verification tests
  • Error symptoms, responses, and recoveries
  • Actions
  • Diagnosing Event Management problems
  • Related documentation
  • Requisite function
  • Internal components of Event Management
  • Resource monitors
  • Service information
  • Automatic method - phoenix.snap
  • Manual method of data collection
  • Error information
  • AIX Error Log for Event Management
  • Error log files
  • Event Management daemon errors
  • EMCDB problems
  • Resource Monitor problems
  • Dump information
  • Dump information from Event Management Resource Monitor daemons
  • Trace information
  • Trace facility built into Event Management
  • Information to collect before contacting the IBM Support Center
  • Diagnostic instructions
  • Verify SP software installation
  • Identify the failing node
  • Verify event registration
  • Verify Resource Monitors
  • Error symptoms, responses, and recoveries
  • Recover crashed node (Event Management Resource Monitor daemons)
  • Recover EMCDB
  • Security errors
  • Diagnosing IBM Virtual Shared Disk problems
  • Related documentation
  • Requisite function
  • Before you begin
  • Error information
  • Errors logged by the IBM Virtual Shared Disk device driver and IBM Recoverable Virtual Shared Disk subsystem
  • Dump information
  • Trace information
  • Internal virtual shared disk device driver circular trace buffer
  • IBM Recoverable Virtual Shared Disk subsystem tracing to the console log
  • IBM Recoverable Virtual Shared Disk subsystem logging of recovery actions
  • Information to collect before contacting the IBM Support Center
  • Diagnostic procedures
  • Installation verification test
  • Configuration verification tests
  • Operational verification tests
  • Error symptoms, responses, and recoveries
  • Recognizing recovery
  • Planning for recovery
  • Virtual Shared Disk node failure
  • Switch failure scenarios
  • Topology Services or recovery service daemon failure
  • Disk EIO errors
  • Diagnosing Job Switch Resource Table Services problems
  • Related documentation
  • Requisite function
  • Error information
  • Job Switch Resource Table Services error and information log
  • AIX Error Logs and templates for JSRT Services
  • Information to collect before contacting the IBM Support Center
  • Diagnostic procedures - Installation verification
  • Error symptoms, responses, and recoveries
  • Actions
  • Diagnosing User Access problems
  • Actions
  • Action 1 - Check the /etc/security/passwd file
  • Action 2 - Check Login Control
  • Action 3 - Verify that the automount daemon is running
  • Stopping and restarting automount
  • Verifying System Management installation
  • Verification test output
  • What system management verification checks
  • Objects tested by SYSMAN_test on the control workstation only
  • Objects Tested by SYSMAN_test on the control workstation and boot/install servers
  • Objects tested by SYSMAN_test on all SP nodes (not the control workstation)
  • Objects relating to optional system management services
  • Additional tests
  • Interpreting the test results
  • Diagnosing Perspectives problems on the SP System
  • Related documentation
  • Requisite function
  • Error information
  • Information to collect before contacting the IBM Support Center
  • Diagnostics
  • Error symptoms, responses, and recoveries
  • Actions
  • Diagnosing file collections problems on the SP System
  • Related documentation
  • Requisite function
  • SP configuration
  • Error information
  • Information to collect before contacting the IBM Support Center
  • Diagnostics
  • File collections errors on any SP host
  • File collections configuration errors
  • File collections server (control workstation or boot/install server) errors
  • File collections client errors
  • Error symptoms, responses, and recoveries

  • Diagnosing SP node and network attached hardware

  • SP-specific LED/LCD values
  • Other LED/LCD codes
  • Diagnosing 604 and 604e High Node problems
  • 604 and 604e High Node characteristics
  • Error conditions and performance considerations
  • Using SystemGuard and BUMP programs
  • Diagnosing POWER3 SMP Thin and Wide Node problems
  • POWER3 SMP Thin and Wide Node characteristics
  • Boot sequence for the POWER3 SMP Thin and Wide Node
  • Error conditions and performance considerations
  • Service Processor surveillance
  • Diagnosing POWER3 SMP High Node problems
  • POWER3 SMP High Node characteristics
  • Boot sequence for the POWER3 SMP High Node
  • Error conditions and performance considerations
  • Service Processor Surveillance
  • SP Expansion I/O Unit
  • Diagnosing 332 MHz SMP Thin and Wide Node problems
  • 332 MHz SMP Node characteristics
  • Boot sequence for the 332 MHz SMP Node
  • Error conditions and performance considerations
  • Service Processor surveillance
  • Diagnosing dependent node configuration problems
  • SP configuration diagnosis
  • SP Switch Router configuration diagnosis
  • SNMP configuration diagnosis
  • Diagnosing SP-attached Server and Clustered Enterprise Server problems
  • Related documentation
  • Requisite function
  • Error information
  • Logs
  • Dump information
  • Trace information
  • Information to collect before contacting the IBM Support Center
  • Diagnostic procedures
  • Installation verification tests
  • Configuration verification tests
  • Operational verification tests
  • Error symptoms, responses, and recoveries
  • Actions
  • Diagnosing IBM e(logo)server pSeries 690 problems
  • Related documentation
  • Requisite function
  • Error information
  • Logs
  • Dump information
  • Trace information
  • Information to collect before contacting the IBM Support Center
  • Diagnostic procedures
  • Installation verification tests
  • Configuration verification tests
  • Operational verification tests
  • Error symptoms, responses, and recoveries
  • Actions
  • Diagnosing PSSP T/EC Event Adapter problems

  • Notices
  • Trademarks
  • Publicly available software
  • Glossary of Terms and Abbreviations

  • Bibliography
  • Information formats
  • Finding documentation on the World Wide Web
  • Accessing PSSP documentation online
  • Manual pages for public code
  • RS/6000 SP planning publications
  • RS/6000 SP hardware publications
  • RS/6000 SP Switch Router publications
  • Related hardware publications
  • RS/6000 SP software publications
  • AIX publications
  • DCE publications
  • Redbooks
  • Non-IBM publications
  • Index

  • [ Top of Page | Previous Page | Next Page | Table of Contents | Index ]