ITEM: BC9105L

system crashed with 888, it is back up now


Env:
  AIX 3.2.5
  RISC 990

Desc:
  Customer's system crashed with 888-102-300-0c4 
  28mb dump out of a 20mb dump space.

Action:
  had the customer do this:
  crash /dev/hd7
  stat
  t -m

  Sending in the output from this command

Next Action:
  Waiting on email from the customer 

Response:

Script started on Wed Dec 13 10:50:26 1995netdb:/tmp/ibmsupt \# crash /dev/hd7
Using /unix as the default namelist file.
Reading in Symbols .........................
Symbol panicstr has null value.
Symbol t_maint has null value.
> stats\^H \^H
Not a valid dump data area:0x41495800
0452-338: Cannot read uname structure from address 0x41495800
        time of crash: Tue Dec 12 18:46:38 1995
Not a valid dump data area:0x383affe3
        age of system: 
> t -m
MST STACK TRACE:
    00219db0 (excpt=00000000:00000000:00000000:00000000:00000000) (intpri=0)
        IAR:        0000559c  .i_enable + 9c:       bcr   cr20,0
        *LR:        00021b14  .i_offlevel + 7c
        *00058e68:  00000000  \

    2ff98000 (excpt=60a60018:40000000:007fffff:60a60018:00000106) (intpri=b)
        IAR:        0149bfc4  [TRL.ext:trace_cmd] + 7854:       lsi   r5,r3,0x4
        *LR:        01499484  [TRL.ext:llcheck] + 4d14
        *2ff7fd90:  00000000  \
        2ff7fde0:   014a6314  [TRL.ext:test_grpaddr] + 11ba4
        2ff7fe40:   014a616c  [TRL.ext:rcv_completion] + 119fc
        2ff7fef0:   014a5aac  [TRL.ext:link_manager] + 1133c
        2ff7ff80:   014a4df4  [TRL.ext:init_proc] + 10684
        2ff7ffc0:   000330f0  .procentry + 14
        00000000:   00000000  \
.Output from the errpt:
.ERROR LABEL:    DSI_PROC
ERROR ID:       20FAED7F

Date/Time:       Tue Dec 12 18:46:38 
Sequence Number: 245053
Machine Id:      000003328000
Node Id:         netdb
Class:           S
Type:            PERM
Resource Name:   SYSVMM 

Error Description
Data Storage Interrupt, Processor

Probable Causes
SOFTWARE PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE
Detail Data
Data Storage Interrupt Status Register
4000 0000 
Data Storage Interrupt Address Register
007F FFFF 
Segment Register, SEGREG
60A6 0018 
EXVAL
0000 000E 

Response:

Found a Exact hit on this.  Known bug, APAR to fix is IX41199

System crashes when packet is received on sap 0xfc and user has
not opened sap.
PROBLEM CONCLUSION:
Test_grpaddr() is no longer called when discovery packet is
received per discovery packet is special case and does not
require explicit enable of sap.  Group address is already
handled within discovery sap logic.  Also, added check in
test_grpaddr() to verify sap exists before handling packet.
This handles exposure that adapter places packet on dlc ring
queue and sap is disabled before packet is processed.
I explained the above to the client, he would like a more
detailed explained as to why his system has crashed, why it
is happening now, what in the source is being fixed.DLCs support 2 
kinds of resolution -- name discovery and address
resolution.  The application defines which of these are to be used
to find the remote.  Name discovery uses a special protocol over the
FC sap.  There is a find and a found packet which is sent to determine
the address of the remote system.

Address resolution uses the TEST command sent on the NULL sap as a
broadcast message.
Apar ix41199 was to fix a problem when the Discovery packet is sent
to the FC sap but was unexpected (system not using name discovery).
Probable cause is due to some other system erroneously using name
discovery instead of address resolution.There apparently is no 
longer an external documentation which
describes the 0xFC sap being used for name discovery.  This was a
limited functionality for SNA which was later removed.  Because
the RS/6000 had to interoperate with the AIX RT, the function was
retained -- although it is not the default and will be removed in the
future.
New functionality was added to support OS/2 RIPL which also uses the
0xFC sap.  This also is not documented externally (non-AIX pubs).
This assumes that the Functional Address will be the LS/X RIPL
functional address = 0xC00040000000.
Without an attachment or link trace (or a network sniffer trace), I
cannot tell what packet was received.

Response:

Provided this information to the customer

Closing with customer approval



Support Line: system crashed with 888, it is back up now ITEM: BC9105L
Dated: January 1996 Category: N/A
This HTML file was generated 99/06/24~13:30:23
Comments or suggestions? Contact us