To contact the AIX Support Center,
call either
or 9-1-800-225-5249 (CALL-AIX),
      - then option 1 (we have an AIX Support Line Contract and
                  Need Hardware or Software Technical Assistance).
      - then option 1 (for Technical Assistance on AIX).
      - Your "Customer Number (aka "Entitlement Number") is 5336519.
        or if you want to use CS's "Entitlement Number", it's IN00018.
Bruce says this IN00018 number is still good for Delphion's use.
      - The assign your call a "Problem Number" (aka an "Item Number").
(I used to call 1-800-237-5511, but their option 3 (AIX) seems to get
 you to the same place)
 
=========================================================================================
 5/20/96      Finally got my hands on an RS/6000 model 43P in order to
              track down why it doesn't want to install AIX 4 from NIM.
              Tony Rall did some tracing and discovered what this guy in
Raleigh discovered and reported in the POWERPC FORUM last February.
I called Chris Dukes and got names and CMVC defect # 2849.  The people
that are working on it are Shailendra Bhatnagar, SHAIL@AUSVMB, a
contractor, Steaven Sombar, STEVEN@AUSTIN, and Scott Walton,
SCOTTW@AUSNOTES.  Chris said they've been working on this defect for
113 days.
 
----- POWERPC FORUM appended at 19:19:54 on 96/02/02 GMT (by PAKRAT at RALVM17)
Subject: Bootp on RSPCs on TR... this is why it doesn't work.
Ref:     Append at 20:38:57 on 96/02/01 GMT (by VRBASS at ATLVMIC1)
 
For those of you with RSPCs in a TR environment and they just can't
seem to boot off of your NIM server, here are the entertaining results
of a network trace.
 
The RSPC sends out a broadcast frame containing an ARP packet so that
it can locate the router.
The router sends out a source routed frame containing an ARP packet
(So that it can go through the right bridges) back to the RSPC.
The RSPC is supposed to take the RIF field of that frame, reverse
the information, and store it to handle the source routing
of all subsequent frames.  However the frame the RSPC sends out to hold
the directed BOOTP packet has a BLANK RIF field.
(The ROM IPL emulation disk, as well as the firmware in MCA based
RS/6000s does correctly fill out the RIF field).
As the RSPC's frame goes around the ring without being taken by
anyone, the directed BOOTP request fails.
As the RSPC sees this failure, it proceeds to send a broadcast BOOTP
packet which is promptly ignored.
 
Chris Dukes
 
------------------------------------------------------------------------
 
  Called the Support Center to get on their callback list.
The Problem/Item number is AZ4813.  They opened a PMR (# 0X08349R)
and I should get a call back when the fix is available.
 
  I called Shailendra at 1-512-838-8573 to try to find out the status of
CMVC defect # 2849 and left a message.  He called back and said the
defect has been fixed, but is not yet available.  I asked if there was
a way I could get a copy of the fixed firmware early.  After notes to
Scott Walton (scottw@ausnotes) and Steaven Sombar, I got a pre-V1.10
version at /afs/austin/depts/19yb/public/images/TestBuilds/
carol.net-fixes/car96115.img.  Follow the directions in the README to
dosformat/doswrite it to diskette along with the flash.6xe executable,
type eatabug at the feed-me-the-SMS-diskette graphic to get into the
"resident monitor", then type flash -a to update the 43P firmware.
The only trick was that I had to write the .img file to diskette with
a leading 'P' in the filename (PAR96115.IMG) 'cause the flash program
looks for any file on diskette with the pattern P*.IMG.
 
  If you ever want to get the real stuff, see the directory
/afs/austin/depts/19yb/public/images/PowerSeries830_850, where you'll
find V1.02, V1.04, V1.06, V1.07, V1.08, and V1.09.  Presumably, we'll
want the V1.10 level when it eventually comes out.
 
  This firmware update got the 43P to boot across our token ring as
well as it does with ethernet, but both ways now stop somewhere in the
tftp'd nucleus, probably (according to the forum) in rc.boot where it
does the bootinfo -c.  But fixing rc.boot according to the forum didn't
fix the problem.  I also tried turning debugging on in the nucleus and
connecting up the 3164, but I didn't get anything to come out at that
console.  Time to call the Support Center back again.
 
  They gave me a new item # and PMR #.  They are not the same thing,
but they're treated almost the same.  The Item # is used locally at
the Support Center to track calls, a PMR # is used in Retain.  The new
numbers are Item # BJ2181, PMR # 0X322, branch office 49R.  They're
going to have their NIM expert, Kyle Cline give me a call.  Kyle is
KCLINE@AUSVMR, 8-793-9294.
 
      ********************************************************
      *   By the way, this bug in the 43P firmware also      *
      * exists in the PowerPC 400 firmware, but Scott        *
      * Walton says the 400 is out of service and they       *
      * don't intend to fix that firmware.  The bottom line  *
      * is you can't network-install a 400 machine on our    *
      * token ring netwook.                                  *
      ********************************************************
 
 
=========================================================================================
 6/02/96      Once every couple of weeks, Flick's machine gets into a
AZ9944        state that almost nothing works on it.  He was able to get
              a root window by rexec-ing a ksh, but only a subset of
commands were working.  SRC for example, was broken.  It would eventually
time out.  Flick thought it had something to do with sockets.
 
  The Support Center found that there were two srcmstr processes running,
one a parent of the other.  The child srcmstr could not be killed.  They
thought this was fixed (as well as various named problems by the way) in
the latest service, available through fixdist via IX58114.  (Note on 9-18-96:
     When searching for the latest fixes in fixdist, look for the text string,
     "latest aix 4.1".  You'll find a few of them.  Just select the latest
     one.  The APAR/PTF number will be different.  For example, the August
     APAR is IX61071.
                                                            I downloaded
330 MB into /afs/alm/common/inst.images/4.1.4.  Now I gotta figure out
what to do with all that stuff, i.e. how to apply it onto Flick's
machine, nemesis, and the spot, spot1.
 
 
=========================================================================================
 6/03/96      Mike Carey's console on bucky quit working after a normal
BJ6838        reboot.  Disk-based diagnostics work fine on it, but hard
              disk-based diagnostics and AIX 4.1.4 both act like they
don't "see" the console.  The error logs (errpt -a) show software
errors.  The info from errpt shows
         LABEL:          GRAPHICS
         IDENTIFIER:     E85C5C4C
         Resource Name:  CFGLFT
         Description:    SOFTWARE PROGRAM ERROR
         Detail Data:    DETECTED   FAILED    RC   ERROR LOCATION
                           cfglft  build_dds    45  1114  31
and in the next error,     cfglft      open    -1  1119  18
 
  Dan in the Kernel Group said that IX53983 would fix my problem and that
it wasn't available through fixdist yet.  He sent me a 8mm tape, which
I loaded from maple and put into /afs/alm/ais/322temp/carey_fix.  It
turned out to be just a few files, the two relevant ones being the .toc
and devices.graphics.com.usr.4.1.4.  I applied it to bucky and rebooted,
but the symptoms still exist.  I called Dan back and left a message on
Friday, 6/7/96 at 11:45.  Dan called back and said he was going to
queue my call to the graphics group.  A neat command to see if a
particular fix is applied to a machine is, instfix -ik IX53983.
 
  Also checked out RETAIN, but didn't learn much.  There are two hits
on the error label, E85C5C4C.  One was the above fix, and the other
was for a GXT1000 graphics card, which Mike didn't have.  He has the
standard, 1-1 card, a "Color Graphics Display Adapter".
 
  Called the Support Center back on Monday (6-10) at 2:25.  Talked to
Mike Olivier (OLIVIER at AUSTIN, 8-793-6932) in the Graphics group who
said he'd investigate and call me back today.  Mike called again and
wanted the output of a "configuration manager" command (cfgmgr -v),
which showed no error configuring gda0 (the graphics adapter), but
did show an error configuring lft0
   Method error (/etc/methods/cfglft):
   0514-045 Error building a DDS structure.
and the output of the errpt -a command showing the two cfglft errors
described above.  I sent him both in a note, but I sent them to his
VM userid.  Mike called on Tuesday and I pointed him to them.
 
  Called again Thursday (6-13) at 9:35 and left a message to get a
status check.  Mike returned call with the suggestion to change the
permissions of the special file /dev/lft0 from 662 (I think it was)
to 666.  I did and left a message for Mike Carey to reboot and see
if that fixed things.  Called Mike Carey back Monday morning (I took
Friday off) and no, the terminal still wasn't working.  Left a msg
for Mike Olivier Monday morning at 10:10.
 
  Mike Olivier called again asking if we had a good system backup
(I didn't clarify exactly what kind of backup he was talking about).
He said he suspects a corrupted ODM data base.  It sounded to me like
he wasn't certain and was just guessing, so I pushed back and suggested
we verify that it's really a corrupted ODM and possibly repair it if
we can.  He said he'd investigate and get back to us.
 
  Mike Olivier called again at 2:00 and this time he recommended
running fsck in service mode as root.  He also noticed 5 instances of
power failures (see EPOW_SUS in the errpt) and was asking why there
were so many.  Was the user pulling the power plug on the box?
One in particular happened 5-29 at 3:15, maybe about the time Mike
Carey had the original problem.
 
  Mike Carey couldn't run diagnostics (remember?) from the hard disk
and said he was just going to reinstall.  I tried calling him back
to say, "Wait, we can run diags from diskette.", but he wound up
reinstalling everything anyway.  I called Mike Olivier back to
cancel this call, but I'm really disappointed that we couldn't
identify what exactly was wrong and fix it.  The cause thought, is
likely the fact that Mike Carey often powers his machine off
when it's hung up instead of hitting the yellow reset button twice
to reboot.
 
 
=========================================================================================
CA8212      Called in our named-hang problem with AIX 4.1.4.  The named
06/21/96    daemon often (a few times a week on nim) simply refuses to
            resolve names.  Things look ok, but a host jasper (say)
returns "host: 0827-801 Host name jasper does not exist."  Recycling
named fixes it 'till the next time.  nim (and now jasper, dale, and
ech) are at the latest level, UM443183.
 
  Meghan O'Brien (meghan@austin.ibm.com at 8-793-7115) told us to turn
on tracing with the named option -d 1 (the 1 says debug level 1, which
is the least amount of data, 9 is the most).  The log is kept at
/var/tmp/named.run.  When named fails, send Meghan the last few k-bytes
of the log.  By the way, kill -USR1 `cat /etc/named.pid` will increment
the debug level, while   kill -USR2 `cat /etc/named.pid` will turn off
debugging.  I have debugging turned on on jasper, nim, ech, and cabernet.
 
  Also by the way, when you talk to others about this, tell them we're
running named on all our client machines "as cache-servers".  Normally
only your name servers run named.  We run it here this way to get two
advantages.  1) We have a backup name server, and 2) I.P. addresses
are cached locally instead of at the name server.  Once Meghan
understood this, she was more comfortable with our setup.
 
  Came in on Monday morning and tried a host jasper on nim.  Worked ok.
Did some work and a few minutes later, named was hung.  Sent Meghan
the last 3,000 lines or so of the named.run file as well as a software
error reported in the errpt (turns out the errpt entry was due to my
sending named a kill -9 signal, so it wasn't really relavent).
 
  Meghan called back Tuesday (6-25) morning and said there wasn't
anything in the named.run.  She wanted our /etc/named.boot & .ca files
and also to turn on syslog.  The steps required for turning on syslog
are
 - Edit /etc/syslog.conf & add the line daemon.err /tmp/named.syslog
   or whatever file you want to send this stuff to.
 - touch /tmp/named.syslog
 - refresh -s syslogd
 - stopsrc -s named
 - startsrc -s named
A curious thing, the first line that gets logged in that file is one
that complains of the domain line in /etc/named.boot, which seems
perfectly ok to me.
     Jun 25 09:28:15 jasper named[6208]:  \
                /etc/named.boot: line 1: unknown field 'domain'
I've changed nim, jasper, cabernet, & ech.
 
  named failed on ech on Wednesday morning (6-26), but there's
nothing else in the syslog file except for that initial line
complaining about the domain record.  I called Meghan at 11:10 and
left a message.
 
  Meghan called back on Thurday (6-27) and said we shouldn't have a
domain line in our named.boot.  They aren't needed for cache-only
name servers.  But Meghan says that isn't what's hanging our named
daemon.  Meghan had me send a kill -2 to named which produces a dump
in /var/tmp/named_dump.db, which I mailed to her along with our
/etc/named.local.  But Meghan is giving up trying to resolve our
problem and is requeuing our call to the "Back-end Communications
Group".
 
  Meghan did say we were a bit down-level, but that wasn't our problem
either.  The latest level
for                 is
bos.net.tcp.client  4.1.4.13 (we were at 9)  U443181,
bos.net.tcp.server  4.1.4.11 (we were at 8)  U443826.
The problem is, those fixes aren't available on fixdist yet.  I've
called the Support Center to ask how I'd get these fixes, and to also
rattle their cage a bit to get them to look at our problem.  Meghan
said I may have to do this.
 
  Got a call back at 11:20 am Friday (6-28) and Robert Justice talked
me through ftp-ing the fixes from Boulder.
   rftp aix.boulder.ibm.com
   login as anonymous
      (There's a ls -Ral file in the root directory to help find things)
   cd /aix/fixes/v4/os
      (The DCE fixes are in /aix/v4/fixes/other)
   binary  (or devices or X11 or xlc or ??)
   the fixes were in the files bos.net.tcp.server.4.1.4.11.bff
                           and bos.net.tcp.client.4.1.4.13.bff
I put them in /afs/alm/common/inst.images/4.1.4 and run inutoc on
that directory as root.  From nemesis, smitty install_fileset,
directory=/afs/alm/common/inst.images/4.1.4, and for the
"FIXES to install" field, you need the IX* numbers, which you can
get from the .toc file.  In this case, IX54156 IX53663 is good.
I put the fixes on both nim & jasper, and recycled named.
Dale decided to apply all the fixes in common/inst.images/4.1.4,
so cabernet has the updated stuff, too.  Rick put the fixes on
ech also.
 
  This problem's been transferred to Scott Tanquary.  I called 7/8/96
at 4pm requesting a callback and again 7/9/96 at 11:30.  Scott's tie line
is 8-793-6725.  Scott called back at 11:50.  We talked through some things
but I couldn't answer some of his questions on how our nameservers were
configured.  I got the root password from Tony Rall (l0vebars).

  Called Scott back Wednesday (7/10) evening and left a message.  I called
him again, this time to his personal phone (evidently, these support guys
have 2 phone lines and they sometimes choose to not answer their support
line).  This time Scott told me that for cache-only name servers, the root
nameservers should be what's defined in /etc/named.ca, that is the name
servers for the whole internet.  We have our Almaden nameservers defined.
I talked to Tony Rall about this and he kinda agreed, but said it didn't
have to be the internet root name server, having just the root name server
for ibm.com is good enough.  Tony came up with the following /etc/named.ca
which I've installed on nim & jasper late Friday (7-12).  We'll see how this
fairs.

 ; Root IBM nameservers
.                        86400 IN NS pollux.cbe.ibm.com.
                         86400 IN NS leda2.cwp.ibm.com.
                         86400 IN NS castor.cdf.ibm.com.
leda2.cwp.ibm.com.       86400 IN A  9.14.1.3
castor.cdf.ibm.com.      86400 IN A  9.78.1.2
pollux.cbe.ibm.com.      86400 IN A  9.14.1.2

  Scott Tanquary wants to know our status with this problem.  I haven't seen
the named failure on jasper, nim, or cabernet, so I'm willing to say it's a
named configuration problem.  Tony Rall wants to try Tom Engelsiepen's 
suggestion of putting in specifying our nameservers as serving the almaden
domain.  Tony'll implement that in our nameservers and we'll test that.
Meanwhile, I called Scott and had him close this problem.  At least, I tried
to.  Scott's evidently in the process of changing his number to 8-793-4177
and neither number is accepting messages.  I sent Scott a note instead.


10/21/96    We are seeing the same named hang on AIX 4 systems as we saw last
            June.  We may need to start tracing named again. 
=========================================================================================
CA8539      Called in a problem with nim after putting on the latest
06/21/96    (April) fixes (IX57483, the PTF #).  Kinda bizarre, but I
Barkat      saw after a nimadd xwing, 4 things get put in /etc/exports,
            then propogated to /etc/xtab with an exportfs command, but
there was one file missing in /etc/exports but in /etc/xtab.
Specifically,
 
These directories below, were they in         | exports? | xtab?
----------------------------------------------+----------+--------
/inst.images                                  |  Yes     |  Yes
/export/spot1/usr                             |  Yes     |  Yes
/home.native/root/Install_Scripts/Do-setupnet |  Yes     |  Yes
/export/nim/scripts/xwing.script              |  No !!!  |  Yes
 
Thus, any further nim activity like another nimadd, would put the new
stuff in /etc/exports, but the missing xwing.script wouldn't be in
/etc/xtab, and the install would fail when it tried NFS-exporting
the xwing.script.
 
When the install fails, the alog -t bosinst -o shows these errors,
 Network installation manager customization.
rc=175
0042-175 c_script: An unexpected result was returned by the
        "/usr/sbin/mount" command:
 
mount: 1831-011 access denied for NEMESIS.ALMADEN.IBM.COM:/export/nim/scripts/xwing.script
mount: 1831-008 giving up on:
NEMESIS.ALMADEN.IBM.COM:/export/nim/scripts/xwing.script
The file access permissions do not allow the specified action.
 
  I backed up and did each step one at a time, that is, I did the
nim -o allocate for the lpp_source, my Do-setupnet script, and spot,
and checked /etc/exports & /etc/xtab.  All was ok.  I then did the
nim -o bos_inst -a source=spot -a no_client_boot=yes xwing and
things were also ok.  All four lines were in both /etc/exports and
/etc/xtab as expected.  And sure enough, the install went fine.  Shit!
I can't seem to get it to fail again.
 
  We agreed to close this problem and open up another one should it
fail again.  Meanwhile, I've written a little script and stuck it at
the end of my nimadd & nimadd-isa scripts to check for this situation.
Also, Kyle of the Support Center suggested lsnim -Fl machine_name to
check for the correct number of "exported    = ..." lines.
 
=========================================================================================
CB0404      Darrell Long (7-2376 in H2-805) wanted to install AIX 4.1.4
 6/26/96    on his 7012-340, which has a working ethernet adapter, but
Sunita      the IPL ROM Emulation diskette doesn't recognize it.  It's
            got a 2-8 on the back, which isn't in my list of adapters.
A lscfg on Darrel's machine (running 3.2.5) shows
    ent0   00-00-0E   Standard Ethernet Adapter
The adapter is labeled "Thick/Thin" & has one of those 2-bank jumpers
on it to switch it from one to the other.  It's correctly on thick now.
This card is only about 4 inches square with a diagonal notch taken
out of it.  Adapter number 2-8 isn't in my list of adapters.
 
Curiously, *my* 340 also has an ethernet adapter in the back, but it's
labeled 2-9, not 2-8.  Also, a lscfg on my machine shows this to be a
    ent0   00-00-0E   Integrated Ethernet Adapter
I'm not using the card.  My 340 is connected to token ring.
 
  I've tried my 2-9 card in kipling and the IPL ROM Emulation code
didn't recognize that either.  I pulled my ethernet card out of
aix-test to let Darrell Long borrow it just for the install.
Afterwards, we switched to his 2-8 adapter and AIX 4.1.4 is happilly
running with it.  I told Sunita all this Thursday (6-27), but told
her I wanted to pursue it.  Either say we don't support the 2-8 and
2-9 adapters, else fix the IPL ROM Emulation diskette to recognize
them.

  Ok, pass the crow.  The 340 *does* have BOOTP-enabled IPL ROM, so
you *can* boot without the IPL ROM Emulation diskette.  Why did I
think otherwise?  I called the Support Center and had them cancel
this PMR.
=========================================================================================
3147X,49R   Called to get on the Interested Parties list for ADSM APAR IC13320.
 9/05/96    In the past, this has been a simple thing that the Support Center
            handled on a routine basis.  But this time, the first time I called,
the woman said to call 1-800-879-2755, which turned out to be the "Software
Manufacturing Solutions" line, option 1 got to the "National Publication Ordering
Center", and option 2 the "Software Support Center", who had never even heard of
an Interested Parties list and "we use RETAIN all the time."  I called back to the
AIX Support Center and talked to a different woman, who also didn't know how to add
me to the IP list, but at least she created a PMR (#3147X, branch office 49R) and
called the ADSM Support people, who said *they* would add me.  Whew!  What a hassle.

Anyway, here's the append from the ADSM FORUM on IBMUNIX.
            
     ----- ADSM FORUM appended at 03:52:02 on 96/08/22 GMT (by MARNATTI at LEXVM2) -
   Subject: Passwordaccess = GENERATE
   Ref:     Append at 18:27:18 on 96/08/14 GMT (by MARNATTI at LEXVM2)

   This is a follow up to the note I appended to this forum asking
   about conditions that would cause a generated password to need
   to be reset on the client side, since I was seeing where this needed
   to be done at strange times.
   It turns out that there is an open APAR, IC13320, for this problem
   with no PTF ready yet.  This APAR describes what I was seeing;
   during repeated ADSM archiving operations the client's password
   file will up and disappear.  In case anyone else is experiencing
   this problem, a temporary work around, according to this problem
   record, is to place a 2 second delay between ADSM client calls
   for archive operations.

   John Marnatti
   ISSC, Lexington, KY            

=========================================================================================
CE9917     Ron Moore's 43P won't install via NIM.  He's got a token ring machine (dinorm)
 9/18/96   that is configured correctly.  It contacts nim, tftp's the initial boot image
Choon      over, grays out the lower portion of the screen, but doesn't do anything after
           that.  I allocated the debug-spot and traced it using the console concentrator
and ate so that I can trap all the console messages.  What's happening is between led's
610 and 612, the following mount is done
    mount -r NEMESIS.ALMADEN.IBM.COM:/export/debug-spot/usr /SPOT/usr
Since there's precious little in this initial nucleus to debug stuff with, I changed
the /export/debug-spot/usr/lib/methods/showled module to the normal ls command from
/usr/bin, added a couple of ${SHOWLED} -ld /SPOT/usr commands to 
/export/debug-spot/usr/lpp/bos/inst_root/sbin/rc.boot, and rebuilt the initial tftp'd
nucleus with a nim -o check -F -a debug=yes debug-spot command to see the state of
affairs both before and after the mount (around line 760).  Here are the relevant
lines from that trace

+ /usr/lib/methods/showled -lR /SPOT        <---- Remember, this is the ls command
total 16
-rw-r--r--   1 0        0           1060 Dec  1 14:29 niminfo
drwxr-xr-x   2 0        0            512 Sep 18 1996  usr     <---- All's ok before.
/SPOT/usr:
total 0
+ /usr/lib/methods/showled 0x610
0x610 not found
+ mount -r NEMESIS.ALMADEN.IBM.COM:/export/debug-spot/usr /SPOT/usr
+ [ 0 -ne 0 ]
+ /usr/lib/methods/showled 0x612
0x612 not found
+ [ -d /SPOT ]
+ /usr/lib/methods/showled -lR /SPOT
total 8
-rw-r--r--   1 0        0           1060 Dec  1 14:29 niminfo
----------   0 0        0              0 Jan  1 1970  usr     <---- Weird !!!
+ [ -d /SPOT/usr ]
+ cp /SPOT/usr/lib/boot/network/rc.bos_inst /etc              <---- Fails (of course).
cp: /SPOT/usr/lib/boot/network/rc.bos_inst: Not a directory
...  The rc.boot script continues on for a little bit more, and actually fails with
     a "Illegal Trap Instruction Interrupt in Kernel", and if you do a "g" to get
     it going, you quickly get one more line, LED{0A8}, which I have no idea where
     that's coming from.

I've rebooted nim.  That didn't do any good.  I've installed another token ring
connected 43P (Jim Hafner's triumph-t).  That's going fine.  I could try to install
with the built-in ethernet, but haven't done that yet.  Time to get help.

Background info:  I've installed the latest microcode from Austin afs tree, from
/afs/austin/depts/19yb/public/images/PowerSeries830_850/v1.10, which is what I put
on Jim's triumph-t machine.  The token ring network adapter configuration is also
correct.  It shows

Adapter MAC Address    0004ac356d32
I/O Address            a20
Interrupt              9
RAM Address            d00000
RAM Size               64K
ROM Address            cc000
Remote IPL             Disabled
Token Ring Data Rate   16
Auto Sense             Disabled

  Choon wanted me to apply the latest service, so I got a half gig of updates and
applied them to both nemesis and to the debug-spot.  The install still fails the
same way, a zero-length /SPOT/usr.  I've also tried the install via ethernet.
Same failure.

Robert      I sent Robert my trace (kapture9) showing the ls -lR /SPOT both before
9/23/96     and after the apparently-successful mount command.  He's going to forward
9:30 am     that to somebody else and they should call back sometime later today.
PMR 5183X,  He's also upped the severity to 2 (I thought 2 was the default, evidently,
BO 49R      this was a 3).  BUCKNER@AUSTIN.IBM.COM or BUCKNER at AUSVMR.

Baltazar    I talked with this guy, too and sent him the trace to BALT@AUSTIN.IBM.COM.
9/23/96     He asked a few questions, and went away to think about it a bit.
10:15 am    Baltazar called back and asked for some ls -ld commands to check on each
11:50 am    directory in /export/debug-spot/usr/lib/boot/network/rc.bos_inst.  He also
            wanted me to check the system date & time on the 43P after booting the
SMS diskette.  It was 01/11/39 and 21:29:00 (this being 09/23/96 and 11:54:00).

  The problem is fixed now.  The problem was due to that date being so far off.
Evidently, NFS can't handle date differences so great (or maybe it was dates past
the "epoch", we weren't quite sure what the details were).  I booted the machine in
maintenance mode from a CD and was able to import the root volume group from the
AIX 4.1.3 system that was on there, and set the date.  FYI, if you ever have a 
machine that *doesn't* have an OS on it, you can import the date command via floppy
by following these steps (Baltazar said he was going to document these steps in
the PMR):

1)  From a good machine, dd of=/dev/fd0 if=/usr/bin/date.
2)  Boot from a CD and get into a Limited Function shell.
3)  In order for the new date command to have the execute permission set, first
    cp /bin/t /date.
4)  dd of=/date if=/dev/fd0.
5)  Then you can date mmddHHMMyy.

==============================================================================
CG0223     Flick's complaining about some include file in his C++ program
10/09/96   not working correctly.  He's already got it narrowed down to the
           socket.h file, which is part of the bos.adt.include fileset, which
was updated in the latest fixes I got from fixdist.  We have bos.adt.include
version 4.1.4.14, which is bad.  Flick says that 4.1.4.8 was ok.  This is
Flick's note to AIX service:

    socket.h for bos.adt.include 4.1.4.14 does not honor
    _XOPEN_EXTENDED_SOURCE for C++ source.  This works for socket.h at
    bos.adt.include 4.1.4.8.  In particular the third argument to accept
    is a size_t * (4.1.4.14) when using the C++ compiler even though
    _XOPEN_EXTENDED_SOURCE is undefined.  Can someone tell me if this is
    fixed in a later release of bos.adt.include ?

  See also the three appends in the AIX4 FORUM with the subject
Subject: Why did the accept() prototype change from 4.1 to 4.2 ? 

12/05/96   Called again 'cause Da Li is having the same problem with some
4:00 pm    other program which includes <sys/socket.h>, which sucks in
           /usr/include/sys/socket.h.  Da is getting the same kind of
errors Flick was getting about incompatible function definitions.
I left a message with Dennis somebody asking for him to call me back.
Fixdist and anonymous rftp to aix.boulder.ibm.com still showed 4.1.4.16
being the latest bos.adt.include.

==============================================================================
CI8707      Rosa says qprt on AIX 4.1.4 doesn't honor the -Y0 (that's a zero)
12/06/96    option.  It's suppose to force simplex printing, but it's coming
Wes         out duplex.  I tried it on my machine, jasper, running AIX 4.1.4
            using the command qprt -c -P3116c1a -Y0 public_html/test.ps
and everything came out as expected.  I also tried on maui (AIX 3.2.5) and
without the -Y0 flag.  Rosa is investigating why 3116c1a works, but 3116g1a
and 3116b1a doesn't.

==============================================================================
bu0419      Tom Griffin (7-1444) with ocrx1 (root pw=foobar) and K. Mohiuddin 
02/11/97    with moidin6k (no root pw), reported by Sandeep Gopisetty (7-2680),
12:50       both report the same problem with their identical machines.
Anthony     After installing the latest AIX 4.1.4 image, the one I serviced on 
            1-24-97 with the January fixes from fixdist, they have two problems.

1)  Have a corrupted /var/adm/ras/errlog.  If you do an errpt, you get the msg
       The supplied error log file is not valid: /var/adm/ras/errlog.
    A /usr/lib/errdemon -l (to get the error log attributes) shows,
       Error Log Attributes
       ---------------------------------------------
       Log File                /var/adm/ras/errlog
       Log Size                507 bytes
       Memory Buffer Size      8192 bytes                        
    What's different from a normal system is the Log Size.
    A /usr/lib/errdemon -s 1048576 to reset the size fixes this problem.
    -->>    Turns out what caused this problem was the /var file system in
    -->>  the spot was full (1 4-MB partition), mostly due to the file
    -->>  /var/adm/ras/installp.log.  I changed /export/spot1/usr/lpp/bosinst/image.template
    -->>  (as I have documented in the inside front cover of my nim manual)
    -->>  to make /var initially 8MB.  This fixed the corrupted error log.
    
    But, what's worse is

2)  X doesn't start up.  They both have a RS/6000 model 250 (7011-26-38806 in
    the case of moidin6k), with a built-in GXT150 Graphics Adapter.  Luckily,
    Sandeep showed me a recently-installed (on 1-11-97, before my January
    service upgrade to the spot), also identically-configured RS/6000-250
    named lalita, that works.  Before that, according to the
    /var/adm/ras/nim.installp log, the last time I installed service was on 
    October 25th, 1996, which was probably the September service level.

02/17/97    Haven't heard from anybody for almost a week, so I called in to
12:00       see what's going on.  I was queued to Barbara, who helped me a
Barbara     little bit.  She wanted to run diagnostics, but the AIX diag
Patterson   command as well as the hard disk-based diagnostics (what you get
            when you boot in service mode) both fail to test the graphics
adapter 'cause they think the device is busy.

  I wasn't able to run my diskette-based diagnostics (nor Dale's) 'cause I
got flashing 888's after reading the first diskette.  Turns out I need a newer
level of diagnostics, orderable through the FE, which I've done.  Once I get
them, I'll make a bunch of copies for everybody else.

02/19/97    After calling FE in and getting Bob to run his diagnostics, we
            found out the hardware was fine all along.  I called Barbara back
            and told her this.  Meanwhile, I got Bob, the FE, to get me a
more current set of diagnostic diskettes so I could run on the 250's.  He
gave me two sets.

There's a third problem with these machines, well, with moidin6k & lalita at
least.  Those two both have token ring adapters in them as well as the built-in
ethernet adapters built into the motherboard.  Those two machines are actually
using their token-ring adapters, with nothing connected to the ethernet.
The problem is, both machines are getting numerous ethernet adapter errors
reported in the error log, about one every two minutes.  This causes the error
log to fill up and recycle in about 50 hours.  This isn't related to the 
graphics adapter problem, 'cause lalita's graphics adapter works fine.

==============================================================================
BU2275        I had another run-in with a 43P, this one (ananda) is ethernet-
02/17/97    attached, that wouldn't install.  After tracing is with the debug
10:30       spot, I saw the same problem that the NIM FORUM gave a workaround
Choon       to back in Jan, 1996, that is, add the following 2 lines to 
            /exports/spot1/usr/lpp/bos/inst_root/sbin/rc.boot,
    BOOT_SERV_IP=129.33.24.63
    E802=0

  By the way, to connect the portable console to a 43P, 
1) From the 9-pin connector on the back of the 43P, plug in Dale's universal,
   white, sex change cable, the end without the red tape, then
2) Plug *the same end* - not the other end - evidently the cable crosses some
   lines over in the middle, into Dale's 22-foot, 25-pin male/female to 25-pin
   male/female, flat grey cable he keeps wound up in the bottom drawer in his 
   office,
3) Into a null modem, then
4) Into the portable console.

  To connect the 43P to Pine, instead of steps 3) and 4) above, plug Dale's
flat grey cable from step 2) into a connection on the back of pine's 
null-modem block, top row where the grey connector converts to the flat cable.
 
I put this fix in last year and haven't had any problems with it since, but
this time, it wasn't enough to fix the garbage the bootinfo -c command was
returning.  Here's the relevant stuff from using ate's ctrl-b to capture output,
+ bootinfo -c
+ set -- 255.255.255.255 129.27.20.51 129.27.20.253 0 0 1 /tftpboot/ananda.almaden.ibm.com
 1.4.255.255.248.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0. 0
+ CLIENT_IPADDR=255.255.255.255
+ BOOT_SERV_IP=129.27.20.51
+ BOOT_GATE_IP=129.27.20.253
The fix I've got in fixes BOOT_SERV_IP, but as you can see, the CLIENT_IPADDR
and BOOT_GATE_IP addresses are both wrong as well.  This causes the network
interface to be set up incorrectly and the tftp to nim later on fails with
tftp: sendto: No route to host and looping on LED 608.  I put in a fix in 
rc.boot, but this is a very specific fix for this one machine.  Right after 
the other two-line fix, I added 
  first3="$(CLIENT_IPADDR%.*}"   #  Get first 3 octets.
  case "$first3" in
    129\.33\.[89] | 129\.33\.1[0-5]) BOOT_GATE_IP=129.33.8.253 ;;
    129\.33\.2[4-9] | 129\.33\.3[01]) BOOT_GATE_IP=129.33.24.253 ;;
    129\.33\.160) BOOT_GATE_IP=129.33.72.253 ;;
    129\.33\.7[2-9]) BOOT_GATE_IP=129.33.72.253 ;;
    255\.255\.255) CLIENT_IPADDR=129.33.29.64; BOOT_GATE_IP=129.33.24.253 ;;
    *) ;;
  esac
As you can guess, I initially noticed the BOOT_GATE_IP was wrong and tried
making a generic fix based on the CLIENT_IPADDR, but since that was screwed up
too, added the last case.

  I just updated the firmware to their latest level, 1.11, 'cause a 97/01/09
append in the NIM FORUM suggested this would fix the bad bootinfo data.
I even called Dick Chimenti who started the thread and he said that yes, it
did fix it.  Well, it didn't for me.

  I mailed the debug script output to choon@austin.ibm.com.  I pointed him to
the pertinent lines where the data is all wrong and he's gonna look into it,
confer with others, and call me back.

12:00  Choon called back and asked if we put in the I.P. addresses in the 43P
       network setup screen, with leading zeros or not.  We should not have
leading zeros.  I checked and yep, we did have leading zeros.  Changed it and
rebooted from nim and things worked fine.  Damn!  With Power PC's, you don't
put in leading zeros, with RS/6000's, you do.  How's one suppose to know?
Especially if pings & bootp's to nim & tftp's from nim all worked ok.


==============================================================================
85553,49R     Having a problem with the Firewall code installed on the patgate
12/30/97    machine (9.1.8.252 & 192.168.56.252).  The symptoms of the problem
11:30       are, from root@patgate, I could nfs-mount something from the CWS,
                 mount 192.168.56.65:/spdata/sys1/install /mnt
            then wc /mnt/default/lppsource/raj, which is a tiny, 9-byte file,
but when I try   wc /mnt/default/lppsource/.toc, which is a bigger, 1 MB file,
the command hangs - that is, it never completes and ctl-c-ing out of it leaves
the system in a funny state.  That is, subsequent NFS reads, unmounts, and
mounts also hang.  I traced a subsequent, identical mount command and it was
getting data from the .toc file, in other words, some process (nfsd?) on the
CWS was remembering that we wanted to read the .toc file, and was shoving us
data even though we had killed the wc socket and was trying to do something
else now.

  I iptrace'd a good wc .toc command from another machine (as0073e0) and the
failing wc .toc from patgate and saw that a UDP packet was being denied by a
seemingly-unrelated firewall rule.  The details are

- The iptrace (/tmp/bad.wc.toc.iptrace10) from patgate shows the first part
  of the .toc file data being transmitted in the following three packets 
  Note the fragmentation of the UDP packet, resulting in the same ip_id
  number and different ip_off(set) values.  These 3 packets were all coming
  sequentially from 192.168.56.65, port 2049, to 192.168.56.252, port 1226.
        Timestamp       Size  ip_len  ip_id  ip_off  Packet Fate
    ------------------  ----  ------  -----  ------  ----------------
    22:04:26.758164736  1514   1500    199       0+   Passed
    22:04:26.759396864  1514   1500    199    1480+   Denied by Rule 2
    22:04:26.760448768  1306   1292    199    2960    Denied by Rule 2

- The firewall log (/var/adm/sng/logs/971229_220x.log) show the first
  packet getting passed on through, due to my PermitAll rule #4, but
  the second & third packets being denied due to rule #2.
    Dec 29 22:04:27 patgate : 1997;4014: 2073;ICA1036i;#:;4;R:p;
        i:;192.168.56.252;s:;192.168.56.65;d:;192.168.56.252;p:;udp;
        sp:;2049;dp:;1226;r:;l;a:;n;f:;y;T:;0;e:;n;l:;1500;
    Dec 29 22:04:28 patgate : 1997;4014: 2073;ICA1036i;#:;2;R:d; 
        i:;192.168.56.252;s:;192.168.56.65;d:;192.168.56.252;p:;udp;
        sp:;0;dp:;0;r:;l;a:;n;f:;y;T:;0;e:;n;l:;1500;
    Dec 29 22:04:28 patgate : 1997;4014: 2073;ICA1036i;#:;2;R:d;
        i:;192.168.56.252;s:;192.168.56.65;d:;192.168.56.252;p:;udp;
        sp:;0;dp:;0;r:;l;a:;n;f:;y;T:;0;e:;n;l:;1292;

- According to the /etc/security/fwfilters.cfg file, Rule #2 is
    #       Between anySource and anyDestination
    #        Service : Syslog
    # Description : deny Syslog
    deny 0.0.0.0 0.0.0.0 0.0.0.0 0.0.0.0 udp any 0 eq 514 non-secure both
       inbound l=y f=y
  This rule doesn't look like it should be denying these packets.

- I've got FW.base, cfgcli, & libraries at the latest 3.1.1.2 level.

  Until I get a call back from the Support Center, I tried changing the nfsd
daemon to limit read packet size to 1480 bytes by the command
  chssys -s nfsd -a "-r 1480 8"
You can do an
  odmget -q subsysname=nfsd SRCsubsys|grep cmdargs
to see what it's set to.  It was just 8 before I changed it.
... A little while later ...
That didn't seem to change anything, but it was probably due to the fact
that patgate was screwed up (see first paragraph).  I couldn't remount the
directory.

  I read a bit about "Fragmentation Control" on Firewall rules and changed
the "nonsecure Syslog" rule from yes to headers.  The settings refer to what
type of packets, this rule applies to.

                            |     S E T T I N G
      Applies to Packets    | Yes | No | Only | Headers
  --------------------------+-----+----+------+---------
              Non-Fragments |  x  | x  |      |   x
           Fragment Headers |  x  |    |  x   |   x
  Fragments Without Headers |  x  |    |  x   |   
So now, rule #2 doesn't apply to Fragments Without Headers any more and Voila!
Things now work fine.  How 'bout that?  A little bit of RTFM works (I just 
wish I had the manuals).  This still leaves one question, tho'.  Why did this
rule get applied to UDP "Fragments Without Header" packets?  The rule says
specifically, destination port 514 only.  When the Support Center calls back,
I'll ask them.

  The answer is, since UDP fragment packets don't have port numbers in them,
the port clause of rules are ignored, so that rule was denying *all* UDP
fragments from and to anywhere.  One could argue that's the wrong thing to
do.  The reasoning could have been "since UDP fragment packets don't have port
numbers in them and this rule has a specific port number clause, then don't
apply this rule".  But that's not the rationale.

==============================================================================
FL0563      x    On my AIX 4.3.0 system, with DCE & DFS installed, when I'm
03/19/98    x  authenticated to DCE (via integrated login) then my
14:45       x  authentication expires, I cannot klist or kinit any longer.
Donovan     x  I get the message "IOT/Abort trap".  I have the official,
            x  PTF set 22 from a CD the Support Center sent Dale installed.
xxxxxxxxxxxxx
  To debug an expired DCE Identity, I went to the security registry as
cell_admin under dcecp and saw what the default and minimum ticket lifetimes
were by,
  dcecp> registry show  -att
  {deftktlife +1-06:00:00.000I-----}
  ... 4 more lines ...
  {mintktlife +0-00:05:00.000I-----}
  ... 2 more lines ...

Then to change it, I
  dcecp> registry mod -mintktlife +0-00:00:01.000  
to first change the minimum to 1 second (Donovan said the default can't be
less than the minimum), then I set up to logon again, and quickly (so as to
not affect anybody else),
  dcecp> registry mod -deftktlife +0-00:00:10.000  
to change the default ticket lifetime to 10 seconds, then logged on, getting
my 10-second ticket lifetime, then
  dcecp> registry mod -deftktlife +1-06:00:00.000 
to set the default ticket lifetime back to what it was.  I can do these two
steps as many times as I'd like, then when I was done, I
  dcecp> registry mod -mintktlife +0-00:05:00.000
to set the minimum back to its original 5 minutes.

  Donovan called back and left three fixes for me to get, IX74434, IX74068,
and IX72275.  Rick downloaded them from the web fixdist and I installed and
tested them and things now work fine.

==============================================================================
21769         x    I had an AIX 4.3.0 system with DCE/DFS 2.1 working fine.
05/28/98      x  I updated to AIX 4.3.1 and that appeared to go ok, but a
13:00         x  previously-fixed bug (see problem above) appeared again.
Julian Owens  x  I decided to upgrade to DCE/DFS 2.2.  I smitty update_all'd
t/l 421-7141  x  from the CD, but now DCE doesn't want to start.
xxxxxxxxxxxxxxx
  Looking at the /opt/dcelocal/etc/cfgdce.log, I saw that it complained
about not finding some configuration files
  The file, /opt/dcelocal/var/dfs/BosConfig, was not found. 
  The file, /opt/dcelocal/var/dced/cdscache.inf, was not found. 
  The file, /opt/dcelocal/var/dced/clksynch.inf, was not found. 
and making some bad assumptions
  (Information only) Unable to determine the security server type.
         Replica will be assumed.

  I was not configured as a security server before doing this.  Before,
(taken from ech) a lsdce command returned
  Current state of DCE configuration:
  cds_cl       COMPLETE   CDS Clerk
  dts_cl       COMPLETE   DTS Clerk
  rpc          COMPLETE   RPC Endpoint Mapper
  sec_cl       COMPLETE   Security Client
Now a lsdce command shows
  Gathering component state information... 

                   Component Summary for Host: jasper.almaden.ibm.com 
             Component                 Configuration State   Running State 
  Security client                          Configured           Running   
  RPC                                      Configured           Running   
  Integrated Login (dceunixd)              Configured         Not Running   
  Initial Directory server                 Configured         Not Running 
  Directory client                         Configured           Running   
  Security Replica server                  Configured         Not Running 
  DTS client                               Configured         Not Running 
  Global Directory Agent                   Configured         Not Running 

I attempted to rmdce -F -o local all_srv, and got some errors.  I even
tried rmdce -F -g all_srv and got other errors.  Now lsdce shows
  Gathering component state information... 

                   Component Summary for Host: jasper.almaden.ibm.com 
             Component                 Configuration State   Running State 
  Security client                          Configured           Running   
  RPC                                      Configured           Running   
  Initial Directory server                 Configured         Not Running 
  Directory client                         Configured           Running   
  Security Replica server                  Configured         Not Running 
  DTS client                               Configured         Not Running 
  Global Directory Agent                     Partial          Not Running 
so it looks like it's getting worse.  I decided to call it in to get real
help.

  Julian Owens (good, chatty, level 1 guy) looked around and didn't find
thing obvious.  He sent me to level 2.

6/2/98              Claudia called.  I gave her root access (root's password
Claudia Barrett   is now claudia).  She's going to poke around and pass it off
8-678-0910        to somebody else.


==============================================================================
FZ0339        x    After doing a fresh AIX 4.2.1/PSSP 2.4 install on the new
07/01/98      x  SP/2 nodes in the Patent Server complex, I get an extra,
              x  unwanted route in both the ODM & (of course) the routing
Russell       x  table.  This route comes back after a reboot, despite me
McDonald      x  having deleted it from both the route table itself and the
xxxxxxxxxxxxxxx  ODM.

  As background, there are two ethernet adapters on each node, connecting to
two different subnets.  A netstat -rn shows
  ...
  Route Tree for Protocol Family 2:
  default          192.168.56.251    UG        0        1  en1    -    -  
  9.1.10.17        192.168.56.252    UGHD      0       20  en1    -    -  
  127              127.0.0.1         U         6      177  lo0    -    -  
  192.168.55       192.168.55.17     U         4     7235  en0    -    -  
  192.168.56       192.168.55.65     UG        1      180  en0    -    -   =>
  192.168.56       192.168.56.17     U         1        0  en1    -    -
Note the second-to-last line, where it says to use the CWS as a router to get
to subnet 192.168.56, which makes the native connection to 192.168.56 unused.

Usefull commands to look around & fix things are
  netstat -rn
  odmget -q 'name=inet0 AND attribute=route' CuAt
  odmdelete -o CuAt -q "name=inet0 AND value='net,192.168.56.0,-netmask,255.255.255.0,192.168.55.65'"
  route -f;route add -net 0 192.168.56.251

Russell and I tried deleting everything we could out of /etc/inittab, leaving
in just init, brc, and cons, and still the bogus route came back.  I tried
putting in debug code in /sbin/rc.boot, and couldn't figure out where/when the
bogus route was being put into the ODM.

  Turns out the answer was that extra route is in the ODM that's stored in the
boot image!  To resave that version of the ODM, run the savebase command.  We
did this and verified that the route didn't come back.  Now why that bad route
got put in that ODM is left for the Support Center to figure out.  I now know
how to overcome that situation.

==============================================================================
HB3410             x    Since installing the second frame to the SP/2, all the
07/16/98           x  other nodes have been logging a lot of extra crap in 
03:00              x  their /var/ha/log/hats.15.103329.as0000e0 file.  Lines
Jon Meyer          x  like this show up every 5 seconds, filling /var.
meyerjw@us.ibm.com x     07/17  14:56:18 hatsd[0]: Received a Group Proclaim 
8-421-7157         x            from (192.168.55.65:0x4591719e) in group
                   x            (192.168.55.65:0x45af716a).
xxxxxxxxxxxxxxxxxxxx
                        Jon investigated it a bit and claims it's normal at
this level of PSSP (2.2) to see all this crap in the error log, and it's not
an indication of an anything wrong.  In later versions of PSSP, they cleaned
things up to not log this.  He closed this PMR on 7/27/97, but I called and
left a message asking what I should do, since my /var file system is gettting
filled with this junk.  See PMR 34340,49R.

==============================================================================
Item # HC9258         x    When trying to configure DCE on the J50 (ar0081e0)
PMR # 34353, 49R      x  the system crashes.  The last thing Ed's mkdcecl.pl
07/27/98              x  script does is the command,
09:00                 x     mkdce -o local -h ar0081e0
Sandy Comsudi         x           -c ar0073e0.patent.ibm.com
sancom@austin.ibm.com x           -s ar0073e0.patent.ibm.com
8-523-4130            x           -n          patent.ibm.com sec_cl cds_cl
xxxxxxxxxxxxxxxxxxxxxxx  I'm also seeing sporadic problems using vi.
                         Sometimes vi works fine.  Sometimes, it hangs up for
30 seconds or so, then works normally.  Other times, it hangs completely and
never comes back (ok, I don't know about "never".  I waited 14 minutes 'till
the system crashed when I ran dced).

I used smitty to copy over the dump to /dfscache, and Sandy had me use the
script command to capture the console output from these commands
   script /tmp/crash.info
   crash /dfscache/vi_and_mkdce.dump
   cpu
   status
   stat
   trace -m
   errpt
   symptom
   od vmmerrlog 9 a
   quit             #  To exit out of the crash command, and
   exit             #  To finish the script command.
I then sent her that file, ala
   mail -s 'Item # HC9258' sancom@austin.ibm.com < /tmp/crash.info

After taking a few minutes to look at that dump info, Sandy then called
back and had me ship her the complete dump itself via anonymous ftp.
To collect information, I first ran this snap command
   snap -a -N -d /dfscache/snap.command
It turns out I maybe shouldn't have specified "/snap.command".  I was
under the impression this command would generate one file.  Instead it
created a /dfscache/snap.command directory with 16 subdirectories
underneath it.  Anyway, I then ran
   snap -c -d /dfscache/snap.command
which tar'd up and compressed that directory, and created the file
/dfscache/snap.command/snap.tar.Z.  I renamed that file to my PMR
number.brach office number.tar.Z,
   cd /dfscache/snap.command
   mv snap.tar.Z 34353.49R.tar.Z
then anonymously rftp'd it to Sandy.  Sandy said I could either put it
in the /incoming directory of the IBM-internal cia.austin.ibm.com server,
or the /aix directory of the external testcase.boulder.ibm.com server.
I chose the latter.

07/29/98  This item got moved around a lot and now is with Dwayne McConnell,
          (DWAYNE at IBMUSM26, tie-line 8-678-2720).  He suggested I needed
to get APAR IX79277, which brings bos.mp up to 4.2.1.12.  I had 4.2.1.10.
I used fixdist to get the fix, which actually drug down 4.2.1.14, installed
it, but still got a system crash, this time with just vi running/hanging.
Also, with this boot, root didn't have a password.  Strange.  I know it had a
password.  And to prove it, when the system rebooted after this dump, root
again had the correct password.  More strangeness.  I left a message for
Dwayne.

08/04/98  Called Dwayne McConnell again to follow up on last week's call, and
          he decided he wanted the dump, so I erased the old stuff from
/dfscache, and copied the dump to /dfscache/vi.dump, and ran these commands
   snap -a -N -d /dfscache/snap
   snap -c -d /dfscache/snap
   cd /dfscache/snap
   mv snap.tar.Z PMR34353.49R.000.tar.Z          (000 is USA's country code)
   /local/bin/rftp testcase.boulder.ibm.com
   anonymous
   jasper@almaden.ibm.com
   cd /aix
   bin
   put PMR34353.49R.000.tar.Z

  I also offered reinstalling AIX to Dwayne.  He said he didn't think it would
fix things, but it's worth a try.

08/06/98  I've still got the same problems after the reinstall.  Called Dwayne
          back to inform him and also to see what's up.  Had to leave a msg.

08/07/98  Called Dwayne back directly to inquire when I should expect to hear
          back from him, especially with a solution.

08/12/98  Called the Support Center again to get a status.  The problem's been
          transferred to Manoj Kumar (tie-line 678-3708).  Manoj called back
and left this message,
    "Regarding the PMR with system crashes.  I can't really see any code
  problem or any memory corruption.  I have a very strong feeling in this
  case you're running into some problem with hardware, somewhere in the
  planar board where things are going haywire in the sense of what's in
  the memory is not what the CPU is seeing, so when it does address
  calculation, it's messing up.  

    ... Tell service that you're having strange and random crashes ...

    I can't see anyway that we can run into these kind of problems.
  Three random crashes in completely different parts of the code 
  and unexplained by the memory, at least at the time of the dump.
  Everything in memory looks ok, (but) by the time the processor saw
  it, some of these things had (changed).

  I called it into service at 4 pm Wednesday, 8/12/98.  J50-2605593.
Reference # 31WB5KH.  Ernie Garcia came Thursday morning, 8/13.  He ran
diagnostics, which didn't show anything wrong (as expected), and he ordered
a new system planar, which got installed Thursday afternoon.  By Friday noon,
I had re-installed AIX and had an up & running, fully-configured system.
I called the Support Center and closed the problem.

  On Thursday, 8/20/98, I called in this J50 yet again.  There's still
something wrong with it, the same thing.  Yes, I was able to last week get
by the mkdce and fully configure the system, but then the system went south
on me.  I reinstalled and tried to configure it again, but again, I die on
the mkdce.  Reference # 32TJHB2.  Anthony DeMott's gonna look at it on
Friday morning, 8/21/98, but I'll be on vacation.

==============================================================================
HE4429                x    I've got two machines (as0202e0 & as0203e0) that
08/05/98              x  have new SSA arrays defined on them  Both are at
 4:45                 x  hdisk2 and are 27.3GB big.  When trying to put them
Aaron & Brenda        x  into a volume group, I get two different error
--mail address--      x  messages.  Repeated "cfgmgr" & "diag -a" commands
--phone--             x  don't change anything.  Even deleting the SSA RAID
                      x  array and redefining it didn't change anything.
xxxxxxxxxxxxxxxxxxxxxxx  The SP/2 is a 9076, serial # 0277261.

as0202 gives
   0516-796 mkvg: Making hdisk2 a physical volume. Please wait.
   mkvg: An invalid physical volume ID been detected on hdisk2.
   0516-862 mkvg: Unable to create volume group.

as0203 gives
   0516-796 mkvg: Making hdisk2 a physical volume. Please wait.
   0516-304 mkvg: Unable to find device id 55aa75c78bf5ea00 in the Device
           Configuration Database.
   0516-324 mkvg: Unable to find PV identifier for physical volume.
           The Device Configuration Database is inconsistent for physical
           volumes.

They keyed on the fact that a lspv command returned
   hdisk0         000018505d4d6920    rootvg         ( 9.1 GB rootvg)
   hdisk1         none                None           ( 4.5 GB unused)
   hdisk2         0000000000000000    None           (27.3 GB SSA   )
Note the 16 zeros for the PVID.  They focused on getting the real PVID in this
command's output.  They first tried
   dd if=/dev/zero of=/dev/hdisk2 bs=512 count=1
   chdev -l hdisk2 -a pv=yes
but when that didn't work, they had me do
   rmdev -dl hdisk2
after which, the lspv command didn't return anything for hdisk2, as expected.
   cfgmgr
after which, the lspv command showed "none None" hdisk2.  That's good.
   chdev -l hdisk2 -a pv=yes
and lo and behold, the lspv command returned
   hdisk2         0000185018f665db    None           (for as0202e0)
   hdisk2         00001102190c0f13    None           (for as0203e0)
as expected.  I was then able to do the mkvg and the rest of what I needed to
do to define the file system, fine.

==============================================================================
------                x    In the patent server domain, I've lost my quorum
--/--/98              x  among the DCE servers.  ar0073e0 is supposed to be my
--:--                 x  master, with 71 & 72 replicas, but I can't define a
--who--               x  new volume for example.  I get the messages
--mail address--      x     Could not lock FLDB entry (id=0,,16, type=0, op=32)
--phone--             x     Error: no quorum elected (dfs / ubk)
                      x     Error in release: no quorum elected (dfs / ubk)
xxxxxxxxxxxxxxxxxxxxxxx  I get the same thing when I try a fts release users.

  That was at 4pm on 8/12.  The next morning, I started investigating.  The web
help pages say to check time on each server (it's correct) and to type 
   udebug -rpcgroup /.:/fs -long 
"to analyze the FLDB and check if the different machines have the same version
of the database."  They don't.
ar0071e0 shows
   Host 192.168.56.71, his time is 0
   Vote: Last yes vote for 255.255.255.255 at -3 (not sync site); Last vote started at -903033334
   Local db version is 902640226.1
   I am not sync site
   Lowest host 255.255.255.255 at -903033334

ar0072e0 shows
   Host 192.168.56.72, his time is 0
   Vote: Last yes vote for 255.255.255.255 at -5 (not sync site); Last vote started at -903033378
   Local db version is 902617959.1
   I am not sync site
   Lowest host 255.255.255.255 at -903033378

ar0073e0 shows
   Host 192.168.56.73, his time is 0
   Vote: Last yes vote for 192.168.56.71 at -28 (not sync site); Last vote started at -38
   Local db version is 902640226.1
   I am not sync site
   Lowest host 192.168.56.71 at -28

But then when I was poking around some more, everything came into synch.  The above
   udebug -rpcgroup /.:/fs -long
command showed
   Host 192.168.56.73, his time is 0
   Vote: Last yes vote for 192.168.56.71 at -8 (sync site); Last vote started at -9
   Local db version is 903033621.1
   I am not sync site
   Lowest host 192.168.56.71 at -8
on all 3 servers.  I don't know what fixed things.  The only clue maybe is these lines
found in /var/dce/dfs/adm/BosLog.old,
   Sun Aug  2 04:00:05 1998: /opt/dcelocal/bin/bosserver: beginning logging
   Sun Aug  2 04:00:05 1998: Server directory access is okay
   Sun Aug  9 04:00:00 1998: reBossvrWatchThread: no error; simple restart exit
   compat_UnregisterServer: unexpected error from rpc_ep_unregister: Not registered in
        endpoint map (dce / rpc)
   Sun Aug  9 04:00:00 1998: /opt/dcelocal/bin/bosserver: error unregistering self: Not
        registered in endpoint map (dce / rpc)
   Sun Aug  9 04:00:02 1998: /opt/dcelocal/bin/bosserver: error destroying bnode
        timeout condition variable; errno = 16
   Sun Aug  9 04:00:02 1998: childWatchThread: exception or cancellation in 
        (cma_)sigwait (bossvr_thread_childWatch.c: 245)
   Sun Aug  9 04:00:03 1998: /opt/dcelocal/bin/bosserver: application exit

  I wound up never calling this in since things seemed to fix themselves, but I did want
to document what I saw in case it happened next time.

==============================================================================
42790                 x    I've tried four times now on two different systems
08/20/98              x  to install Net.Commerce v3.1.  On as0204e0, I learned
10:00                 x  how to smoothly install Net.Commerce (have 260Mb free
Tillman Baldwin       x  in /usr and have ipfx 2.2 pre-installed), but when I
TJBALDWI at IBMUSM20  x  get to the step of configuring Net.Commerce from the
8-444-7687            x  NT web browser, I get an error message, "Cannot
                      x  modify web server configuration file."  I got the
xxxxxxxxxxxxxxxxxxxxxxx  same thing on as0206e0.  

  I'm thinking now, that this was my own fault.  I thought I was selecting
the "DB2 UDB Workgroup", but I was doing just the "DB2 Client Application
Enabler" instead.  I'm redoing it now.

  I tried reinstalling on as0204, but got another system crash (see below).
Tried again on as0206e0, but I get the same error - "Cannot modify web
server configuration file."


==============================================================================
HI7185                x    Sometimes after rebooting a silver SP/2 node, I
08/28/98              x  wind up with a zero-length /etc/inetd.conf. Just
12:00                 x  prior to the install, I installed a bunch of
Donovan               x  additional software, mostly X things, and the latest
                      x  AIX service that I had, so I don't really know what
                      x  of those 3 things caused the zero-length inetd.conf,
                      x  installing software, applying service, or rebooting.
xxxxxxxxxxxxxxxxxxxxxxx  
                           The Support Center hadn't heard of others 
complaining of the same thing, and it didn't happen at the next reboot, so we
just closed the incident.  At least if somebody else gets the same symptom,
they may get a hit in RETAIN.

==============================================================================
HR2493                 x    Kin says ar0143e1 took a dump on Saturday, 9-26-98
09/29/98               x  at 6:17 pm, that he wanted me to call in.  ar0143e1
15:15                  x  is a 43P-240 (not a 530 like I told the Support
Leslie Devlin          x  Center).  I copied the dump over to
ldevlin@austin.ibm.com x  /var/adm/ras/system_dump_on_Sep26 and did the normal
8-523-4253             x    script /tmp/crash.out
                       x    crash /var/adm/ras/system_dump_on_Sep26
                       x    cpu
xxxxxxxxxxxxxxxxxxxxxxxx    status
                            stat
                            trace -m
                            errpt
                            symptom
                            od vmmerrlog 9 a
                            quit         (to exit out of crash)
                            exit         (to exit out of script)
I then mail -s'PMR HR2493,bo49R' ldevlin@austin.ibm.com < /tmp/crash.out.

  Later, Leslie called back to say the system crashed in the middle of
PHXENTBD (whatever that is), which is in the fileset devices.pci.23100020.rte,
which is the PCI 10/100 Ethernet Device Driver.  She suggested installing the
latest level, which is 4.2.1.4.  We had 4.2.1.2 on, so I went ahead and got
and installed that version, but I wonder if it's really going to fix the
problem.

  It turns out that the system crashed overnight, so we're now running with
the 4.2.1.4 version of devices.pci.23100020.rte.  Looking at it a bit more
closely, I see we've had 11 crashes on ar0135e0 this month.  Here's a 
synopsis from the error log,
  IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION  ---  Failing Module ---
  3573A829   0930003398 U S CMDCRASH       SYSTEM DUMP  dfscore
  3573A829   0926182298 U S CMDCRASH       SYSTEM DUMP  phxentdd 
  3573A829   0925130898 U S CMDCRASH       SYSTEM DUMP  if_en/netinet/dfscorep
  3573A829   0924195798 U S CMDCRASH       SYSTEM DUMP  dfscore
  3573A829   0923204198 U S CMDCRASH       SYSTEM DUMP  dfscore
  3573A829   0922125298 U S CMDCRASH       SYSTEM DUMP  nfs.ext
  3573A829   0903172598 U S CMDCRASH       SYSTEM DUMP  dfscore
  3573A829   0903152098 U S CMDCRASH       SYSTEM DUMP  ???
  3573A829   0903125098 U S CMDCRASH       SYSTEM DUMP  dfscore
  3573A829   0902213098 U S CMDCRASH       SYSTEM DUMP  dfscore
  3573A829   0902103898 U S CMDCRASH       SYSTEM DUMP  csa/nfs.ext/vnop_rdwr

I did the following to package both dumps together, document things, ftp
the tar file to Boulder, then I called Leslie back.
   snap -a -N -d /ips/junk    (This compressed last night's dump and
                               wrote it to /ips/junk/dump/dump.Z, so to
                               include the prior dump there, I did ... )
   compress /var/adm/ras/system_dump_on_Sep26
   cp -p    /var/adm/ras/system_dump_on_Sep26.Z /ips/junk/dump
   vi /ips/junk/other/README.PROBLEM   
   snap -c -d /ips    (This tar'd up /ips/junk & wrote /ips/junk/snap.tar.Z)
   cd /ips/junk
   mv snap.tar.Z PMR-HR2493.49R.000.tar.Z
To ftp it,
   rftp testcase.boulder.ibm.com
   anonymous
   jasper@almaden.ibm.com
   cd /aix
   bin
   put PMR-HR2493.49R.000.tar.Z
 
I learned later that my "HR2493" number is not a PMR number, but what the
Support Center calls an "X-menu" number.  I was suppose to call Leslie and
have assign me a PMR number, then use that in the tar file's name.  Oh, well.
Also, since we have at least 2 different problems, Leslie says she's opening
up a second PMR.  The phxentdd dump is PMR 48613 and the dfscore dump is
PMR 48642.

  We eventually replaced the 43P with another idle 43P, moving over all the
adapter cards & disk drives, and the new 43P is running fine.  The old 43P
that we now suspect there's something wrong with, is being used by Kin and/or
Bruce for something or rather.

==============================================================================
JF2646                x    When migrating from PSSP 2.2 to PSSP 2.4 on one of
11/23/98              x  our older nodes (as0101 in this case), "Step 10" in
13:30                 x  the PSSP book has you run /tmp/pssp_script, which is
Cliff                 x  really /spdata/sys1/install/pssp/pssp_script.
--mail address--      x  There were a few problems running that script.
--phone--             x  
                      x  
xxxxxxxxxxxxxxxxxxxxxxx  

1)  It failed in handling of fbcheck in /etc/inittab when it didn't exist.
    "Fixed" by adding "fbcheck:2:wait:/usr/sbin/fbcheck" in inittab before
    running pssp_script.
2)  Also had to insure /spdata/sys1/install was NFS exported.
3)  Also had to manually install devices.chrp.base.rte on the node.
4)  Had to insure xlC.rte was at 3.1.4.8 and a bunch of other stuff was
    updated, so I just applied latest service.

After all the above, pssp_script ran ok.

==============================================================================
PMR 59782,49R         x    I forgot to document this PMR and frankly, forgot
11/23/98              x  all about it 'till I got a note on 2/25/99 saying 
                      x  that PTF U463038 for PMR 59782 is closed.  Huh???
xxxxxxxxxxxxxxxxxxxxxxx  
                           This was a problem that I spent 12/3/98 debugging.
Turns out that /etc/rc.sp does a stopsrc and a startsrc -s inetd, but the
startsrc wasn't working 'cause inetd wasn't stopped yet, it was in the
"inoperative" state.  I fixed /etc/rc.sp by adding these lines after the
stopsrc
    while [[ -z `lssrc -s inetd | grep inoperative` ]]
    do  sleep 1
    done
and these after the startsrc
    while [[ -z `/local/bin/lsof -i -n | grep 'kshell (LISTEN)'` ]]
    do  sleep 1
    done
to wait for inetd to come up.

The APAR is IX85054.  What they did is update /usr/lpp/ssp/install/bin/rc.sp
in level 2.4.0.7 to essentially put in my wait loop after the stopsrc.
They didn't address waiting after the startsrc.  I wonder if that'll be a
problem for me.

==============================================================================
Item # JK3543         x    My machine, jasper, a 7043-140 (43P) running AIX
PMR # 67742,49R       x  4.3.1 with DCE 2.2.0.2 and all the latest fixes,
12/29/98              x  still occasionally hangs while trying to contact a
 9:20                 x  DFS server.  I get these messages,
Dot                   x     dfs: lost contact with server 9.1.24.165
support@transarc.com  x        in cell: almaden.ibm.com
(412) 667-4500 x7391  x  I called this problem into the support center a
xxxxxxxxxxxxxxxxxxxxxxx  couple of weeks ago, but at that time, they just told
                         me to reboot & install the latest fixes, which I did.
I'm now at the latest, DCE 2.2 with Fix Pack 2, that I can be at.

Also last time, the dfsbind process was huge, so I set up a little cron job to
hourly go see what the size of the dfsbind process was.  Here was the job,
   #!/bin/ksh

   file=/tmp/monitor_dfsbind

   #  Put in the header line if the file doesn't exist.
   if [[ ! -f $file ]]
   then echo "------------------------------->" $(ps aux | head -1) > $file
   fi

   #  Find the dfsbind PID with the inner ps -ef command (the ps aux command
   #  doesn't give the full name of the running program, so we've got to use
   #  the ps -ef command to find the PID, then look for that PID in the
   #  ps aux command).
   #  Keep that line in the designated file.
   echo $(date) "-->" $(ps aux | grep -v grep \
        | grep "^root *$(ps -ef | grep -v grep | grep -v monitor_dfsbind | \
                         grep dfsbind | awk '{print $2}')") >> /tmp/monitor_dfsbind
with this crontab entry,
   #  Run a frequent job that monitors the size of the dfsbind daemon.
   0 * * * * /monitor_dfsbind.sh 1>/dev/null 2>&
I had this running since 12-10-98 and had 2 reboots since then.  Without
boring you with details, the size of the dfsbind process during the first
boot, stayed pretty constant at 1172-1500Kb.  I had to reboot 12-14 to
replace the processor heat sink & fan (it was making noise).

  The second boot saw the dfsbind process slowly grow from 1172Kb to 1812Kb
over 8.5 days.  I had to reboot yesterday, 12-28, 'cause the system was
apparently hung over the Christmas holidays.  I had taken off 12/22 - 12/28
and the last entry in the /tmp/monitor_dfsbind log was dated Wednesday
morning at 1 am.

  This last time, the system has only been up for on day and the hourly
watch on dfsbind has seen it mushroom.  Here are the lines from the last
reboot (slightly edited),
----------------------> USER PID %CPU %MEM    SZ   RSS    STIME    TIME  COMMAND
Mon Dec 28 09:00:00 --> root 8792 0.0  0.0   1136   472  08:16:37   0:00 /opt/dcelocal/bin
Mon Dec 28 10:00:00 --> root 8792 0.0  1.0   1180   664  08:16:37   0:00 /opt/dcelocal/bin
Mon Dec 28 11:00:00 --> root 8792 0.2  6.0   8004  7788  08:16:37   0:17 /opt/dcelocal/bin
Mon Dec 28 12:00:01 --> root 8792 0.8 13.0  16936 16728  08:16:37   1:51 /opt/dcelocal/bi
Mon Dec 28 13:00:00 --> root 8792 1.1 16.0  21352 21244  08:16:37   3:05 /opt/dcelocal/bi
Mon Dec 28 14:00:00 --> root 8792 1.6 22.0  28300 28192  08:16:37   5:31 /opt/dcelocal/bi
Mon Dec 28 15:00:00 --> root 8792 1.8 24.0  31648 31540  08:16:37   7:05 /opt/dcelocal/bi
Mon Dec 28 16:00:00 --> root 8792 2.1 28.0  37064 36944  08:16:37   9:48 /opt/dcelocal/bi
Mon Dec 28 17:00:00 --> root 8792 2.7 33.0  43756 43560  08:16:37  13:57 /opt/dcelocal/bi
Mon Dec 28 18:00:00 --> root 8792 2.9 37.0  48040 47796  08:16:37  17:04 /opt/dcelocal/bi
Mon Dec 28 19:00:00 --> root 8792 3.4 41.0  53816 53372  08:16:37  21:44 /opt/dcelocal/bi
Mon Dec 28 20:00:00 --> root 8792 3.8 45.0  59152 58468  08:16:37  26:40 /opt/dcelocal/bi
Mon Dec 28 21:00:00 --> root 8792 4.4 49.0  65412 63756  08:16:37  33:21 /opt/dcelocal/bi
Mon Dec 28 22:00:00 --> root 8792 4.9 53.0  71392 69608  08:16:37  40:33 /opt/dcelocal/bi
Mon Dec 28 23:00:00 --> root 8792 5.4 57.0  76920 74428  08:16:37  47:47 /opt/dcelocal/bi
Tue Dec 29 00:00:00 --> root 8792 6.0 60.0  82812 78396  08:16:37  56:16 /opt/dcelocal/bi
Tue Dec 29 01:00:00 --> root 8792 6.6 60.0  89008 78784  08:16:37  66:10 /opt/dcelocal/bi
Tue Dec 29 02:00:00 --> root 8792 6.8 60.0  92644 79144  08:16:37  72:01 /opt/dcelocal/bi
Tue Dec 29 03:00:00 --> root 8792 7.1 61.0  97176 80200  08:16:37  79:59 /opt/dcelocal/bi
Tue Dec 29 04:00:00 --> root 8792 7.3 63.0 100264 81632  08:16:37  85:50 /opt/dcelocal/b
Tue Dec 29 05:00:00 --> root 8792 7.4 63.0 103408 82740  08:16:37  91:31 /opt/dcelocal/b
Tue Dec 29 06:00:02 --> root 8792 7.5 65.0 106280 84932  08:16:37  97:14 /opt/dcelocal/b
Tue Dec 29 07:00:01 --> root 8792 7.6 66.0 109640 86588  08:16:37 104:00 /opt/dcelocal/
Tue Dec 29 08:00:08 --> root 8792 7.7 68.0 112364 89308  08:16:37 109:39 /opt/dcelocal/
Tue Dec 29 09:00:03 --> root 8792 7.7 68.0 114820 88780  Dec 28   114:52 /opt/dcelocal/

As you can see, the dfsbind process is taking over all my resident memory,
causing huge paging delays.  This was the same symptom I saw 3 weeks ago.

  I restarted the dfsbind process via the smit menus to get me by, 'till I
can get on the horn with the Support folks about this.

01/14/99              After playing phone tag with the Support Center for a
Reggie Clinton      few weeks (I haven't pushed the issue), Reggie finally
8-444-5257          got hold of me.  I told him about my monitor_dfsbind log
l2dce@us.ibm.com    file and he told me about 2 commands he wanted me to type
                    in to dump some internal tracing.  The 2 commands are
dfstrace dump -file /tmp/dfsdump and send a kill -30 signal to the dfsbind
process, which will write /var/dce/dfs/adm/icl.bind.  Both are painless to run,
so I ran both and mailed him all 3 files (/tmp/monitor_dfsbind, /tmp/dfsdump,
and /var/dce/dfs/adm/icl.bind).

When dfsbind starts to get bloated again, I'm to run the above 2 commands
again and mail him the 3 files again.

01/15/99              Dale talked to Liz Hughes (who I met in San Antonio),
Liz Hughes          who's looking at the problem.
8-678-3483          I talked with Liz, who was more interested in my complaint
                    of loosing contact with a server, than the dfsbind process
growing.  I told her that the loosing contact was more of a transient thing,
that the dfsbind problem occurred more often.  Liz wanted to close this item
due to the confusion of symptoms, and if/when it happened again, to open up
a new PMR.

See Item # JP9021 below for a continuation of the apparent dfsbind memory leak.

See also Item # 84648,49R below for the "Lost contact with <file server>" msg.

==============================================================================
Item # x              x    I've got problems installing a VeriSign SSL
PMR # 67918,49R       x  certificate on two machines, baboon & penguin.
12/31/98              x  Baboon has the Lotus Domino Go Webserver 4.6.1.0
--:--                 x  installed.  Penguin has Internet Connection Server
Paul Kelsey           x  4.2.1.7, just like the patent site does.  Both appear
--mail address--      x  to receive the certificates fine, but give the
8-444-5399            x  following error in a popup window whenever you try to
xxxxxxxxxxxxxxxxxxxxxxx  get some page via https/SSL:
                         
       The security library has encountered an improperly formatted
       DER-encoded message.
and the bottom line is, it don't work. 

  I tried calling the VeriSign Support number (1-650-429-3400, options 1,2,2)
and they sent me a URL (http://www.StructuredArts.com/edu/ssleay-pkcs10.htm)
to look at, but that wasn't too useful 'cause it was too technical.  It was
purported to
       "give a brief description of how you can generate a PKCS-10
        certificate request using SSLeay -.6.6 or later for use
        with the Netscape Certificate Server 1.01 product."
The solutions it offered was too technical to follow and I'm not sure it
really applied to my problem.

  The other solution the VeriSign folks offered was to specify a different
web server when you request the certificate.  On one of their web pages, they
ask you to specify which web server you've got.  Two of the choices are Lotus
and IBM.  So which should I select for the "Lotus Domino Go Webserver", which
is put out by IBM?  Initially, I chose IBM, so I went through VeriSign's web
pages to reissue the certificate (free within 30 days), specifying Lotus.
After re-receiving the new certificate, I got the same thing.

1-5-99    Paul Kelsey called and said yes, this is a known problem and he
10:25   knows how to take care of it.  I gave him root's password on baboon
        and he's gonna do his magic to fix it.
 2:00     Paul fixed it and sent me replacement keyfile.kyr & keyfile.sth
        files.  Both seem to work.  I left a phone msg with Paul to ask
1)  What did he do exactly, so I'm able to do it myself.
    Answer:  Use the /usr/sbin/mkkf utility distributed with the new
    webserver code, and delete the obviously bad key.  It's the one that
    when you cycle through the keys in the keyfile, has a name like
        Current Key Name:  CN = penguin.almaden.ibm.com, 
                           OU = Almaden Research Center,
                            O = International Business Machines Corporation,
                            L = San Jose, ST = California, C = US
    instead of the more normal looking,
        Current Key Name:  Patent Server
    You can then select the "Patent Server" key.  When you "Show" the
    information about that key, it should *NOT* have the "Key has a certificate."
    line at the bottom.  When you back out a menu or two, you can select
    "R - Receive a Certificate into a Key Ring File" to properly
    receive the certificate.  Save the keyfile and use it.

2)  How does one upgrade to 4.6.2.5?
    Answer:  You've got to get the code through the Support Center.
    I've now got it and installed it on baboon.  Be careful to save the
    following files:
         /etc/httpd.conf
         /etc/ics_pics.conf
         /etc/lgw_fcgi.conf  <---  Called /etc/ics_fcgi.conf for ICS 4.2.1.7.
         /etc/servlet.conf   <---  I didn't save this one & it got overwritten.
         /usr/lpp/internet/server_root/protect/webadmin.passwd  <---  Surprisingly,
                                               this file apparently got overwritten.

3)  How can I get penguin, running ICS 4.2.1.7 to work correctly?
    Answer:  Once I installed the 4.6.2.5 Lotus Domino Go Webserver on
             baboon, I could then get penguin's keyfile.kyr file over to
             baboon and manipulate it with the mkkf utility mentioned above.
             It was easy enough to figure out and it fixed it beautifully.

4)  Can he help with www4.patents.ibm.com.  I had generated the request on
    as0110e1 on 9-15-98 and received it and (I thought) had tested it to
    insure it worked (I could be mistaken about the test), and sure enough,
    the keyfile.sth & keyfile.kyr files are all dated 9-15-98 at 15:50, but
    I don't know how to tell if the certificate has been received yet or not.
    The problem I have with that is the keyfile password doesn't even unlock
    it.
    Answer:  I had mistyped the password and got lucky and found my error and
             was able to unlock it & change the keyring password to what it
             should have been.  It turns out I never received the certificate,
             so I did that as well.

==============================================================================
Item # JL1215         x    I've got a problem with dsh-ing commands from the
PMR # xxxxx,49R       x  CWS.  Even though we're authenticated as root.admin,
--/--/99              x  the dsh command doesn't work to most machines.
--:--                 x  It only works to as0201-5, ar0079e0, and ar0081-4e0.
--who--               x  For all the other machines, you get the error 
--mail address--      x  message, 
--phone--             x        3004-609 Your password has expired.
xxxxxxxxxxxxxxxxxxxxxxx        /usr/lpp/ssp/rcmd/bin/rsh: 0041-004 Kerberos
                               rcmd failed: rcmd protocol failure.

  It's almost as if there's a password associated with each machine that has
expired, but I've never heard about that.

  I spent a good hour and a half on the phone with the Support Center guy who
was worse than useless.  After poking around and not getting anywhere - he was
convinced it had something to do with the fact these machines were DCE client
machines and the PSSP Kerberos code was getting confused with the DCE Kerberos
code - it wasn't, he finally said he was going to fax me instructions on
rebuilding the DCE Kerberos database and that should fix things.  "We've had a
high success rate after doing this," he said, so he expected this to fix my
problem.

  Turns out he never sent me those directions.  Meanwhile Tom discovered that
if you "su" to root, you get the "Your password has expired." message, which
you don't get if you login to root directly.  The problem was that root's
password was 26 weeks old.  It's time to change root's password on all machines.

==============================================================================
Item #                x    A Shell Patent Server customer called to complain
PMR # 70943,49R       x  about problems getting to the site.  I'm not sure
--/--/99              x  what his problem really is, but I do see these error
--:--                 x  messages in the /arc/ipnfb/logs/httpd-errors.* files
Bill Polomchak        x  for him,
--mail address--      x    [PUT NOT ALLOWED] [host: gate2.shellus.com]
8-526-1619            x                    SSL Handshake failed.
xxxxxxxxxxxxxxxxxxxxxxx  so I called the Support Center to ask them what it
                         meant.  They said that was a generic message and it
could be a lot of things.  They suggested I go to 4.2.1.9, which fixed a lot
of things in the code and will probably eliminate those messages.  They also
said there was another level coming out in a week or so, 4.2.1.10, which may
help.  I told them I'd take both and Bill said he was gonna put 4.2.1.9 on an
FTP site and send me a note with the info, but I never heard from him.  I
left a message with him Monday morning, 2/1/99.

  Bill made the 4.2.1.9 code ftp-able from wp5.raleigh.ibm.com in
/usr/rjasper.  I got a copy & installed it on as0103e0.  The server appears
fine, but I still get hundreds of "SSL Handshake failed." messages in the
httpd-errors log.  I'm also getting a lot of "Request parsing failed."
messages that I'd like to understand.  I sent Bill a note on 2/12/99 to ask
if the 4.2.1.10 code was available yet, and he's out of town 'till 2/22.
I called the Support Center back and asked for Diane Swan (8-526-1933) to
call back.

 3/03/99         We got the 4.2.1.10 code & installed it on the gold nodes,
 4:00          as0103 & as0104, and we're still getting the same
Diane Swan               [PUT NOT ALLOWED] [host: ...] SSL Handshake failed.
8-526-1933    as well as [OK] [host: ...] Request parsing failed.
              messages.  I called Diane back & left a phone message saying
this.

 4/02/99         Got a note from Kevin asking for our config file & what
 9:00          of Go we have.  I sent him the config file & told him we
Kevin Vaughan  didn't have Go, we had ICS 4.2.1.10.
Raleigh

  The bottom line with this "SSL Handshake failed." message is that it's
probably caused by the user interrupting the page download, and it's not worth
exploring and tracking it down any further, especially, like the Support
Center guys said, we're on such a down-level web server.  ICS has been
replaced by the Lotus Domino Go web server, and that's now being replaced by
the IBM HTTP Web Server Powered by Apache.

==============================================================================
Item # JP9021         x    This is a continuation of the apparent dfsbind
PMR # 71121,49R       x  memory leak problem I opened on 12/29/98 (see 
02/04/99              x  Item # JK3543, PMR # 67742,49R above).  Level 1
14:15                 x  wanted me to package up what I've got & drop it off
Claudia Barrett       x  as JP9021.tar.Z in their testcase.boulder.ibm.com
zeclaw@us.ibm.com     x  anonymous ftp site.  This, I did, along with a little
8-441-3455            x  README file to tell them what it was.  An interesting
xxxxxxxxxxxxxxxxxxxxxxx  tidbit with the way they have their drop-off ftp site
                         set up is, you can't overwrite stuff that's already
there.  I put the file with the README there as JP9021-with-README2.tar.Z.

  I got a call back from the Support Center asking me to repackage the files
and this time use relative path names, instead of absolute.  I did.

  Claudia asked me to pick up /aix/fromibm/71121.b49r.tar.Z from
testcase.boulder.ibm.com.  I've untar'd it in ~jasper/71121 for now.
I've got to read the README carefully, as it seems to be very dangerous.

  Sure enough, it is dangerous.  What the package does is to modify
/usr/ccs/lib/libc.a in a /tmp directory, and mount the file (not a directory)
/tmp/libc.a.debug/libc.a over /usr/css/lib/libc.a.  When this happens, login
or anything that tries to authenticate (su, telnet, ftp, etc) quits working.
This in itself I could live with for a while, but when I tried restarting
dfsbind, my system froze.  One advantage of mounting this single file, is
when you reboot, the file "mount" goes away and things revert to normal.

  I did have to reboot, but even now after a reboot, my system is giving me
"/bin/ksh: A remote host did not respond within the timeout period." messages
whenever I execute a command (probably while trying to update the shell
history file).  I called the Support Center again, when Claudia seemed to
have gone home.  Also, X seems very slow.  X events like highlighting a line
of text, doesn't work as quickly as it should.  Oftentimes, the highlighting
doesn't happen.

 2/08/99      Monday morning.  Went back to trying to use this debug version
              of /usr/ccs/lib/libc.a, which evidently, is used for all kinds
of commands, like ls, cp, mv, etc, so when trying to move/copy this file in
place, you have to mv it to get the existing one out of the way, then since
there is no libc.a, you have to use this new feature/construct of ksh variable
assingment,
   variable=value command
where variable=value for this one command only.

The original libc.a is at /usr/ccs/lib/libc.a.orig, and /tmp/libc.a.SAVE/libc.a.
The debug libc.a is at /usr/ccs/lib/libc.a.debug and /tmp/libc.a.debug/libc.a.

To move the debug version of libc.a in place,
   cd /usr/ccs/lib
   mv libc.a libc.a.orig
   LIBPATH=/tmp/libc.a.SAVE cp -p /tmp/libc.a.debug/libc.a .

To start up the dfsbind process,
   kill -9 the currently running dfsbind process, if there is one.
   export DEBUGMALLOC=3
   /opt/dcelocal/bin/dfsbind

To move the original libc.a back in place,
   cd /usr/ccs/lib
   mv libc.a libc.a.debug
   LIBPATH=/tmp/libc.a.SAVE cp -p /tmp/libc.a.debug/libc.a .
   
The problem is the dfsbind process crashed whenever I tried running it with
the debug version of dfsbind.  Claudia sent me the senddata.pl Perl script,
(available from their web/ftp site - See the "DCE for AIX" link at
   http://www.software.ibm.com/enetwork/dce/downloads)
which I had to fix for my machine (Perl in /local/bin/perl, not /usr/local,
and I had to escape 2 occurances of "@austin").  I put the fixed verstion in
/usr/bin on jasper, as well as in /usr/local/bin (which is a link to
/dfs/apps/userlocal, so it's at /dfs/apps/userlocal/bin/senddata.pl)
on the Patent Server site.  The core file was at /var/dce/dfs/adm/dfsbind/core.
I ran send senddata.pl & Claudia, who happened to be logged on, got a copy of
the packaged stuff, which I put in /tmp/71121.

 2/11/99        Was informed that they found a bug in an AIX library that
              causes dfsbind to core dump when run with the debug library.
They've opened up APAR IX87547 to fix that.  The description in RETAIN says
    In libs_threads.c, the code for __libs_child_post_fork frees
    a global variable _sec_rmutex by calling _rec_mutex_free().
    _sec_rmutex is declared as: struct rec_mutex        _sec_rmutex;
    While _sec_rmutex never get malloc'd, therefore it should
    never be freed. 
As I understand this, this won't fix my original problem, it will just allow
me to run dfsbind with the debug library.  We'll see.

 4/05/99        Closed 3/18/99 as a duplicate of IX85363.  Called the Support
              Center to get a status on IX85363, specifically whether or not
I could get the PTF.

 4/27/99        Got a phone call from Bill Smith, DCE Level 2 in Raleigh,
Bill Smith    saying that PTF U464373 has been closed for IX85363, and he's
8-441-4096    sending me the CD.

 4/28/99        Got the CD, installed the fix, and rebooted jasper (good thing
                too, 'cause it was hosed up in the can't talk to DFS state -
see PMR # 84648,49R below).  The phone messages I've been getting from Bill
Smith indicate he wants to close this PMR, but there's a note in the PMR in
RETAIN that says
  "Hi library folks, The info at the end of the pmr is what pertains to
   aix.  When this has been fixed, please requeue the secondary back to
   l2dce,109.  Thanks."
So my PMR shouldn't be closed.  I tried calling Bill and left a msg, so I
called the Support Center, who put this note on the last page,
  "Update: Rick is requesting this pmr NOT be closed, also
   requesting that it be queued to the correct queue l2dce,109
   (ref. pg17)  And, would like a callback asap."

I think what we need to do is try running dfsbind again with the debug 
library.

==============================================================================
PMR # 72384,49R       x    Got a problem with multiple Patent Server nodes.
02/08/99              x  The dceunixd process either dies completely, or it
 1:00                 x  still shows up in the process table, but isn't doing
Julian                x  its thing of translating userid number to userid name.
--mail address--      x  For example, "id jasper" returns 
--phone--             x    3004-820 User not found in /etc/passwd file
xxxxxxxxxxxxxxxxxxxxxxx  which it doesn't, jasper is a DCE id.  When this
                         happens to one of our web servers, the httpd process
just stops returning data (not sure why, exactly, but I know if I kill the
dceunixd process & restart it, things will start working again).  This has
happened to 4 nodes last week, and another two today, all different.

Each node is running AIX 4.2.1 with DCE 2.1.0.26, which is really current.

  Julian said one thing we could do is run dceunixd in debug mode, which
runs dceunixd in the foreground and spits a bunch of lines at the console.
Here are the lines I saw when I did that & logged in from another window
as root, which seems to indicate an error, but Julian wasn't interested
in it (see below to see what he *was* interested in).
<root@as0111e0:/> dceunixd -d9
Main: Initialization complete
serve_client (4): expecting msg of 12 bytes
process_req (4): got req_type 1002
do_gnam (4): groupname = tty
do_gnam (4): cellname = /.../patent.ibm.com
serve_client (8): expecting msg of 12 bytes
process_req (8): got req_type 1002
do_gnam (8): groupname = tty
do_gnam (8): cellname = /.../patent.ibm.com
serve_client (9): expecting msg of 12 bytes
process_req (9): got req_type 1002
do_gnam (9): groupname = tty
do_gnam (9): cellname = /.../patent.ibm.com
do_gnam (4): sec_rgy_site_open returns 387063931
do_gnam (4): sec_rgy_pgo_get_members returns 387063931
do_gnam (4): override_get_group_info returns 387063931
do_gnam (4): reply buf is 47 bytes
do_gnam (4): reply buf -- 47:0011:Registry server unavailable (dce / sec)
process_req (4): reply buf of size 47
serve_client (4): sending reply, 47 bytes
do_gnam (8): sec_rgy_site_open returns 387063931
do_gnam (8): sec_rgy_pgo_get_members returns 387063931
do_gnam (8): override_get_group_info returns 387063931
do_gnam (8): reply buf is 47 bytes
do_gnam (8): reply buf -- 47:0011:Registry server unavailable (dce / sec)
process_req (8): reply buf of size 47
serve_client (8): sending reply, 47 bytes
do_gnam (9): sec_rgy_site_open returns 387063931
do_gnam (9): sec_rgy_pgo_get_members returns 387063931
do_gnam (9): override_get_group_info returns 387063931
do_gnam (9): reply buf is 47 bytes
do_gnam (9): reply buf -- 47:0011:Registry server unavailable (dce / sec)
process_req (9): reply buf of size 47
serve_client (9): sending reply, 47 bytes


  Julian called somebody in level 3 & they suggested sending the hung
dceunixd process a kill -6 signal, which causes it to create a core 
file in /var/dce/security/adm/dceunixd.  I then tar'd & compressed it
up, and dropped it in their testcase.boulder.ibm.com anonymous ftp
server at /aix/JR3477.tar.Z.

Reggie Clinton        Reggie tells me that the core file I packaged up and
8-444-5257          sent to Julian is worthless without the corresponding
                    libraries, which is what their senddata.pl program does
(see PMR # 71121 above), that is, it packages up the core file along with the
libraries that exist on the system (e.g. libc.a).  I found 2 dceunixd core
files, on as0111 & as0112, that I tried sending them with senddata.pl, but
senddata.pl complained that the core files were incomplete.  Evidently, I've
got to enlarge the /var/dce filesystem

 2/24/99              I finally had time to get back to this.  I found a 130MB
11:45 am            core file on as0107 at /var/dce/security/adm/dceunixd/core,
                    so I ran /usr/local/bin/senddata.pl to package it up and
dropped off the /arc/senddata/datapkg.tar.Z.uu file at testcase.boulder.ibm.com
at /aix/JR3477.49r-L3DCE-datapkg.tar.Z.uu.

 3/10/99              Got a call from Bob saying that yes, they did want the
11:00 am            other dceunixd core file packaged up to look at.
Bob Breeze          (I had forgotten I had told them I had another one and
8-523-6189          whether or not they wanted it.)  I packaged up the one on
                    as0110 dated 2/26/99 using their /usr/local/bin/senddata.pl
and sent it to them at testcase.boulder.ibm.com, creating the file
/aix/72384.b49R-L3DCE-datapkg.tar.Z.uu.

==============================================================================
Item # JR5749         x    Kin has given me a simple-enough C program that
PMR # xxxxx,49R       x  does a gethostbyname system call which, if run as
 2/09/99              x  root, takes virtually no time at all to run, but if
 4:00                 x  run as a non-root id, takes 2 minutes 20 seconds to
Clayton Briggs        x  run.  I tried doing a kernel trace, but couldn't see
--mail address--      x  anything useful.
8-421-7172            x  
xxxxxxxxxxxxxxxxxxxxxxx  Here is his gethostbyname program

#include <stdio.h>
#include <netdb.h>
#include <time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/statfs.h>
#include <sys/mode.h>

main(int argc, char**argv){

  struct hostent  h, *p;
  long now;

  if (argc!=2) {
     printf("Usage: %s hostname\n", argv[0]);
     exit(-1);
  }
  time(&now);
  printf("starting : gethostbyname(%s) at %s", argv[0], ctime(&now));
  p=gethostbyname(argv[1]);
  if (p!=NULL) {
     printf("HOST=>%s\n",p->h_name);
  } else {
     printf("UNKNOWN HOST\n");
  }
  time(&now);
  printf("ended    : gethostbyname(%s) at %s", argv[0], ctime(&now));
}

Which produces output like this when run from my non-root userid,

jasper@ar0141e1> gethostbyname ar0141e11.patent.ibm.com
starting : gethostbyname(/ips/bin/gethostbyname) at Tue Feb  9 12:30:38 1999
HOST=>ar0141e1.patent.ibm.com
ended    : gethostbyname(/ips/bin/gethostbyname) at Tue Feb  9 12:33:08 1999

  Clayton compared the version of xlC.rte on ar0141e1 (3.1.4.8) versus what
it was on the machine it was compiled on, which was Kin's machine spartan,
which had 3.1.4.4.  Clayton pointed me to 
   http://ftp.software.ibm.com/cgi-bin/support/rs6000.support/downloads
where I was able to search on PTF U453695 & download 5 installp images,
       xlC.C++.heapview.3.1.4.6
       xlC.C++.iclui.lib.3.1.4.2
       xlC.C++.iclui.samples.3.1.4.1
       xlC.C++.lib.3.1.4.6
       xlC.rte.3.1.4.8
I got Kin's permission and installed them on his spartan machine, but it
didn't do any good.

  Thanks to a hint from Glenn Deen, namely that if something takes 2.5
minutes to run, then you should be thinking TCP/IP timeout, I was able to
figure it out.  According to Glenn, TCP takes 2 minutes to time out, after
which, for DNS at least, it tries UDP, which times out after 30 seconds
(or so).  While it's hung, said Glenn, do a netstat -A (or better yet,
a lsof -i -n, then you see the running process), and you might see attempted
DNS activity.  What I saw were lines like this
   raj2 54092 jasper 3uc inet 0x01e65100 0t0 UDP localhost:4916->localhost:domain
(I had named the program raj2 and was running it as jasper)

This said that yes, the program was hung up trying to do domain(=53) I.P.
traffic to itself.  Turned out that the permissions on /etc/resolv.conf
were 600, so non-root users couldn't see it, which resulted (evidently) in
programs trying to talk to/connect with a named daemon on the local machine,
which wasn't there.  After 2 minutes, the TCP attempt timed out, then after
another 20-30 seconds, the UDP attempt failed.  chmod-ing /etc/resolv.conf
to 644, fixed everything.

Now the only question remaining is how did the permissions get changed?
It evidently happened when I applied service, but that's kinda bizarre.


==============================================================================
PMR # 72971,49R       x    For the last 2 days, when ar0073e0 runs the nightly
 2/11/99              x  ADSM incremental backups, there are more than 22,000 
 2:15                 x  lines in the log file saying it's "Expiring" a bunch
--who--               x  of files in different 
--mail address--      x  /.../patent.ibm.com/fs/.rw/patent/cache directories.
--phone--             x  First of all, we don't want ADSM to backup anything
xxxxxxxxxxxxxxxxxxxxxxx  in DSF, and it doesn't appear it really is.  If you
                         look at when ADSM thinks any of these files got
backed up, it says it got backed up on the previous ADSM incremental backup.
But if you look at the console log from the previous incremental backup, you
don't see any lines saying it backed up the file.  Very puzzling.

  We were running ADSM version 3.1.0.1 (aka fileset level 3.1.20.1), and at
Darrel's suggestion, I quickly updated to the latest version, 3.1.0.6
(fileset level 3.1.20.6), ftp'd from shasta.sanjose.ibm.com, file
/adsm/fixes/v3r1/aix/4.2/U461721.adsm.client.aix42.  It didn't change anything.
Each incremental backup run says it's expiring thousands of DFS files, 
purportedly backed up by the previous run, yet the previous run doesn't say
it backed it up.

 2/15/99        Greg called to ask a few questions.  I explained things to
Greg Keys     him and sent him my dsm.opt, dsm.sys,  inclexcl.dsm, 
6-0895        dsmc.incr.backup.990209 & .990210, and the output of a
Level 2 Rep   dsmc q b -inactive command for one of those files.  I tar'd
              and compressed the whole thing & mailed it to him.

 3/25/99 (Thursday)   Since I hadn't heard from these guys in 5 weeks or so,
Greg Keys           I called the Support Center (big mistake!) to get a call
6-0895              back.  After a half hour on hold, they said they would
                    leave a msg for Greg.

 3/29/99 (Monday)     Greg called back finally and asked for the output from
Greg Keys           df & ls -l / commands, as well as a "service trace", which
6-0895              is accomplished by putting these two lines to dsm.opt,
gkeys@us.ibm.com         TRACEFLAGS SERVICE
                         TRACEFILE /tmp/dsmc.incremental.service.trace

 5/12/99 (Wednesday)    I got a call from a Carolyn, asking me to call Greg
Greg Keys             back.  He hasn't done anything since we last talked on
6-0895                3/29.  Although my notes don't say so, I *did* generate
gkeys@us.ibm.com      the "service trace" as he asked and evidently, ftp'd it
                      to index.storsys.ibm.com, and put a tar file into the
adsm/incoming directory, with the PMR # as part of the file name.  Anyway, 
Greg had it.  Greg now is saying he wants another service trace, just to 
compare the two.  I created a PMR72971,49R.service.trace.tar.Z containing our
  dsmc.incremental.service.trace
  dsm.opt
  dsmerror.log
  dsm.sys
  inclexcl.dsm
and put it out there for him.  Just for the record, I don't see the nightly
ADSM backups displaying the offending behaviour now, nor (I believe) have we
seen it since February.  But the two dsmc.incr.backup.990209 & .990210 runs I
sent in on 2/15 do definitely document what I described.

 5/19/99                I got another phone call saying she didn't see the
Carolyn               PMR72971,49R.service.trace.tar.Z file I sent Greg last
6-0958                week, so I posted it to index.storsys.ibm.com in the
                      adsm/incoming directory again for her.  I had kept the
file in ar0073e0's /usr/lpp/adsm/bin directory, so it wasn't too much trouble.

 5/27/99                Carolyn called again.  Since this phenomenon isn't
Carolyn               affecting us any more and they really don't have enough
6-0958                information to guess what went wrong, I agreed they can
                      close this problem.


==============================================================================
PMR # 74709,49R       x    We have a "hostname-leak" using the URL
 2/25/99              x  http://www.patent.ibm.com/promo/vendors/smartpat,
 4:45                 x  in that the URL the browser displays after the page
Diane Swan            x  is loaded, shows the SP/2 node name of the server
--mail address--      x  that serviced your request.  It doesn't do this all
8-526-1933            x  the time and it's not consistent.  Sometimes you see
xxxxxxxxxxxxxxxxxxxxxxx  http://www.patent.ibm.com/... as you should, other
                         times you see http://as0112e0.patent.ibm.com.

  We're running ICSS 4.2.1.7 on the SP/2 nodes, but I just upgraded to
ICSS 4.2.1.10 on as0116 to test that, and it does the same thing.  The page
source is at /dfs/prod/ipn/htdocs/promo/vendors/smartpat/index.html, which
is a link to smartpatent-main.html.en in the same directory.

  I got a note from Diane saying to run the httpd daemon with the -vv option,
which will generate a trace at the console.  When I tried this, the web
server refused to "leak" the node name, in other words, it worked fine.
Frustrating!  I left a message with Diane saying I can no longer recreate
this behaviour, so she's leaving the PMR open for a couple of weeks before
calling back, and she'll probably close it then.


==============================================================================
PMR # 81417,49R       x    I'm calling on behalf of Matt Morris in Raleigh,
 3/10/99              x  who works at a customer site, Rational, and is
 9:27                 x  having troubles with a 320 booting between AIX 4.1
Sandra                x  and AIX 4.3.  Matt's office phone is (919) 845-3236
--mail address--      x  and his pager is 1-888-857-2288
--phone--             x  
xxxxxxxxxxxxxxxxxxxxxxx    Here is the sequence of events,
- AIX 4.1.3 is installed ok on hdisk2/3.
- AIX 4.3.1 is installed on hdisk0/1.
- Bootlist is switched to hdisk0 & AIX 4.3 booted ok.
- Bootlist is switched back to hdisk2 & AIX 4.1 is booted ok.
- Bootlist is switched back to hdisk0 & the machine won't boot.  It appears
  as if hdisk0 isn't bootable 'cause it sits at LED 223-229,

  I got Matt & Sandra on a conference call and Sandra talked Matt through
booting in service mode & rewriting the AIX 4.3 boot image on hdisk0.  After
that, Matt was able to boot AIX 4.3 ok.  We left it at that, but if Matt
has further problems, he now knows how to call back to the Support Center,
reference this call & talk to Sandra again.

==============================================================================
PMR # 81783,49R       x    Henry tells me the web server on 206 dies every so
 3/12/99              x  often.  The only symptom is in the error log, where
 3:00                 x  it says LABEL=SRC, IDENTIFIER=E18E984F, Class=S,
McCloskey, Sharon     x  Type=PERM, Resource Name=SRC, 
--mail address--      x  Description=SOFTWARE PROGRAM ERROR, Symptom Code=0,
8-444-4866            x  Software Error Code=-9017, Error Code=1536,
xxxxxxxxxxxxxxxxxxxxxxx  Detecting Module='srchevn.c'@line:'288', and
                         Failing Module=httpd.
This is the 4.6.2.3 version of the Lotus Domino Go Webserver.

 4/06/99             Nothing was ever done and this isn't as big as a problem
                     as we at first perceived, so I called Sharon and had her
cancel the PMR.

==============================================================================
PMR # 81789,49R       x    I also called in to ask about the other problems
 3/12/99              x  we're having with our Net Commerce server, as0206.
 3:00                 x  There are 2, both with the Net Commerce "server"
Mike Karamanolis      x  binary (/usr/lpp/NetCommerce3/bin/server).
coachk@us.ibm.com     x   
8-441-4595            x    As background, /etc/rc.local starts the
xxxxxxxxxxxxxxxxxxxxxxx  /usr/lpp/NetCommerce3/bin/srvrctrl process, which
                         monitors & babysits the above "server" process and
a back_server process, which we don't care about right now.  All three of 
these guys have configuration files in the /usr/lpp/internet/server_root/pub/
directory and log files in /arc/NetCommerce3/instance/patents/logs.

  I forget where I found the core file, but I saved the core file for when
this server process died, along with its log file.  Its PID was 6954.  See
the 2 *6954* files in /tmp, server.core.6954 & ncommerce19990304145305_6954.log.

  Mike said he wanted the dump & core file, so I sent it to him.

==============================================================================
PMR # 84648,49R       x    Called in again (see PMR # 67742,49R above) the
 4/05/99              x  problem that keeps plaguing me on my 43P-140, jasper
--:--                 x  machine.  Besides the apparent memory leak in dfsbind
--who--               x  (see Item # JP9021 above), possibly related is another
--mail address--      x  problem, namely
--phone--             x     dfs: lost contact with server 9.1.24.168
xxxxxxxxxxxxxxxxxxxxxxx         in cell: almaden.ibm.com
                         9.1.24.168 is almdfs4, which is where my home directory
is.  I've also seen it with another server, 9.1.24.165, when my home directory
was on that server, namely almdfs1.

  Another symptom when the things go to hell on jasper, is X starts temporarily
hanging.  That is, mouse movements/window focus gets delayed for a minute or so,
but always eventually comes back.

  After spending an hour and a half with the the Level 1 guy, we finally got
things back to normal by logging off & logging back in with the "Command Line"
login.  I told him logging off was essentially the same amount of pain as far
as I was concerned, than rebooting.  The Level 1 guy forwarded me to Chien Yu,
who I am to call back when things go to hell again.

 4/07/99             As predicted, this situation happened again.  X was very
 1:00:00           slow in changing focus when the mouse changed windows, and I
Reggie Clinton     couldn't cd (at times, this symptom was transient) to my DFS
                   home directory.  I tried calling Chien Yu back and let him
log on as root to poke around, and he didn't find anything unusual as far as
performance things go (paging, file systems, CPU load, error, etc.).  He gave
up and called the DCE guys back in.  Reggie poked around some more and still
didn't find anything out of the ordinary (but then again, Reggie didn't seem
that good - see my dealings with him for past PMR's).

  What we did finally learn, was that things got "fixed" after doing a
kdestroy & re-dce_login -ing back in, at least for that one window.  Doing a
kinit wasn't enough.  Makes me wonder if the default 30-hour token lifetime
had anything to do with it.  It did take just under 2 days for this problem to
manifest itself.  And I do remember Jim Hafner & Rick Haeckel saying one needs
to kinit in the "original" window, whatever that means when one logs on
through the CDE desktop.  The dce_login command created another credential
file (see your KRB5CCNAME environment variable).  The kdestroy command erased
the old credential file, which is what you get when you create another aixterm
window, say, so all old windows as well as new ones, are screwed.  And just
copying the new credential file to the old, didn't fix things, either.

 4/09/99          Finally gave up fighting this and upgraded my machine to
Robin Redden      AIX 4.3.2 and DCE 2.2.0.4, which just became available this
8-678-1542        week (it wasn't there on Monday).  Also got a call from
                  Robin and she suggested changing root's ulimits, making
root's stanza in /etc/security/limits, 
root:
   fsize = -1
   core = -1
   cpu = -1
   data = -1
   rss = -1
   stack = -1
   nofiles = -1  (On AIX 4.3 only, not on AIX 4.2.1)

I used this command to make the changes on my machine,
   for i in fsize core cpu data rss stack nofiles
     do chsec -f /etc/security/limits -s root -a $a=-1
   done
ommitting the nofiles for the Patent Server site.

 4/15/99          Am experiencing the same problems, so I called the Support
Reggie Clinton    Center yet again (sigh!).  Reggie is going to poke around
                  again.  Reggie talked to Robin, who said she won't be able
to learn anything by poking around my system (sounds like she doesn't have
the time), so Reggie's gonna poke around.  The symptoms include ....
- Certain X events are really slow, like focus changing in a timely manner
  (I have focus set to mouse

System upgraded to DCE 2.2.0.5, which Dale had on CD.

 4/26/99          This seemed to help for a bit, but I had to shutdown on
Reggie Clinton    Friday 4/16 & reboot 4/19 for a power down, and my system is
                  acting strange again on Monday, 4/26, so it only lasted a
week.  I'm getting very slow X response & "A remote host did not respond
within the timeout period" messages.

 4/27/99          Reggie called back and I gave him access again to root on my
                  jasper machine to poke around.  I don't know if he really
ever did, much less learn, anything.

 4/28/99          Meanwhile, I got an AIX update CD that updated primarily
                  libc.a, per the related PMR to this one, 71121 above.
I installed that and rebooted, cleaning up my "A remote host did not respond
within the timeout period" error message I've been getting the last few days.

 5/17/99          It's been a few weeks and everything was running normally
                  with AIX 4.3.2 and DCE 2.2.0.5, but the infamous
"dfs: lost contact with server 9.1.24.168 in cell: almaden.ibm.com"
messages are back (9.1.24.168 = almdfs4, which is where my home directory is).
I called the Support Center back, who left a callback with Reggie.  (sigh)

 5/17/99          Reggie called back and pointed me to a DCE/DFS Level 3 web
14:10             page at http://guero.austin.ibm.com.  Following the
                  directions on a DFS tracing page a few links down,
http://guero.austin.ibm.com/dfs-support/dfs-traces/dfstracing.html#Tracing DFS Client Failure
(that's a bizarre URL, with blanks in it).  It boils down to doing
- dfstrace setset -set cm -active
- cm checkf
- cd
- dfstrace dump -file /tmp/dfstrace.out
- dfstrace setset -set cm -dormant

Then I mailed him that trace file via a mail l3dce@us.ibm.com < /tmp/dfstrace.out

 5/27/99          Reggie says he's got some instructions to follow the next
15:20             time this situation happens.  He's also got somebody on deck
Reggie            that is willing to login when it next happens, and poke
                  around herself.  It's about time somebody did that!

 6/30/99          The problem came back again.  I noticed this morning that
Liz Hughes        things were not as snappy, so I started to plow through the
8-678-3483        things they suggested I do in the 5/27/99 note from Chris 
                  Dodson.  When I got to the part of the note that asks the
question, "If he runs these commands, does the problem clear up?", those
commands being dfs.clean dfsbind to stop dfsbind & rc.dfs dfsbind to start it
up again, X started to *really* misbehave, not accepting keystrokes and
eventually not even following the mouse.  I could telnet in and saw the
dfs.clean process with a spawned "dcecp -s /usr/bin/show.cfg dfs" that
wouldn't go away, even after I killed -9 it.


  I called the Support Center and Reggie Clinton is no longer there, but when
I didn't get a call back from level 2 for a half hour or so, I called Liz
directly.  She spent a couple of hours on my system as root, poking around,
and was able to get dfsbind started up again.  From a phone msg (so these
details are unreliable), she started up another instance of dfsbind from the
command line, let it start, which freed up the hung dcecp.  She then shut it
down & restarted it normally (which I guess means, rc.dfs dfsbind??).

  Anyway, after she did her thing, I was able to use my userid without logging
off or back on, or rebooting.  Everything appeared normal.  I'm not sure what
exactly Liz learned from all this.

 7/19/99       Just for the record, my machine started to hang up again, 
 9:00        getting the "A remote host did not respond within the timeout
             period." error messages.  I restarted dfsbind & it cleared up.

==============================================================================
PMR # 86271,49R       x    The find command has a -fstype clause, which is
 4/16/99              x  suppose to use the file systems defined in /etc/vfs.
11:10                 x  This works for -fstype afs, but doesn't work for dfs.
Vani Ramagiri         x  Seems like it should.
vani@austin.ibm.com   x
8-523-4168            x    After 25 minutes on the phone, the guy said he'd
xxxxxxxxxxxxxxxxxxxxxxx  investigate it.  What a waste of my time for such a
                         trivial problem.

  Further investigation yieled that this did work fine under AIX 4.1 & 4.2,
but fails only under AIX 4.3.  Vani called and she also needed some
guidance/convincing that this was a problem, but she finally came around.

 4/30/99           In the PMR, Vani has written,

  "While I was looking at the defects that were fixed after 4.2., I came
   across defects 246300 for bos43D and 25289 for bos42G for - find`s option
   `-fstype nfs` does not work w/ NFS v3.  I was checking to see if the fix
   for this defect might have caused this behavior change in 4.3.x.
   *
   Quoting from the material from this defect 246300 and 25289:
     "Right now the only valid options to -fstype argument are jfs and nfs
      Sometime ago a fix was dropped into find, so that -fstype will consider
      only  'jfs' and 'nfs' as the valid options. This fix was made so that
      the backup command will skip the non-jfs filesystems when doing a mksysb."
   *
   I talked to the developer who fixed this defect to confirm that it is not a
   defect that AIX4.3.x is returning zero for #find . -fstype dfs -print | wc
   He confirmed that this is working as designed."

I don't know if this is a reasonable argument or not, so I put in an append in
the DFSADMIN FORUM on IBMUNIX to get their opinion, before I argue the point
with the Support Center.

==============================================================================
PMR # 87091,49R       x    ar0071e0 is dying every few minutes, 7 in the last
04/23/99              x  hour and a half, starting at 1:30 am on 4/23/99.
 3:00 a.m. (!!)       x  This system has been rock solid ever since the sight
Jessie Ball           x  came up, except for the last month, we've had 1363
--mail address--      x  SCSI RAID adapter errors on scraid0 over the last 20
--phone--             x  days.  So at 9pm on Wednesday, 4/21/99, Anthony
xxxxxxxxxxxxxxxxxxxxxxx  DeMott & I replaced the scraid0 adapter.  Except for
                         (an unrelated?) problem with not being able to
install the latest microcode, which we called in & somebody is suppose to be
looking at, everything seemed fine.  We stopped getting the 60 errors/day.
Reference the old Hardware Reference # 31W3HHV, called in on 4/6/99.

  Then, 25 hours after running fine on it, ar0071e0 started to crash every
10-15 minutes.  Here's the output from an errpt -N CMDCRASH command.
   IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
   3573A829   0423042699 U S CMDCRASH       SYSTEM DUMP
   3573A829   0423041399 U S CMDCRASH       SYSTEM DUMP
   3573A829   0423030399 U S CMDCRASH       SYSTEM DUMP
   3573A829   0423025099 U S CMDCRASH       SYSTEM DUMP
   3573A829   0423023699 U S CMDCRASH       SYSTEM DUMP
   3573A829   0423022399 U S CMDCRASH       SYSTEM DUMP
   3573A829   0423021099 U S CMDCRASH       SYSTEM DUMP
   3573A829   0423015799 U S CMDCRASH       SYSTEM DUMP
   3573A829   0423012999 U S CMDCRASH       SYSTEM DUMP

I called this into the Software Support to get them to agree that it's
probably hardware related.  The details on each dump were the same,
   csa:2ff3b400
   [dcelfs.ext:MarkFrags] 200
   [dcelfs.ext:MarkFrags] 58
   [dcelfs.ext:MarkBlock] 50
   [dcelfs.ext:epib_Free] 274
   [dcelfs.ext:epia_Truncate] 734
   [dcelfs.ext:epif_ChangeLink] 264
   [dcelfs.ext:vnm_Inactive] 228

   SYMPTOM CODE
   PIDS/576565500 LVLS/420 PCSS/SPI1 MS/700 FLDS/[dcelfs.e VALU/7c8e7008
                  FLDS/[dcelfs.e VALU/50
They agreed.  It seems that DFS is writing into/onto some disk, most likely
that SCSI RAID drive on scraid0.  I called this into Hardware Support.
I don't know what they're gonna replace (I suggested backplane), but that's
their problem.  Hardware Reference # 30YPP62,

  Meanwhile, I commented out the starting up of DCE from /etc/inittab, so I
can get a stable system, then carefully commented out /dev/scsi0lv from
/var/dce/dfs/dfstab and then brought DCE back up.  I had to run salvage on
each of the other aggregates, most notably, scsi1, which had some critical
filesets that were unreplicated (oops!).

 4/23/99        We spent the day trying to get the new SCSI adapter working,
                without success.  We finally reinstalled the old adapter,
and returned to the state of getting dozens of error messages per day.
At least the machine stayed up.  Anthony DeMott is pursuing with his support
people, the problem of getting the new adapter to work.


 4/26/99        I called the Support Center to ask them to look at the CSA
                dump we had generated, 'cause hardware support wanted us to.
Meanwhile, the Software support person noticed that we were one level down
for our device support.  We had devices.pci.14102e00.rte 4.2.1.5 and
devices.pci.14102e00.diag 4.2.1.4, and there's a .6 & .5 respectively.
I installed those fixes, and all the newer fixes I had from Fixdist, and
set the system to be rebooted at 1:01 am tomorrow.

 4/27/99                  Roxanne called and we walked through the
Roxanne Merizalde           script crash.out
roxannem@austin.ibm.com     crash csa-dump-1
8-523-4141                  stat
                            status
                            cpu
                            symptom
                            p -m
sequence to get a quick snapshot of the CSA dump.  One problem we had was
since I had installed the latest service last night, the crash command didn't
like the dump, when run on ar0071e0, the machine that caused the dump.  But
since ar0072e0 is identical to what ar0071e0 was yesterday, I moved the dump
file to ar0072e0 & ran the crash command there.  Seemed to work fine.  I
mailed both the crash.out script file and a 1.9 MB errpt -a output to her.
She called back a few minutes later & said that yes, she wanted the dump, but
to run "snap" on ar0072e0, since the nucleus had changed on ar0071e0.
The snap command is /usr/sbin/snap, part of the bos.rte.serv_aid fileset.
She said to
        snap -gfkD < -d path if /tmp isn't big enough> -c
   Then mv /dfscache/snap.tar.Z PMR87091.49R.tar.Z
   and  rftp it to testcase.boulder.ibm.com.

What I had to do was to modify /usr/sbin/snap, cause it did a sysdumpdev -L
to see where & what the previous dump was, and since it happened on ar0071e0,
I just cut & pasted the output of 71's sysdump -L command to 72's snap cmd.
Anyway, I got it to work.

Chris Kime       Chris & Liz called to report what they learned from the dump
CKIME@AUSTIN   I had sent in, and the bottom line was, they can't help point
8-678-2268     to any specific hardware.  DFS was trying to clean out some
   and         entry in their playback log, and was looking at inconsistent
Liz Hughes     data.  Whether that was due to a file being erased (in my mind,
8-678-3483     most likely) or file being closed, they can't say.  They also
               can't tell for sure when that data was written to disk.  Was it
written just a few minutes ago or 2 days ago?  They can't tell.

  The tidbits they did tell me were

1)  When running salvage, use the -verbose option & pipe the output to a file
    (or tee it), and also use the -salvage option, which would get the salvage
    command to do its most extensive checking possible.  E.G.
      salvage -aggregate scsi0 -salvage -verbose | tee /tmp/salvage.output

2)  To detach an aggregate from DFS,
      dfsexport -aggregate scsi0 -detach -force

They also suggested that I detach scsi0 from DFS & run salvage -salvage as
shown above, before replacing the adapter again, just to insure we're starting
at a fresh, known, good state.  It's possible that there are "land mines"
still waiting to be stepped on, and this will insure there's not.

 6/01/99         Anthony DeMott & Bob Olmstead came in at 7pm this Tuesday
30YPP62        night to attempt again to resolve this hardware issue with
               ar0071e0.  We couldn't get the new adapter (08L1319) with
the latest microcode (98348) to accept the array configuration information
on the drives.  It always produced an error in the error log.  We also
could not load the new microcode on the old adapter.  Sigh.

 6/03/99         A new wrinkle in this problem.  I noticed that ar0072e0 is
31VZN8Y        getting the same SCSI_ARRAY_ERR1 & SCSI_ARRAY_ERR9 errors that
               we're getting on ar0071e0.  I called it in as a separate
hardware problem, but referenced the other 30YPP62 problem.  I can't tell
when the errors started 'cause the error log has wrapped, but since 5/14/99,
ar0072e0 has gotten these number of errors.
             | ERR1 | ERR9 | Total
     --------+------+------+-------
     scraid0 |   27 |  195 |  222
     scraid1 |  251 |  987 | 1238
Just for the record, ar0071e0 = 7025 (F40) #05506.
                     ar0072e0 = 7025 (F40) #05507.

==============================================================================
PMR # 88666,49R       x    I have a couple of problems with the upgrade to the
 5/05/99              x  latest version of eNetwork Firewall on the Patent
11:45                 x  Server's socks/ssh/mail gateway machine, ar0135e0/1.
Pat Krohn             x  I was at version 3.1.1.2 of the Firewall code, and I
PKROHN@us.ibm.com     x  upgraded to eNetwork Firewall 3.2.3.0, which is the
8-444-5847 (Raleigh)  x  latest I could find.
xxxxxxxxxxxxxxxxxxxxxxx 
                         
  The first anomoly I saw was when I went under smitty,
  - IBM e-Network Firewall for AIX
  - System Administration
  - Secure Interface
  - List
which just runs the fwlistadptr command, I got the following error messages,
  192.168.56.135  Unrecognized Interface
  Pager message cannot be greater than carrier maximum message length.
If you go under the GUI (by typing fwconfig), under
  - System Administration
  - Interfaces,
it shows the 2 interfaces correctly,
  IP Address               Type            Name
  ===============   ====================   ====
  204.146.135.135   Non-Secure Interface   en0
  192.168.56.135    Secure Interface       en1
I'm not sure this in itself is affecting me, but ...

  The second anomoly I saw was more severe.  Incoming ssh packets from the
Internet are being filtered.  First of all, there was a bug in this build of
eNetwork Firewall where the wrong message catalog was used, so error messages
and such, aren't correct.  Pat had me download from testcase.boulder.ibm.com,
the following two files to fix this problem, both from the /aix/fromibm
directory, cat_323.tar.Z & cat_323.readme.txt.  Essentially, it replaced a 
bunch of binaries to use the correct msg numbers.  This fixed the fwlistadptr
descibed above and fixed the wrong messages in the
/var/adm/sng/logs/fwreg_l4.log file, but didn't fix the underlying problem,
that some packets are being denied, where they weren't before.

  As proof of this, a nsa scan I did from ar0176e0/1 to the 204.146.135.135
side on Saturday, did indeed show the ports that should be open, were indeed
open.  That is, ssh (22), SMTP (25), sshfail (2222), http proxy (8080).
But when I ran the same command now, nothing shows open.

  I rebooted ar0135e0/1 and all is back to normal again, i.e. everything's
working as it should, which is the same thing I saw on Saturday.  It appears
that things only go wrong after a day or so???  I've implemented a check on
ar0176e0/1 to hourly go out and see if port 22 is still alive.  See root's
crontab & the /tmp/ssh.135.log file.


==============================================================================
PMR # 90599,49R       x    On only one machine, ar0081e0, one of the J50 
 5/19/99              x  Verity servers, for some reason, every week or so,
15:15                 x  we loose the ability to "see" the /dfs/admin
Julian Owens          x  directory, or anything (of course) underneath it.
--mail address--      x  I'm not sure it's only the root userid or not, but
8-421-7141            x  I know a reboot clears things up, that is, we can
xxxxxxxxxxxxxxxxxxxxxxx  see /dfs/admin again.  This affects the periodic
                         recycles we do the the vserverprod server, as that
thing needs /dfs/admin/bin/subsysfuncs.sh to start up correctly.

The permissions to /dfs/admin & /dfs/admin/bin are the same, namely
   {mask_obj rwxcid}
   {user_obj rwxcid}
   {user cell_admin rwxcid}
   {group_obj r-x---}
   {group admin rwxcid}
   {other_obj r-x---}
where root falls under the other_obj permissions, and thus has rx to the
directory.  The permissions to /dfs/admin/bin/subsysfuncs.sh are also
correct - root has r, due to other_obj.

When I saw this situation on 5/19, it cleared itself up in a couple of hours
or so (I wasn't watching it that closely), and I didn't have to reboot.

ar0081e0 is running AIX 4.2.1.0 & DCE 2.1.0.28,which is the latest.

Julian says that when this happens again, do a kill -30 to the dfsbind process
to create that /var/dce/dfs/adm/icl.bind file, and send that to him.
I put in the following crontab entry to monitor it,
   1,6,11,16,21,26,31,36,41,46,51,56 * * * * /monitor_dfsbind
where /monitor_dfsbind is
#!/bin/ksh

file=/tmp/monitor_dfs

if [ ! -x /dfs/admin/bin/subsysfuncs.sh ]
then 
     if [ ! -w /var/dce/dfs/adm/icl.bind.PMR90599,49R ]
     then #  Find the dfsbind PID and send it a kill -30 signal to create icl.bind.
          dfsbind_PID=$(ps -ef | grep -v grep | grep -v monitor_dfsbind | grep dfsbind | awk '{print $2}
')
          kill -30 $dfsbind_PID
          #  The dfsbind process takes a second or so to create icl.bind, so
          #  wait around for it.
          sleep 3
          mv /var/dce/dfs/adm/icl.bind /var/dce/dfs/adm/icl.bind.PMR90599,49R
     fi
echo "$(date) Can't see /dfs/admin/bin/subsysfuncs.sh" >> $file
fi

So we'll see how often it happens that we don't see it, and how long the
situation lasts.

 5/21/99   Julian called to see if this trap I set up had tripped yet.  It hadn't.
 5/26/99   Julian called again.  Nope, still not yet.

 6/07/99   My /monitor_dfsbind script finally triggered at 12:21 on Saturday,
 9:00      June 5, 1999.  I called the Support Center back and found out that
           Julian had closed this PMR on 6/1/99, the day I was out on vacation at 
Yosemite.  I reopened it, left a phone message with Julian, and e-mailed
Julian the 1800 line /var/dce/dfs/adm/icl.bind.PMR90599,49R.

10:45      Julian wanted me to package up the icl.bind file, tarring and
           compressing it up into a 90599.49R.tar.Z file, and drop it off at
testcase.boulder.ibm.com.  I did this with a
  cd /var/dce/dfs/adm
  tar cvf 90599.49R.tar icl.bind.PMR90599,49R
  compress 90599.49R.tar
command & rftp'd 90599.49R.tar.Z over via
  rftp testcase.boulder.ibm.com
  cd /aix
  bin
  put 90599.49R.tar.Z

Doing some debugging of my own, the only non-zero "exit code" I see in the
generated icl.bind file is
  time 897.064064, pid 7: do_auth_request: exit code:382312679
  time 897.064103, pid 7: ProcessRequest: took 0 seconds, exit code:382312679
and a "dce_err 382312679" command says that error number means
  dce_err: 382312679: Authentication ticket expired (dce / rpc)
Since this is root we're running from, the dced daemon is the one that's
suppose to keep root authenticated as "self".

 2:45      Coincidently, this same situation happened on one of our other
           Verity servers, ar0078e0.  I did another kill -30 signal to the
dfsbind process and renamed the icl.bind file to
/var/dce/dfs/adm/icl.bind.PMR90599,49R.2 and like earlier, rftp'd it to
testcase.boulder.ibm.com.

 6/08/99           Reggie called and said he wanted me to collect more
11:00            debugging/dumping info the next time this happens (sigh).
Reggie Clinton   He pointed me to this guy's DCE/DFS debugging web page at
8-444-5257       http://guero.austin.ibm.com/dfs-support/dfs-traces/icltraces.html.
                 I enhanced my /monitor_dfsbind script & copied it to 78 to
also run there.
When looking at it, I saw the following errors
  pid 21026: ERR: dfs: fileset (0,,28) error, code 691089410, 
          on server : 192.168.56.71, in cell: patent.ibm.com, .
According to dce_err, 691089410 is
  fileset not present and exported on server: already deleted/moved (dfs / xvl)
I don't know what PID 21026 is/was.  It's not there now and it wasn't in the
ps -ef command I did when I took this snapshot.  Hmmmmm ...

Also, there was a bunch of
  pid 21026: INF: CM cm_write error 19 on fid 0.28.40071.2570365
Oh, I see.  This log file entries are from a week ago when I had different
DCE/DFS servers down.  Pay attention to the time, which is the number of
seconds since the last time printed, e.g.
  Current time: Tue Jun  1 12:09:52 1999
and is a counter that wraps at 1024 seconds.

 6/09/99           Hey, alright!  We didn't have to wait too long before
 2:20            this happened again.  It tripped on ar0081e0 on 6/9/99
Julian           at 12:21 pm (what a coincidence this is the exact same
                 time it tripped last Saturday.  Hmmmmm.
Anyway, I got a lot of good (I hope) debugging info for the Support Center.
I called in and warned them I was dropping it off at testcase.boulder.ibm.com.
I had to call it 90599.49R.3.tar.Z 'cause the original stuff I dropped
off there 2 days ago, was still around.  Here's what was in there ...
-rw-r--r--   0 0    25457 Jun 09 12:21:09 1999 PMR90599,49R.dfstrace
-rw-r--r--   0 0    27374 Jun 08 12:39:19 1999 PMR90599,49R.dfstrace.normal
-rw-r--r--   0 0     5169 Jun 09 12:21:06 1999 PMR90599,49R.general.info
-rw-r--r--   0 0     5169 Jun 08 12:39:17 1999 PMR90599,49R.general.info.normal
-rw-r--r--   0 0   112123 Jun 09 12:21:01 1999 PMR90599,49R.icl.bind
-rw-r--r--   0 0   111207 Jun 08 12:39:11 1999 PMR90599,49R.icl.bind.normal
-rw-r--r--   0 0      828 Jun 09 13:16:00 1999 PMR90599,49R.log

The *.normal files are a "snapshot" of when things were running normally
yesterday.  In particular, the PMR90599,49R.general.info.normal file shows a
normal klist command where the "Identity Info Expires" at 1999/06/09:12:08:07.

The PMR90599,49R.log file shows that root was running authenticated at 12:16
(there's no log entry) and unauthenticated at 12:21.  It stayed that way until
sometime between 13:16:00 and 13:21.

The dfstrace does have these lines in it
  time 608.410054, pid 0: Current time: Wed Jun  9 12:09:52 1999
  time 608.410054, pid 19682: RPC: krpc_ReadHelper returns code 382312679
  time 608.410179, pid 19682: RPC: sec_auth, done opcode 3, st 382312679
  time 608.410232, pid 19682: dfs: ticket has expired; running unauthenticated.
Which makes it appear that the "unauthenticaton" of root was before 12:11.  Hmmmm.

 6/15/99         Chris called and left a message asking to get access to the
 3:00          machine when this situation happened again.  I left a message
Chris Kime     back telling him sure, and that I would call him when it 
CKIME@AUSTIN   happened next.
8-678-2268        ...

 6/17/99         It struck ar0081e0 again, but Chris's phonemail message says
12:21          he's out of town 'till Tuesday.  I tried getting ahold of Liz
               Hughes at T/L 8-678-3483, but she didn't answer either.
Since this situation appears to "fix" itself within an hour and I didn't
notice this 'till 30 minutes into the hour, there was no time to really get
somebody else involved.  We'll have to wait 'till next time.

 6/17/99         Surprisingly, this situation has persisted on ar0081e0 for
15:11          the last 3 hours.  I went ahead and called it into the Support
               Center to document it & to see if anybody else is available.
I guess nobody was.  They requeued a secondary call, but nobody called me
back in time.  It finally did clear itself up between 15:17:48 & 15:18:48
after almost 3 hours.  Liz did call me eventually & I gave her access to root
on ar0081e0 so she can poke around.  I predict that ar0078e0 will be the next
to go, and on Monday afternoon.  I added additional debugging code to my
/monitor_dfsbind script, so we'll see if it catches anything new.  Liz said
she would put a reminder to herself for Monday.

 7/27/99         Chris called back to see what was happening here.  I haven't
 12:00         gotten any notes from my /monitor_dfsbind script that I hacked
Chris Kime     together, but checking the /var/dce/dfs/adm/PMR90599,49R.log
CKIME@AUSTIN   file, I see that it has indeed happened many times.  Hmmmmm,
8-678-2268     my script has a bug in it?  Hmmm, there were 2 pieces of unsent
               mail in the mailq on 78.  Oh, I see.  My script keys off the
/var/dce/dfs/adm/PMR90599,49R.icl.bind file.  Once that file exists, I don't
collect any more debugging code, nor send anymore mail.  I erased
/var/dce/dfs/adm/PMR90599,49R.icl.bind on both 78 & 81 to start fresh.

               
==============================================================================
PMR # 90825,49R       x    On the CWS, when I do any iptrace command, I get
 5/21/99              x  this error message,
10:30                 x     iptrace: 0827-877 setsockopt -: There is not
--who--               x     enough buffer space for the requested socket
--mail address--      x     operation.
--phone--             x  I have no idea what "buffer space" this is referring
xxxxxxxxxxxxxxxxxxxxxxx  to, so I asked the Support Center.  It doesn't appear
                         to be mbufs.
  I was looking into a problem we were getting trying to install AIX on a J50
(ar0085e0).  It appeared to not correctly install a piece of AIX networking
code used to do NFS mounts.  Also, when we tried reinstalling it, the CWS
wasn't answering bootp requests, although I could see the bootpd daemon being
launched on the CWS. 

  It wasn't mbufs.  It was sb_max.  At the bottom of /etc/rc.net, was a bunch
of no -o commands to tune the network options, including no -o sb_max=163840.
The Support Center had me raise this to 1048576.  We also saw, via an
entstat en1 command,
  ETHERNET STATISTICS (en1) :
  Device Type: Ethernet High Performance LAN Adapter
  Hardware Address: 02:60:8c:f5:35:92
  Elapsed Time: 20 days 18 hours 8 minutes 46 seconds

  Transmit Statistics:                          Receive Statistics:
  --------------------                          -------------------
  Packets: 11245370                             Packets: 7854577
  Bytes: 8017791451                             Bytes: 1995424467
  Interrupts: 11179718                          Interrupts: 7817828
  Transmit Errors: 0                            Receive Errors: 238
  Packets Dropped: 0                            Packets Dropped: 0
  Max Packets on S/W Transmit Queue: 512        Bad Packets: 0
  S/W Transmit Queue Overflow: 138125
  Current S/W+H/W Transmit Queue Length: 0
...
Note the Overflow count, 138,125 in 20 days.  Also, the "Current .. Queue
Length" line might have been interesting to look at yesterday.  I went
through the process of 
 - ssh-ing into root on the en0 (SP/2) side of the CWS
 - ifconfig en1 down
 - ifconfig en1 detach
 - rmdev -l en1
 - smitty ethernet
   - Adapter
   - Change / Show Characteristics of an Ethernet Adapter (or smitty chgenet)
     - ent1
     And changed the
     - "TRANSMIT queue size" from 512 to 150.  Beware of a discrepancy with
       what the help text (PF1) says for this field, namely that the
       "Valid values range from 20 through 150", and what the prompt
       (PF4) says, that the valid range is 20-2048.  150 is the limit.
       The guy didn't know, but guessed that perhaps the fact this was
       set to 512, which is over the 150 limit, was causing us grief.
     - "RECEIVE buffer pool size" from the default 37 to the max of 64.
 - mkdev -l en1
 - route add default 192.168.56.251
 - route add 9.0.0.0 192.168.56.252

  We'll see if we get any more network flakiness.  I'll check the overflow
count again in a few days (if I remember).

  Three days later, on 5/24/99, the relevant lines of the entstat en1 cmd were
Max Packets on S/W Transmit Queue: 100        Bad Packets: 0
S/W Transmit Queue Overflow: 0
Current S/W+H/W Transmit Queue Length: 0
so things look good.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Changing /etc/rc.net to do a no -o sb_max=1048576 may also help a web server
stalling problem we were having with the free side nodes on 6/25/99.
  We'll see ...

I also changed /tftpboot/tuning.cust on the CWS
from /usr/sbin/no -o sb_max=163840
  to /usr/sbin/no -o sb_max=1048576
so that if/when the node ever gets rebuilt or re-cust'd, it'll keep the
change.

Hmmmm.  After a reboot, my changes, which were still there in /etc/rc.net,
didn't seem to take effect.  Turns out I (also?) need them in the
/tftpboot/tuning.cust file on each node.
==============================================================================
PMR # 92405,49R       x    Apparently, since the power up after the Memorial
 6/03/99              x  Day power outage, the ssa0 array on ar0072e0 has been
10:30                 x  down.  ssa0 = hdisk4 = pdisks 0-7 = 8 * 9.1 GB 
Wahid                 x  A varyonvg ssa0vg command yields this error message,
--mail address--      x    PV Status:  hdisk4  009001466be3b171      PVNOTFND
--phone--             x    0516-013 varyonvg: The volume group cannot be
xxxxxxxxxxxxxxxxxxxxxxx            varied on because there are no good copies
                                   of the descriptor area.
There is nothing in the error log for these disks.

The Support Center had me do a lqueryvg -Atp hdisk4 command, but that gave
    0516-024 lqueryvg: Unable to open physical volume.
            Either PV was not configured or could not be opened. Run
            diagnostics.

A "Link Verification" (under diags) didn't show anything connected to the 
A1/A2 Adapter Ports.  A physical inspection showed everything was powered up
and connected correctly (solid green lights).  Finally did
   rmdev -dl hdisk4
   cfgmgr
   lspv                           hdisk4 showed up with a good PVID.
   importvg -y ssa0vg hdisk4      To get vg back online.
   dfsexport -all                 To get it DFS-exported again.
And it all came back ok, including the pdisks showing up in the Link
Verification test.  It's interesting to note that just a software rmdev/cfgmgr
was enough to get the drives back in the Link Verification test.  Bruce and I
didn't expect that would work.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

 6/07/99     That's funny.  This exact same thing happened to ar0073e0, ssa0.
 1:00         ar0073e0 ssa0 = hdisk3 = pdisks 0-7 = 8 * 9.1 GB, just like ssa0
             was on ar0072e0, except.  I had to go through the same scenario, namely
   rmdev -dl hdisk3
   cfgmgr
   varyonvg ssa0vg
   dfsexport -all
To get it back.

For the record, as of 2:15 on  6/7/99, all 3 DFS servers are dfsexport-ing
scsi0-1 and ssa0-2.
But on 9/24/99, ar0071e0 got ssa3 added.


==============================================================================
PMR # 92627,49R       x    On the Patent site, Becky remove her rsosa userid
 6/04/99              x  from the patent group, thereby removing her DCE
 1:00                 x  account (remember, an account consists of 3 things,
Mike                  x  account = principal + primary group + organization
--mail address--      x  and removing any of the 3, removes the account).
--phone--             x  I recreated her account by getting into dcecp and
xxxxxxxxxxxxxxxxxxxxxxx  typing account create rsosa -group patent
                           -organization none -mypwd Br1g1t -password new4now
Note that account create is one of those commands that have to be done from
within dcecp, or from a TCL script.  You can't do it from the command line via
dcecp -c.

Anyway, both before and after I recreated her account, a principal show rsosa
dcecp command only works half the time.  The other half of the time, I get the
message "Error: Registry object not found".  Dale suspects the "CDS cache" on
one of my DCE/CDS replicas is stale.  He says there's a procedure to flush it
and force it to resynch.

What I got from the Support Center was a process to shut down DFS & DCE on
each replica, erase files from a bunch of directories, then restart.  Not
exactly what I was hoping for.

In poking around myself, I found the following commands under dcecp ...
  registry cat
returns a list of replicas, e.g.
  /.../patent.ibm.com/subsys/dce/sec/ar0071e0
  /.../patent.ibm.com/subsys/dce/sec/ar0072e0
  /.../patent.ibm.com/subsys/dce/sec/master

  registry show -replica /.:/subsys/dce/sec/ar0071e
returns a lot of information about each replica, a key field in this case
being the lastupdtime.
For ar0071e0, that line was {lastupdtime 1999-06-04-12:40:28.000-07:00I-----}
For ar0072e0, that line was {lastupdtime 1999-05-31-16:46:03.000-07:00I-----}
so I know it's ar0072e0 that's bad.

Another easier way to determine this is to simply do a
  registry verify
which "Returns a list of replicas not up-to-date with the master."
Sure enough, it returns
  /.../patent.ibm.com/subsys/dce/sec/ar0072e

Poking around some more, I found these incantations/options of registry show,
  help registry show
returns
  -attributes   Returns the attributes of a replica or master registry.
  -master       Returns all propagation info kept by the master replica.
  -policies     Returns the policies of a replica or master registry.
  -replica      Returns propagation info kept by the specified replica
and
  registry show -master
shows
  {name /.../patent.ibm.com/subsys/dce/sec/ar0071e0}
  {type slave}
  {propstatus update}
  {lastupdtime 1999-06-04-12:40:28.000-07:00I-----}
  {lastupdseqsent 0.1747}
  {numupdtogo 0}
  {lastcommstatus 0}

  {name /.../patent.ibm.com/subsys/dce/sec/ar0072e0}
  {type slave}
  {propstatus update}
  {lastupdtime 1999-05-31-16:46:03.000-07:00I-----}
  {lastupdseqsent 0.1740}
  {numupdtogo 7}
  {lastcommstatus {Data integrity error (invalid password is specified)}}

  {name /.../patent.ibm.com/subsys/dce/sec/master}
  {type master}
Note the "lastcommstatus" for ar0072e0.  What does that mean?

On ar0072e0, there are 164 lines in /dceconfig/vardce/svc/warning.log.
# Lines  That Say
-------  ---------------------------------------------------------
  85     Protocol version mismatch
  58     in header > fragbuf data size
  10     Caught signal 1. Exiting.
  11     Cannot NSI unexport Object UUID for this dced server.


On ar0071e0, there are 197 lines in /dceconfig/vardce/svc/warning.log.
# Lines  That Say
-------  ---------------------------------------------------------
  85     Protocol version mismatch
  58     in header > fragbuf data size
  21     Caught signal 1. Exiting.
  22     Cannot NSI unexport Object UUID for this dced server.
   2     Thread routine error
Which are pretty similar, so it appears that nothing there explains
why 72 is sick.

 6/04/99   I called back ('cause it was getting late and I wanted this fixed,
 4:30      that's why) and Julian talked me through removing the sec_srv DCE
Julian     service, via smitty dce on ar0072e0,
                        Unconfigure DCE/DFS
                        2 local only unconfiguration for this machine
and in the "COMPONENTS to Remove" field, select just sec_srv, which is the
"Security Server (Replica)".

Then on the master, ar0073e0, smitty dce
                              Unconfigure DCE/DFS
                              3 admin only unconfiguration for another machine
in the "COMPONENTS to Remove" field, select just sec_srv again,
in the "Client Machine DCE HOSTNAME" field, select ar0072e0.
Julian then had me check smitty's work by 3 series of commands
  dcecp -c rpcgroup list /.:/sec
  /.../patent.ibm.com/subsys/dce/sec/master
  /.../patent.ibm.com/subsys/dce/sec/ar0071e0
"Good," he said, "just the two.  ar0072e0 isn't there."

Then it was dcecp -c cdsli -cworld | grep /.:/subsys/dce/sec
  d       /.:/subsys/dce/sec
  o       /.:/subsys/dce/sec/ar0071e0
  o       /.:/subsys/dce/sec/master
"Good," again.

dcecp -c registry cat
/.../patent.ibm.com/subsys/dce/sec/ar0071e0
/.../patent.ibm.com/subsys/dce/sec/master
"Good," yet again.

Then, to add the replica back, on the server, ar0072e0,
  smitty dce
  Configure DCE/DFS
  Configure DCE/DFS Servers
  SECURITY Server
  2 secondary
Accept all the defaults and go for it.

Checking it back on ar0073e0 via
    dcecp -c registry verify
and dcecp -c registry show -master
showed that everything was back in sequence.  Great.

==============================================================================
PMR # 05370,49R       x    ERS (Emergency Response Service) evidently has
 6/23/99              x  changed the NSA code they use for their monthly 
 4:00                 x  scans, 'cause now eagle is highlighted in their
--who--               x  6/22/99 ERS scan.  Note also some other servers are
--mail address--      x  highlighted as well, a couple of alphaworks servers, 
--phone--             x  our patent socks/mail/ssh gateway machine
xxxxxxxxxxxxxxxxxxxxxxx  ar0135e0/1, and www2.patents.ibm.com.  I gotta go
                         check those out separately.  Here is ERS's claim ...

   The following new problems were reported.

      o 198.4.83.38 (eagle.almaden.ibm.com): [low] HTTP-Proxy is active on
        TCP port 80.
      o 198.4.83.81 (jcentral.alphaworks.ibm.com): [low] HTTP-Proxy is active
        on TCP port 80.
      o 198.4.83.82 (jcentral2.alphaworks.ibm.com): [low] HTTP-Proxy is
        active on TCP port 80.
      o 204.146.135.135: [low] HTTP-Proxy is active on TCP port 8080.
      o 204.146.135.162: [low] HTTP-Proxy is active on TCP port 80.

  What is happening is, if you set your browser to use www as your proxy
server (who would do that, first of all?), and then try to get to some web
site, say www.apple.com, eagle would answer with whatever it is you asked
for, from www, not from Apple.  I.E. http://www.apple.com gets you the same
thing as http://www, that is, Almaden's home page.

  Tony did some tracing and determined that the only difference between a
normal GET request and a proxified GET request, is the GET statement.
   Normal would be GET /              HTTP/1.0
proxified would be GET /www.apple.com HTTP/1.0
What's really wrong is the web server isn't paying attention to the server
piece of the GET request.  Sounds like a bug in the Apache web server to me,
else a configuration problem on my end, so I thought I'd call it in to the
Support Center to ask them.

  Meanwhile, Tony pushed back on ERS to ask them what we can do about this.
Our servers aren't really acting as proxy servers and there's no documentation
on what we can do to prevent it from failing their test.  They consider a
server to "fail" unless it returns an error code of 200 or 407.

Alan Rich      Alan called me too early this morning, and we've missed each
 6/23/99     other twice each, so I sent him a note explaining the situation
 5:10 am     to him.
8-526-0362


==============================================================================
PMR # 06458,49R       x    The performance of all our free side web servers,
 6/30/99              x  as0107 & as0109-15, as well as our two gold servers,
15:30                 x  as0103 & as0104.  The symptoms of the problem are
Daniel / John         x  various.  The first thing we see when this problem 
--mail address--      x  hits, is a large number of I.P. connections that
--phone--             x  don't go away.  They appear to be in various stages
xxxxxxxxxxxxxxxxxxxxxxx  of termination.  If you do a netstat -An command,
                         you see the normal 'ESTABLISHED', but you also see
a bunch of CLOSE_WAIT, LAST_ACK, FIN_WAIT_1, FIN_WAIT_2 & TIME_WAIT states.

  John gave me a little education on TCP socket closures.  He said there
were 2 types of closing, active and passive.  Active is when the application
(the web server in this case) does the close.  Passive is when the client
does the close.  In both, here is what happens and the states the connection
is in;
              Who Does What                 Socket State Afterwards
            ----------------------------    -----------------------
   Active:  Server Decides to Terminate
            Connection & Sends a FIN             FIN_WAIT_1

            Client Sends ACK                     FIN_WAIT_2

            Client Sends a FIN (???)             TIME_WAIT
            Server Sends ACK (???)               Closed
           

  Passive:  Client Decides to Terminate
            Connection & Sends a FIN.
            Server sees FIN, sends FIN
            to application, application
            (or is it AIX??) sends ACK.          CLOSE_WAIT
            Application does terminating
            cleanup, then sends a FIN.           LAST_ACK
            Client sends ACK.                    Closed.

The points to note are FIN_WAIT_1 or 2 and TIME_WAIT are Active close
states.  CLOSE_WAIT & LAST_ACK are passive close states.  Also, if
sockets are "stuck" in CLOSE_WAIT, it's due to the server application
(httpd) not finishing its socket cleanup and sending the last FIN.
In other words, it's an application problem.

  The other thing John focused on was trying to clean up the many errors
we're seeing with an "entstat en1" command, specifically under the
"Receive Statistics:" column, the "No Resource Errors:" line.  Most of
the web servers have many thousands of "No Resource Errors".  Just like
PMR #90825,49R (see 5/21/99 10:30 entry above), when I had troubles on
the CWS, I changed the transmit & receive buffer size on the card itself
to the max.  The process, repeated from above, is to
 - Take node out of rotation (duh!),
 - ssh-ing into root on the en0 (SP/2) side,
 - ifconfig en1 down detach
 - rmdev -l en1  (I don't think this is necessary, is it?)
 - smitty chgenet
     - ent1
     And changed the
     - "TRANSMIT queue size" from 512 to 150.  Beware of a discrepancy with
       what the help text (PF1) says for this field, namely that the
       "Valid values range from 20 through 150", and what the prompt
       (PF4) says, that the valid range is 20-2048.  150 is the limit.
       The guy didn't know, but guessed that perhaps the fact this was
       set to 512, which is over the 150 limit, was causing us grief.
     - "RECEIVE buffer pool size" from the default 37 to the max of 64.
 - mkdev -l en1
 - route add default 192.168.56.251
 - route add -net 9 -netmask 255.0.0.0 192.168.56.252

  I made these changes on nodes 3 & 4, our gold web servers, since they were
the two that have been giving us the most troubles lately.

==============================================================================
PMR # 07981,49R       x    While converting the CWS from DCE 2.1 to DCE 2.2,
--/--/99              x  (I had just successfully converted the AIX level from
--:--                 x  4.2.1 to 4.3.2), I commented out the starting up of
--who--               x  DCE/DFS from /etc/inittab and rebooted so DCE wasn't
--mail address--      x  running at all, then installed the new DCE 2.2 code,
--phone--             x  including PTF set 5, then ran migrate.dce, then
xxxxxxxxxxxxxxxxxxxxxxx  migrate.dfs.  Both appear to have worked fine.
                         I then defined dceunixd as well.

Now when I start DCE/DFS by running "/etc/rc.dce all", everything but the
dfsd daemon starts.  The last few lines in /etc/dce/cfgdce.log are
  ...
  Starting the DFSBIND daemon...
  The DFS kernel extension dfscore.ext has successfully loaded.
  Waiting up to 120 seconds for the daemon to start.
  DFSBIND daemon successfully started.
  Starting the DFSD daemon...
  The DFS kernel extension dfscore.ext has successfully loaded.
  readRPC_SUPPORTED_NETADDRS
  Waiting up to 120 seconds for the daemon to start.
  Waited 5 seconds.
  Waited 10 seconds.
     ...
  Waited 120 seconds.
  0x113155ed: The following component is not running, and is not registered
              in DCED as running: DFS client.
  unknown math function "DCF_MESSAGE"
  0x1138da69: Unable to start the DFS client.
  0x1138da5d: The components on DFS host, as0000e0 did not start successfully.
  0x113159fb: Start did not complete successfully.

We tried unconfiguring DFS with a rmdfs -l -F all and reconfiguring it, but
that didn't seem to work either.

Liz called back, poked around and said dfsd *was* running, it was just
cleaning out the old dfscache, which takes a long time the first time.  She
thinks that there was just a "disconnect" between the start-up script waiting
"only" 2 minutes, and the dfsd daemon taking longer than that the first time
cleaning out the DFS cache.


==============================================================================
PMR # 14498,49R       x    This is a continuation of the DCE 2.1 to DCE 2.2
 8/31/99              x  conversion problems I was having before.  This time
 9:00                 x  it was as0206e0/1 that I converted AIX 4.2.1 to 4.3.2,
Mike Patton           x  DCE 2.1 to 2.2, and PSSP 2.4 to 3.1, on Saturday,
vdpatton@us.ibm.com   x  August 28.  See PMR 07981,49R above.
8-442-7072            x   
xxxxxxxxxxxxxxxxxxxxxxx    Back then, Liz Hughes focused on the messages you
                         see in /etc/dce/cfgdce.log, and indeed, there are
error messages in there. 

  As near as I can piece together, here's what happened Saturday,
 7:37  System booted (error logging turned on).
 7:46  System shutdown by user.
 7:51  System booted (error logging turned on).
 7:54  /usr/lpp/dce/bin/migcheck runs, according to cfgdce.log.  Gets
         0x11315066: The system call (chdir) failed with a return code of -1  ...
         0x11315069: An error occurred creating the file /opt/dcelocal/tmp/cfgdce.sem.
       Because the /opt/dcelocal/tmp (=/var/dce/tmp) directory didn't exist.
 7:54  System shutdown (error logging turned off).
 8:30  System Upgraded from AIX 4.2.1 to AIX 4.3.2, DCE & PSSP included.
 9:10  System shutdown by user.
 9:15  Finished upgrading.  System booted (error logging turned on).
 9:16  start.dce runs.  Gets
         0x1131504a: A failure occurred during the copy of /opt/dcelocal/var/dced/Acl.db
                     to /opt/dcelocal/var/dced/backup/Acl.tdb. 
         0x11315073: The directory, /opt/dcelocal/var/dced/backup, was not found.
       Because /opt/dcelocal/var/dced/backup (=/var/dce/dced/backup) didn't exist.
 9:29  show.cfg (lsdce) ran, saying "0x11315b69: A new release of DCE has been
         installed.  The DCE configuration data needs to be migrated.  Please run
         migrate.dce.
 9:39  start.dce ran again.  Got .../backup directory doesn't exist error again.
 9:39  /var/dce/tmp & /opt/dcelocal created.
 9:40  start.dce ran again.  This time, DCE got converted & set up ok, with
 9:41  show.cfg running to show the DCE components configured ok, and then
 9:41  show.cfg again showing no DFS components configured.
I don't know what was happening between 9:41 & 9:48, but at
 9:48  show.cfg ran again, showing the DCE components configured ok, and
 9:48  show.cfg again, showing no DFS components configured.  Then
 9:48  I probably invoked /etc/rc.dce by hand, causing start.dce to run,
       which after about a hundred, normal-looking messages, gets these
       error message.
         ...
         Starting the DFSD daemon...
         The DFS kernel extension dfscore.ext has successfully loaded. 
         readRPC_SUPPORTED_NETADDRS
         Waiting up to 120 seconds for the daemon to start.
         Waited 5 seconds. 
         Waited 10 seconds. 
                ...
         Waited 120 seconds.
         0x113155ed: The following component is not running, and is not
                     registered in DCED as running: DFS client.
         unknown math function "DCF_MESSAGE"
         0x1138da69: Unable to start the DFS client.
         0x1138da5d: The components on DFS host, as0206e0 did not start successfully.
         0x113159fb: Start did not complete successfully.
         0x11315066: The system call (chdir) failed with a return code of -1 and
                     error number of 2.
10:17  System shutdown by user.
10:21  System booted (error logging turned on).
10:22  /etc/rc.dce runs again from /etc/inittab.  Gets same errors as above.
12:09  start.dce ran again, I guess by hand?  Gets same errors as above.
12:14  start.dce ran again, I guess by hand?  Gets same errors as above.
12:25  System shutdown by user.
12:30  System booted (error logging turned on).
12:30  /etc/rc.dce runs again from /etc/inittab.  Gets same errors as above.

Hmmmm.  I thought I had to run /etc/rc.dce by hand a few minutes after the 12:30
boot.
Was my problem all along, that I had :once: rather than :wait: in /etc/inittab?
Hmmmmmm.  Since I wasn't sure, I had Julian close the PMR.

==============================================================================
PMR # 14610,49R       x    I chose to reinstall AIX 4.3.2 on our new S70, and
 8/31/99              x  afterwards I noticed that the new 32 GB SSA drives
 5:15                 x  did not configure correctly.  So, I 
Ishmall               x    mount as0000e0:/spdata/sys1/install/aix432/lppsource /mnt
--mail address--      x  & cfgmgr -i /mnt
--phone--             x  but I got an err message saying there are the
xxxxxxxxxxxxxxxxxxxxxxx  following missing filesets.
                         
    devices.pci.0c000c02                     
    devices.pci.14105800                     
    devices.pci.14109100 = Something to do with the SSA ??
    devices.pci.ssa

I hunted around and found devices.pci.14109100 on some CD that came with the
S70, but didn't find the other 3 anywhere.

Kimberly from the install group came on the line, but I figured things out by
the time she was of any help.  I applied the AIX service I had & reran cfgmgr,
and it installed the right filesets. Evidently, the base 4.3.2 cfgmgr code
had the wrong list of filesets for these devices.

==============================================================================
PMR # 23385,L11       x    Note the different Branch Office number, due to my
 9/08/99              x  using Customer # IN00018, rather than my normal
11:40                 x  5336519, 'cause this stupid first level AIX Support
Diane Swan            x  guy claimed 5336519 didn't have support for the
8-526-1933            x  Lotus Domino Go webserver.  But then, he initially
Also Alan Rich        x  claimed that IN00018 didn't either, but when I
  at 8-526-0362       x  finally told him that LDG was under the Websphere
Both in Raleigh       x  umbrella, he said, "Oh, yeah.  You do have Websphere
xxxxxxxxxxxxxxxxxxxxxxx  Support, so you're ok."  Sounds like I may have
                         troubles with this in the future.  I was told that
Wayne Tippery (8-262-3804) was the person to call to get IBM Internal
customer numbers updated.  Anyway, I didn't call Wayne this time, 'cause I
finally got through.

  All I wanted to do is get the latest Lotus Domino Go Webserver, which is
at least 4.6.2.6, to update our Net.Commerce servers, especially as0206e0/1,
which is today running LDG 4.6.2.51.  This is so I can get a modern SSL
certificate from Equifax.

  Diane told me I could get it by rftp-ing to service2.boulder.ibm.com as
userid=webuser, password=code4you, cd-ing to /internet/DGW/aix, and picking
up both the *tar file & the service.txt.

==============================================================================
PMR # 24566,49R       x    When updating as0204e1 from AIX 4.2.1 to AIX 4.3.2,
10/06/99              x  something I did on 9/9/99 (cute) on as0201e1 and
--:--                 x  everything worked fine, I had lots of troubles with
Donovan & Frank       x  the NIM/PSSP scripts not setting up the NFS stuff
--mail address--      x  correctly.  The directories that should be
--phone--             x  NFS-exported include
xxxxxxxxxxxxxxxxxxxxxxx     /spdata/sys1/install/aix432/lppsource
                            /spdata/sys1/install/aix432/spot/spot_aix432/usr
                            /spdata/sys1/install/pssp/bosinst_data_migrate
                            /export/nim/scripts/as0204e0.script
All with -ro,root=as0204e0.patent.ibm.com,access=as0204e0.patent.ibm.com
at the end.  I was getting different things, e.g. the lppsource line not having
the extra stuff on it, or missing lines altogether.

  Frank called on 10/13 and we ran a bunch of commands, picking on as0201=2 2 1=18
    spchvgobj -i bos.obj.ssp.432 -r rootvg -v aix432 -p PSSP-3.1 2 2 1
and either a full spbootins -r migrate 2 2 1, which would run setup_server,
or more simply, just allnimres -l 18 which does the same thing.
All we were looking for/at is whether or not /etc/exports or exportfs
shows the -ro,root=as0202e0.patent.ibm.com:,access=as0202e0.patent.ibm.com:
for the spot, bosinst_data, & script lines.  While Frank was on the phone,
they all did.  Everything worked correctly.

Use spchvgobj -i bos.obj.ssp.421 -r rootvg -v aix421 -p PSSP-2.4 2 2 1
    spbootins -s no -r disk 2 2 1
and unallnimres -l 18 to reset things back to normal.

Since we couldn't get it to fail, we decided to close the PMR until it happens
again.

==============================================================================
PMR # 27516,49R         x    For the last two nights, ar0143e1 has crashed at
10/26/99                x  around 2:00 - 2:30 in the morning.  The Support
 9:40                   x  Center had me run sysdumpdev -L to get the dump
Chet Holt               x  device (/dev/dumplv), then script crash.out to
chetholt@austin.ibm.com x  start recording these next commands.
8-523-4138              x
xxxxxxxxxxxxxxxxxxxxxxxxx  crash /dev/dumplv, then once inside crash,
stat, status, symptom, cpu, trace -m, od prog_log 8, errpt -a, then quit.

I then sent it all to Chet via
  mail -s'PMR 27516,49R' chetholt@austin.ibm.com < /tmp/crash.out

Chet called back a few minutes later & said we needed to run a fsck on one
(some?) of our file systems.  I know that on Wednesday of last week, Kin
was trying to run fsck on /usr.  I sent him a note telling him he needed to
get into a limited function shell, but I don't know what he did.

Turns out it was the /ips file system that was bad.  I rebooted with nothing
getting started up, ran fsck to fix things, and all was well again.
==============================================================================
PMR # 29192,49R         x    Got a crash on ar0144 this morning, so I called
11/05/99                x  it in.  Got Chet again (see previous PMR).  We went
 9:10                   x  through the same sequence of commands to collect
Chet Holt               x  data, then mailed him the stuff.
chetholt@austin.ibm.com x    sysdumpdev -L  To determine the dump device,
8-523-4138              x  which was /dev/dumplv, the dump size (46MB), and
xxxxxxxxxxxxxxxxxxxxxxxxx  the dump status (0), which said the dump completed
                           successfully.
                         
The commands he wanted mailed to him were,
   script crash.out
   crash /dev/dumplv
   stat
   status
   symptom
   cpu
   t -m
   errpt -a
   quit    To leave crash
   quit    To finish script command
   mail -s'PMR 29192,49R' chetholt@austin.ibm.com < crash.out

Chet called back at 1:00 to say this has been fixed.  I've intentionally kept
these image servers a bit downlevel 'cause Kin said applying some libpthreads
fix on them, would break them.  Kin didn't tell me 'till a week ago when I
asked, that this problem has been fixed when they went to a later version of
the MQM(?) code.  Anyway, Chet's gonna send me a CD with the AIX 4.2.1 fixes
on them, 'cause I don't have room on the CWS to store both the 4.2.1 & 4.3.2
fixes, so I've blown away the 4.2.1 fixes.
==============================================================================
PMR # 40421,49R         x    ar0072e0 crashes soon after DFS comes up.  I have
11/13/99                x  just upgraded its AIX from 4.2.1 to 4.3.2, keeping
 2:30                   x  it at DCE 2.1.  Julian says it's a known problem
Julian                  x  with AIX 4.3.2 & DCE 2.1 and to just apply the
--mail address--        x  latest fixes.  I did, converting to DCE 2.2 along
--phone--               x  the way.
xxxxxxxxxxxxxxxxxxxxxxxxx  
                             Now, DCE won't come up at all.  The problem is
with the DCE Migration step.  The /opt/dcelocal/etc/cfgdce.log says
   0x11315b5a: DCE migration cannot be performed because the following
               files were not found:
   /opt/dcelocal/etc/mkdce.data
   /lpp/save.config/etc/dce/rc.dce

  After much digging, I discovered that when I installed the DCE 2.2 code,
the installation replaced the link Ed had set up at /etc/dce, pointing to
/dceconfig/etcdce, with a directory.  This directory is where all the
config files & other stuff is.

  After much fooling around, trying to combine the old /dceconfig/etc/dce
directory with the new /etc/dce, I'm waiting on level 3.  Meanwhile, I 
noticed that the /var/dce link is also now a directory.

  The next time I upgrade DCE on 71 or 73, undo the links at
lrwxrwxrwx  1 root  system  17 Oct 11 1997  /etc/dce -> /dceconfig/etcdce
lrwxrwxrwx  1 root  system  17 Oct 11 1997  /var/dce -> /dceconfig/vardce
lrwxrwxrwx  1 root  system  15 Oct 11 1997  /krb5    -> /dceconfig/krb5

  Bill first had me backup what we have, namely
     /etc/dce          /krb5      /usr/lib      /usr/ccs      /var/dce
     /etc/dce.2.2.orig /dceconfig /etc/objrepos /usr/lpp/dce*
which was quite a bit of data.  I had to 
     tar cvf - $(cat /tmp/root) | compress > /usr/bill.tar.Z
 and mv /usr/bill.tar.Z /dceconfig
and even then it was 120MB.

  Then he wanted to disable any replica attempts to ar0072e0, so as root
on ar0073, I tried doing these dcecp cmds,
     clearinghouse cat
     clearinghouse disable /.../patent.ibm.com/ar0072e0_ch
but that didn't work, I guess 'cause 73 didn't think that 72 was alive.

  Bill then had me do this command,
   cdscp set dir /.:/ to new epoch master /.:/ar0073e0_ch 
         readonly /.:/ar0071e0_ch exclude /.:/ar0072e0_ch
Then running his /tmp/whannon2 script, which was
   dcecp -c dir synch /.:/
   for i in `cdsli -Rd`         cdsli -Rd returns 59 lines, /.:/hosts,
   do                           50 /.:/hosts/<machine name> lines, and
   echo Synching $i             8 others (e.g. /.:/users)
   dcecp -c dir synch $i
   done
This kept failing, apparently timing out, with these messages
   Synching /.:/hosts
   Error: Unable to communicate with any CDS server
   Synching /.:/hosts/ar0071e0
   Error: Unable to communicate with any CDS server
   Synching /.:/hosts/ar0072e0
   Error: Unable to communicate with any CDS server
   Synching /.:/hosts/ar0073e0
Bill then came up with 
   cat > /tmp/finddirs << EOF
   for i in \`cdsli -Rd \`
   do 
   cdscp show dir \$i CDS_Replicas
   done
   EOF
which created a 914-line /tmp/finddirs file.  Anyway, Bill lost me in what
he was trying to do.  This last script creates a /tmp/finddirs, which you can
run to create something that shows if ar0072 is in there or not.  For the 27
CDS directories that 72 was replicating (stored in /tmp/whannon4), I ran
   for i in $(cat /tmp/whannon4);do export dit=$i;/tmp/whannon1;done
where whannon1 was
   cdscp set dir $dit to new epoch master /.:/ar0073e0_ch 
         readonly /.:/ar0071e0_ch exclude /.:/ar0072e0_ch

Then Bill got under dcecp to do
   dcecp
   dcecp> sec_admin               which changes the prompt.  Interesting.
   sec-admin> si /.: -u           To "bind" to the master
   sec_admin> lrep -all
That showed that 73 was the master & 71 and 72 were replicas and by the way,
the last update was on 11/8/99.  Bill had me do a
   sec_admin> delrep subsys/dce/sec/ar0072e0
to delete the 72 security replica.  Now a lrep -all shows that 72 is
"marked for deletion".


  Bill suggests following these steps
1) Change /etc/dce, /var/dce, & /krb5 to directories.  Copy command was
      cp -pRh /dceconfig/etcdce/.orig* /etc/dce   to preserve the links.
      cp -pRh /dceconfig/vardce/* /var/dce
      cp -pRh /dceconfig/krb5/* /krb5
2) Unmount /dceconfig.

3) Force install DCE 2.1  (will lslpp now say DCE 2.2 is uninstalled?? - Yes.)
      Had to get the DCE 2.1 base code from Ed's 
      execute.adtech.internet.ibm.com:/export/nim/lpp_source/aix432/PROD/DCEDFS-2.1

   Got an error trying to NFS-mount something from the CWS.  What now??
       <root@ar0072e0:/> mount cws:/spdata/sys1/install/aix421/lppsource /mnt
       exec(): 0509-036 Cannot load program /usr/lib/drivers/nfs.ext because of the following errors:
           0509-025 The /usr/lib/drivers/nfs.ext file is not executable or not in correct XCOFF format.

       nfsmnthelp: Cannot run a file that does not have a valid format.
   Shit!  /usr/lib/drivers/nfs.ext is in the bos.net.nfs.client fileset, which
   did get more updated (4.3.2.10) on ar0072e0, than it did on 71 (4.3.2.6).  I
       scp -p root@ar0071e0:/usr/lib/drivers/nfs.ext /usr/lib/drivers/nfs.ext
   and was then able to mount.  I'll have to call this problem in later.
   71 can mount ok.
      This is fixed by bos.net.nfs.client.4.3.2.11, closed in the last week or two.

4) Apply fixes, too (if not done earlier).
      Had to download them from 
      http://www-4.ibm.com/software/network/dce/support/fixes/dceaix.html
      (DCE PTF Set 27 = IY87874)

5) Copy the 3 directories from /dceconfig to 
      mount /dceconfig
      cp -pRh /dceconfig/etcdce.orig/* /etc/dce
      cp -pRh /dceconfig/vardce/* /var/dce
      cp -pRh /dceconfig/krb5/*   /krb5
6) Try the rmdce -o local sec_srv   to deconfigure the security server.
   but don't be surprised if it doesn't work.  (It worked fine.)
7) Try to start DCE 
       DCE & DFS seemed to start up ok (Yay!), but not all my aggregates
       are there.  I'm missing my SSA devices.  Even a lscfg doesn't show
       them.
       <root@ar0072e0:/> fts lsaggr ar0072e0
       There are 2 aggregates on the server ar0072e0 (ar0072e0.patent.ibm.com):
                scsi0 (/dev/scsi0lv): id=2     (LFS)
                scsi1 (/dev/scsi1lv): id=3     (LFS)
       Missing are ssa0-2.
       <root@ar0072e0:/> lspv
                hdisk0         009000733866c1cb    rootvg         
                hdisk1         009001466cdd2e2d    scsi1vg        
                hdisk2         009001466cdc8cde    scsi0vg        
                hdisk3         009001466c667a49    rootvg
       Missing are hdisk4-6.

  Noted that the filesets for the SSA device drivers (devices.mca.8f97.*)
weren't the same on 72 as they were on 71.  72 had funky 98.2.1.1008
levels, evidently from when we picked up pre-release drivers back
who-knows-when.  On 71, they were the normal 4.3.2.0 levels.
I went back to the base 4.3.2 code on the CWS and force-installed
the right filesets, and rebooted.  73 is ok, too (4.2.1.x).

  That didn't fix things, cfgmgr still won't configure the drives.
Called the Support Center back.  PMR 40457,49R.  Greg talked me through
picking up the latest SSA code from http://www.hursley.ibm.com/~ssa/rs6k.
I got the tar file, expanded it on 72, which put stuff in /usr/sys/inst.images,
then completely removed all SSA filesets (installp) and devices (rmdev),
rebooted, installed most of the stuff from /usr/sys/inst.images, then
ran cfgmgr, which now saw all the devices.  While the drives were not
in use, we took the opportunity to insure the drives' microcode was up to
date (the 16 GB drives needed to be updated).

  Tried to start DCE/DFS again, and secd is dying.  Shit!  I thought we
got past this.  Oh, yeah.  I had to rmdce -o local sec_srv again, then
/etc/rc.dce all came up with all aggregates exported ok.
       <root@ar0072e0:/> fts lsaggr ar0072e0
       There are 5 aggregates on the server ar0072e0 (ar0072e0.patent.ibm.com):
                 ssa0 (/dev/ssa0lv):  id=1     (LFS)
                scsi0 (/dev/scsi0lv): id=2     (LFS)
                scsi1 (/dev/scsi1lv): id=3     (LFS)
                 ssa1 (/dev/ssa1lv):  id=4     (LFS)
                 ssa2 (/dev/ssa2lv):  id=5     (LFS)

8) If not done earlier, (it was done earlier)
      rmdce -o local sec_srv       To deconfigure the security server.
   then
      mkdce -o full  sec_srv       To recreate the security replica & synch it.
   This gives me the message
      Enter password to be assigned to initial DCE accounts:
   What's this asking for?  I gave it the current cell_admin password.
   It then went on to say
      Cannot configure sec_srv until sec_cl is unconfigured
      Current state of DCE configuration:
      cds_cl       COMPLETE   CDS Clerk
      cds_second   COMPLETE   Additional CDS Server
      rpc          COMPLETE   RPC Endpoint Mapper
      sec_cl       COMPLETE   Security Client
   so I don't know what to make of this.
   Bob from the Support Center talked me through doing this from
   smitty.  It's a lot easier that way.  I think the command should have been
      mkdce -R -s ar0073e0 sec_srv

9)  If all's ok, then fix CDS replicas by
      cdscp set dir /.:/ to new epoch master /.:/ar0073e0_ch
           readonly /.:/ar0071e0_ch readonly /.:/ar0072e0_ch
    This wasn't necessary.  The step above essentially did this.

10) And on 73, modify /tmp/whannon1 (exclude -> readonly) and re-run
    for i in $(cat /tmp/whannon4);do export dit=$i;/tmp/whannon1;done


==============================================================================
PMR # 53242,49R         x    Some servlet pages on the Net.Commerce server,
12/29/99                x  as0206e0/1, quit working.  A test URL to use is
 9:00                   x  http://as0206e1/servlet/com.ibm.ipnfb/servlets.IPNAdminServlet
John Mahoney            x  The only other clue were these lines in the httpd
jmahone@us.ibm.com      x  log located in /arc/httpds/logs/httpd-errors.Dec291999.
8-444-4635              x    [was error] ose_init : Failed in timebomb validation.
xxxxxxxxxxxxxxxxxxxxxxxxx    ose_init : Your timebomb is corrupted or has expired.
                             ose_init : Please obtain another copy of the product.
  John had evidently seen this before and quickly sent me a file to fix it.
Just replace the trx.properties file in the <install_root>/properties directory.
This being an IBM Websphere Application Server problem, "install_root" is
/usr/WebSphere/AppServer, so the file that needed to be replaced is
/usr/WebSphere/AppServer/properties/trx.properties.  The old one contained 
2IOJFOVZGubqx
The new one contains
4669803

  Whatever that file is, replacing it & recycling the web server with a
     startsrc -s httpd -e DB2INSTANCE=inst1
command fixed the problem.

  I went to the other websphere machines, ar0079e0, baboon, and ncc-312, to
replace this file, but the file didn't exist.  The file belongs to the
IBMWebAS.base.core fileset, which on as0206, is at 2.0.3.1, whereas on the
other 3 machines, it's only at 2.0.0.0.  Perhaps it's just something with the
newer release?

==============================================================================
PMR # 57093,49R         x    Since converting to DCE 2.2, my pretty_fts_lsfldb.tcl
01/25/00                x  script that I run weekly on ar0073e0, doesn't run.
10:45                   x  After running for 36 hours, it core dumps.  I've
Jeff Pickering          x  tried looking at it, but there's nothing wrong on
--mail address--        x  my end and this used to work with DCE 2.1.
8-442-7243              x   
xxxxxxxxxxxxxxxxxxxxxxxxx    I've got DCE 2.2.0.7 installed and 73 is AIX 4.3.
                           
==============================================================================
PMR # 59486,49R         x    There are multiple things wrong with the Patent
--/--/00                x  server DFS cell.  First of all, I spent 10 days
--:--                   x  trying to move the 60 GB patent.verity fileset from
Paul Brennfleck         x  73:ssa1 to 73:ssa3.  The move should take about 
--mail address--        x  15 hours to finish, but after 3-5 days, I either 
8-989-6897              x  killed the move myself, or the system would get
xxxxxxxxxxxxxxxxxxxxxxxxx  rebooted due to other problems (who knows?  Maybe
                           it's due to the same causes).
  After fin

                           x
==============================================================================
PMR # 24630,L11         x    The cdsclerk died on ar0071 on 4/4/00 at 3:53 AM.
 4/05/00                x  I had to stop (even tho' it wasn't running) the
10:10                   x  cdsclerk, restart it, and also restart the repserver
Julian Owens            x  process.  Things looked ok after all that.
jowens1@us.ibm.com      x
--phone--               x    I searched rshelp.austin.ibm.com (what a site!)
xxxxxxxxxxxxxxxxxxxxxxxxx  but didn't find anything, so I packaged up everything
                           using their senddata.pl utility, which I had tucked
away at /dfs/apps/userlocal/bin/senddata.pl.  Here's the README I created to
send with the package.  I pointed senddata to /dumpfs/data and it created a
74 MB /dumpfs/data/datapkg.tar.Z.uu file, which I rftp'd to testcase.software.ibm.com
and put at /aix/toibm/24630.L11-L3DCE-datapkg.tar.Z.uu.

Notes on this core file, by Rick Jasper,
                            System Admin
                            Almaden Research Center
                            IBM in San Jose, California
                            (408) 927-2731 or tieline 457-2731

  This core file was found at /var/dce/adm/directory/cds/cdsclerk/core
and occurred on Tuesday, April 4th, 2000 at 3:53 AM when nobody was in,
so there was nothing too interesting going on.  The core file is
269,429,847 bytes big and is complete.  Things returned to normal on
the server once I restarted the cdsclerk & repserver, and are running
fine now.

  The machine this occurred on (ar0071e0) is one of three DCE/DFS
servers in my cell, the other two being ar0072e0 & ar0073e0, which is
the master.  All machines are running AIX 4.3.2 and DCE 2.2 at the
latest levels, specifically DCE is at PTF set 7.

  This is the associated line from the /opt/dcelocal/var/svc/fatal.log
file
    2000-04-04-02:35:57.632-08:00I----- cdsclerk(30362) FATAL cds general
      clerk_client.c 1557 0x000011ae msgID=0x10D0AB83 Routine
      pthread_mutex_lock failed : status = -1.

  Finally, here is the associated error log entry
---------------------------------------------------------------------------
LABEL:          CORE_DUMP
IDENTIFIER:     C60BB505

Date/Time:       Tue Apr  4 03:53:24 
Sequence Number: 17317
Machine Id:      009000734C00
Node Id:         ar0071e0
Class:           S
Type:            PERM
Resource Name:   SYSPROC

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
           6
USER'S PROCESS ID:
       30362
FILE SYSTEM SERIAL NUMBER
          13
INODE NUMBER
        8194
PROGRAM NAME
cdsclerk
ADDITIONAL INFORMATION
pthread_k 88
??
_p_raise 64
raise 34
abort B8
dce_svc_p 3F0
link_free 28C
_pthread_ 40
clerk_cli 440
_pthread_ C4
??

Symptom Data
REPORTABLE
1
INTERNAL ERROR
1
SYMPTOM CODE
PIDS/576539300 LVLS/410 PCSS/SPI2 FLDS/cdsclerk SIG/6 FLDS/link_free VALU/28c
---------------------------------------------------------------------------


==============================================================================
PMR # 24740,L11         x    After installing AIX service to the S80, I didn't
04/27/00                x  notice that the bosboot command didn't run.
 4:00                   x  Normally, you see these msgs after an update_all,
Alex                    x  0503-409 installp:  bosboot verification starting...
--mail address--        x  installp:  bosboot verification completed.
--phone--               x  0503-408 installp:  bosboot process starting...
xxxxxxxxxxxxxxxxxxxxxxxxx  bosboot: Boot image is 5331 512 byte blocks.

Well, when we rebooted the machine next, the network and other things didn't
come up.  Logging on through the serial port, we were able to poke around and
looking at the install msgs in /smit.log, we didn't see those bosboot msgs.

Alex had me run
   bosboot -ad /dev/ipldevice
and I rebooted and everything was fine.

==============================================================================
PMR # xxxxx,L11         x    x
--/--/00                x  x
--:--                   x  x
--who--                 x  x
--mail address--        x  x
--phone--               x  x
xxxxxxxxxxxxxxxxxxxxxxxxx  x
                           x
==============================================================================
PMR # xxxxx,L11         x    x
--/--/00                x  x
--:--                   x  x
--who--                 x  x
--mail address--        x  x
--phone--               x  x
xxxxxxxxxxxxxxxxxxxxxxxxx  x
                           x