How To Upgrade AIX & PSSP on a Patent Server Machine.

  This presumes the machine is at AIX 4.2.1 & DCE 2.1, and you're going to
AIX 4.3.2 & DCE 2.2.  These notes are based on many iterations, resolving
things ahead of time before they become problems.

  Some steps are different based on whether you're acting on a standalone,
RS/6000 box, or an SP/2 node.  Unless otherwise stated, the step works as
described for either type.  I'll point out the differences.


0) A good thing to do before you get going is to verify that NIM has the correct
   en0 ethernet adapter's MAC address by comparing a
       netstat -in | grep en0 | grep -i link
   on the machine to be upgraded, with a
       lsnim -l as0nnne0 | grep if1
   on the CWS.
   For SP/2 nodes, insure the SDR has the same MAC address with a
       splstdata -b frame# node# 1 | grep as0
   command on the CWS.  There have been ethernet adapter changes, so this
   just eliminates questions later should you have troubles.

   Another good thing to do is to check the error log on the machine, fixing and
   cleaning out any old entries.
      errpt -d H                          For Hardware errors, eg
      errclear -N <resourse_name> 0       To clear them out

   Yet another good thing to do is to clear out root's mail, both already-inc'd
   mail (using scan, show, rmm) and yet-to-be-inc'd mail (inc it).  My ksh hint
   to remove a lot of mail is,
      rmm $(scan | grep expression | awk '{print $1}' | tr -d '+')
      rm /Mail/inbox/,*

   You might also insure the ODM accurately reflects the current hardware
   configuration with
      diag -a

1) On the CWS, insure /etc/exports does not have the normal /spdata/sys1/install
   exported.  There's a copy of the normal, everyday /etc/exports saved at
   /etc/exports.save, which has /cd and /spdata/sys1/install exported.  This is
   what the CWS normally has exported, but when you run setup_server, it tries
   to export the /spdata/sys1/install/aix432/lppsource directory which, since
   it's a subdirectory of /spdata/sys1/install, get an NFS error. Use
      exportfs 
   to see what's exported now, and
      vi /etc/exports
      exportfs -u /spdata/sys1/install
   to unexport whatever's now exported.  You can keep /cd if you want.

2) Prepare the machine for the DCE conversion.
      mkdir /var/dce/tmp /var/dce/config /var/dce/dced/backup

      vi /etc/inittab
   and replace the
      rccleancred:2:wait:/etc/dce/rc.cleancred -v > /dev/console 2>&1
      rcdce:2:wait:/etc/dce/rc.dcestart all > /dev/console 2>&1
   lines with
   Replace the rccleancred line in /etc/inittab, with
      : cleanupdce:2:wait:/usr/bin/clean_up.dce > /dev/console 2>&1
      : rcdce:2:wait:/etc/rc.dce all > /dev/console 2>&1 # Start DCE/DFS Daemons

   Comment out the /etc/rc.local line as well, especially for something like
   the DB/2 nodes, so nothing gets started up.

   While you're at it, check for and delete if they exist, the following
   lines in /etc/inittab
      - db, i4ls, imnss, imqss, httpdlite

      vi /etc/dce/rc.dce
   and remove unneeded parms on the dfsd startup command, at the end of the
   file.  Unneccessary parms are
      - callback $RPC_SUPPORTED_NETADDRS
      - blocks 20000
      - cachedir /dfscache
      - chunksize 16

   Reboot the machine/node to get an IPL without DCE/DFS up.  When it's back up,
   you may have troubles getting to the IBM Internal network 'cause /etc/rc.local
   is commented out.  You may need to
      dsh -was0nnne0 route add -net 9 -netmask 255.0.0.0 192.168.56.252
   from the CWS to get that routing back.

   Clear out the DFS cache directory.
      for i in $(jot 10 0);do rm /dfscache/V1$i*;done
      for i in $(jot 10 0);do rm /dfscache/V$i*;done
      find /dfscache -type f -exec rm {} \;

      vi /etc/inittab
   and un-comment-out, i.e. restore the rcdce line with
      /rcdce
      0xx:x

3) For SP/2 nodes only, configure NIM on the CWS.  For machines, this step is
   consolidated into step 6) below.  For non-SP/2 nodes, skip to step 6).

   This is the command one uses with PSSP 3.1 instead of the old, spbootins
   command to set the AIX & PSSP level.
      spchvgobj -i bos.obj.ssp.432 -r rootvg -v aix432 -p PSSP-3.1 frame# node# 1
   See the results of the command if you want to, with
      splstdata -b frame# node# 1
   You'll see that next_install_image = bos.obj.ssp.432
                       lppsource_name = aix432
                     and the pssp_ver = PSSP-3.1

4) Again, this step is for SP/2 nodes only.  Set the boot "response" field of
   the SDR to "migrate" and run setup_server, with a
      spbootins -r migrate frame# node# 1
   command.  You can see the boot response of the SDR with a
      splstdata -b frame# node# 1
   command.  The setup_server command, which takes a few minutes to complete.

   After it completes, check to see what's NFS exported with an
      exportfs
   command.  For the as0204 install on 10/6/99, I had a lot of trouble here.
   It appeared setup_server wasn't doing the right thing with /etc/exports.
   What should be there is 
       /spdata/sys1/install/aix432/spot/spot_aix432/usr
       /spdata/sys1/install/pssp/bosinst_data_migrate
       /spdata/sys1/install/pssp/pssp_script
       /export/nim/scripts/as0204e0.script
   All with -ro,root=as0204e0.patent.ibm.com,access=as0204e0.patent.ibm.com
   at the end, as well as
       /spdata/sys1/install/aix432/lppsource -ro
   I was getting different things, e.g. the lppsource line not having the
   extra stuff on it or missing lines altogether, which causes LED 611, 

   I opened up PMR 24566,49R to investigate, but a week later, everything was
   working ok.  If this situation occurs again, reopen the PMR or open a new one.
   The quickest way to debug, is to use a series of allnimres & unallnimres
   commands.
        allnimres -l 2 4 1, puts the lines in /etc/exports.
      unallnimres -l 2 4 1, takes them out
   presuming you've already done the spchvgobj & spbootins commands.

5) 4/3/2000 Note:  I don't see any /etc/rc.sp anywhere that has this local mod
                   I did long ago.  I wonder what happened to it?  Anyway, this
                   bullet no longer applies since I lost my fix.  If you want
                   details, see my PMR 59782,49R dated 11/23/98, in my
                   ~jasper/aixnotes/aix.support.center.history file.

   Beware of your local mod to /etc/rc.sp where you call /local/bin/lsof to
   insure inetd is indeed listening to kshell.  The lsof in /local/bin
   doesn't work on AIX 4.3.2.  I modified /etc/rc.sp to check the oslevel,
   and run /local/bin/lsof.for.AIX.4.3 instead.  Insure the target node
   has either an unmodified /etc/rc.sp, or this fixed one.

6) For non-SP/2 nodes, insure the machine is up and ready to be installed,
   because these next commands will configure NIM on the CWS, and start
   the install!!  On the machine,
      /local/bin/rearm-nim.sh
   On the CWS,
      cd $s1
      ./domigratetoaix432now.sh ar0078e0
   Within 20 seconds, you should see two of the following messages on the
   target machine, 4 seconds apart
      *******************************************************************
      *******************************************************************
      *******************************************************************

         NIM has initiated a bos installation operation on this machine.
         Automatic reboot and reinstallation will follow shortly...

      *******************************************************************
      *******************************************************************
      *******************************************************************
   Then you're logged off a few seconds later, so unless you're watching
   for them, you may miss these.


   For SP/2 nodes, these too should be powered up, else the setup_server
   process will hang with a dsh to the down nodes to run
   /usr/lpp/ssp/install/bin/services_config (if you get into that state,
   you could kill the individial dsh commands to the down nodes).

   Start the install with
      nodecond frame# node#
   which network boots the node.  This will upgrade AIX & DCE on the node, but
   not SSP.

   Don't have a R/W spmon window open during this step, it'll cause it to fail.

   If you want to monitor things, from the CWS, 
      tail -f /var/adm/SPlogs/spmon/nc/nc.2.5
   (or whatever the latest file is in the /var/adm/SPlogs/spmon/nc directory).

   You can also monitor things with a R/O window via a
      s1term frame# node#
   command.

   Another handy way to monitor what's going on is to have a
      spled &
   window open on the CWS.

   On the 4/3/2000, as0205 conversion, this step hung after these lines in the
   /var/adm/SPlogs/spmon/nc/nc.2.5 file,
      Nodecond Status: command is boot /pci@fef00000/ethernet@10
      Nodecond Status: network boot successfull
      Nodecond Status: bootp sent over ethernet
      Nodecond Status: waiting for "Welcome to AIX" message
      Nodecond Status: network boot proceeding, nodecond is exiting
      Nodecond Status: finished
   I opened a R/O s1term window to watch it more closely, and saw the bootp
   packets getting sent just fine, but when that image was booted, as0205
   hung on LED E105.

   The http://rshelp.austin.ibm.com web site pointed me to PMR 48501,7TD
   which was an 8-week, Sev 1 problem that they were guessing at all kinds
   of things, then finally found an SP node, not involved with the problem
   at all, that had full duplex turned on for its private SP/2, BNC, 10Mbs,
   en0 interface, which is wrong.  I checked my nodes with an appropriate
     dsh 'lsattr -E -l ent0 -a full_duplex 2>/dev/null'
   command, looking for 
      full_duplex yes Full duplex True
   instead of
      full_duplex no  Full duplex True
   I found that as0211 had full duplex turned on.  I fixed as0211 by a
     chdev -l ent0 -a full_duplex=no
   command and as0205 installed just fine.

   Unfortunately, this was after I also wasted time updating the SDR 'cause a
      splstdata -a
   command showed nodes a lot of enet_rate & duplex settings to be null.
   The PMR said to insure they were set to 10 & half respectively.  Using
   smitty, I changed this with a spethernt command, discovering 2 options
   not documented in my book, -f 10 & -d half, which are the defaults.
   I also fixed all the en1 interfaces, which at the time (4/4/00), were
   all 100/full duplex except for nodes 3-5 & 7-12 on the first frame.
   Now the splstdata -a command shows the right thing.

   I also tried applying service to the spot, but that didn't fix things either.

   First, restore the route to the IBM Internal network.  Since /etc/rc.local
   is commented out, you'll need to
     dsh -was0nnne0 route add -net 9 -netmask 255.0.0.0 192.168.56.252
   to get that routing back.

   On the as0101e1 install, I quickly hung on LED c48.  Opening up an R/W window
   to the node, I saw
      Disks specified in bosinst.data do not define a root volume group.
   On the CWS,
      lsnim -l as0101e1  --->  bosinst_data = migrate
      lsnim -l migrate   --->  /spdata/sys1/install/pssp/bosinst_data_migrate
   which had in it
        target_disk_data: HDISKNAME = hdisk0
   I wonder if as0101e1's rootvg is on hdisk1, not hdisk0.  Rebooting AIX 4.2.1
      spbootins -r disk 1 1 1
      nodecond 1 1
   I saw
      <root@as0101e0:/> lsvg -p rootvg
      rootvg:
      PV_NAME           PV STATE    TOTAL PPs   FREE PPs    FREE DISTRIBUTION
      hdisk0            active      537         290         104..00..00..78..108
      hdisk1            active      537         314         108..57..00..41..108
   It looks right, so what's wrong?  Hmmmm.  The SDR, with a
      splstdata -v 1 1 1 ---> pv_list = hdisk0
   Ahhh, as0101 is the only node with 2 PV's in rootvg and the SDR needs to show that.
   I used smitty to change this to hdisk0,hdisk1.  The command it used was
      spchvgobj -r rootvg -h 'hdisk0,hdisk1' -c 1 -p 'PSSP-3.1' 1 1 1
   Then tried installing it again with
      spbootins -r migrate 1 1 1
      nodecond 1 1
   Seems to be working this time.  Hmmmm.

7) After the install, check to insure DCE/DFS got upgraded ok.  I used to have a lot
   of troubles here, but now I think things go pretty smoothly.  If DFS isn't up
   already after the reboot, you'll probably see on the console if you've been
   watching it, or in the /etc/dce/cfgdce.log file,
      DCE Migration cannot be performed unless no DCE daemons are running.
      0x11315b66: You can attempt to stop the daemons by running the command stop.dce,
                  or you can stop them manually.
   Simply rerun 
      /etc/rc.dce all
   and the conversion will finish and DCE & DFS will come up.

   When DCE & DFS are both there ok, if a
      lsdce -r
   command doesn't show the
      Integrated Login (dceunixd)              Configured           Running
   line, then
      config.dce dce_unixd
   command to configure dceunixd.

   Restore the cleanupdce line that was commented out from /etc/inittab.
      
8) Check to see what you've got.  For example, run 
      oslevel -l 4.3.2.0
   to see which filesets still need updating to get to AIX 4.3.2.
   For the 201, 204 & 206 updates, I've had to run 
      installp -u bos.msg.en_US.diag.rte

   For the 201 & 204 updates, I also saw 3 devices.ssa.* filesets were still
   at the 4.2.1 level.  I'm not sure why those filesets didn't get updated,
   probably because we used to have SSA drives on this system, but they got
   moved from when we moved the DB/2 database to as0209e0/1?  I dunno.

   From the node, you can
       mount cws:/spdata/sys1/install/aix432/lppsource /mnt
       smitty update_all
   pointing it to the /mnt directory.
   Interestingly enough, this reinstalled ssp.jm & ssp.css.

   But these 3 filesets,
       Fileset                       Actual Level    Maintenance Level
       ---------------------------------------------------------------
       devices.ssa.disk.rte          4.2.1.10        4.3.2.0
       devices.ssa.IBM_raid.rte      4.2.1.7         4.3.2.0
       devices.ssa.tm.rte            4.2.1.5         4.3.2.0
   were a problem to install.  Had to go to
       smitty installp
   and select
       "Install and Update from LATEST Available Software"
   at the bottom of the screen, and explicitly select the base 4.3.2.0 filesets for
   pointing it to the /mnt directory. and select
   these 3 guys, viz
       devices.ssa.IBM_raid.rte 4.3.2.0
       devices.ssa.disk.rte 4.3.2.0
   and devices.ssa.tm.rte 4.3.2.0

   When you're done, umount /mnt.

9) For non-SP/2 machines, especially those I did years ago, I used to install
   more than was minimally required to get dsh (part of ssp.clients) working.
   So, remove the unneccessary filesets
      installp -u ssp.ha ssp.ha_clients ssp.pman ssp.sysctl ssp.sysman
      installp -u ssp.perlpkg ssp.basic
   All you need, is ssp.clients.

   For an SP/2 node, to upgrade the PSSP code, set the node's bootp response
   to customize, copy the PSSP 3.1 pssp_script to the node and run it.

   First, from the node itself,
      route add -net 9 -netmask 255.0.0.0 192.168.56.252
   to restore the routing back to IBM's internal network.  Then
      mkitab -i rc 'fbcheck:2:wait:/usr/sbin/fbcheck 2>&1 | alog -tboot > /dev/console # run /etc/firstboot'
      installp -u ssp.jm ssp.css

   Then from the CWS,
      spbootins -r customize frame# node# 1
   which will run setup_server, which takes a few minutes.

   From the CWS, type
      pcp -was0xxxe0 /spdata/sys1/install/pssp/pssp_script /tmp/pssp_script

   Now before you do this last step, check /tftpboot/script.cust on the CWS.
   Comment out the updating of everything except /etc/hosts, that is, at the
   end of the script, insure
       /etc/resolv.conf,
       /etc/passwd,
       /etc/group,
       /etc/security/passwd,
       /etc/security/group,
   and /etc/security/user
   are all commented out.  I had to use ADSM to restore these files on 206.

   Then from the node, type
      /tmp/pssp_script
   or if you prefer, from the CWS, you can type
      dsh -was0xxxe0 /tmp/pssp_script

   If you want to watch the progress of this step, from another CWS window,
      tail -f /var/adm/SPlogs/sysman/NODE.config.log.21428
   where NODE is e.g. as0202e0, and 21428 is the PID of the ksh process
   that got forked off by the /tmp/pssp_script program.
   If nothing else, check this file when you're done to insure things went ok.
   For some reason, a lot of times lately, it complains that the fbcheck line
   you just added to /etc/inittab, isn't there, and sure enough, it isn't.
   I've had to re-add that line, then rerun /tmp/pssp_script.  Dunno why.

   When it's completed successfully,
      rm /tmp/pssp_script

10)    vi /etc/motd
    and put the IBM message lines back in so that AIXCOPS will run clean.

11) Run lockdown.sh to insure things are still ok.  It's not uncommon for lockdown
    to find ports that were previously either not defined or not running, now enabled.

12) Check for known errors,
       ls -l /etc      | grep ' 0 '          # 8 zero-length files are ok.
       ls -l /usr/bin  | grep ' 0 '          # There should be no zero-length files.
       ls -l /usr/sbin | grep ' 0 '          # There should be no zero-length files.
       ls -l /etc/services /etc/inetd.conf   # Insure these are over 4,000 bytes big.
       grep -v '^#' /etc/inetd.conf    # Should only be sshdfail, ssalld, kshell, klogin,
                                         and maybe clutild.
       grep '^start' /etc/rc.tcpip     # Should only be syslogd, portmap, inetd, xntpd,
                                         and sshd.

13) Insure any 100Mbs ethernet interface (ent1 on SP/2 nodes, else probably ent0)
    is configured correctly for 100 Mb, full duplex.  For unknown reasons, the
    initial boot has the en1 interface working ok, but subsequent reboots try
    configuring the interface to be 10 Mbs & half duplex.  To see what the
    setting is in the ODM, type
       odmget -q 'name=ent0 AND attribute=media_speed' CuAt | grep value
    or odmget -q 'name=ent1 AND attribute=media_speed' CuAt | grep value
    It should be
        value = "100_Full_Duplex"
    not value = "10_Half_Duplex"
    If it's wrong, you can set/fix it with
       chdev -l ent1 -a media_speed=100_Full_Duplex -P
    The -P option says to just change the ODM database.  Since en1 is working
    for this boot, you don't have to change it.  It's just the next boot you
    need to fix it for, and to do that, all you need do is change the ODM.

    (Note these commands are different, due to it being a different ethernet
     adapter, than the E105 problems I describe in step 5) above.)

14) If this system has SSA on it, upgrade the SSA code with these commands,
       mount cws:/spdata/sys1/install/aix432/lppsource/PROD/SSA_Upgrades /mnt
       smitty installp
    Insure you select
       "Install and Update from ALL Available Software"
    at the bottom of the screen, else you won't get the SSA microcode filesets
    (ssamcode.* and ssadiskmcode.*).
    Point it to /mnt and go ahead and install everything, i.e. type "all" in the
       "SOFTWARE to install"
    field.  Again, when you're done,
       umount /mnt

    To see what levels of microcode you have now, before they get updated,
    For each D40 enclosure, each SSA adapter, and each pdisk,
       lscfg -vl enclosure0 | grep ROS
       lscfg -vl ssa0 | grep ROS
    or lscfg -vl pdisk0 | grep ROS
    For each,  Look at ROS Level and ID.  See the http://www.hursley.ibm.com/ssa/rs6k
    web page for the latest levels for each type of device.

    The adapter microcode will automatically get updated the next time cfgmgr
    runs and the adapter & drives aren't being used.  I.E. since the system is now
    quiesced, you can either run
       cfgmgr
    or lscfg -vl pdisk0
    now (easiest), or shutdown -Fr.

    The enclosure or pdisk microcode, require a manual step.
    To update the enclosure or pdisk microcode, ensure all drives aren't in use,
    including, for an enclosure, any drives being used by other systems.  Then
    - For each enclosure, type
         /etc/microcode/ssa_sesdld -d enclosure0 -f coral014.hex
      Even though the enclosure microcode is really updated, a
         lscfg -vl enclosure0
      command will still show the downlevel code.  To fix this,
         mkdev -l enclosure0
    - To update all the drives at once,
         ssadload -u
      This takes about 30 seconds per drive.
    
    See my notes in ~jasper/aixnotes/ssa for further information.

15)    vi /etc/inittab
    and restore both the cleanupdce & the /etc/rc.local lines that was commented
    out earlier.

16)    shutdown -Fr
    You're done.  You can now proudly move the line for this system from the file
    /dishes/aix421 on the CWS, into the /dishes/aix432 file.

==================================================================================