Some SP/2 tidbits ... --------------------------------------------------------------------------- See http://www.rs6000.ibm.com/support/sp for info on different fix levels of PSSP code. --------------------------------------------------------------------------- Try spmon -d -G to see a snapshot of the whole SP/2. Returns stuff like ----------------------------- Frame 3 (and 1 & 2) ------------------------- Frame Node Node Host/Switch Key Env Front Panel LCD/LED is Slot Number Type Power Responds Switch Fail LCD/LED Flashing -------------------------------------------------------------------------- 1 33 extrn on no no N/A N/A LCDs are blank N/A --------------------------------------------------------------------------- To get a node to network boot, use nodecond 3 1 (for frame 3, node 1, e.g.) You do this when you want to install AIX on a node, presuming you've already done the spchvgobj -i bos.obj.ssp.432 -p PSSP-3.1 -v aix432 3 1 1 and spbootins -r install 3 1 1 To get a node to power off or on, try hmcmds off 33 (for node 33) or hmcmds on 3:1 (for frame 3, node 1) --------------------------------------------------------------------------- To Completely Reinstall AIX on a node (what I did for as0207 on 3/10/00) 1) spchvgobj -i bos.obj.ssp.432 -r rootvg -v aix432 -p PSSP-3.1 2 7 1 2) vi /etc/exports To remove /spdata/sys1/install exportfs -u /spdata/sys1/install 3) spbootins -r install 2 7 1 4) Start the install with nodecond 2 7 which network boots the node. 5) Follow the notes in my ~jasper/public_html/patent.new-machine.build.process.html --------------------------------------------------------------------------- To Upgrade AIX & PSSP on a Node, e.g. What I did for as0206 on 8/28/99, and as0201 on 9/09/99, and as0204 on 10/06/99, and as0205 on 12/01/99, but I had troubles that I never investigated, so I tried again on 4/03/00 As I recall, the CWS was ignoring 205's bootp packets. Don't know why. Things looked good in /etc/bootptab. Anyway, I'll dig into it this time. 0) A good thing to do before you get going is to verify the SDR has the correct en0 ethernet adapter's MAC address by comparing a netstat -in | grep en0 | grep -i link on the node, with a splstdata -b frame# node# 1 | grep as0 on the CWS. If you want, you can even compare this with a lsnim -l as0nnne0 | grep if1 on the CWS. This just eliminates questions later should you have troubles. Another good thing to do is to check the error log on the node, fixing and cleaning out any old entries. Yet another good thing to do is to clear out root's mail, both already-inc'd and yet-to-be-inc'd mail. 1) spchvgobj -i bos.obj.ssp.432 -r rootvg -v aix432 -p PSSP-3.1 frame# node# 1 This is the command one uses with PSSP 3.1 instead of the spbootins command to set the AIX & PSSP level. See the results of the command with splstdata -b frame# node# 1 Insure that next_install_image = bos.obj.ssp.432 lppsource_name = aix432 and the pssp_ver = PSSP-3.1 2) On the CWS, insure /etc/exports does not have the normal /spdata/sys1/install exported. There's a copy of the normal, everyday /etc/exports saved at /etc/exports.save, which has /cd and /spdata/sys1/install exported. This is what the CWS normally has exported, but when you run setup_server, it tries to export the /spdata/sys1/install/aix432/lppsource directory which, since it's a subdirectory of /spdata/sys1/install, get an NFS error. Use exportfs to see what's exported now, and vi /etc/exports exportfs -u /spdata/sys1/install to unexport whatever's now exported. You can keep /cd if you want. 3) spbootins -r migrate frame# node# 1 This sets the "response" field of the splstdata -b frame# node# 1 command to "migrate" and runs the setup_server command. At this point, check to see what's NFS exported with an exportfs command. For the as0204 install on 10/6/99, I had a lot of trouble here. It appeared setup_server wasn't doing the right thing with /etc/exports. What should be there is /spdata/sys1/install/aix432/spot/spot_aix432/usr /spdata/sys1/install/pssp/bosinst_data_migrate /spdata/sys1/install/pssp/pssp_script /export/nim/scripts/as0204e0.script All with -ro,root=as0204e0.patent.ibm.com,access=as0204e0.patent.ibm.com at the end, as well as /spdata/sys1/install/aix432/lppsource -ro I was getting different things, e.g. the lppsource line not having the extra stuff on it or missing lines altogether, which causes LED 611, I opened up PMR 24566,49R to investigate, but a week later, everything was working ok. If this situation occurs again, reopen the PMR or open a new one. The quickest way to debug, is to use a series of allnimres & unallnimres commands. allnimres -l 2 4 1, puts the lines in /etc/exports. unallnimres -l 2 4 1, takes them out presuming you've already done the spchvgobj & spbootins commands. 4) 4/3/2000 Note: I don't see any /etc/rc.sp anywhere that has this local mod I did long ago. I wonder what happened to it? Anyway, this bullet no longer applies since I lost my fix. If you want details, see my PMR 59782,49R dated 11/23/98, in my ~jasper/aixnotes/aix.support.center.history file. Beware of your local mod to /etc/rc.sp where you call /local/bin/lsof to insure inetd is indeed listening to kshell. The lsof in /local/bin doesn't work on AIX 4.3.2. I modified /etc/rc.sp to check the oslevel, and run /local/bin/lsof.for.AIX.4.3 instead. Insure the target node has either an unmodified /etc/rc.sp, or this fixed one. 5) Prepare the node for the DCE conversion. mkdir /var/dce/tmp mkdir /var/dce/config mkdir /var/dce/dced/backup Comment out rccleancred from /etc/inittab Replace the rcdce line in /etc/inittab with rcdce:2:wait:/etc/rc.dce all > /dev/console 2>&1 # Start DCE/DFS Daemons and comment it out as well. Comment out the /etc/rc.local line as well if, especially for something like the DB/2 nodes, so nothing gets started up. While you're at it, check for and uncomment out if they exist, the following lines in /etc/inittab - db, i4ls, imnss, imqss, httpdlite Reboot the machine/node to get an IPL without DCE/DFS up. When it's back up, clear out the DFS cache directory. for i in $(jot 10 0);do rm /dfscache/V1$i*;done for i in $(jot 10 0);do rm /dfscache/V$i*;done find /dfscache -type f -exec rm {} \; Remove unneeded parms on dfsd startup command, in /etc/dce/rc.dce Unneccessary parms are - callback $RPC_SUPPORTED_NETADDRS - blocks 20000 - cachedir /dfscache - chunksize 16 Un-comment-out the rcdce line in /etc/inittab, i.e. restore it. 6) Start the install with nodecond frame# node# which network boots the node. This will upgrade AIX & DCE on the node, but not SSP. Don't have a R/W spmon window open during this step, it'll cause it to fail. If you want to monitor things, from the CWS, tail -f /var/adm/SPlogs/spmon/nc/nc.2.5 (or whatever the latest file is in /var/adm/SPlogs/spmon/nc). You can also monitor things with a R/O window via a s1term frame# node# command. Another way to monitor the install is to watch the LED's with a spled & command, or even another way, is with multiple lsnim -l as0nnne0 commands. On the 4/3/2000, as0205 conversion, this step hung with LED E105 and these lines in the /var/adm/SPlogs/spmon/nc/nc.2.5 file, Nodecond Status: command is boot /pci@fef00000/ethernet@10 Nodecond Status: network boot successfull Nodecond Status: bootp sent over ethernet Nodecond Status: waiting for "Welcome to AIX" message Nodecond Status: network boot proceeding, nodecond is exiting Nodecond Status: finished I opened a R/O s1term window to watch it more closely, and saw the bootp packets getting sent just fine, but when that image was booted, as0205 hung on LED E105. Hmmmmm. The http://rshelp.austin.ibm.com web site pointed me to PMR 48501,7TD, which took 8 weeks (November & December, 1999) to resolve this Sev 1 problem. After a lot of guessing, they finally discovered a node in their SP/2 complex that had full_duplex set on for its private, BNC, 10Mbs ethernet interface. Checking my nodes with a dsh lsattr -E -l ent0 -a full_duplex command, showed that node 211 did indeed have full_duplex yes. Changing that to no (had to bring down en0 & remove it first), fixed it and now 205 installs fine. Whew!! Of course, this was after I had tried other things that didn't fix the problem. E.G. - A splstdata -a 2 1 11 command showed nodes 201-211 having enet_rate="" & duplex="". The PMR says to insure those are correctly set to 10 & half respectively. Using smitty, I changed this with a spethernt command, discovering 2 options not documented in my book, -f 10 & -d half, which are the defaults. Now the splstdata -a 2 1 11 command shows the right thing, at least for the en0 interfaces. Afterwards, you had to do a setup_server again in order to rebuild the /tftpboot/as0205e0.patent.ibm.com.config_info file. But this didn't solve the problem. - They suggested applying service to the spot. I did, but this also didn't help. 7) Since /etc/rc.local was commented out of /etc/inittab, you don't have a route to the IBM Internal systems (net 9). To restore that, on the node, type route add -net 9 -netmask 255.0.0.0 192.168.56.252 8) Check to insure DCE/DFS got upgraded ok. I had a lot of troubles here. On the as0204 upgrade, DCE got converted ok, but DFS didn't. Checking /etc/dce/cfgdce.log, I saw DCE Migration cannot be performed unless no DCE daemons are running. 0x11315b66: You can attempt to stop the daemons by running the command stop.dce, or you can stop them manually. so I typed the /etc/rc.dce all in by hand, and DFS came right up. 9) Check /etc/inittab. - Replace the rccleancred:2:wait:/etc/dce/rc.cleancred -v > /dev/console 2>&1 line with cleanupdce:2:wait:/usr/bin/clean_up.dce > /dev/console 2>&1 - Restore the /etc/rc.local line that was commented out when you started - Check for any other anomolies. On the 206 update, I had to uncomment these 3 lines from /etc/inittab : imnss:2:once:/usr/IMNSearch/bin/imnss -start imnhelp >/dev/console 2>&1 : imqss:2:once:/usr/IMNSearch/bin/imq_start >/dev/console 2>&1 : httpdlite:2:once:/usr/IMNSearch/httpdlite/httpdlite -r /etc/IMNSearch/httpdlite/httpdlite.conf >/dev/console 2>&1 although I can't swear they weren't in there before the upgrade. Note on the 205 install. I saw these lines : i4ls:2:wait:sh /etc/i4ls.rc >/dev/console 2>&1 # start svlum : imnss:2:once:/usr/IMNSearch/bin/imnss -start imnhelp >/dev/console 2>&1 : httpdlite:2:once:/usr/IMNSearch/httpdlite/httpdlite -r /etc/IMNSearch/httpdlite/httpdlite.conf & >/dev/console 2>&1 : imqss:2:once:/usr/IMNSearch/bin/imq_start >/dev/console 2>&1 in /etc/inittab before the migration, so they must have been in there before the 206 update, too. 10) To upgrade the PSSP code, set the node's bootp response to customize, copy the PSSP 3.1 pssp_script to the node and run it. From the node itself, mkitab -i rc 'fbcheck:2:wait:/usr/sbin/fbcheck 2>&1 | alog -tboot > /dev/console # run /etc/firstboot' installp -u ssp.jm ssp.css From the CWS, type spbootins -r customize frame# node# 1 pcp -was0xxxe0 /spdata/sys1/install/pssp/pssp_script /tmp/pssp_script Now before you do this last step, check /tftpboot/script.cust on the CWS. Comment out the updating of everything 'cept /etc/hosts, that is, comment out /etc/resolv.conf, /etc/passwd, /etc/group, /etc/security/passwd, /etc/security/group, and /etc/security/user. I had to use ADSM to restore these files on 206. Then from the CWS, you can type dsh -was0xxxe0 /tmp/pssp_script or more simply, from the node itself, /tmp/pssp_script You can monitor the progress of this by tail -f /var/adm/SPlogs/sysman/as0nnne0.config.log. on the node. 11) After the install, check to see what you've got. For example, run oslevel -l 4.3.2.0 to see which filesets still need updating to get to AIX 4.3.2. For the 201, 204 & 206 updates, I've had to run installp -u bos.msg.en_US.diag.rte For the 201 & 204 updates, I also saw 3 devices.ssa.* filesets were still at the 4.2.1 level. I'm not sure why those filesets didn't get updated, probably because we used to have SSA drives on this system, but they got moved from when we moved the DB/2 database to as0209e0/1? I dunno. From the node, you can mount cws:/spdata/sys1/install/aix432/lppsource /mnt and smitty update_all Interestingly enough, this reinstalled ssp.jm & ssp.css. But these 3 filesets, Fileset Actual Level Maintenance Level --------------------------------------------------------------- devices.ssa.disk.rte 4.2.1.10 4.3.2.0 devices.ssa.IBM_raid.rte 4.2.1.7 4.3.2.0 devices.ssa.tm.rte 4.2.1.5 4.3.2.0 were a problem to install. Had to go to "Install and Update from LATEST Available Software", and explicitly select the base 4.3.2.0 filesets for these 3 guys, viz devices.ssa.IBM_raid.rte 4.3.2.0 devices.ssa.disk.rte 4.3.2.0 and devices.ssa.tm.rte 4.3.2.0 When you're done, umount /mnt. 12) Put the IBM message lines back in /etc/motd so that AIXCOPS will run clean. 13) Check for known errors, ls -l /etc | grep ' 0 ' # 8 zero-length files are ok. ls -l /usr/bin | grep ' 0 ' # There should be no zero-length files. ls -l /usr/sbin | grep ' 0 ' # There should be no zero-length files. ls -l /etc/services /etc/inetd.conf # Insure these are over 4,000 bytes big. grep -v '^#' /etc/inetd.conf # Should only be sshdfail, kshell, klogin, and maybe clutild and ssalld. If you see a bunch of crap, run the lockdown.sh script. E.G. on the as0203 upgrade on 4/4/00, I had rstatd-udp, rusersd-udp, rwalld-udp, sprayd-udp, pcnfsd-udp, xmquery-udp, login-tcp, fbcheck, shell-tcp, rcnfs, and nimclient. grep '^start' /etc/rc.tcpip # Should only be syslogd, inetd, xntpd, sshd and maybe portmap. 14) Insure en1 is defined correctly, that is, 100 Full_Duplex, not 10 Half_Duplex. On the 4-4-2000 as0205 install, en1 (192.168.56.21) didn't come up, due to the SDR information regarding en1, being omitted. I went through and insured all the splstdata -a info for all the nodes, was correct (most was simply omitted). Note as of 4-4-2000, we still had our original, old frame 1 nodes and only nodes 1,6,13-16 had 10/100 ethernet cards in en1. Nodes 3-5 & 7-12 were only 10 Mbs and half duplex. We have 4 drawers/8 nodes ready to replace nodes 3-10, but that's waiting on our power supply upgrade. When those nodes get replaced, we'll swap node 6's 10/100 card into node 11, then we only need 1 more 10/100 card for node 12. On the 4-4-2000 as0203 install, even the above wasn't enough. The en1 interface still came back as 10_Half_Duplex after the reboot. Next time, before the reboot below, check the setting with a odmget -q 'name=ent1 AND attribute=media_speed' CuAt | grep value (not lsattr -E -l ent0 -a full_duplex like it was above?? Hmmmmm.) If need be, change the database only with a chdev -l ent1 -a media_speed=100_Full_Duplex -P command. Then reboot & see if it comes up ok. 15) shutdown -Fr to insure a clean boot. --------------------------------------------------------------------------- To open a R/O window to a node's console, type s1term 2 13 To open a R/W window to a node's console, type spmon -open 2 13 --------------------------------------------------------------------------- To do a "Manual Node Conditioning", which I had to do when I got in the 4 new silver nodes and again when the S70 came in, Open a R/W window via spmon -open node23 for example, then power off the node, then power it on again. At the "memory keyboard network scsi ..." prompt, hit 1, then enter to get into the "System Management Services" screens. If you have multiple nodes to do, it's better to do them one at a time 'cause it seems to miss your "1" keystrokes if you do even two at a time. Select "3" = "Utilities", then "4 Remote Initial Program Load Setup then select the I.P. adapter and set the I.P. parameters as appropriate. Get back out the the main "System Management Services" menu, and select "2 Multiboot" then "3 Install From" then "3 Ethernet ( Integrated )" This will reboot the node from the Integrated ethernet adapter, so if you have the CWS already set up to install, this will start that process. After AIX gets rewritten, the node will do the normal node-startup thing of sending the CWS a bootp packet to ask what it should do. --------------------------------------------------------------------------- The command spmon -gui calls under the covers to bring up the LED display is spled. --------------------------------------------------------------------------- To refresh the hats/hags/haem, etc daemons in an SP complex, from the CWS type syspar_ctrl -r (refresh) Or for individual nodes, from the node itself, type /usr/lpp/ssp/bin/syspar_ctrl -k (kill) then /usr/lpp/ssp/bin/syspar_ctrl -s (start) --------------------------------------------------------------------------- There are four different microcodes associated with an SP/2, all are distinct and are updateable by the customer in various ways. They are 1) System Firmware: For each node. You can see the level you're at by a lscfg -vp command. Look for the first occurance of the "ROM Level (alterable) ..." line under the "System Firmware:" section. For example, ... Name: rom Model: AM29F040-PLCC Node: rom@fff00000 Physical Location: P2 System Firmware: ROM Level (alterable).......L98112 Version.....................RS6K System Info Specific.(YL)...P2 ... On 7-20-98, I updated this firmware for each of the 6 nodes in the second frame, by getting the WIL98112 IMGBIN file from AIXTOOLS, pcp-ing it to each node as /var/wil98112.img, then dsh-ing this shutdown command, shutdown -Fu /var/wil98112.img 2) Service Processor Firmware: You can see the level of this firmware by the same lscfg -vp command. Look for the "ROM Level (alterable)..." line in the "IBM,sp" section. For example, ... Name: IBM,sp Node: IBM,sp@ie8 Physical Location: P2-X1 Service Processor: Part Number................. 93H4228 EC Level.................... E76324A FRU Number.................. 93H4214 Manufacture ID..............3966-1944843 ROM Level.(non-alterable)...19980414 Serial Number...............00007472 ROM Level (alterable).......19980414 ... On 7-20-98, I attempted to update this firmware on the new SP/2 nodes, but the version on AIXTOOLS in the 7025F50S PACKAGE file was SP980413 IMGBIN, and each node had level SP980414, so nothing got changed when I tried updating it. The procedure to update it by the way, is to get that SP980413 IMGBIN file copied to each node to /var/spflash.img, then invoke diag, - "Task Selection(Diagnostics, Advanced Diagnostics, Service Aids, etc.)" - "Update System or Service Processor Flash" near the bottom. - You can then point it to the file, /var/spflash.img to get it to update. 3) Node Supervisor Firmware: 4) Frame Supervisor Firmware: This is the firmware the PSSP code knows about and is now complaining that I should update. An "spsvrmgr -G -q status all" command returns spsvrmgr: Frame Slot Supervisor Media Installed Required State Versions Version Action _____ ____ __________ ____________ ____________ ____________ 1 0 Active u_10.3c.0709 u_10.3c.060c Upgrade u_10.3c.070c _____ ____ __________ ____________ ____________ ____________ 2 0 Active u_10.3c.0709 u_10.3c.070c None u_10.3c.070c ____ __________ ____________ ____________ ____________ 1 Active u_10.3e.0703 u_10.3e.0704 None u_10.3e.0704 ____ __________ ____________ ____________ ____________ 2 Active u_10.3e.0703 u_10.3e.0704 None u_10.3e.0704 ____ __________ ____________ ____________ ____________ 3 Active u_10.3e.0703 u_10.3e.0704 None u_10.3e.0704 ____ __________ ____________ ____________ ____________ 4 Active u_10.3e.0703 u_10.3e.0704 None u_10.3e.0704 ____ __________ ____________ ____________ ____________ 5 Active u_10.3e.0703 u_10.3e.0704 None u_10.3e.0704 ____ __________ ____________ ____________ ____________ 6 Active u_10.3e.0703 u_10.3e.0704 None u_10.3e.0704 I'm guessing that slot 0 refers to the Frame Supervisor, and the other slots 1-6 refer to the Node Supervisor. One can use smit to update. The above command was done after I had updated the CWS & Frame 2 to PSSP 2.4, but hadn't updated the Frame 1 nodes yet. ---------------------------------------------------------------------------