To contact the AIX Support Center, call either or 9-1-800-225-5249 (CALL-AIX), - then option 1 (we have an AIX Support Line Contract and Need Hardware or Software Technical Assistance). - then option 1 (for Technical Assistance on AIX). - Your "Customer Number (aka "Entitlement Number") is 5336519. or if you want to use CS's "Entitlement Number", it's IN00018. Bruce says this IN00018 number is still good for Delphion's use. - The assign your call a "Problem Number" (aka an "Item Number"). (I used to call 1-800-237-5511, but their option 3 (AIX) seems to get you to the same place) ========================================================================================= 5/20/96 Finally got my hands on an RS/6000 model 43P in order to track down why it doesn't want to install AIX 4 from NIM. Tony Rall did some tracing and discovered what this guy in Raleigh discovered and reported in the POWERPC FORUM last February. I called Chris Dukes and got names and CMVC defect # 2849. The people that are working on it are Shailendra Bhatnagar, SHAIL@AUSVMB, a contractor, Steaven Sombar, STEVEN@AUSTIN, and Scott Walton, SCOTTW@AUSNOTES. Chris said they've been working on this defect for 113 days. ----- POWERPC FORUM appended at 19:19:54 on 96/02/02 GMT (by PAKRAT at RALVM17) Subject: Bootp on RSPCs on TR... this is why it doesn't work. Ref: Append at 20:38:57 on 96/02/01 GMT (by VRBASS at ATLVMIC1) For those of you with RSPCs in a TR environment and they just can't seem to boot off of your NIM server, here are the entertaining results of a network trace. The RSPC sends out a broadcast frame containing an ARP packet so that it can locate the router. The router sends out a source routed frame containing an ARP packet (So that it can go through the right bridges) back to the RSPC. The RSPC is supposed to take the RIF field of that frame, reverse the information, and store it to handle the source routing of all subsequent frames. However the frame the RSPC sends out to hold the directed BOOTP packet has a BLANK RIF field. (The ROM IPL emulation disk, as well as the firmware in MCA based RS/6000s does correctly fill out the RIF field). As the RSPC's frame goes around the ring without being taken by anyone, the directed BOOTP request fails. As the RSPC sees this failure, it proceeds to send a broadcast BOOTP packet which is promptly ignored. Chris Dukes ------------------------------------------------------------------------ Called the Support Center to get on their callback list. The Problem/Item number is AZ4813. They opened a PMR (# 0X08349R) and I should get a call back when the fix is available. I called Shailendra at 1-512-838-8573 to try to find out the status of CMVC defect # 2849 and left a message. He called back and said the defect has been fixed, but is not yet available. I asked if there was a way I could get a copy of the fixed firmware early. After notes to Scott Walton (scottw@ausnotes) and Steaven Sombar, I got a pre-V1.10 version at /afs/austin/depts/19yb/public/images/TestBuilds/ carol.net-fixes/car96115.img. Follow the directions in the README to dosformat/doswrite it to diskette along with the flash.6xe executable, type eatabug at the feed-me-the-SMS-diskette graphic to get into the "resident monitor", then type flash -a to update the 43P firmware. The only trick was that I had to write the .img file to diskette with a leading 'P' in the filename (PAR96115.IMG) 'cause the flash program looks for any file on diskette with the pattern P*.IMG. If you ever want to get the real stuff, see the directory /afs/austin/depts/19yb/public/images/PowerSeries830_850, where you'll find V1.02, V1.04, V1.06, V1.07, V1.08, and V1.09. Presumably, we'll want the V1.10 level when it eventually comes out. This firmware update got the 43P to boot across our token ring as well as it does with ethernet, but both ways now stop somewhere in the tftp'd nucleus, probably (according to the forum) in rc.boot where it does the bootinfo -c. But fixing rc.boot according to the forum didn't fix the problem. I also tried turning debugging on in the nucleus and connecting up the 3164, but I didn't get anything to come out at that console. Time to call the Support Center back again. They gave me a new item # and PMR #. They are not the same thing, but they're treated almost the same. The Item # is used locally at the Support Center to track calls, a PMR # is used in Retain. The new numbers are Item # BJ2181, PMR # 0X322, branch office 49R. They're going to have their NIM expert, Kyle Cline give me a call. Kyle is KCLINE@AUSVMR, 8-793-9294. ******************************************************** * By the way, this bug in the 43P firmware also * * exists in the PowerPC 400 firmware, but Scott * * Walton says the 400 is out of service and they * * don't intend to fix that firmware. The bottom line * * is you can't network-install a 400 machine on our * * token ring netwook. * ******************************************************** ========================================================================================= 6/02/96 Once every couple of weeks, Flick's machine gets into a AZ9944 state that almost nothing works on it. He was able to get a root window by rexec-ing a ksh, but only a subset of commands were working. SRC for example, was broken. It would eventually time out. Flick thought it had something to do with sockets. The Support Center found that there were two srcmstr processes running, one a parent of the other. The child srcmstr could not be killed. They thought this was fixed (as well as various named problems by the way) in the latest service, available through fixdist via IX58114. (Note on 9-18-96: When searching for the latest fixes in fixdist, look for the text string, "latest aix 4.1". You'll find a few of them. Just select the latest one. The APAR/PTF number will be different. For example, the August APAR is IX61071. I downloaded 330 MB into /afs/alm/common/inst.images/4.1.4. Now I gotta figure out what to do with all that stuff, i.e. how to apply it onto Flick's machine, nemesis, and the spot, spot1. ========================================================================================= 6/03/96 Mike Carey's console on bucky quit working after a normal BJ6838 reboot. Disk-based diagnostics work fine on it, but hard disk-based diagnostics and AIX 4.1.4 both act like they don't "see" the console. The error logs (errpt -a) show software errors. The info from errpt shows LABEL: GRAPHICS IDENTIFIER: E85C5C4C Resource Name: CFGLFT Description: SOFTWARE PROGRAM ERROR Detail Data: DETECTED FAILED RC ERROR LOCATION cfglft build_dds 45 1114 31 and in the next error, cfglft open -1 1119 18 Dan in the Kernel Group said that IX53983 would fix my problem and that it wasn't available through fixdist yet. He sent me a 8mm tape, which I loaded from maple and put into /afs/alm/ais/322temp/carey_fix. It turned out to be just a few files, the two relevant ones being the .toc and devices.graphics.com.usr.4.1.4. I applied it to bucky and rebooted, but the symptoms still exist. I called Dan back and left a message on Friday, 6/7/96 at 11:45. Dan called back and said he was going to queue my call to the graphics group. A neat command to see if a particular fix is applied to a machine is, instfix -ik IX53983. Also checked out RETAIN, but didn't learn much. There are two hits on the error label, E85C5C4C. One was the above fix, and the other was for a GXT1000 graphics card, which Mike didn't have. He has the standard, 1-1 card, a "Color Graphics Display Adapter". Called the Support Center back on Monday (6-10) at 2:25. Talked to Mike Olivier (OLIVIER at AUSTIN, 8-793-6932) in the Graphics group who said he'd investigate and call me back today. Mike called again and wanted the output of a "configuration manager" command (cfgmgr -v), which showed no error configuring gda0 (the graphics adapter), but did show an error configuring lft0 Method error (/etc/methods/cfglft): 0514-045 Error building a DDS structure. and the output of the errpt -a command showing the two cfglft errors described above. I sent him both in a note, but I sent them to his VM userid. Mike called on Tuesday and I pointed him to them. Called again Thursday (6-13) at 9:35 and left a message to get a status check. Mike returned call with the suggestion to change the permissions of the special file /dev/lft0 from 662 (I think it was) to 666. I did and left a message for Mike Carey to reboot and see if that fixed things. Called Mike Carey back Monday morning (I took Friday off) and no, the terminal still wasn't working. Left a msg for Mike Olivier Monday morning at 10:10. Mike Olivier called again asking if we had a good system backup (I didn't clarify exactly what kind of backup he was talking about). He said he suspects a corrupted ODM data base. It sounded to me like he wasn't certain and was just guessing, so I pushed back and suggested we verify that it's really a corrupted ODM and possibly repair it if we can. He said he'd investigate and get back to us. Mike Olivier called again at 2:00 and this time he recommended running fsck in service mode as root. He also noticed 5 instances of power failures (see EPOW_SUS in the errpt) and was asking why there were so many. Was the user pulling the power plug on the box? One in particular happened 5-29 at 3:15, maybe about the time Mike Carey had the original problem. Mike Carey couldn't run diagnostics (remember?) from the hard disk and said he was just going to reinstall. I tried calling him back to say, "Wait, we can run diags from diskette.", but he wound up reinstalling everything anyway. I called Mike Olivier back to cancel this call, but I'm really disappointed that we couldn't identify what exactly was wrong and fix it. The cause thought, is likely the fact that Mike Carey often powers his machine off when it's hung up instead of hitting the yellow reset button twice to reboot. ========================================================================================= CA8212 Called in our named-hang problem with AIX 4.1.4. The named 06/21/96 daemon often (a few times a week on nim) simply refuses to resolve names. Things look ok, but a host jasper (say) returns "host: 0827-801 Host name jasper does not exist." Recycling named fixes it 'till the next time. nim (and now jasper, dale, and ech) are at the latest level, UM443183. Meghan O'Brien (meghan@austin.ibm.com at 8-793-7115) told us to turn on tracing with the named option -d 1 (the 1 says debug level 1, which is the least amount of data, 9 is the most). The log is kept at /var/tmp/named.run. When named fails, send Meghan the last few k-bytes of the log. By the way, kill -USR1 `cat /etc/named.pid` will increment the debug level, while kill -USR2 `cat /etc/named.pid` will turn off debugging. I have debugging turned on on jasper, nim, ech, and cabernet. Also by the way, when you talk to others about this, tell them we're running named on all our client machines "as cache-servers". Normally only your name servers run named. We run it here this way to get two advantages. 1) We have a backup name server, and 2) I.P. addresses are cached locally instead of at the name server. Once Meghan understood this, she was more comfortable with our setup. Came in on Monday morning and tried a host jasper on nim. Worked ok. Did some work and a few minutes later, named was hung. Sent Meghan the last 3,000 lines or so of the named.run file as well as a software error reported in the errpt (turns out the errpt entry was due to my sending named a kill -9 signal, so it wasn't really relavent). Meghan called back Tuesday (6-25) morning and said there wasn't anything in the named.run. She wanted our /etc/named.boot & .ca files and also to turn on syslog. The steps required for turning on syslog are - Edit /etc/syslog.conf & add the line daemon.err /tmp/named.syslog or whatever file you want to send this stuff to. - touch /tmp/named.syslog - refresh -s syslogd - stopsrc -s named - startsrc -s named A curious thing, the first line that gets logged in that file is one that complains of the domain line in /etc/named.boot, which seems perfectly ok to me. Jun 25 09:28:15 jasper named[6208]: \ /etc/named.boot: line 1: unknown field 'domain' I've changed nim, jasper, cabernet, & ech. named failed on ech on Wednesday morning (6-26), but there's nothing else in the syslog file except for that initial line complaining about the domain record. I called Meghan at 11:10 and left a message. Meghan called back on Thurday (6-27) and said we shouldn't have a domain line in our named.boot. They aren't needed for cache-only name servers. But Meghan says that isn't what's hanging our named daemon. Meghan had me send a kill -2 to named which produces a dump in /var/tmp/named_dump.db, which I mailed to her along with our /etc/named.local. But Meghan is giving up trying to resolve our problem and is requeuing our call to the "Back-end Communications Group". Meghan did say we were a bit down-level, but that wasn't our problem either. The latest level for is bos.net.tcp.client 4.1.4.13 (we were at 9) U443181, bos.net.tcp.server 4.1.4.11 (we were at 8) U443826. The problem is, those fixes aren't available on fixdist yet. I've called the Support Center to ask how I'd get these fixes, and to also rattle their cage a bit to get them to look at our problem. Meghan said I may have to do this. Got a call back at 11:20 am Friday (6-28) and Robert Justice talked me through ftp-ing the fixes from Boulder. rftp aix.boulder.ibm.com login as anonymous (There's a ls -Ral file in the root directory to help find things) cd /aix/fixes/v4/os (The DCE fixes are in /aix/v4/fixes/other) binary (or devices or X11 or xlc or ??) the fixes were in the files bos.net.tcp.server.4.1.4.11.bff and bos.net.tcp.client.4.1.4.13.bff I put them in /afs/alm/common/inst.images/4.1.4 and run inutoc on that directory as root. From nemesis, smitty install_fileset, directory=/afs/alm/common/inst.images/4.1.4, and for the "FIXES to install" field, you need the IX* numbers, which you can get from the .toc file. In this case, IX54156 IX53663 is good. I put the fixes on both nim & jasper, and recycled named. Dale decided to apply all the fixes in common/inst.images/4.1.4, so cabernet has the updated stuff, too. Rick put the fixes on ech also. This problem's been transferred to Scott Tanquary. I called 7/8/96 at 4pm requesting a callback and again 7/9/96 at 11:30. Scott's tie line is 8-793-6725. Scott called back at 11:50. We talked through some things but I couldn't answer some of his questions on how our nameservers were configured. I got the root password from Tony Rall (l0vebars). Called Scott back Wednesday (7/10) evening and left a message. I called him again, this time to his personal phone (evidently, these support guys have 2 phone lines and they sometimes choose to not answer their support line). This time Scott told me that for cache-only name servers, the root nameservers should be what's defined in /etc/named.ca, that is the name servers for the whole internet. We have our Almaden nameservers defined. I talked to Tony Rall about this and he kinda agreed, but said it didn't have to be the internet root name server, having just the root name server for ibm.com is good enough. Tony came up with the following /etc/named.ca which I've installed on nim & jasper late Friday (7-12). We'll see how this fairs. ; Root IBM nameservers . 86400 IN NS pollux.cbe.ibm.com. 86400 IN NS leda2.cwp.ibm.com. 86400 IN NS castor.cdf.ibm.com. leda2.cwp.ibm.com. 86400 IN A 9.14.1.3 castor.cdf.ibm.com. 86400 IN A 9.78.1.2 pollux.cbe.ibm.com. 86400 IN A 9.14.1.2 Scott Tanquary wants to know our status with this problem. I haven't seen the named failure on jasper, nim, or cabernet, so I'm willing to say it's a named configuration problem. Tony Rall wants to try Tom Engelsiepen's suggestion of putting in specifying our nameservers as serving the almaden domain. Tony'll implement that in our nameservers and we'll test that. Meanwhile, I called Scott and had him close this problem. At least, I tried to. Scott's evidently in the process of changing his number to 8-793-4177 and neither number is accepting messages. I sent Scott a note instead. 10/21/96 We are seeing the same named hang on AIX 4 systems as we saw last June. We may need to start tracing named again. ========================================================================================= CA8539 Called in a problem with nim after putting on the latest 06/21/96 (April) fixes (IX57483, the PTF #). Kinda bizarre, but I Barkat saw after a nimadd xwing, 4 things get put in /etc/exports, then propogated to /etc/xtab with an exportfs command, but there was one file missing in /etc/exports but in /etc/xtab. Specifically, These directories below, were they in | exports? | xtab? ----------------------------------------------+----------+-------- /inst.images | Yes | Yes /export/spot1/usr | Yes | Yes /home.native/root/Install_Scripts/Do-setupnet | Yes | Yes /export/nim/scripts/xwing.script | No !!! | Yes Thus, any further nim activity like another nimadd, would put the new stuff in /etc/exports, but the missing xwing.script wouldn't be in /etc/xtab, and the install would fail when it tried NFS-exporting the xwing.script. When the install fails, the alog -t bosinst -o shows these errors, Network installation manager customization. rc=175 0042-175 c_script: An unexpected result was returned by the "/usr/sbin/mount" command: mount: 1831-011 access denied for NEMESIS.ALMADEN.IBM.COM:/export/nim/scripts/xwing.script mount: 1831-008 giving up on: NEMESIS.ALMADEN.IBM.COM:/export/nim/scripts/xwing.script The file access permissions do not allow the specified action. I backed up and did each step one at a time, that is, I did the nim -o allocate for the lpp_source, my Do-setupnet script, and spot, and checked /etc/exports & /etc/xtab. All was ok. I then did the nim -o bos_inst -a source=spot -a no_client_boot=yes xwing and things were also ok. All four lines were in both /etc/exports and /etc/xtab as expected. And sure enough, the install went fine. Shit! I can't seem to get it to fail again. We agreed to close this problem and open up another one should it fail again. Meanwhile, I've written a little script and stuck it at the end of my nimadd & nimadd-isa scripts to check for this situation. Also, Kyle of the Support Center suggested lsnim -Fl machine_name to check for the correct number of "exported = ..." lines. ========================================================================================= CB0404 Darrell Long (7-2376 in H2-805) wanted to install AIX 4.1.4 6/26/96 on his 7012-340, which has a working ethernet adapter, but Sunita the IPL ROM Emulation diskette doesn't recognize it. It's got a 2-8 on the back, which isn't in my list of adapters. A lscfg on Darrel's machine (running 3.2.5) shows ent0 00-00-0E Standard Ethernet Adapter The adapter is labeled "Thick/Thin" & has one of those 2-bank jumpers on it to switch it from one to the other. It's correctly on thick now. This card is only about 4 inches square with a diagonal notch taken out of it. Adapter number 2-8 isn't in my list of adapters. Curiously, *my* 340 also has an ethernet adapter in the back, but it's labeled 2-9, not 2-8. Also, a lscfg on my machine shows this to be a ent0 00-00-0E Integrated Ethernet Adapter I'm not using the card. My 340 is connected to token ring. I've tried my 2-9 card in kipling and the IPL ROM Emulation code didn't recognize that either. I pulled my ethernet card out of aix-test to let Darrell Long borrow it just for the install. Afterwards, we switched to his 2-8 adapter and AIX 4.1.4 is happilly running with it. I told Sunita all this Thursday (6-27), but told her I wanted to pursue it. Either say we don't support the 2-8 and 2-9 adapters, else fix the IPL ROM Emulation diskette to recognize them. Ok, pass the crow. The 340 *does* have BOOTP-enabled IPL ROM, so you *can* boot without the IPL ROM Emulation diskette. Why did I think otherwise? I called the Support Center and had them cancel this PMR. ========================================================================================= 3147X,49R Called to get on the Interested Parties list for ADSM APAR IC13320. 9/05/96 In the past, this has been a simple thing that the Support Center handled on a routine basis. But this time, the first time I called, the woman said to call 1-800-879-2755, which turned out to be the "Software Manufacturing Solutions" line, option 1 got to the "National Publication Ordering Center", and option 2 the "Software Support Center", who had never even heard of an Interested Parties list and "we use RETAIN all the time." I called back to the AIX Support Center and talked to a different woman, who also didn't know how to add me to the IP list, but at least she created a PMR (#3147X, branch office 49R) and called the ADSM Support people, who said *they* would add me. Whew! What a hassle. Anyway, here's the append from the ADSM FORUM on IBMUNIX. ----- ADSM FORUM appended at 03:52:02 on 96/08/22 GMT (by MARNATTI at LEXVM2) - Subject: Passwordaccess = GENERATE Ref: Append at 18:27:18 on 96/08/14 GMT (by MARNATTI at LEXVM2) This is a follow up to the note I appended to this forum asking about conditions that would cause a generated password to need to be reset on the client side, since I was seeing where this needed to be done at strange times. It turns out that there is an open APAR, IC13320, for this problem with no PTF ready yet. This APAR describes what I was seeing; during repeated ADSM archiving operations the client's password file will up and disappear. In case anyone else is experiencing this problem, a temporary work around, according to this problem record, is to place a 2 second delay between ADSM client calls for archive operations. John Marnatti ISSC, Lexington, KY ========================================================================================= CE9917 Ron Moore's 43P won't install via NIM. He's got a token ring machine (dinorm) 9/18/96 that is configured correctly. It contacts nim, tftp's the initial boot image Choon over, grays out the lower portion of the screen, but doesn't do anything after that. I allocated the debug-spot and traced it using the console concentrator and ate so that I can trap all the console messages. What's happening is between led's 610 and 612, the following mount is done mount -r NEMESIS.ALMADEN.IBM.COM:/export/debug-spot/usr /SPOT/usr Since there's precious little in this initial nucleus to debug stuff with, I changed the /export/debug-spot/usr/lib/methods/showled module to the normal ls command from /usr/bin, added a couple of ${SHOWLED} -ld /SPOT/usr commands to /export/debug-spot/usr/lpp/bos/inst_root/sbin/rc.boot, and rebuilt the initial tftp'd nucleus with a nim -o check -F -a debug=yes debug-spot command to see the state of affairs both before and after the mount (around line 760). Here are the relevant lines from that trace + /usr/lib/methods/showled -lR /SPOT <---- Remember, this is the ls command total 16 -rw-r--r-- 1 0 0 1060 Dec 1 14:29 niminfo drwxr-xr-x 2 0 0 512 Sep 18 1996 usr <---- All's ok before. /SPOT/usr: total 0 + /usr/lib/methods/showled 0x610 0x610 not found + mount -r NEMESIS.ALMADEN.IBM.COM:/export/debug-spot/usr /SPOT/usr + [ 0 -ne 0 ] + /usr/lib/methods/showled 0x612 0x612 not found + [ -d /SPOT ] + /usr/lib/methods/showled -lR /SPOT total 8 -rw-r--r-- 1 0 0 1060 Dec 1 14:29 niminfo ---------- 0 0 0 0 Jan 1 1970 usr <---- Weird !!! + [ -d /SPOT/usr ] + cp /SPOT/usr/lib/boot/network/rc.bos_inst /etc <---- Fails (of course). cp: /SPOT/usr/lib/boot/network/rc.bos_inst: Not a directory ... The rc.boot script continues on for a little bit more, and actually fails with a "Illegal Trap Instruction Interrupt in Kernel", and if you do a "g" to get it going, you quickly get one more line, LED{0A8}, which I have no idea where that's coming from. I've rebooted nim. That didn't do any good. I've installed another token ring connected 43P (Jim Hafner's triumph-t). That's going fine. I could try to install with the built-in ethernet, but haven't done that yet. Time to get help. Background info: I've installed the latest microcode from Austin afs tree, from /afs/austin/depts/19yb/public/images/PowerSeries830_850/v1.10, which is what I put on Jim's triumph-t machine. The token ring network adapter configuration is also correct. It shows Adapter MAC Address 0004ac356d32 I/O Address a20 Interrupt 9 RAM Address d00000 RAM Size 64K ROM Address cc000 Remote IPL Disabled Token Ring Data Rate 16 Auto Sense Disabled Choon wanted me to apply the latest service, so I got a half gig of updates and applied them to both nemesis and to the debug-spot. The install still fails the same way, a zero-length /SPOT/usr. I've also tried the install via ethernet. Same failure. Robert I sent Robert my trace (kapture9) showing the ls -lR /SPOT both before 9/23/96 and after the apparently-successful mount command. He's going to forward 9:30 am that to somebody else and they should call back sometime later today. PMR 5183X, He's also upped the severity to 2 (I thought 2 was the default, evidently, BO 49R this was a 3). BUCKNER@AUSTIN.IBM.COM or BUCKNER at AUSVMR. Baltazar I talked with this guy, too and sent him the trace to BALT@AUSTIN.IBM.COM. 9/23/96 He asked a few questions, and went away to think about it a bit. 10:15 am Baltazar called back and asked for some ls -ld commands to check on each 11:50 am directory in /export/debug-spot/usr/lib/boot/network/rc.bos_inst. He also wanted me to check the system date & time on the 43P after booting the SMS diskette. It was 01/11/39 and 21:29:00 (this being 09/23/96 and 11:54:00). The problem is fixed now. The problem was due to that date being so far off. Evidently, NFS can't handle date differences so great (or maybe it was dates past the "epoch", we weren't quite sure what the details were). I booted the machine in maintenance mode from a CD and was able to import the root volume group from the AIX 4.1.3 system that was on there, and set the date. FYI, if you ever have a machine that *doesn't* have an OS on it, you can import the date command via floppy by following these steps (Baltazar said he was going to document these steps in the PMR): 1) From a good machine, dd of=/dev/fd0 if=/usr/bin/date. 2) Boot from a CD and get into a Limited Function shell. 3) In order for the new date command to have the execute permission set, first cp /bin/t /date. 4) dd of=/date if=/dev/fd0. 5) Then you can date mmddHHMMyy. ============================================================================== CG0223 Flick's complaining about some include file in his C++ program 10/09/96 not working correctly. He's already got it narrowed down to the socket.h file, which is part of the bos.adt.include fileset, which was updated in the latest fixes I got from fixdist. We have bos.adt.include version 4.1.4.14, which is bad. Flick says that 4.1.4.8 was ok. This is Flick's note to AIX service: socket.h for bos.adt.include 4.1.4.14 does not honor _XOPEN_EXTENDED_SOURCE for C++ source. This works for socket.h at bos.adt.include 4.1.4.8. In particular the third argument to accept is a size_t * (4.1.4.14) when using the C++ compiler even though _XOPEN_EXTENDED_SOURCE is undefined. Can someone tell me if this is fixed in a later release of bos.adt.include ? See also the three appends in the AIX4 FORUM with the subject Subject: Why did the accept() prototype change from 4.1 to 4.2 ? 12/05/96 Called again 'cause Da Li is having the same problem with some 4:00 pm other program which includes , which sucks in /usr/include/sys/socket.h. Da is getting the same kind of errors Flick was getting about incompatible function definitions. I left a message with Dennis somebody asking for him to call me back. Fixdist and anonymous rftp to aix.boulder.ibm.com still showed 4.1.4.16 being the latest bos.adt.include. ============================================================================== CI8707 Rosa says qprt on AIX 4.1.4 doesn't honor the -Y0 (that's a zero) 12/06/96 option. It's suppose to force simplex printing, but it's coming Wes out duplex. I tried it on my machine, jasper, running AIX 4.1.4 using the command qprt -c -P3116c1a -Y0 public_html/test.ps and everything came out as expected. I also tried on maui (AIX 3.2.5) and without the -Y0 flag. Rosa is investigating why 3116c1a works, but 3116g1a and 3116b1a doesn't. ============================================================================== bu0419 Tom Griffin (7-1444) with ocrx1 (root pw=foobar) and K. Mohiuddin 02/11/97 with moidin6k (no root pw), reported by Sandeep Gopisetty (7-2680), 12:50 both report the same problem with their identical machines. Anthony After installing the latest AIX 4.1.4 image, the one I serviced on 1-24-97 with the January fixes from fixdist, they have two problems. 1) Have a corrupted /var/adm/ras/errlog. If you do an errpt, you get the msg The supplied error log file is not valid: /var/adm/ras/errlog. A /usr/lib/errdemon -l (to get the error log attributes) shows, Error Log Attributes --------------------------------------------- Log File /var/adm/ras/errlog Log Size 507 bytes Memory Buffer Size 8192 bytes What's different from a normal system is the Log Size. A /usr/lib/errdemon -s 1048576 to reset the size fixes this problem. -->> Turns out what caused this problem was the /var file system in -->> the spot was full (1 4-MB partition), mostly due to the file -->> /var/adm/ras/installp.log. I changed /export/spot1/usr/lpp/bosinst/image.template -->> (as I have documented in the inside front cover of my nim manual) -->> to make /var initially 8MB. This fixed the corrupted error log. But, what's worse is 2) X doesn't start up. They both have a RS/6000 model 250 (7011-26-38806 in the case of moidin6k), with a built-in GXT150 Graphics Adapter. Luckily, Sandeep showed me a recently-installed (on 1-11-97, before my January service upgrade to the spot), also identically-configured RS/6000-250 named lalita, that works. Before that, according to the /var/adm/ras/nim.installp log, the last time I installed service was on October 25th, 1996, which was probably the September service level. 02/17/97 Haven't heard from anybody for almost a week, so I called in to 12:00 see what's going on. I was queued to Barbara, who helped me a Barbara little bit. She wanted to run diagnostics, but the AIX diag Patterson command as well as the hard disk-based diagnostics (what you get when you boot in service mode) both fail to test the graphics adapter 'cause they think the device is busy. I wasn't able to run my diskette-based diagnostics (nor Dale's) 'cause I got flashing 888's after reading the first diskette. Turns out I need a newer level of diagnostics, orderable through the FE, which I've done. Once I get them, I'll make a bunch of copies for everybody else. 02/19/97 After calling FE in and getting Bob to run his diagnostics, we found out the hardware was fine all along. I called Barbara back and told her this. Meanwhile, I got Bob, the FE, to get me a more current set of diagnostic diskettes so I could run on the 250's. He gave me two sets. There's a third problem with these machines, well, with moidin6k & lalita at least. Those two both have token ring adapters in them as well as the built-in ethernet adapters built into the motherboard. Those two machines are actually using their token-ring adapters, with nothing connected to the ethernet. The problem is, both machines are getting numerous ethernet adapter errors reported in the error log, about one every two minutes. This causes the error log to fill up and recycle in about 50 hours. This isn't related to the graphics adapter problem, 'cause lalita's graphics adapter works fine. ============================================================================== BU2275 I had another run-in with a 43P, this one (ananda) is ethernet- 02/17/97 attached, that wouldn't install. After tracing is with the debug 10:30 spot, I saw the same problem that the NIM FORUM gave a workaround Choon to back in Jan, 1996, that is, add the following 2 lines to /exports/spot1/usr/lpp/bos/inst_root/sbin/rc.boot, BOOT_SERV_IP=129.33.24.63 E802=0 By the way, to connect the portable console to a 43P, 1) From the 9-pin connector on the back of the 43P, plug in Dale's universal, white, sex change cable, the end without the red tape, then 2) Plug *the same end* - not the other end - evidently the cable crosses some lines over in the middle, into Dale's 22-foot, 25-pin male/female to 25-pin male/female, flat grey cable he keeps wound up in the bottom drawer in his office, 3) Into a null modem, then 4) Into the portable console. To connect the 43P to Pine, instead of steps 3) and 4) above, plug Dale's flat grey cable from step 2) into a connection on the back of pine's null-modem block, top row where the grey connector converts to the flat cable. I put this fix in last year and haven't had any problems with it since, but this time, it wasn't enough to fix the garbage the bootinfo -c command was returning. Here's the relevant stuff from using ate's ctrl-b to capture output, + bootinfo -c + set -- 255.255.255.255 129.27.20.51 129.27.20.253 0 0 1 /tftpboot/ananda.almaden.ibm.com 1.4.255.255.248.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0 .0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0. 0 + CLIENT_IPADDR=255.255.255.255 + BOOT_SERV_IP=129.27.20.51 + BOOT_GATE_IP=129.27.20.253 The fix I've got in fixes BOOT_SERV_IP, but as you can see, the CLIENT_IPADDR and BOOT_GATE_IP addresses are both wrong as well. This causes the network interface to be set up incorrectly and the tftp to nim later on fails with tftp: sendto: No route to host and looping on LED 608. I put in a fix in rc.boot, but this is a very specific fix for this one machine. Right after the other two-line fix, I added first3="$(CLIENT_IPADDR%.*}" # Get first 3 octets. case "$first3" in 129\.33\.[89] | 129\.33\.1[0-5]) BOOT_GATE_IP=129.33.8.253 ;; 129\.33\.2[4-9] | 129\.33\.3[01]) BOOT_GATE_IP=129.33.24.253 ;; 129\.33\.160) BOOT_GATE_IP=129.33.72.253 ;; 129\.33\.7[2-9]) BOOT_GATE_IP=129.33.72.253 ;; 255\.255\.255) CLIENT_IPADDR=129.33.29.64; BOOT_GATE_IP=129.33.24.253 ;; *) ;; esac As you can guess, I initially noticed the BOOT_GATE_IP was wrong and tried making a generic fix based on the CLIENT_IPADDR, but since that was screwed up too, added the last case. I just updated the firmware to their latest level, 1.11, 'cause a 97/01/09 append in the NIM FORUM suggested this would fix the bad bootinfo data. I even called Dick Chimenti who started the thread and he said that yes, it did fix it. Well, it didn't for me. I mailed the debug script output to choon@austin.ibm.com. I pointed him to the pertinent lines where the data is all wrong and he's gonna look into it, confer with others, and call me back. 12:00 Choon called back and asked if we put in the I.P. addresses in the 43P network setup screen, with leading zeros or not. We should not have leading zeros. I checked and yep, we did have leading zeros. Changed it and rebooted from nim and things worked fine. Damn! With Power PC's, you don't put in leading zeros, with RS/6000's, you do. How's one suppose to know? Especially if pings & bootp's to nim & tftp's from nim all worked ok. ============================================================================== 85553,49R Having a problem with the Firewall code installed on the patgate 12/30/97 machine (9.1.8.252 & 192.168.56.252). The symptoms of the problem 11:30 are, from root@patgate, I could nfs-mount something from the CWS, mount 192.168.56.65:/spdata/sys1/install /mnt then wc /mnt/default/lppsource/raj, which is a tiny, 9-byte file, but when I try wc /mnt/default/lppsource/.toc, which is a bigger, 1 MB file, the command hangs - that is, it never completes and ctl-c-ing out of it leaves the system in a funny state. That is, subsequent NFS reads, unmounts, and mounts also hang. I traced a subsequent, identical mount command and it was getting data from the .toc file, in other words, some process (nfsd?) on the CWS was remembering that we wanted to read the .toc file, and was shoving us data even though we had killed the wc socket and was trying to do something else now. I iptrace'd a good wc .toc command from another machine (as0073e0) and the failing wc .toc from patgate and saw that a UDP packet was being denied by a seemingly-unrelated firewall rule. The details are - The iptrace (/tmp/bad.wc.toc.iptrace10) from patgate shows the first part of the .toc file data being transmitted in the following three packets Note the fragmentation of the UDP packet, resulting in the same ip_id number and different ip_off(set) values. These 3 packets were all coming sequentially from 192.168.56.65, port 2049, to 192.168.56.252, port 1226. Timestamp Size ip_len ip_id ip_off Packet Fate ------------------ ---- ------ ----- ------ ---------------- 22:04:26.758164736 1514 1500 199 0+ Passed 22:04:26.759396864 1514 1500 199 1480+ Denied by Rule 2 22:04:26.760448768 1306 1292 199 2960 Denied by Rule 2 - The firewall log (/var/adm/sng/logs/971229_220x.log) show the first packet getting passed on through, due to my PermitAll rule #4, but the second & third packets being denied due to rule #2. Dec 29 22:04:27 patgate : 1997;4014: 2073;ICA1036i;#:;4;R:p; i:;192.168.56.252;s:;192.168.56.65;d:;192.168.56.252;p:;udp; sp:;2049;dp:;1226;r:;l;a:;n;f:;y;T:;0;e:;n;l:;1500; Dec 29 22:04:28 patgate : 1997;4014: 2073;ICA1036i;#:;2;R:d; i:;192.168.56.252;s:;192.168.56.65;d:;192.168.56.252;p:;udp; sp:;0;dp:;0;r:;l;a:;n;f:;y;T:;0;e:;n;l:;1500; Dec 29 22:04:28 patgate : 1997;4014: 2073;ICA1036i;#:;2;R:d; i:;192.168.56.252;s:;192.168.56.65;d:;192.168.56.252;p:;udp; sp:;0;dp:;0;r:;l;a:;n;f:;y;T:;0;e:;n;l:;1292; - According to the /etc/security/fwfilters.cfg file, Rule #2 is # Between anySource and anyDestination # Service : Syslog # Description : deny Syslog deny 0.0.0.0 0.0.0.0 0.0.0.0 0.0.0.0 udp any 0 eq 514 non-secure both inbound l=y f=y This rule doesn't look like it should be denying these packets. - I've got FW.base, cfgcli, & libraries at the latest 3.1.1.2 level. Until I get a call back from the Support Center, I tried changing the nfsd daemon to limit read packet size to 1480 bytes by the command chssys -s nfsd -a "-r 1480 8" You can do an odmget -q subsysname=nfsd SRCsubsys|grep cmdargs to see what it's set to. It was just 8 before I changed it. ... A little while later ... That didn't seem to change anything, but it was probably due to the fact that patgate was screwed up (see first paragraph). I couldn't remount the directory. I read a bit about "Fragmentation Control" on Firewall rules and changed the "nonsecure Syslog" rule from yes to headers. The settings refer to what type of packets, this rule applies to. | S E T T I N G Applies to Packets | Yes | No | Only | Headers --------------------------+-----+----+------+--------- Non-Fragments | x | x | | x Fragment Headers | x | | x | x Fragments Without Headers | x | | x | So now, rule #2 doesn't apply to Fragments Without Headers any more and Voila! Things now work fine. How 'bout that? A little bit of RTFM works (I just wish I had the manuals). This still leaves one question, tho'. Why did this rule get applied to UDP "Fragments Without Header" packets? The rule says specifically, destination port 514 only. When the Support Center calls back, I'll ask them. The answer is, since UDP fragment packets don't have port numbers in them, the port clause of rules are ignored, so that rule was denying *all* UDP fragments from and to anywhere. One could argue that's the wrong thing to do. The reasoning could have been "since UDP fragment packets don't have port numbers in them and this rule has a specific port number clause, then don't apply this rule". But that's not the rationale. ============================================================================== FL0563 x On my AIX 4.3.0 system, with DCE & DFS installed, when I'm 03/19/98 x authenticated to DCE (via integrated login) then my 14:45 x authentication expires, I cannot klist or kinit any longer. Donovan x I get the message "IOT/Abort trap". I have the official, x PTF set 22 from a CD the Support Center sent Dale installed. xxxxxxxxxxxxx To debug an expired DCE Identity, I went to the security registry as cell_admin under dcecp and saw what the default and minimum ticket lifetimes were by, dcecp> registry show -att {deftktlife +1-06:00:00.000I-----} ... 4 more lines ... {mintktlife +0-00:05:00.000I-----} ... 2 more lines ... Then to change it, I dcecp> registry mod -mintktlife +0-00:00:01.000 to first change the minimum to 1 second (Donovan said the default can't be less than the minimum), then I set up to logon again, and quickly (so as to not affect anybody else), dcecp> registry mod -deftktlife +0-00:00:10.000 to change the default ticket lifetime to 10 seconds, then logged on, getting my 10-second ticket lifetime, then dcecp> registry mod -deftktlife +1-06:00:00.000 to set the default ticket lifetime back to what it was. I can do these two steps as many times as I'd like, then when I was done, I dcecp> registry mod -mintktlife +0-00:05:00.000 to set the minimum back to its original 5 minutes. Donovan called back and left three fixes for me to get, IX74434, IX74068, and IX72275. Rick downloaded them from the web fixdist and I installed and tested them and things now work fine. ============================================================================== 21769 x I had an AIX 4.3.0 system with DCE/DFS 2.1 working fine. 05/28/98 x I updated to AIX 4.3.1 and that appeared to go ok, but a 13:00 x previously-fixed bug (see problem above) appeared again. Julian Owens x I decided to upgrade to DCE/DFS 2.2. I smitty update_all'd t/l 421-7141 x from the CD, but now DCE doesn't want to start. xxxxxxxxxxxxxxx Looking at the /opt/dcelocal/etc/cfgdce.log, I saw that it complained about not finding some configuration files The file, /opt/dcelocal/var/dfs/BosConfig, was not found. The file, /opt/dcelocal/var/dced/cdscache.inf, was not found. The file, /opt/dcelocal/var/dced/clksynch.inf, was not found. and making some bad assumptions (Information only) Unable to determine the security server type. Replica will be assumed. I was not configured as a security server before doing this. Before, (taken from ech) a lsdce command returned Current state of DCE configuration: cds_cl COMPLETE CDS Clerk dts_cl COMPLETE DTS Clerk rpc COMPLETE RPC Endpoint Mapper sec_cl COMPLETE Security Client Now a lsdce command shows Gathering component state information... Component Summary for Host: jasper.almaden.ibm.com Component Configuration State Running State Security client Configured Running RPC Configured Running Integrated Login (dceunixd) Configured Not Running Initial Directory server Configured Not Running Directory client Configured Running Security Replica server Configured Not Running DTS client Configured Not Running Global Directory Agent Configured Not Running I attempted to rmdce -F -o local all_srv, and got some errors. I even tried rmdce -F -g all_srv and got other errors. Now lsdce shows Gathering component state information... Component Summary for Host: jasper.almaden.ibm.com Component Configuration State Running State Security client Configured Running RPC Configured Running Initial Directory server Configured Not Running Directory client Configured Running Security Replica server Configured Not Running DTS client Configured Not Running Global Directory Agent Partial Not Running so it looks like it's getting worse. I decided to call it in to get real help. Julian Owens (good, chatty, level 1 guy) looked around and didn't find thing obvious. He sent me to level 2. 6/2/98 Claudia called. I gave her root access (root's password Claudia Barrett is now claudia). She's going to poke around and pass it off 8-678-0910 to somebody else. ============================================================================== FZ0339 x After doing a fresh AIX 4.2.1/PSSP 2.4 install on the new 07/01/98 x SP/2 nodes in the Patent Server complex, I get an extra, x unwanted route in both the ODM & (of course) the routing Russell x table. This route comes back after a reboot, despite me McDonald x having deleted it from both the route table itself and the xxxxxxxxxxxxxxx ODM. As background, there are two ethernet adapters on each node, connecting to two different subnets. A netstat -rn shows ... Route Tree for Protocol Family 2: default 192.168.56.251 UG 0 1 en1 - - 9.1.10.17 192.168.56.252 UGHD 0 20 en1 - - 127 127.0.0.1 U 6 177 lo0 - - 192.168.55 192.168.55.17 U 4 7235 en0 - - 192.168.56 192.168.55.65 UG 1 180 en0 - - => 192.168.56 192.168.56.17 U 1 0 en1 - - Note the second-to-last line, where it says to use the CWS as a router to get to subnet 192.168.56, which makes the native connection to 192.168.56 unused. Usefull commands to look around & fix things are netstat -rn odmget -q 'name=inet0 AND attribute=route' CuAt odmdelete -o CuAt -q "name=inet0 AND value='net,192.168.56.0,-netmask,255.255.255.0,192.168.55.65'" route -f;route add -net 0 192.168.56.251 Russell and I tried deleting everything we could out of /etc/inittab, leaving in just init, brc, and cons, and still the bogus route came back. I tried putting in debug code in /sbin/rc.boot, and couldn't figure out where/when the bogus route was being put into the ODM. Turns out the answer was that extra route is in the ODM that's stored in the boot image! To resave that version of the ODM, run the savebase command. We did this and verified that the route didn't come back. Now why that bad route got put in that ODM is left for the Support Center to figure out. I now know how to overcome that situation. ============================================================================== HB3410 x Since installing the second frame to the SP/2, all the 07/16/98 x other nodes have been logging a lot of extra crap in 03:00 x their /var/ha/log/hats.15.103329.as0000e0 file. Lines Jon Meyer x like this show up every 5 seconds, filling /var. meyerjw@us.ibm.com x 07/17 14:56:18 hatsd[0]: Received a Group Proclaim 8-421-7157 x from (192.168.55.65:0x4591719e) in group x (192.168.55.65:0x45af716a). xxxxxxxxxxxxxxxxxxxx Jon investigated it a bit and claims it's normal at this level of PSSP (2.2) to see all this crap in the error log, and it's not an indication of an anything wrong. In later versions of PSSP, they cleaned things up to not log this. He closed this PMR on 7/27/97, but I called and left a message asking what I should do, since my /var file system is gettting filled with this junk. See PMR 34340,49R. ============================================================================== Item # HC9258 x When trying to configure DCE on the J50 (ar0081e0) PMR # 34353, 49R x the system crashes. The last thing Ed's mkdcecl.pl 07/27/98 x script does is the command, 09:00 x mkdce -o local -h ar0081e0 Sandy Comsudi x -c ar0073e0.patent.ibm.com sancom@austin.ibm.com x -s ar0073e0.patent.ibm.com 8-523-4130 x -n patent.ibm.com sec_cl cds_cl xxxxxxxxxxxxxxxxxxxxxxx I'm also seeing sporadic problems using vi. Sometimes vi works fine. Sometimes, it hangs up for 30 seconds or so, then works normally. Other times, it hangs completely and never comes back (ok, I don't know about "never". I waited 14 minutes 'till the system crashed when I ran dced). I used smitty to copy over the dump to /dfscache, and Sandy had me use the script command to capture the console output from these commands script /tmp/crash.info crash /dfscache/vi_and_mkdce.dump cpu status stat trace -m errpt symptom od vmmerrlog 9 a quit # To exit out of the crash command, and exit # To finish the script command. I then sent her that file, ala mail -s 'Item # HC9258' sancom@austin.ibm.com < /tmp/crash.info After taking a few minutes to look at that dump info, Sandy then called back and had me ship her the complete dump itself via anonymous ftp. To collect information, I first ran this snap command snap -a -N -d /dfscache/snap.command It turns out I maybe shouldn't have specified "/snap.command". I was under the impression this command would generate one file. Instead it created a /dfscache/snap.command directory with 16 subdirectories underneath it. Anyway, I then ran snap -c -d /dfscache/snap.command which tar'd up and compressed that directory, and created the file /dfscache/snap.command/snap.tar.Z. I renamed that file to my PMR number.brach office number.tar.Z, cd /dfscache/snap.command mv snap.tar.Z 34353.49R.tar.Z then anonymously rftp'd it to Sandy. Sandy said I could either put it in the /incoming directory of the IBM-internal cia.austin.ibm.com server, or the /aix directory of the external testcase.boulder.ibm.com server. I chose the latter. 07/29/98 This item got moved around a lot and now is with Dwayne McConnell, (DWAYNE at IBMUSM26, tie-line 8-678-2720). He suggested I needed to get APAR IX79277, which brings bos.mp up to 4.2.1.12. I had 4.2.1.10. I used fixdist to get the fix, which actually drug down 4.2.1.14, installed it, but still got a system crash, this time with just vi running/hanging. Also, with this boot, root didn't have a password. Strange. I know it had a password. And to prove it, when the system rebooted after this dump, root again had the correct password. More strangeness. I left a message for Dwayne. 08/04/98 Called Dwayne McConnell again to follow up on last week's call, and he decided he wanted the dump, so I erased the old stuff from /dfscache, and copied the dump to /dfscache/vi.dump, and ran these commands snap -a -N -d /dfscache/snap snap -c -d /dfscache/snap cd /dfscache/snap mv snap.tar.Z PMR34353.49R.000.tar.Z (000 is USA's country code) /local/bin/rftp testcase.boulder.ibm.com anonymous jasper@almaden.ibm.com cd /aix bin put PMR34353.49R.000.tar.Z I also offered reinstalling AIX to Dwayne. He said he didn't think it would fix things, but it's worth a try. 08/06/98 I've still got the same problems after the reinstall. Called Dwayne back to inform him and also to see what's up. Had to leave a msg. 08/07/98 Called Dwayne back directly to inquire when I should expect to hear back from him, especially with a solution. 08/12/98 Called the Support Center again to get a status. The problem's been transferred to Manoj Kumar (tie-line 678-3708). Manoj called back and left this message, "Regarding the PMR with system crashes. I can't really see any code problem or any memory corruption. I have a very strong feeling in this case you're running into some problem with hardware, somewhere in the planar board where things are going haywire in the sense of what's in the memory is not what the CPU is seeing, so when it does address calculation, it's messing up. ... Tell service that you're having strange and random crashes ... I can't see anyway that we can run into these kind of problems. Three random crashes in completely different parts of the code and unexplained by the memory, at least at the time of the dump. Everything in memory looks ok, (but) by the time the processor saw it, some of these things had (changed). I called it into service at 4 pm Wednesday, 8/12/98. J50-2605593. Reference # 31WB5KH. Ernie Garcia came Thursday morning, 8/13. He ran diagnostics, which didn't show anything wrong (as expected), and he ordered a new system planar, which got installed Thursday afternoon. By Friday noon, I had re-installed AIX and had an up & running, fully-configured system. I called the Support Center and closed the problem. On Thursday, 8/20/98, I called in this J50 yet again. There's still something wrong with it, the same thing. Yes, I was able to last week get by the mkdce and fully configure the system, but then the system went south on me. I reinstalled and tried to configure it again, but again, I die on the mkdce. Reference # 32TJHB2. Anthony DeMott's gonna look at it on Friday morning, 8/21/98, but I'll be on vacation. ============================================================================== HE4429 x I've got two machines (as0202e0 & as0203e0) that 08/05/98 x have new SSA arrays defined on them Both are at 4:45 x hdisk2 and are 27.3GB big. When trying to put them Aaron & Brenda x into a volume group, I get two different error --mail address-- x messages. Repeated "cfgmgr" & "diag -a" commands --phone-- x don't change anything. Even deleting the SSA RAID x array and redefining it didn't change anything. xxxxxxxxxxxxxxxxxxxxxxx The SP/2 is a 9076, serial # 0277261. as0202 gives 0516-796 mkvg: Making hdisk2 a physical volume. Please wait. mkvg: An invalid physical volume ID been detected on hdisk2. 0516-862 mkvg: Unable to create volume group. as0203 gives 0516-796 mkvg: Making hdisk2 a physical volume. Please wait. 0516-304 mkvg: Unable to find device id 55aa75c78bf5ea00 in the Device Configuration Database. 0516-324 mkvg: Unable to find PV identifier for physical volume. The Device Configuration Database is inconsistent for physical volumes. They keyed on the fact that a lspv command returned hdisk0 000018505d4d6920 rootvg ( 9.1 GB rootvg) hdisk1 none None ( 4.5 GB unused) hdisk2 0000000000000000 None (27.3 GB SSA ) Note the 16 zeros for the PVID. They focused on getting the real PVID in this command's output. They first tried dd if=/dev/zero of=/dev/hdisk2 bs=512 count=1 chdev -l hdisk2 -a pv=yes but when that didn't work, they had me do rmdev -dl hdisk2 after which, the lspv command didn't return anything for hdisk2, as expected. cfgmgr after which, the lspv command showed "none None" hdisk2. That's good. chdev -l hdisk2 -a pv=yes and lo and behold, the lspv command returned hdisk2 0000185018f665db None (for as0202e0) hdisk2 00001102190c0f13 None (for as0203e0) as expected. I was then able to do the mkvg and the rest of what I needed to do to define the file system, fine. ============================================================================== ------ x In the patent server domain, I've lost my quorum --/--/98 x among the DCE servers. ar0073e0 is supposed to be my --:-- x master, with 71 & 72 replicas, but I can't define a --who-- x new volume for example. I get the messages --mail address-- x Could not lock FLDB entry (id=0,,16, type=0, op=32) --phone-- x Error: no quorum elected (dfs / ubk) x Error in release: no quorum elected (dfs / ubk) xxxxxxxxxxxxxxxxxxxxxxx I get the same thing when I try a fts release users. That was at 4pm on 8/12. The next morning, I started investigating. The web help pages say to check time on each server (it's correct) and to type udebug -rpcgroup /.:/fs -long "to analyze the FLDB and check if the different machines have the same version of the database." They don't. ar0071e0 shows Host 192.168.56.71, his time is 0 Vote: Last yes vote for 255.255.255.255 at -3 (not sync site); Last vote started at -903033334 Local db version is 902640226.1 I am not sync site Lowest host 255.255.255.255 at -903033334 ar0072e0 shows Host 192.168.56.72, his time is 0 Vote: Last yes vote for 255.255.255.255 at -5 (not sync site); Last vote started at -903033378 Local db version is 902617959.1 I am not sync site Lowest host 255.255.255.255 at -903033378 ar0073e0 shows Host 192.168.56.73, his time is 0 Vote: Last yes vote for 192.168.56.71 at -28 (not sync site); Last vote started at -38 Local db version is 902640226.1 I am not sync site Lowest host 192.168.56.71 at -28 But then when I was poking around some more, everything came into synch. The above udebug -rpcgroup /.:/fs -long command showed Host 192.168.56.73, his time is 0 Vote: Last yes vote for 192.168.56.71 at -8 (sync site); Last vote started at -9 Local db version is 903033621.1 I am not sync site Lowest host 192.168.56.71 at -8 on all 3 servers. I don't know what fixed things. The only clue maybe is these lines found in /var/dce/dfs/adm/BosLog.old, Sun Aug 2 04:00:05 1998: /opt/dcelocal/bin/bosserver: beginning logging Sun Aug 2 04:00:05 1998: Server directory access is okay Sun Aug 9 04:00:00 1998: reBossvrWatchThread: no error; simple restart exit compat_UnregisterServer: unexpected error from rpc_ep_unregister: Not registered in endpoint map (dce / rpc) Sun Aug 9 04:00:00 1998: /opt/dcelocal/bin/bosserver: error unregistering self: Not registered in endpoint map (dce / rpc) Sun Aug 9 04:00:02 1998: /opt/dcelocal/bin/bosserver: error destroying bnode timeout condition variable; errno = 16 Sun Aug 9 04:00:02 1998: childWatchThread: exception or cancellation in (cma_)sigwait (bossvr_thread_childWatch.c: 245) Sun Aug 9 04:00:03 1998: /opt/dcelocal/bin/bosserver: application exit I wound up never calling this in since things seemed to fix themselves, but I did want to document what I saw in case it happened next time. ============================================================================== 42790 x I've tried four times now on two different systems 08/20/98 x to install Net.Commerce v3.1. On as0204e0, I learned 10:00 x how to smoothly install Net.Commerce (have 260Mb free Tillman Baldwin x in /usr and have ipfx 2.2 pre-installed), but when I TJBALDWI at IBMUSM20 x get to the step of configuring Net.Commerce from the 8-444-7687 x NT web browser, I get an error message, "Cannot x modify web server configuration file." I got the xxxxxxxxxxxxxxxxxxxxxxx same thing on as0206e0. I'm thinking now, that this was my own fault. I thought I was selecting the "DB2 UDB Workgroup", but I was doing just the "DB2 Client Application Enabler" instead. I'm redoing it now. I tried reinstalling on as0204, but got another system crash (see below). Tried again on as0206e0, but I get the same error - "Cannot modify web server configuration file." ============================================================================== HI7185 x Sometimes after rebooting a silver SP/2 node, I 08/28/98 x wind up with a zero-length /etc/inetd.conf. Just 12:00 x prior to the install, I installed a bunch of Donovan x additional software, mostly X things, and the latest x AIX service that I had, so I don't really know what x of those 3 things caused the zero-length inetd.conf, x installing software, applying service, or rebooting. xxxxxxxxxxxxxxxxxxxxxxx The Support Center hadn't heard of others complaining of the same thing, and it didn't happen at the next reboot, so we just closed the incident. At least if somebody else gets the same symptom, they may get a hit in RETAIN. ============================================================================== HR2493 x Kin says ar0143e1 took a dump on Saturday, 9-26-98 09/29/98 x at 6:17 pm, that he wanted me to call in. ar0143e1 15:15 x is a 43P-240 (not a 530 like I told the Support Leslie Devlin x Center). I copied the dump over to ldevlin@austin.ibm.com x /var/adm/ras/system_dump_on_Sep26 and did the normal 8-523-4253 x script /tmp/crash.out x crash /var/adm/ras/system_dump_on_Sep26 x cpu xxxxxxxxxxxxxxxxxxxxxxxx status stat trace -m errpt symptom od vmmerrlog 9 a quit (to exit out of crash) exit (to exit out of script) I then mail -s'PMR HR2493,bo49R' ldevlin@austin.ibm.com < /tmp/crash.out. Later, Leslie called back to say the system crashed in the middle of PHXENTBD (whatever that is), which is in the fileset devices.pci.23100020.rte, which is the PCI 10/100 Ethernet Device Driver. She suggested installing the latest level, which is 4.2.1.4. We had 4.2.1.2 on, so I went ahead and got and installed that version, but I wonder if it's really going to fix the problem. It turns out that the system crashed overnight, so we're now running with the 4.2.1.4 version of devices.pci.23100020.rte. Looking at it a bit more closely, I see we've had 11 crashes on ar0135e0 this month. Here's a synopsis from the error log, IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION --- Failing Module --- 3573A829 0930003398 U S CMDCRASH SYSTEM DUMP dfscore 3573A829 0926182298 U S CMDCRASH SYSTEM DUMP phxentdd 3573A829 0925130898 U S CMDCRASH SYSTEM DUMP if_en/netinet/dfscorep 3573A829 0924195798 U S CMDCRASH SYSTEM DUMP dfscore 3573A829 0923204198 U S CMDCRASH SYSTEM DUMP dfscore 3573A829 0922125298 U S CMDCRASH SYSTEM DUMP nfs.ext 3573A829 0903172598 U S CMDCRASH SYSTEM DUMP dfscore 3573A829 0903152098 U S CMDCRASH SYSTEM DUMP ??? 3573A829 0903125098 U S CMDCRASH SYSTEM DUMP dfscore 3573A829 0902213098 U S CMDCRASH SYSTEM DUMP dfscore 3573A829 0902103898 U S CMDCRASH SYSTEM DUMP csa/nfs.ext/vnop_rdwr I did the following to package both dumps together, document things, ftp the tar file to Boulder, then I called Leslie back. snap -a -N -d /ips/junk (This compressed last night's dump and wrote it to /ips/junk/dump/dump.Z, so to include the prior dump there, I did ... ) compress /var/adm/ras/system_dump_on_Sep26 cp -p /var/adm/ras/system_dump_on_Sep26.Z /ips/junk/dump vi /ips/junk/other/README.PROBLEM snap -c -d /ips (This tar'd up /ips/junk & wrote /ips/junk/snap.tar.Z) cd /ips/junk mv snap.tar.Z PMR-HR2493.49R.000.tar.Z To ftp it, rftp testcase.boulder.ibm.com anonymous jasper@almaden.ibm.com cd /aix bin put PMR-HR2493.49R.000.tar.Z I learned later that my "HR2493" number is not a PMR number, but what the Support Center calls an "X-menu" number. I was suppose to call Leslie and have assign me a PMR number, then use that in the tar file's name. Oh, well. Also, since we have at least 2 different problems, Leslie says she's opening up a second PMR. The phxentdd dump is PMR 48613 and the dfscore dump is PMR 48642. We eventually replaced the 43P with another idle 43P, moving over all the adapter cards & disk drives, and the new 43P is running fine. The old 43P that we now suspect there's something wrong with, is being used by Kin and/or Bruce for something or rather. ============================================================================== JF2646 x When migrating from PSSP 2.2 to PSSP 2.4 on one of 11/23/98 x our older nodes (as0101 in this case), "Step 10" in 13:30 x the PSSP book has you run /tmp/pssp_script, which is Cliff x really /spdata/sys1/install/pssp/pssp_script. --mail address-- x There were a few problems running that script. --phone-- x x xxxxxxxxxxxxxxxxxxxxxxx 1) It failed in handling of fbcheck in /etc/inittab when it didn't exist. "Fixed" by adding "fbcheck:2:wait:/usr/sbin/fbcheck" in inittab before running pssp_script. 2) Also had to insure /spdata/sys1/install was NFS exported. 3) Also had to manually install devices.chrp.base.rte on the node. 4) Had to insure xlC.rte was at 3.1.4.8 and a bunch of other stuff was updated, so I just applied latest service. After all the above, pssp_script ran ok. ============================================================================== PMR 59782,49R x I forgot to document this PMR and frankly, forgot 11/23/98 x all about it 'till I got a note on 2/25/99 saying x that PTF U463038 for PMR 59782 is closed. Huh??? xxxxxxxxxxxxxxxxxxxxxxx This was a problem that I spent 12/3/98 debugging. Turns out that /etc/rc.sp does a stopsrc and a startsrc -s inetd, but the startsrc wasn't working 'cause inetd wasn't stopped yet, it was in the "inoperative" state. I fixed /etc/rc.sp by adding these lines after the stopsrc while [[ -z `lssrc -s inetd | grep inoperative` ]] do sleep 1 done and these after the startsrc while [[ -z `/local/bin/lsof -i -n | grep 'kshell (LISTEN)'` ]] do sleep 1 done to wait for inetd to come up. The APAR is IX85054. What they did is update /usr/lpp/ssp/install/bin/rc.sp in level 2.4.0.7 to essentially put in my wait loop after the stopsrc. They didn't address waiting after the startsrc. I wonder if that'll be a problem for me. ============================================================================== Item # JK3543 x My machine, jasper, a 7043-140 (43P) running AIX PMR # 67742,49R x 4.3.1 with DCE 2.2.0.2 and all the latest fixes, 12/29/98 x still occasionally hangs while trying to contact a 9:20 x DFS server. I get these messages, Dot x dfs: lost contact with server 9.1.24.165 support@transarc.com x in cell: almaden.ibm.com (412) 667-4500 x7391 x I called this problem into the support center a xxxxxxxxxxxxxxxxxxxxxxx couple of weeks ago, but at that time, they just told me to reboot & install the latest fixes, which I did. I'm now at the latest, DCE 2.2 with Fix Pack 2, that I can be at. Also last time, the dfsbind process was huge, so I set up a little cron job to hourly go see what the size of the dfsbind process was. Here was the job, #!/bin/ksh file=/tmp/monitor_dfsbind # Put in the header line if the file doesn't exist. if [[ ! -f $file ]] then echo "------------------------------->" $(ps aux | head -1) > $file fi # Find the dfsbind PID with the inner ps -ef command (the ps aux command # doesn't give the full name of the running program, so we've got to use # the ps -ef command to find the PID, then look for that PID in the # ps aux command). # Keep that line in the designated file. echo $(date) "-->" $(ps aux | grep -v grep \ | grep "^root *$(ps -ef | grep -v grep | grep -v monitor_dfsbind | \ grep dfsbind | awk '{print $2}')") >> /tmp/monitor_dfsbind with this crontab entry, # Run a frequent job that monitors the size of the dfsbind daemon. 0 * * * * /monitor_dfsbind.sh 1>/dev/null 2>& I had this running since 12-10-98 and had 2 reboots since then. Without boring you with details, the size of the dfsbind process during the first boot, stayed pretty constant at 1172-1500Kb. I had to reboot 12-14 to replace the processor heat sink & fan (it was making noise). The second boot saw the dfsbind process slowly grow from 1172Kb to 1812Kb over 8.5 days. I had to reboot yesterday, 12-28, 'cause the system was apparently hung over the Christmas holidays. I had taken off 12/22 - 12/28 and the last entry in the /tmp/monitor_dfsbind log was dated Wednesday morning at 1 am. This last time, the system has only been up for on day and the hourly watch on dfsbind has seen it mushroom. Here are the lines from the last reboot (slightly edited), ----------------------> USER PID %CPU %MEM SZ RSS STIME TIME COMMAND Mon Dec 28 09:00:00 --> root 8792 0.0 0.0 1136 472 08:16:37 0:00 /opt/dcelocal/bin Mon Dec 28 10:00:00 --> root 8792 0.0 1.0 1180 664 08:16:37 0:00 /opt/dcelocal/bin Mon Dec 28 11:00:00 --> root 8792 0.2 6.0 8004 7788 08:16:37 0:17 /opt/dcelocal/bin Mon Dec 28 12:00:01 --> root 8792 0.8 13.0 16936 16728 08:16:37 1:51 /opt/dcelocal/bi Mon Dec 28 13:00:00 --> root 8792 1.1 16.0 21352 21244 08:16:37 3:05 /opt/dcelocal/bi Mon Dec 28 14:00:00 --> root 8792 1.6 22.0 28300 28192 08:16:37 5:31 /opt/dcelocal/bi Mon Dec 28 15:00:00 --> root 8792 1.8 24.0 31648 31540 08:16:37 7:05 /opt/dcelocal/bi Mon Dec 28 16:00:00 --> root 8792 2.1 28.0 37064 36944 08:16:37 9:48 /opt/dcelocal/bi Mon Dec 28 17:00:00 --> root 8792 2.7 33.0 43756 43560 08:16:37 13:57 /opt/dcelocal/bi Mon Dec 28 18:00:00 --> root 8792 2.9 37.0 48040 47796 08:16:37 17:04 /opt/dcelocal/bi Mon Dec 28 19:00:00 --> root 8792 3.4 41.0 53816 53372 08:16:37 21:44 /opt/dcelocal/bi Mon Dec 28 20:00:00 --> root 8792 3.8 45.0 59152 58468 08:16:37 26:40 /opt/dcelocal/bi Mon Dec 28 21:00:00 --> root 8792 4.4 49.0 65412 63756 08:16:37 33:21 /opt/dcelocal/bi Mon Dec 28 22:00:00 --> root 8792 4.9 53.0 71392 69608 08:16:37 40:33 /opt/dcelocal/bi Mon Dec 28 23:00:00 --> root 8792 5.4 57.0 76920 74428 08:16:37 47:47 /opt/dcelocal/bi Tue Dec 29 00:00:00 --> root 8792 6.0 60.0 82812 78396 08:16:37 56:16 /opt/dcelocal/bi Tue Dec 29 01:00:00 --> root 8792 6.6 60.0 89008 78784 08:16:37 66:10 /opt/dcelocal/bi Tue Dec 29 02:00:00 --> root 8792 6.8 60.0 92644 79144 08:16:37 72:01 /opt/dcelocal/bi Tue Dec 29 03:00:00 --> root 8792 7.1 61.0 97176 80200 08:16:37 79:59 /opt/dcelocal/bi Tue Dec 29 04:00:00 --> root 8792 7.3 63.0 100264 81632 08:16:37 85:50 /opt/dcelocal/b Tue Dec 29 05:00:00 --> root 8792 7.4 63.0 103408 82740 08:16:37 91:31 /opt/dcelocal/b Tue Dec 29 06:00:02 --> root 8792 7.5 65.0 106280 84932 08:16:37 97:14 /opt/dcelocal/b Tue Dec 29 07:00:01 --> root 8792 7.6 66.0 109640 86588 08:16:37 104:00 /opt/dcelocal/ Tue Dec 29 08:00:08 --> root 8792 7.7 68.0 112364 89308 08:16:37 109:39 /opt/dcelocal/ Tue Dec 29 09:00:03 --> root 8792 7.7 68.0 114820 88780 Dec 28 114:52 /opt/dcelocal/ As you can see, the dfsbind process is taking over all my resident memory, causing huge paging delays. This was the same symptom I saw 3 weeks ago. I restarted the dfsbind process via the smit menus to get me by, 'till I can get on the horn with the Support folks about this. 01/14/99 After playing phone tag with the Support Center for a Reggie Clinton few weeks (I haven't pushed the issue), Reggie finally 8-444-5257 got hold of me. I told him about my monitor_dfsbind log l2dce@us.ibm.com file and he told me about 2 commands he wanted me to type in to dump some internal tracing. The 2 commands are dfstrace dump -file /tmp/dfsdump and send a kill -30 signal to the dfsbind process, which will write /var/dce/dfs/adm/icl.bind. Both are painless to run, so I ran both and mailed him all 3 files (/tmp/monitor_dfsbind, /tmp/dfsdump, and /var/dce/dfs/adm/icl.bind). When dfsbind starts to get bloated again, I'm to run the above 2 commands again and mail him the 3 files again. 01/15/99 Dale talked to Liz Hughes (who I met in San Antonio), Liz Hughes who's looking at the problem. 8-678-3483 I talked with Liz, who was more interested in my complaint of loosing contact with a server, than the dfsbind process growing. I told her that the loosing contact was more of a transient thing, that the dfsbind problem occurred more often. Liz wanted to close this item due to the confusion of symptoms, and if/when it happened again, to open up a new PMR. See Item # JP9021 below for a continuation of the apparent dfsbind memory leak. See also Item # 84648,49R below for the "Lost contact with " msg. ============================================================================== Item # x x I've got problems installing a VeriSign SSL PMR # 67918,49R x certificate on two machines, baboon & penguin. 12/31/98 x Baboon has the Lotus Domino Go Webserver 4.6.1.0 --:-- x installed. Penguin has Internet Connection Server Paul Kelsey x 4.2.1.7, just like the patent site does. Both appear --mail address-- x to receive the certificates fine, but give the 8-444-5399 x following error in a popup window whenever you try to xxxxxxxxxxxxxxxxxxxxxxx get some page via https/SSL: The security library has encountered an improperly formatted DER-encoded message. and the bottom line is, it don't work. I tried calling the VeriSign Support number (1-650-429-3400, options 1,2,2) and they sent me a URL (http://www.StructuredArts.com/edu/ssleay-pkcs10.htm) to look at, but that wasn't too useful 'cause it was too technical. It was purported to "give a brief description of how you can generate a PKCS-10 certificate request using SSLeay -.6.6 or later for use with the Netscape Certificate Server 1.01 product." The solutions it offered was too technical to follow and I'm not sure it really applied to my problem. The other solution the VeriSign folks offered was to specify a different web server when you request the certificate. On one of their web pages, they ask you to specify which web server you've got. Two of the choices are Lotus and IBM. So which should I select for the "Lotus Domino Go Webserver", which is put out by IBM? Initially, I chose IBM, so I went through VeriSign's web pages to reissue the certificate (free within 30 days), specifying Lotus. After re-receiving the new certificate, I got the same thing. 1-5-99 Paul Kelsey called and said yes, this is a known problem and he 10:25 knows how to take care of it. I gave him root's password on baboon and he's gonna do his magic to fix it. 2:00 Paul fixed it and sent me replacement keyfile.kyr & keyfile.sth files. Both seem to work. I left a phone msg with Paul to ask 1) What did he do exactly, so I'm able to do it myself. Answer: Use the /usr/sbin/mkkf utility distributed with the new webserver code, and delete the obviously bad key. It's the one that when you cycle through the keys in the keyfile, has a name like Current Key Name: CN = penguin.almaden.ibm.com, OU = Almaden Research Center, O = International Business Machines Corporation, L = San Jose, ST = California, C = US instead of the more normal looking, Current Key Name: Patent Server You can then select the "Patent Server" key. When you "Show" the information about that key, it should *NOT* have the "Key has a certificate." line at the bottom. When you back out a menu or two, you can select "R - Receive a Certificate into a Key Ring File" to properly receive the certificate. Save the keyfile and use it. 2) How does one upgrade to 4.6.2.5? Answer: You've got to get the code through the Support Center. I've now got it and installed it on baboon. Be careful to save the following files: /etc/httpd.conf /etc/ics_pics.conf /etc/lgw_fcgi.conf <--- Called /etc/ics_fcgi.conf for ICS 4.2.1.7. /etc/servlet.conf <--- I didn't save this one & it got overwritten. /usr/lpp/internet/server_root/protect/webadmin.passwd <--- Surprisingly, this file apparently got overwritten. 3) How can I get penguin, running ICS 4.2.1.7 to work correctly? Answer: Once I installed the 4.6.2.5 Lotus Domino Go Webserver on baboon, I could then get penguin's keyfile.kyr file over to baboon and manipulate it with the mkkf utility mentioned above. It was easy enough to figure out and it fixed it beautifully. 4) Can he help with www4.patents.ibm.com. I had generated the request on as0110e1 on 9-15-98 and received it and (I thought) had tested it to insure it worked (I could be mistaken about the test), and sure enough, the keyfile.sth & keyfile.kyr files are all dated 9-15-98 at 15:50, but I don't know how to tell if the certificate has been received yet or not. The problem I have with that is the keyfile password doesn't even unlock it. Answer: I had mistyped the password and got lucky and found my error and was able to unlock it & change the keyring password to what it should have been. It turns out I never received the certificate, so I did that as well. ============================================================================== Item # JL1215 x I've got a problem with dsh-ing commands from the PMR # xxxxx,49R x CWS. Even though we're authenticated as root.admin, --/--/99 x the dsh command doesn't work to most machines. --:-- x It only works to as0201-5, ar0079e0, and ar0081-4e0. --who-- x For all the other machines, you get the error --mail address-- x message, --phone-- x 3004-609 Your password has expired. xxxxxxxxxxxxxxxxxxxxxxx /usr/lpp/ssp/rcmd/bin/rsh: 0041-004 Kerberos rcmd failed: rcmd protocol failure. It's almost as if there's a password associated with each machine that has expired, but I've never heard about that. I spent a good hour and a half on the phone with the Support Center guy who was worse than useless. After poking around and not getting anywhere - he was convinced it had something to do with the fact these machines were DCE client machines and the PSSP Kerberos code was getting confused with the DCE Kerberos code - it wasn't, he finally said he was going to fax me instructions on rebuilding the DCE Kerberos database and that should fix things. "We've had a high success rate after doing this," he said, so he expected this to fix my problem. Turns out he never sent me those directions. Meanwhile Tom discovered that if you "su" to root, you get the "Your password has expired." message, which you don't get if you login to root directly. The problem was that root's password was 26 weeks old. It's time to change root's password on all machines. ============================================================================== Item # x A Shell Patent Server customer called to complain PMR # 70943,49R x about problems getting to the site. I'm not sure --/--/99 x what his problem really is, but I do see these error --:-- x messages in the /arc/ipnfb/logs/httpd-errors.* files Bill Polomchak x for him, --mail address-- x [PUT NOT ALLOWED] [host: gate2.shellus.com] 8-526-1619 x SSL Handshake failed. xxxxxxxxxxxxxxxxxxxxxxx so I called the Support Center to ask them what it meant. They said that was a generic message and it could be a lot of things. They suggested I go to 4.2.1.9, which fixed a lot of things in the code and will probably eliminate those messages. They also said there was another level coming out in a week or so, 4.2.1.10, which may help. I told them I'd take both and Bill said he was gonna put 4.2.1.9 on an FTP site and send me a note with the info, but I never heard from him. I left a message with him Monday morning, 2/1/99. Bill made the 4.2.1.9 code ftp-able from wp5.raleigh.ibm.com in /usr/rjasper. I got a copy & installed it on as0103e0. The server appears fine, but I still get hundreds of "SSL Handshake failed." messages in the httpd-errors log. I'm also getting a lot of "Request parsing failed." messages that I'd like to understand. I sent Bill a note on 2/12/99 to ask if the 4.2.1.10 code was available yet, and he's out of town 'till 2/22. I called the Support Center back and asked for Diane Swan (8-526-1933) to call back. 3/03/99 We got the 4.2.1.10 code & installed it on the gold nodes, 4:00 as0103 & as0104, and we're still getting the same Diane Swan [PUT NOT ALLOWED] [host: ...] SSL Handshake failed. 8-526-1933 as well as [OK] [host: ...] Request parsing failed. messages. I called Diane back & left a phone message saying this. 4/02/99 Got a note from Kevin asking for our config file & what 9:00 of Go we have. I sent him the config file & told him we Kevin Vaughan didn't have Go, we had ICS 4.2.1.10. Raleigh The bottom line with this "SSL Handshake failed." message is that it's probably caused by the user interrupting the page download, and it's not worth exploring and tracking it down any further, especially, like the Support Center guys said, we're on such a down-level web server. ICS has been replaced by the Lotus Domino Go web server, and that's now being replaced by the IBM HTTP Web Server Powered by Apache. ============================================================================== Item # JP9021 x This is a continuation of the apparent dfsbind PMR # 71121,49R x memory leak problem I opened on 12/29/98 (see 02/04/99 x Item # JK3543, PMR # 67742,49R above). Level 1 14:15 x wanted me to package up what I've got & drop it off Claudia Barrett x as JP9021.tar.Z in their testcase.boulder.ibm.com zeclaw@us.ibm.com x anonymous ftp site. This, I did, along with a little 8-441-3455 x README file to tell them what it was. An interesting xxxxxxxxxxxxxxxxxxxxxxx tidbit with the way they have their drop-off ftp site set up is, you can't overwrite stuff that's already there. I put the file with the README there as JP9021-with-README2.tar.Z. I got a call back from the Support Center asking me to repackage the files and this time use relative path names, instead of absolute. I did. Claudia asked me to pick up /aix/fromibm/71121.b49r.tar.Z from testcase.boulder.ibm.com. I've untar'd it in ~jasper/71121 for now. I've got to read the README carefully, as it seems to be very dangerous. Sure enough, it is dangerous. What the package does is to modify /usr/ccs/lib/libc.a in a /tmp directory, and mount the file (not a directory) /tmp/libc.a.debug/libc.a over /usr/css/lib/libc.a. When this happens, login or anything that tries to authenticate (su, telnet, ftp, etc) quits working. This in itself I could live with for a while, but when I tried restarting dfsbind, my system froze. One advantage of mounting this single file, is when you reboot, the file "mount" goes away and things revert to normal. I did have to reboot, but even now after a reboot, my system is giving me "/bin/ksh: A remote host did not respond within the timeout period." messages whenever I execute a command (probably while trying to update the shell history file). I called the Support Center again, when Claudia seemed to have gone home. Also, X seems very slow. X events like highlighting a line of text, doesn't work as quickly as it should. Oftentimes, the highlighting doesn't happen. 2/08/99 Monday morning. Went back to trying to use this debug version of /usr/ccs/lib/libc.a, which evidently, is used for all kinds of commands, like ls, cp, mv, etc, so when trying to move/copy this file in place, you have to mv it to get the existing one out of the way, then since there is no libc.a, you have to use this new feature/construct of ksh variable assingment, variable=value command where variable=value for this one command only. The original libc.a is at /usr/ccs/lib/libc.a.orig, and /tmp/libc.a.SAVE/libc.a. The debug libc.a is at /usr/ccs/lib/libc.a.debug and /tmp/libc.a.debug/libc.a. To move the debug version of libc.a in place, cd /usr/ccs/lib mv libc.a libc.a.orig LIBPATH=/tmp/libc.a.SAVE cp -p /tmp/libc.a.debug/libc.a . To start up the dfsbind process, kill -9 the currently running dfsbind process, if there is one. export DEBUGMALLOC=3 /opt/dcelocal/bin/dfsbind To move the original libc.a back in place, cd /usr/ccs/lib mv libc.a libc.a.debug LIBPATH=/tmp/libc.a.SAVE cp -p /tmp/libc.a.debug/libc.a . The problem is the dfsbind process crashed whenever I tried running it with the debug version of dfsbind. Claudia sent me the senddata.pl Perl script, (available from their web/ftp site - See the "DCE for AIX" link at http://www.software.ibm.com/enetwork/dce/downloads) which I had to fix for my machine (Perl in /local/bin/perl, not /usr/local, and I had to escape 2 occurances of "@austin"). I put the fixed verstion in /usr/bin on jasper, as well as in /usr/local/bin (which is a link to /dfs/apps/userlocal, so it's at /dfs/apps/userlocal/bin/senddata.pl) on the Patent Server site. The core file was at /var/dce/dfs/adm/dfsbind/core. I ran send senddata.pl & Claudia, who happened to be logged on, got a copy of the packaged stuff, which I put in /tmp/71121. 2/11/99 Was informed that they found a bug in an AIX library that causes dfsbind to core dump when run with the debug library. They've opened up APAR IX87547 to fix that. The description in RETAIN says In libs_threads.c, the code for __libs_child_post_fork frees a global variable _sec_rmutex by calling _rec_mutex_free(). _sec_rmutex is declared as: struct rec_mutex _sec_rmutex; While _sec_rmutex never get malloc'd, therefore it should never be freed. As I understand this, this won't fix my original problem, it will just allow me to run dfsbind with the debug library. We'll see. 4/05/99 Closed 3/18/99 as a duplicate of IX85363. Called the Support Center to get a status on IX85363, specifically whether or not I could get the PTF. 4/27/99 Got a phone call from Bill Smith, DCE Level 2 in Raleigh, Bill Smith saying that PTF U464373 has been closed for IX85363, and he's 8-441-4096 sending me the CD. 4/28/99 Got the CD, installed the fix, and rebooted jasper (good thing too, 'cause it was hosed up in the can't talk to DFS state - see PMR # 84648,49R below). The phone messages I've been getting from Bill Smith indicate he wants to close this PMR, but there's a note in the PMR in RETAIN that says "Hi library folks, The info at the end of the pmr is what pertains to aix. When this has been fixed, please requeue the secondary back to l2dce,109. Thanks." So my PMR shouldn't be closed. I tried calling Bill and left a msg, so I called the Support Center, who put this note on the last page, "Update: Rick is requesting this pmr NOT be closed, also requesting that it be queued to the correct queue l2dce,109 (ref. pg17) And, would like a callback asap." I think what we need to do is try running dfsbind again with the debug library. ============================================================================== PMR # 72384,49R x Got a problem with multiple Patent Server nodes. 02/08/99 x The dceunixd process either dies completely, or it 1:00 x still shows up in the process table, but isn't doing Julian x its thing of translating userid number to userid name. --mail address-- x For example, "id jasper" returns --phone-- x 3004-820 User not found in /etc/passwd file xxxxxxxxxxxxxxxxxxxxxxx which it doesn't, jasper is a DCE id. When this happens to one of our web servers, the httpd process just stops returning data (not sure why, exactly, but I know if I kill the dceunixd process & restart it, things will start working again). This has happened to 4 nodes last week, and another two today, all different. Each node is running AIX 4.2.1 with DCE 2.1.0.26, which is really current. Julian said one thing we could do is run dceunixd in debug mode, which runs dceunixd in the foreground and spits a bunch of lines at the console. Here are the lines I saw when I did that & logged in from another window as root, which seems to indicate an error, but Julian wasn't interested in it (see below to see what he *was* interested in). dceunixd -d9 Main: Initialization complete serve_client (4): expecting msg of 12 bytes process_req (4): got req_type 1002 do_gnam (4): groupname = tty do_gnam (4): cellname = /.../patent.ibm.com serve_client (8): expecting msg of 12 bytes process_req (8): got req_type 1002 do_gnam (8): groupname = tty do_gnam (8): cellname = /.../patent.ibm.com serve_client (9): expecting msg of 12 bytes process_req (9): got req_type 1002 do_gnam (9): groupname = tty do_gnam (9): cellname = /.../patent.ibm.com do_gnam (4): sec_rgy_site_open returns 387063931 do_gnam (4): sec_rgy_pgo_get_members returns 387063931 do_gnam (4): override_get_group_info returns 387063931 do_gnam (4): reply buf is 47 bytes do_gnam (4): reply buf -- 47:0011:Registry server unavailable (dce / sec) process_req (4): reply buf of size 47 serve_client (4): sending reply, 47 bytes do_gnam (8): sec_rgy_site_open returns 387063931 do_gnam (8): sec_rgy_pgo_get_members returns 387063931 do_gnam (8): override_get_group_info returns 387063931 do_gnam (8): reply buf is 47 bytes do_gnam (8): reply buf -- 47:0011:Registry server unavailable (dce / sec) process_req (8): reply buf of size 47 serve_client (8): sending reply, 47 bytes do_gnam (9): sec_rgy_site_open returns 387063931 do_gnam (9): sec_rgy_pgo_get_members returns 387063931 do_gnam (9): override_get_group_info returns 387063931 do_gnam (9): reply buf is 47 bytes do_gnam (9): reply buf -- 47:0011:Registry server unavailable (dce / sec) process_req (9): reply buf of size 47 serve_client (9): sending reply, 47 bytes Julian called somebody in level 3 & they suggested sending the hung dceunixd process a kill -6 signal, which causes it to create a core file in /var/dce/security/adm/dceunixd. I then tar'd & compressed it up, and dropped it in their testcase.boulder.ibm.com anonymous ftp server at /aix/JR3477.tar.Z. Reggie Clinton Reggie tells me that the core file I packaged up and 8-444-5257 sent to Julian is worthless without the corresponding libraries, which is what their senddata.pl program does (see PMR # 71121 above), that is, it packages up the core file along with the libraries that exist on the system (e.g. libc.a). I found 2 dceunixd core files, on as0111 & as0112, that I tried sending them with senddata.pl, but senddata.pl complained that the core files were incomplete. Evidently, I've got to enlarge the /var/dce filesystem 2/24/99 I finally had time to get back to this. I found a 130MB 11:45 am core file on as0107 at /var/dce/security/adm/dceunixd/core, so I ran /usr/local/bin/senddata.pl to package it up and dropped off the /arc/senddata/datapkg.tar.Z.uu file at testcase.boulder.ibm.com at /aix/JR3477.49r-L3DCE-datapkg.tar.Z.uu. 3/10/99 Got a call from Bob saying that yes, they did want the 11:00 am other dceunixd core file packaged up to look at. Bob Breeze (I had forgotten I had told them I had another one and 8-523-6189 whether or not they wanted it.) I packaged up the one on as0110 dated 2/26/99 using their /usr/local/bin/senddata.pl and sent it to them at testcase.boulder.ibm.com, creating the file /aix/72384.b49R-L3DCE-datapkg.tar.Z.uu. ============================================================================== Item # JR5749 x Kin has given me a simple-enough C program that PMR # xxxxx,49R x does a gethostbyname system call which, if run as 2/09/99 x root, takes virtually no time at all to run, but if 4:00 x run as a non-root id, takes 2 minutes 20 seconds to Clayton Briggs x run. I tried doing a kernel trace, but couldn't see --mail address-- x anything useful. 8-421-7172 x xxxxxxxxxxxxxxxxxxxxxxx Here is his gethostbyname program #include #include #include #include #include #include #include main(int argc, char**argv){ struct hostent h, *p; long now; if (argc!=2) { printf("Usage: %s hostname\n", argv[0]); exit(-1); } time(&now); printf("starting : gethostbyname(%s) at %s", argv[0], ctime(&now)); p=gethostbyname(argv[1]); if (p!=NULL) { printf("HOST=>%s\n",p->h_name); } else { printf("UNKNOWN HOST\n"); } time(&now); printf("ended : gethostbyname(%s) at %s", argv[0], ctime(&now)); } Which produces output like this when run from my non-root userid, jasper@ar0141e1> gethostbyname ar0141e11.patent.ibm.com starting : gethostbyname(/ips/bin/gethostbyname) at Tue Feb 9 12:30:38 1999 HOST=>ar0141e1.patent.ibm.com ended : gethostbyname(/ips/bin/gethostbyname) at Tue Feb 9 12:33:08 1999 Clayton compared the version of xlC.rte on ar0141e1 (3.1.4.8) versus what it was on the machine it was compiled on, which was Kin's machine spartan, which had 3.1.4.4. Clayton pointed me to http://ftp.software.ibm.com/cgi-bin/support/rs6000.support/downloads where I was able to search on PTF U453695 & download 5 installp images, xlC.C++.heapview.3.1.4.6 xlC.C++.iclui.lib.3.1.4.2 xlC.C++.iclui.samples.3.1.4.1 xlC.C++.lib.3.1.4.6 xlC.rte.3.1.4.8 I got Kin's permission and installed them on his spartan machine, but it didn't do any good. Thanks to a hint from Glenn Deen, namely that if something takes 2.5 minutes to run, then you should be thinking TCP/IP timeout, I was able to figure it out. According to Glenn, TCP takes 2 minutes to time out, after which, for DNS at least, it tries UDP, which times out after 30 seconds (or so). While it's hung, said Glenn, do a netstat -A (or better yet, a lsof -i -n, then you see the running process), and you might see attempted DNS activity. What I saw were lines like this raj2 54092 jasper 3uc inet 0x01e65100 0t0 UDP localhost:4916->localhost:domain (I had named the program raj2 and was running it as jasper) This said that yes, the program was hung up trying to do domain(=53) I.P. traffic to itself. Turned out that the permissions on /etc/resolv.conf were 600, so non-root users couldn't see it, which resulted (evidently) in programs trying to talk to/connect with a named daemon on the local machine, which wasn't there. After 2 minutes, the TCP attempt timed out, then after another 20-30 seconds, the UDP attempt failed. chmod-ing /etc/resolv.conf to 644, fixed everything. Now the only question remaining is how did the permissions get changed? It evidently happened when I applied service, but that's kinda bizarre. ============================================================================== PMR # 72971,49R x For the last 2 days, when ar0073e0 runs the nightly 2/11/99 x ADSM incremental backups, there are more than 22,000 2:15 x lines in the log file saying it's "Expiring" a bunch --who-- x of files in different --mail address-- x /.../patent.ibm.com/fs/.rw/patent/cache directories. --phone-- x First of all, we don't want ADSM to backup anything xxxxxxxxxxxxxxxxxxxxxxx in DSF, and it doesn't appear it really is. If you look at when ADSM thinks any of these files got backed up, it says it got backed up on the previous ADSM incremental backup. But if you look at the console log from the previous incremental backup, you don't see any lines saying it backed up the file. Very puzzling. We were running ADSM version 3.1.0.1 (aka fileset level 3.1.20.1), and at Darrel's suggestion, I quickly updated to the latest version, 3.1.0.6 (fileset level 3.1.20.6), ftp'd from shasta.sanjose.ibm.com, file /adsm/fixes/v3r1/aix/4.2/U461721.adsm.client.aix42. It didn't change anything. Each incremental backup run says it's expiring thousands of DFS files, purportedly backed up by the previous run, yet the previous run doesn't say it backed it up. 2/15/99 Greg called to ask a few questions. I explained things to Greg Keys him and sent him my dsm.opt, dsm.sys, inclexcl.dsm, 6-0895 dsmc.incr.backup.990209 & .990210, and the output of a Level 2 Rep dsmc q b -inactive command for one of those files. I tar'd and compressed the whole thing & mailed it to him. 3/25/99 (Thursday) Since I hadn't heard from these guys in 5 weeks or so, Greg Keys I called the Support Center (big mistake!) to get a call 6-0895 back. After a half hour on hold, they said they would leave a msg for Greg. 3/29/99 (Monday) Greg called back finally and asked for the output from Greg Keys df & ls -l / commands, as well as a "service trace", which 6-0895 is accomplished by putting these two lines to dsm.opt, gkeys@us.ibm.com TRACEFLAGS SERVICE TRACEFILE /tmp/dsmc.incremental.service.trace 5/12/99 (Wednesday) I got a call from a Carolyn, asking me to call Greg Greg Keys back. He hasn't done anything since we last talked on 6-0895 3/29. Although my notes don't say so, I *did* generate gkeys@us.ibm.com the "service trace" as he asked and evidently, ftp'd it to index.storsys.ibm.com, and put a tar file into the adsm/incoming directory, with the PMR # as part of the file name. Anyway, Greg had it. Greg now is saying he wants another service trace, just to compare the two. I created a PMR72971,49R.service.trace.tar.Z containing our dsmc.incremental.service.trace dsm.opt dsmerror.log dsm.sys inclexcl.dsm and put it out there for him. Just for the record, I don't see the nightly ADSM backups displaying the offending behaviour now, nor (I believe) have we seen it since February. But the two dsmc.incr.backup.990209 & .990210 runs I sent in on 2/15 do definitely document what I described. 5/19/99 I got another phone call saying she didn't see the Carolyn PMR72971,49R.service.trace.tar.Z file I sent Greg last 6-0958 week, so I posted it to index.storsys.ibm.com in the adsm/incoming directory again for her. I had kept the file in ar0073e0's /usr/lpp/adsm/bin directory, so it wasn't too much trouble. 5/27/99 Carolyn called again. Since this phenomenon isn't Carolyn affecting us any more and they really don't have enough 6-0958 information to guess what went wrong, I agreed they can close this problem. ============================================================================== PMR # 74709,49R x We have a "hostname-leak" using the URL 2/25/99 x http://www.patent.ibm.com/promo/vendors/smartpat, 4:45 x in that the URL the browser displays after the page Diane Swan x is loaded, shows the SP/2 node name of the server --mail address-- x that serviced your request. It doesn't do this all 8-526-1933 x the time and it's not consistent. Sometimes you see xxxxxxxxxxxxxxxxxxxxxxx http://www.patent.ibm.com/... as you should, other times you see http://as0112e0.patent.ibm.com. We're running ICSS 4.2.1.7 on the SP/2 nodes, but I just upgraded to ICSS 4.2.1.10 on as0116 to test that, and it does the same thing. The page source is at /dfs/prod/ipn/htdocs/promo/vendors/smartpat/index.html, which is a link to smartpatent-main.html.en in the same directory. I got a note from Diane saying to run the httpd daemon with the -vv option, which will generate a trace at the console. When I tried this, the web server refused to "leak" the node name, in other words, it worked fine. Frustrating! I left a message with Diane saying I can no longer recreate this behaviour, so she's leaving the PMR open for a couple of weeks before calling back, and she'll probably close it then. ============================================================================== PMR # 81417,49R x I'm calling on behalf of Matt Morris in Raleigh, 3/10/99 x who works at a customer site, Rational, and is 9:27 x having troubles with a 320 booting between AIX 4.1 Sandra x and AIX 4.3. Matt's office phone is (919) 845-3236 --mail address-- x and his pager is 1-888-857-2288 --phone-- x xxxxxxxxxxxxxxxxxxxxxxx Here is the sequence of events, - AIX 4.1.3 is installed ok on hdisk2/3. - AIX 4.3.1 is installed on hdisk0/1. - Bootlist is switched to hdisk0 & AIX 4.3 booted ok. - Bootlist is switched back to hdisk2 & AIX 4.1 is booted ok. - Bootlist is switched back to hdisk0 & the machine won't boot. It appears as if hdisk0 isn't bootable 'cause it sits at LED 223-229, I got Matt & Sandra on a conference call and Sandra talked Matt through booting in service mode & rewriting the AIX 4.3 boot image on hdisk0. After that, Matt was able to boot AIX 4.3 ok. We left it at that, but if Matt has further problems, he now knows how to call back to the Support Center, reference this call & talk to Sandra again. ============================================================================== PMR # 81783,49R x Henry tells me the web server on 206 dies every so 3/12/99 x often. The only symptom is in the error log, where 3:00 x it says LABEL=SRC, IDENTIFIER=E18E984F, Class=S, McCloskey, Sharon x Type=PERM, Resource Name=SRC, --mail address-- x Description=SOFTWARE PROGRAM ERROR, Symptom Code=0, 8-444-4866 x Software Error Code=-9017, Error Code=1536, xxxxxxxxxxxxxxxxxxxxxxx Detecting Module='srchevn.c'@line:'288', and Failing Module=httpd. This is the 4.6.2.3 version of the Lotus Domino Go Webserver. 4/06/99 Nothing was ever done and this isn't as big as a problem as we at first perceived, so I called Sharon and had her cancel the PMR. ============================================================================== PMR # 81789,49R x I also called in to ask about the other problems 3/12/99 x we're having with our Net Commerce server, as0206. 3:00 x There are 2, both with the Net Commerce "server" Mike Karamanolis x binary (/usr/lpp/NetCommerce3/bin/server). coachk@us.ibm.com x 8-441-4595 x As background, /etc/rc.local starts the xxxxxxxxxxxxxxxxxxxxxxx /usr/lpp/NetCommerce3/bin/srvrctrl process, which monitors & babysits the above "server" process and a back_server process, which we don't care about right now. All three of these guys have configuration files in the /usr/lpp/internet/server_root/pub/ directory and log files in /arc/NetCommerce3/instance/patents/logs. I forget where I found the core file, but I saved the core file for when this server process died, along with its log file. Its PID was 6954. See the 2 *6954* files in /tmp, server.core.6954 & ncommerce19990304145305_6954.log. Mike said he wanted the dump & core file, so I sent it to him. ============================================================================== PMR # 84648,49R x Called in again (see PMR # 67742,49R above) the 4/05/99 x problem that keeps plaguing me on my 43P-140, jasper --:-- x machine. Besides the apparent memory leak in dfsbind --who-- x (see Item # JP9021 above), possibly related is another --mail address-- x problem, namely --phone-- x dfs: lost contact with server 9.1.24.168 xxxxxxxxxxxxxxxxxxxxxxx in cell: almaden.ibm.com 9.1.24.168 is almdfs4, which is where my home directory is. I've also seen it with another server, 9.1.24.165, when my home directory was on that server, namely almdfs1. Another symptom when the things go to hell on jasper, is X starts temporarily hanging. That is, mouse movements/window focus gets delayed for a minute or so, but always eventually comes back. After spending an hour and a half with the the Level 1 guy, we finally got things back to normal by logging off & logging back in with the "Command Line" login. I told him logging off was essentially the same amount of pain as far as I was concerned, than rebooting. The Level 1 guy forwarded me to Chien Yu, who I am to call back when things go to hell again. 4/07/99 As predicted, this situation happened again. X was very 1:00:00 slow in changing focus when the mouse changed windows, and I Reggie Clinton couldn't cd (at times, this symptom was transient) to my DFS home directory. I tried calling Chien Yu back and let him log on as root to poke around, and he didn't find anything unusual as far as performance things go (paging, file systems, CPU load, error, etc.). He gave up and called the DCE guys back in. Reggie poked around some more and still didn't find anything out of the ordinary (but then again, Reggie didn't seem that good - see my dealings with him for past PMR's). What we did finally learn, was that things got "fixed" after doing a kdestroy & re-dce_login -ing back in, at least for that one window. Doing a kinit wasn't enough. Makes me wonder if the default 30-hour token lifetime had anything to do with it. It did take just under 2 days for this problem to manifest itself. And I do remember Jim Hafner & Rick Haeckel saying one needs to kinit in the "original" window, whatever that means when one logs on through the CDE desktop. The dce_login command created another credential file (see your KRB5CCNAME environment variable). The kdestroy command erased the old credential file, which is what you get when you create another aixterm window, say, so all old windows as well as new ones, are screwed. And just copying the new credential file to the old, didn't fix things, either. 4/09/99 Finally gave up fighting this and upgraded my machine to Robin Redden AIX 4.3.2 and DCE 2.2.0.4, which just became available this 8-678-1542 week (it wasn't there on Monday). Also got a call from Robin and she suggested changing root's ulimits, making root's stanza in /etc/security/limits, root: fsize = -1 core = -1 cpu = -1 data = -1 rss = -1 stack = -1 nofiles = -1 (On AIX 4.3 only, not on AIX 4.2.1) I used this command to make the changes on my machine, for i in fsize core cpu data rss stack nofiles do chsec -f /etc/security/limits -s root -a $a=-1 done ommitting the nofiles for the Patent Server site. 4/15/99 Am experiencing the same problems, so I called the Support Reggie Clinton Center yet again (sigh!). Reggie is going to poke around again. Reggie talked to Robin, who said she won't be able to learn anything by poking around my system (sounds like she doesn't have the time), so Reggie's gonna poke around. The symptoms include .... - Certain X events are really slow, like focus changing in a timely manner (I have focus set to mouse System upgraded to DCE 2.2.0.5, which Dale had on CD. 4/26/99 This seemed to help for a bit, but I had to shutdown on Reggie Clinton Friday 4/16 & reboot 4/19 for a power down, and my system is acting strange again on Monday, 4/26, so it only lasted a week. I'm getting very slow X response & "A remote host did not respond within the timeout period" messages. 4/27/99 Reggie called back and I gave him access again to root on my jasper machine to poke around. I don't know if he really ever did, much less learn, anything. 4/28/99 Meanwhile, I got an AIX update CD that updated primarily libc.a, per the related PMR to this one, 71121 above. I installed that and rebooted, cleaning up my "A remote host did not respond within the timeout period" error message I've been getting the last few days. 5/17/99 It's been a few weeks and everything was running normally with AIX 4.3.2 and DCE 2.2.0.5, but the infamous "dfs: lost contact with server 9.1.24.168 in cell: almaden.ibm.com" messages are back (9.1.24.168 = almdfs4, which is where my home directory is). I called the Support Center back, who left a callback with Reggie. (sigh) 5/17/99 Reggie called back and pointed me to a DCE/DFS Level 3 web 14:10 page at http://guero.austin.ibm.com. Following the directions on a DFS tracing page a few links down, http://guero.austin.ibm.com/dfs-support/dfs-traces/dfstracing.html#Tracing DFS Client Failure (that's a bizarre URL, with blanks in it). It boils down to doing - dfstrace setset -set cm -active - cm checkf - cd - dfstrace dump -file /tmp/dfstrace.out - dfstrace setset -set cm -dormant Then I mailed him that trace file via a mail l3dce@us.ibm.com < /tmp/dfstrace.out 5/27/99 Reggie says he's got some instructions to follow the next 15:20 time this situation happens. He's also got somebody on deck Reggie that is willing to login when it next happens, and poke around herself. It's about time somebody did that! 6/30/99 The problem came back again. I noticed this morning that Liz Hughes things were not as snappy, so I started to plow through the 8-678-3483 things they suggested I do in the 5/27/99 note from Chris Dodson. When I got to the part of the note that asks the question, "If he runs these commands, does the problem clear up?", those commands being dfs.clean dfsbind to stop dfsbind & rc.dfs dfsbind to start it up again, X started to *really* misbehave, not accepting keystrokes and eventually not even following the mouse. I could telnet in and saw the dfs.clean process with a spawned "dcecp -s /usr/bin/show.cfg dfs" that wouldn't go away, even after I killed -9 it. I called the Support Center and Reggie Clinton is no longer there, but when I didn't get a call back from level 2 for a half hour or so, I called Liz directly. She spent a couple of hours on my system as root, poking around, and was able to get dfsbind started up again. From a phone msg (so these details are unreliable), she started up another instance of dfsbind from the command line, let it start, which freed up the hung dcecp. She then shut it down & restarted it normally (which I guess means, rc.dfs dfsbind??). Anyway, after she did her thing, I was able to use my userid without logging off or back on, or rebooting. Everything appeared normal. I'm not sure what exactly Liz learned from all this. 7/19/99 Just for the record, my machine started to hang up again, 9:00 getting the "A remote host did not respond within the timeout period." error messages. I restarted dfsbind & it cleared up. ============================================================================== PMR # 86271,49R x The find command has a -fstype clause, which is 4/16/99 x suppose to use the file systems defined in /etc/vfs. 11:10 x This works for -fstype afs, but doesn't work for dfs. Vani Ramagiri x Seems like it should. vani@austin.ibm.com x 8-523-4168 x After 25 minutes on the phone, the guy said he'd xxxxxxxxxxxxxxxxxxxxxxx investigate it. What a waste of my time for such a trivial problem. Further investigation yieled that this did work fine under AIX 4.1 & 4.2, but fails only under AIX 4.3. Vani called and she also needed some guidance/convincing that this was a problem, but she finally came around. 4/30/99 In the PMR, Vani has written, "While I was looking at the defects that were fixed after 4.2., I came across defects 246300 for bos43D and 25289 for bos42G for - find`s option `-fstype nfs` does not work w/ NFS v3. I was checking to see if the fix for this defect might have caused this behavior change in 4.3.x. * Quoting from the material from this defect 246300 and 25289: "Right now the only valid options to -fstype argument are jfs and nfs Sometime ago a fix was dropped into find, so that -fstype will consider only 'jfs' and 'nfs' as the valid options. This fix was made so that the backup command will skip the non-jfs filesystems when doing a mksysb." * I talked to the developer who fixed this defect to confirm that it is not a defect that AIX4.3.x is returning zero for #find . -fstype dfs -print | wc He confirmed that this is working as designed." I don't know if this is a reasonable argument or not, so I put in an append in the DFSADMIN FORUM on IBMUNIX to get their opinion, before I argue the point with the Support Center. ============================================================================== PMR # 87091,49R x ar0071e0 is dying every few minutes, 7 in the last 04/23/99 x hour and a half, starting at 1:30 am on 4/23/99. 3:00 a.m. (!!) x This system has been rock solid ever since the sight Jessie Ball x came up, except for the last month, we've had 1363 --mail address-- x SCSI RAID adapter errors on scraid0 over the last 20 --phone-- x days. So at 9pm on Wednesday, 4/21/99, Anthony xxxxxxxxxxxxxxxxxxxxxxx DeMott & I replaced the scraid0 adapter. Except for (an unrelated?) problem with not being able to install the latest microcode, which we called in & somebody is suppose to be looking at, everything seemed fine. We stopped getting the 60 errors/day. Reference the old Hardware Reference # 31W3HHV, called in on 4/6/99. Then, 25 hours after running fine on it, ar0071e0 started to crash every 10-15 minutes. Here's the output from an errpt -N CMDCRASH command. IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION 3573A829 0423042699 U S CMDCRASH SYSTEM DUMP 3573A829 0423041399 U S CMDCRASH SYSTEM DUMP 3573A829 0423030399 U S CMDCRASH SYSTEM DUMP 3573A829 0423025099 U S CMDCRASH SYSTEM DUMP 3573A829 0423023699 U S CMDCRASH SYSTEM DUMP 3573A829 0423022399 U S CMDCRASH SYSTEM DUMP 3573A829 0423021099 U S CMDCRASH SYSTEM DUMP 3573A829 0423015799 U S CMDCRASH SYSTEM DUMP 3573A829 0423012999 U S CMDCRASH SYSTEM DUMP I called this into the Software Support to get them to agree that it's probably hardware related. The details on each dump were the same, csa:2ff3b400 [dcelfs.ext:MarkFrags] 200 [dcelfs.ext:MarkFrags] 58 [dcelfs.ext:MarkBlock] 50 [dcelfs.ext:epib_Free] 274 [dcelfs.ext:epia_Truncate] 734 [dcelfs.ext:epif_ChangeLink] 264 [dcelfs.ext:vnm_Inactive] 228 SYMPTOM CODE PIDS/576565500 LVLS/420 PCSS/SPI1 MS/700 FLDS/[dcelfs.e VALU/7c8e7008 FLDS/[dcelfs.e VALU/50 They agreed. It seems that DFS is writing into/onto some disk, most likely that SCSI RAID drive on scraid0. I called this into Hardware Support. I don't know what they're gonna replace (I suggested backplane), but that's their problem. Hardware Reference # 30YPP62, Meanwhile, I commented out the starting up of DCE from /etc/inittab, so I can get a stable system, then carefully commented out /dev/scsi0lv from /var/dce/dfs/dfstab and then brought DCE back up. I had to run salvage on each of the other aggregates, most notably, scsi1, which had some critical filesets that were unreplicated (oops!). 4/23/99 We spent the day trying to get the new SCSI adapter working, without success. We finally reinstalled the old adapter, and returned to the state of getting dozens of error messages per day. At least the machine stayed up. Anthony DeMott is pursuing with his support people, the problem of getting the new adapter to work. 4/26/99 I called the Support Center to ask them to look at the CSA dump we had generated, 'cause hardware support wanted us to. Meanwhile, the Software support person noticed that we were one level down for our device support. We had devices.pci.14102e00.rte 4.2.1.5 and devices.pci.14102e00.diag 4.2.1.4, and there's a .6 & .5 respectively. I installed those fixes, and all the newer fixes I had from Fixdist, and set the system to be rebooted at 1:01 am tomorrow. 4/27/99 Roxanne called and we walked through the Roxanne Merizalde script crash.out roxannem@austin.ibm.com crash csa-dump-1 8-523-4141 stat status cpu symptom p -m sequence to get a quick snapshot of the CSA dump. One problem we had was since I had installed the latest service last night, the crash command didn't like the dump, when run on ar0071e0, the machine that caused the dump. But since ar0072e0 is identical to what ar0071e0 was yesterday, I moved the dump file to ar0072e0 & ran the crash command there. Seemed to work fine. I mailed both the crash.out script file and a 1.9 MB errpt -a output to her. She called back a few minutes later & said that yes, she wanted the dump, but to run "snap" on ar0072e0, since the nucleus had changed on ar0071e0. The snap command is /usr/sbin/snap, part of the bos.rte.serv_aid fileset. She said to snap -gfkD < -d path if /tmp isn't big enough> -c Then mv /dfscache/snap.tar.Z PMR87091.49R.tar.Z and rftp it to testcase.boulder.ibm.com. What I had to do was to modify /usr/sbin/snap, cause it did a sysdumpdev -L to see where & what the previous dump was, and since it happened on ar0071e0, I just cut & pasted the output of 71's sysdump -L command to 72's snap cmd. Anyway, I got it to work. Chris Kime Chris & Liz called to report what they learned from the dump CKIME@AUSTIN I had sent in, and the bottom line was, they can't help point 8-678-2268 to any specific hardware. DFS was trying to clean out some and entry in their playback log, and was looking at inconsistent Liz Hughes data. Whether that was due to a file being erased (in my mind, 8-678-3483 most likely) or file being closed, they can't say. They also can't tell for sure when that data was written to disk. Was it written just a few minutes ago or 2 days ago? They can't tell. The tidbits they did tell me were 1) When running salvage, use the -verbose option & pipe the output to a file (or tee it), and also use the -salvage option, which would get the salvage command to do its most extensive checking possible. E.G. salvage -aggregate scsi0 -salvage -verbose | tee /tmp/salvage.output 2) To detach an aggregate from DFS, dfsexport -aggregate scsi0 -detach -force They also suggested that I detach scsi0 from DFS & run salvage -salvage as shown above, before replacing the adapter again, just to insure we're starting at a fresh, known, good state. It's possible that there are "land mines" still waiting to be stepped on, and this will insure there's not. 6/01/99 Anthony DeMott & Bob Olmstead came in at 7pm this Tuesday 30YPP62 night to attempt again to resolve this hardware issue with ar0071e0. We couldn't get the new adapter (08L1319) with the latest microcode (98348) to accept the array configuration information on the drives. It always produced an error in the error log. We also could not load the new microcode on the old adapter. Sigh. 6/03/99 A new wrinkle in this problem. I noticed that ar0072e0 is 31VZN8Y getting the same SCSI_ARRAY_ERR1 & SCSI_ARRAY_ERR9 errors that we're getting on ar0071e0. I called it in as a separate hardware problem, but referenced the other 30YPP62 problem. I can't tell when the errors started 'cause the error log has wrapped, but since 5/14/99, ar0072e0 has gotten these number of errors. | ERR1 | ERR9 | Total --------+------+------+------- scraid0 | 27 | 195 | 222 scraid1 | 251 | 987 | 1238 Just for the record, ar0071e0 = 7025 (F40) #05506. ar0072e0 = 7025 (F40) #05507. ============================================================================== PMR # 88666,49R x I have a couple of problems with the upgrade to the 5/05/99 x latest version of eNetwork Firewall on the Patent 11:45 x Server's socks/ssh/mail gateway machine, ar0135e0/1. Pat Krohn x I was at version 3.1.1.2 of the Firewall code, and I PKROHN@us.ibm.com x upgraded to eNetwork Firewall 3.2.3.0, which is the 8-444-5847 (Raleigh) x latest I could find. xxxxxxxxxxxxxxxxxxxxxxx The first anomoly I saw was when I went under smitty, - IBM e-Network Firewall for AIX - System Administration - Secure Interface - List which just runs the fwlistadptr command, I got the following error messages, 192.168.56.135 Unrecognized Interface Pager message cannot be greater than carrier maximum message length. If you go under the GUI (by typing fwconfig), under - System Administration - Interfaces, it shows the 2 interfaces correctly, IP Address Type Name =============== ==================== ==== 204.146.135.135 Non-Secure Interface en0 192.168.56.135 Secure Interface en1 I'm not sure this in itself is affecting me, but ... The second anomoly I saw was more severe. Incoming ssh packets from the Internet are being filtered. First of all, there was a bug in this build of eNetwork Firewall where the wrong message catalog was used, so error messages and such, aren't correct. Pat had me download from testcase.boulder.ibm.com, the following two files to fix this problem, both from the /aix/fromibm directory, cat_323.tar.Z & cat_323.readme.txt. Essentially, it replaced a bunch of binaries to use the correct msg numbers. This fixed the fwlistadptr descibed above and fixed the wrong messages in the /var/adm/sng/logs/fwreg_l4.log file, but didn't fix the underlying problem, that some packets are being denied, where they weren't before. As proof of this, a nsa scan I did from ar0176e0/1 to the 204.146.135.135 side on Saturday, did indeed show the ports that should be open, were indeed open. That is, ssh (22), SMTP (25), sshfail (2222), http proxy (8080). But when I ran the same command now, nothing shows open. I rebooted ar0135e0/1 and all is back to normal again, i.e. everything's working as it should, which is the same thing I saw on Saturday. It appears that things only go wrong after a day or so??? I've implemented a check on ar0176e0/1 to hourly go out and see if port 22 is still alive. See root's crontab & the /tmp/ssh.135.log file. ============================================================================== PMR # 90599,49R x On only one machine, ar0081e0, one of the J50 5/19/99 x Verity servers, for some reason, every week or so, 15:15 x we loose the ability to "see" the /dfs/admin Julian Owens x directory, or anything (of course) underneath it. --mail address-- x I'm not sure it's only the root userid or not, but 8-421-7141 x I know a reboot clears things up, that is, we can xxxxxxxxxxxxxxxxxxxxxxx see /dfs/admin again. This affects the periodic recycles we do the the vserverprod server, as that thing needs /dfs/admin/bin/subsysfuncs.sh to start up correctly. The permissions to /dfs/admin & /dfs/admin/bin are the same, namely {mask_obj rwxcid} {user_obj rwxcid} {user cell_admin rwxcid} {group_obj r-x---} {group admin rwxcid} {other_obj r-x---} where root falls under the other_obj permissions, and thus has rx to the directory. The permissions to /dfs/admin/bin/subsysfuncs.sh are also correct - root has r, due to other_obj. When I saw this situation on 5/19, it cleared itself up in a couple of hours or so (I wasn't watching it that closely), and I didn't have to reboot. ar0081e0 is running AIX 4.2.1.0 & DCE 2.1.0.28,which is the latest. Julian says that when this happens again, do a kill -30 to the dfsbind process to create that /var/dce/dfs/adm/icl.bind file, and send that to him. I put in the following crontab entry to monitor it, 1,6,11,16,21,26,31,36,41,46,51,56 * * * * /monitor_dfsbind where /monitor_dfsbind is #!/bin/ksh file=/tmp/monitor_dfs if [ ! -x /dfs/admin/bin/subsysfuncs.sh ] then if [ ! -w /var/dce/dfs/adm/icl.bind.PMR90599,49R ] then # Find the dfsbind PID and send it a kill -30 signal to create icl.bind. dfsbind_PID=$(ps -ef | grep -v grep | grep -v monitor_dfsbind | grep dfsbind | awk '{print $2} ') kill -30 $dfsbind_PID # The dfsbind process takes a second or so to create icl.bind, so # wait around for it. sleep 3 mv /var/dce/dfs/adm/icl.bind /var/dce/dfs/adm/icl.bind.PMR90599,49R fi echo "$(date) Can't see /dfs/admin/bin/subsysfuncs.sh" >> $file fi So we'll see how often it happens that we don't see it, and how long the situation lasts. 5/21/99 Julian called to see if this trap I set up had tripped yet. It hadn't. 5/26/99 Julian called again. Nope, still not yet. 6/07/99 My /monitor_dfsbind script finally triggered at 12:21 on Saturday, 9:00 June 5, 1999. I called the Support Center back and found out that Julian had closed this PMR on 6/1/99, the day I was out on vacation at Yosemite. I reopened it, left a phone message with Julian, and e-mailed Julian the 1800 line /var/dce/dfs/adm/icl.bind.PMR90599,49R. 10:45 Julian wanted me to package up the icl.bind file, tarring and compressing it up into a 90599.49R.tar.Z file, and drop it off at testcase.boulder.ibm.com. I did this with a cd /var/dce/dfs/adm tar cvf 90599.49R.tar icl.bind.PMR90599,49R compress 90599.49R.tar command & rftp'd 90599.49R.tar.Z over via rftp testcase.boulder.ibm.com cd /aix bin put 90599.49R.tar.Z Doing some debugging of my own, the only non-zero "exit code" I see in the generated icl.bind file is time 897.064064, pid 7: do_auth_request: exit code:382312679 time 897.064103, pid 7: ProcessRequest: took 0 seconds, exit code:382312679 and a "dce_err 382312679" command says that error number means dce_err: 382312679: Authentication ticket expired (dce / rpc) Since this is root we're running from, the dced daemon is the one that's suppose to keep root authenticated as "self". 2:45 Coincidently, this same situation happened on one of our other Verity servers, ar0078e0. I did another kill -30 signal to the dfsbind process and renamed the icl.bind file to /var/dce/dfs/adm/icl.bind.PMR90599,49R.2 and like earlier, rftp'd it to testcase.boulder.ibm.com. 6/08/99 Reggie called and said he wanted me to collect more 11:00 debugging/dumping info the next time this happens (sigh). Reggie Clinton He pointed me to this guy's DCE/DFS debugging web page at 8-444-5257 http://guero.austin.ibm.com/dfs-support/dfs-traces/icltraces.html. I enhanced my /monitor_dfsbind script & copied it to 78 to also run there. When looking at it, I saw the following errors pid 21026: ERR: dfs: fileset (0,,28) error, code 691089410, on server : 192.168.56.71, in cell: patent.ibm.com, . According to dce_err, 691089410 is fileset not present and exported on server: already deleted/moved (dfs / xvl) I don't know what PID 21026 is/was. It's not there now and it wasn't in the ps -ef command I did when I took this snapshot. Hmmmmm ... Also, there was a bunch of pid 21026: INF: CM cm_write error 19 on fid 0.28.40071.2570365 Oh, I see. This log file entries are from a week ago when I had different DCE/DFS servers down. Pay attention to the time, which is the number of seconds since the last time printed, e.g. Current time: Tue Jun 1 12:09:52 1999 and is a counter that wraps at 1024 seconds. 6/09/99 Hey, alright! We didn't have to wait too long before 2:20 this happened again. It tripped on ar0081e0 on 6/9/99 Julian at 12:21 pm (what a coincidence this is the exact same time it tripped last Saturday. Hmmmmm. Anyway, I got a lot of good (I hope) debugging info for the Support Center. I called in and warned them I was dropping it off at testcase.boulder.ibm.com. I had to call it 90599.49R.3.tar.Z 'cause the original stuff I dropped off there 2 days ago, was still around. Here's what was in there ... -rw-r--r-- 0 0 25457 Jun 09 12:21:09 1999 PMR90599,49R.dfstrace -rw-r--r-- 0 0 27374 Jun 08 12:39:19 1999 PMR90599,49R.dfstrace.normal -rw-r--r-- 0 0 5169 Jun 09 12:21:06 1999 PMR90599,49R.general.info -rw-r--r-- 0 0 5169 Jun 08 12:39:17 1999 PMR90599,49R.general.info.normal -rw-r--r-- 0 0 112123 Jun 09 12:21:01 1999 PMR90599,49R.icl.bind -rw-r--r-- 0 0 111207 Jun 08 12:39:11 1999 PMR90599,49R.icl.bind.normal -rw-r--r-- 0 0 828 Jun 09 13:16:00 1999 PMR90599,49R.log The *.normal files are a "snapshot" of when things were running normally yesterday. In particular, the PMR90599,49R.general.info.normal file shows a normal klist command where the "Identity Info Expires" at 1999/06/09:12:08:07. The PMR90599,49R.log file shows that root was running authenticated at 12:16 (there's no log entry) and unauthenticated at 12:21. It stayed that way until sometime between 13:16:00 and 13:21. The dfstrace does have these lines in it time 608.410054, pid 0: Current time: Wed Jun 9 12:09:52 1999 time 608.410054, pid 19682: RPC: krpc_ReadHelper returns code 382312679 time 608.410179, pid 19682: RPC: sec_auth, done opcode 3, st 382312679 time 608.410232, pid 19682: dfs: ticket has expired; running unauthenticated. Which makes it appear that the "unauthenticaton" of root was before 12:11. Hmmmm. 6/15/99 Chris called and left a message asking to get access to the 3:00 machine when this situation happened again. I left a message Chris Kime back telling him sure, and that I would call him when it CKIME@AUSTIN happened next. 8-678-2268 ... 6/17/99 It struck ar0081e0 again, but Chris's phonemail message says 12:21 he's out of town 'till Tuesday. I tried getting ahold of Liz Hughes at T/L 8-678-3483, but she didn't answer either. Since this situation appears to "fix" itself within an hour and I didn't notice this 'till 30 minutes into the hour, there was no time to really get somebody else involved. We'll have to wait 'till next time. 6/17/99 Surprisingly, this situation has persisted on ar0081e0 for 15:11 the last 3 hours. I went ahead and called it into the Support Center to document it & to see if anybody else is available. I guess nobody was. They requeued a secondary call, but nobody called me back in time. It finally did clear itself up between 15:17:48 & 15:18:48 after almost 3 hours. Liz did call me eventually & I gave her access to root on ar0081e0 so she can poke around. I predict that ar0078e0 will be the next to go, and on Monday afternoon. I added additional debugging code to my /monitor_dfsbind script, so we'll see if it catches anything new. Liz said she would put a reminder to herself for Monday. 7/27/99 Chris called back to see what was happening here. I haven't 12:00 gotten any notes from my /monitor_dfsbind script that I hacked Chris Kime together, but checking the /var/dce/dfs/adm/PMR90599,49R.log CKIME@AUSTIN file, I see that it has indeed happened many times. Hmmmmm, 8-678-2268 my script has a bug in it? Hmmm, there were 2 pieces of unsent mail in the mailq on 78. Oh, I see. My script keys off the /var/dce/dfs/adm/PMR90599,49R.icl.bind file. Once that file exists, I don't collect any more debugging code, nor send anymore mail. I erased /var/dce/dfs/adm/PMR90599,49R.icl.bind on both 78 & 81 to start fresh. ============================================================================== PMR # 90825,49R x On the CWS, when I do any iptrace command, I get 5/21/99 x this error message, 10:30 x iptrace: 0827-877 setsockopt -: There is not --who-- x enough buffer space for the requested socket --mail address-- x operation. --phone-- x I have no idea what "buffer space" this is referring xxxxxxxxxxxxxxxxxxxxxxx to, so I asked the Support Center. It doesn't appear to be mbufs. I was looking into a problem we were getting trying to install AIX on a J50 (ar0085e0). It appeared to not correctly install a piece of AIX networking code used to do NFS mounts. Also, when we tried reinstalling it, the CWS wasn't answering bootp requests, although I could see the bootpd daemon being launched on the CWS. It wasn't mbufs. It was sb_max. At the bottom of /etc/rc.net, was a bunch of no -o commands to tune the network options, including no -o sb_max=163840. The Support Center had me raise this to 1048576. We also saw, via an entstat en1 command, ETHERNET STATISTICS (en1) : Device Type: Ethernet High Performance LAN Adapter Hardware Address: 02:60:8c:f5:35:92 Elapsed Time: 20 days 18 hours 8 minutes 46 seconds Transmit Statistics: Receive Statistics: -------------------- ------------------- Packets: 11245370 Packets: 7854577 Bytes: 8017791451 Bytes: 1995424467 Interrupts: 11179718 Interrupts: 7817828 Transmit Errors: 0 Receive Errors: 238 Packets Dropped: 0 Packets Dropped: 0 Max Packets on S/W Transmit Queue: 512 Bad Packets: 0 S/W Transmit Queue Overflow: 138125 Current S/W+H/W Transmit Queue Length: 0 ... Note the Overflow count, 138,125 in 20 days. Also, the "Current .. Queue Length" line might have been interesting to look at yesterday. I went through the process of - ssh-ing into root on the en0 (SP/2) side of the CWS - ifconfig en1 down - ifconfig en1 detach - rmdev -l en1 - smitty ethernet - Adapter - Change / Show Characteristics of an Ethernet Adapter (or smitty chgenet) - ent1 And changed the - "TRANSMIT queue size" from 512 to 150. Beware of a discrepancy with what the help text (PF1) says for this field, namely that the "Valid values range from 20 through 150", and what the prompt (PF4) says, that the valid range is 20-2048. 150 is the limit. The guy didn't know, but guessed that perhaps the fact this was set to 512, which is over the 150 limit, was causing us grief. - "RECEIVE buffer pool size" from the default 37 to the max of 64. - mkdev -l en1 - route add default 192.168.56.251 - route add 9.0.0.0 192.168.56.252 We'll see if we get any more network flakiness. I'll check the overflow count again in a few days (if I remember). Three days later, on 5/24/99, the relevant lines of the entstat en1 cmd were Max Packets on S/W Transmit Queue: 100 Bad Packets: 0 S/W Transmit Queue Overflow: 0 Current S/W+H/W Transmit Queue Length: 0 so things look good. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Changing /etc/rc.net to do a no -o sb_max=1048576 may also help a web server stalling problem we were having with the free side nodes on 6/25/99. We'll see ... I also changed /tftpboot/tuning.cust on the CWS from /usr/sbin/no -o sb_max=163840 to /usr/sbin/no -o sb_max=1048576 so that if/when the node ever gets rebuilt or re-cust'd, it'll keep the change. Hmmmm. After a reboot, my changes, which were still there in /etc/rc.net, didn't seem to take effect. Turns out I (also?) need them in the /tftpboot/tuning.cust file on each node. ============================================================================== PMR # 92405,49R x Apparently, since the power up after the Memorial 6/03/99 x Day power outage, the ssa0 array on ar0072e0 has been 10:30 x down. ssa0 = hdisk4 = pdisks 0-7 = 8 * 9.1 GB Wahid x A varyonvg ssa0vg command yields this error message, --mail address-- x PV Status: hdisk4 009001466be3b171 PVNOTFND --phone-- x 0516-013 varyonvg: The volume group cannot be xxxxxxxxxxxxxxxxxxxxxxx varied on because there are no good copies of the descriptor area. There is nothing in the error log for these disks. The Support Center had me do a lqueryvg -Atp hdisk4 command, but that gave 0516-024 lqueryvg: Unable to open physical volume. Either PV was not configured or could not be opened. Run diagnostics. A "Link Verification" (under diags) didn't show anything connected to the A1/A2 Adapter Ports. A physical inspection showed everything was powered up and connected correctly (solid green lights). Finally did rmdev -dl hdisk4 cfgmgr lspv hdisk4 showed up with a good PVID. importvg -y ssa0vg hdisk4 To get vg back online. dfsexport -all To get it DFS-exported again. And it all came back ok, including the pdisks showing up in the Link Verification test. It's interesting to note that just a software rmdev/cfgmgr was enough to get the drives back in the Link Verification test. Bruce and I didn't expect that would work. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 6/07/99 That's funny. This exact same thing happened to ar0073e0, ssa0. 1:00 ar0073e0 ssa0 = hdisk3 = pdisks 0-7 = 8 * 9.1 GB, just like ssa0 was on ar0072e0, except. I had to go through the same scenario, namely rmdev -dl hdisk3 cfgmgr varyonvg ssa0vg dfsexport -all To get it back. For the record, as of 2:15 on 6/7/99, all 3 DFS servers are dfsexport-ing scsi0-1 and ssa0-2. But on 9/24/99, ar0071e0 got ssa3 added. ============================================================================== PMR # 92627,49R x On the Patent site, Becky remove her rsosa userid 6/04/99 x from the patent group, thereby removing her DCE 1:00 x account (remember, an account consists of 3 things, Mike x account = principal + primary group + organization --mail address-- x and removing any of the 3, removes the account). --phone-- x I recreated her account by getting into dcecp and xxxxxxxxxxxxxxxxxxxxxxx typing account create rsosa -group patent -organization none -mypwd Br1g1t -password new4now Note that account create is one of those commands that have to be done from within dcecp, or from a TCL script. You can't do it from the command line via dcecp -c. Anyway, both before and after I recreated her account, a principal show rsosa dcecp command only works half the time. The other half of the time, I get the message "Error: Registry object not found". Dale suspects the "CDS cache" on one of my DCE/CDS replicas is stale. He says there's a procedure to flush it and force it to resynch. What I got from the Support Center was a process to shut down DFS & DCE on each replica, erase files from a bunch of directories, then restart. Not exactly what I was hoping for. In poking around myself, I found the following commands under dcecp ... registry cat returns a list of replicas, e.g. /.../patent.ibm.com/subsys/dce/sec/ar0071e0 /.../patent.ibm.com/subsys/dce/sec/ar0072e0 /.../patent.ibm.com/subsys/dce/sec/master registry show -replica /.:/subsys/dce/sec/ar0071e returns a lot of information about each replica, a key field in this case being the lastupdtime. For ar0071e0, that line was {lastupdtime 1999-06-04-12:40:28.000-07:00I-----} For ar0072e0, that line was {lastupdtime 1999-05-31-16:46:03.000-07:00I-----} so I know it's ar0072e0 that's bad. Another easier way to determine this is to simply do a registry verify which "Returns a list of replicas not up-to-date with the master." Sure enough, it returns /.../patent.ibm.com/subsys/dce/sec/ar0072e Poking around some more, I found these incantations/options of registry show, help registry show returns -attributes Returns the attributes of a replica or master registry. -master Returns all propagation info kept by the master replica. -policies Returns the policies of a replica or master registry. -replica Returns propagation info kept by the specified replica and registry show -master shows {name /.../patent.ibm.com/subsys/dce/sec/ar0071e0} {type slave} {propstatus update} {lastupdtime 1999-06-04-12:40:28.000-07:00I-----} {lastupdseqsent 0.1747} {numupdtogo 0} {lastcommstatus 0} {name /.../patent.ibm.com/subsys/dce/sec/ar0072e0} {type slave} {propstatus update} {lastupdtime 1999-05-31-16:46:03.000-07:00I-----} {lastupdseqsent 0.1740} {numupdtogo 7} {lastcommstatus {Data integrity error (invalid password is specified)}} {name /.../patent.ibm.com/subsys/dce/sec/master} {type master} Note the "lastcommstatus" for ar0072e0. What does that mean? On ar0072e0, there are 164 lines in /dceconfig/vardce/svc/warning.log. # Lines That Say ------- --------------------------------------------------------- 85 Protocol version mismatch 58 in header > fragbuf data size 10 Caught signal 1. Exiting. 11 Cannot NSI unexport Object UUID for this dced server. On ar0071e0, there are 197 lines in /dceconfig/vardce/svc/warning.log. # Lines That Say ------- --------------------------------------------------------- 85 Protocol version mismatch 58 in header > fragbuf data size 21 Caught signal 1. Exiting. 22 Cannot NSI unexport Object UUID for this dced server. 2 Thread routine error Which are pretty similar, so it appears that nothing there explains why 72 is sick. 6/04/99 I called back ('cause it was getting late and I wanted this fixed, 4:30 that's why) and Julian talked me through removing the sec_srv DCE Julian service, via smitty dce on ar0072e0, Unconfigure DCE/DFS 2 local only unconfiguration for this machine and in the "COMPONENTS to Remove" field, select just sec_srv, which is the "Security Server (Replica)". Then on the master, ar0073e0, smitty dce Unconfigure DCE/DFS 3 admin only unconfiguration for another machine in the "COMPONENTS to Remove" field, select just sec_srv again, in the "Client Machine DCE HOSTNAME" field, select ar0072e0. Julian then had me check smitty's work by 3 series of commands dcecp -c rpcgroup list /.:/sec /.../patent.ibm.com/subsys/dce/sec/master /.../patent.ibm.com/subsys/dce/sec/ar0071e0 "Good," he said, "just the two. ar0072e0 isn't there." Then it was dcecp -c cdsli -cworld | grep /.:/subsys/dce/sec d /.:/subsys/dce/sec o /.:/subsys/dce/sec/ar0071e0 o /.:/subsys/dce/sec/master "Good," again. dcecp -c registry cat /.../patent.ibm.com/subsys/dce/sec/ar0071e0 /.../patent.ibm.com/subsys/dce/sec/master "Good," yet again. Then, to add the replica back, on the server, ar0072e0, smitty dce Configure DCE/DFS Configure DCE/DFS Servers SECURITY Server 2 secondary Accept all the defaults and go for it. Checking it back on ar0073e0 via dcecp -c registry verify and dcecp -c registry show -master showed that everything was back in sequence. Great. ============================================================================== PMR # 05370,49R x ERS (Emergency Response Service) evidently has 6/23/99 x changed the NSA code they use for their monthly 4:00 x scans, 'cause now eagle is highlighted in their --who-- x 6/22/99 ERS scan. Note also some other servers are --mail address-- x highlighted as well, a couple of alphaworks servers, --phone-- x our patent socks/mail/ssh gateway machine xxxxxxxxxxxxxxxxxxxxxxx ar0135e0/1, and www2.patents.ibm.com. I gotta go check those out separately. Here is ERS's claim ... The following new problems were reported. o 198.4.83.38 (eagle.almaden.ibm.com): [low] HTTP-Proxy is active on TCP port 80. o 198.4.83.81 (jcentral.alphaworks.ibm.com): [low] HTTP-Proxy is active on TCP port 80. o 198.4.83.82 (jcentral2.alphaworks.ibm.com): [low] HTTP-Proxy is active on TCP port 80. o 204.146.135.135: [low] HTTP-Proxy is active on TCP port 8080. o 204.146.135.162: [low] HTTP-Proxy is active on TCP port 80. What is happening is, if you set your browser to use www as your proxy server (who would do that, first of all?), and then try to get to some web site, say www.apple.com, eagle would answer with whatever it is you asked for, from www, not from Apple. I.E. http://www.apple.com gets you the same thing as http://www, that is, Almaden's home page. Tony did some tracing and determined that the only difference between a normal GET request and a proxified GET request, is the GET statement. Normal would be GET / HTTP/1.0 proxified would be GET /www.apple.com HTTP/1.0 What's really wrong is the web server isn't paying attention to the server piece of the GET request. Sounds like a bug in the Apache web server to me, else a configuration problem on my end, so I thought I'd call it in to the Support Center to ask them. Meanwhile, Tony pushed back on ERS to ask them what we can do about this. Our servers aren't really acting as proxy servers and there's no documentation on what we can do to prevent it from failing their test. They consider a server to "fail" unless it returns an error code of 200 or 407. Alan Rich Alan called me too early this morning, and we've missed each 6/23/99 other twice each, so I sent him a note explaining the situation 5:10 am to him. 8-526-0362 ============================================================================== PMR # 06458,49R x The performance of all our free side web servers, 6/30/99 x as0107 & as0109-15, as well as our two gold servers, 15:30 x as0103 & as0104. The symptoms of the problem are Daniel / John x various. The first thing we see when this problem --mail address-- x hits, is a large number of I.P. connections that --phone-- x don't go away. They appear to be in various stages xxxxxxxxxxxxxxxxxxxxxxx of termination. If you do a netstat -An command, you see the normal 'ESTABLISHED', but you also see a bunch of CLOSE_WAIT, LAST_ACK, FIN_WAIT_1, FIN_WAIT_2 & TIME_WAIT states. John gave me a little education on TCP socket closures. He said there were 2 types of closing, active and passive. Active is when the application (the web server in this case) does the close. Passive is when the client does the close. In both, here is what happens and the states the connection is in; Who Does What Socket State Afterwards ---------------------------- ----------------------- Active: Server Decides to Terminate Connection & Sends a FIN FIN_WAIT_1 Client Sends ACK FIN_WAIT_2 Client Sends a FIN (???) TIME_WAIT Server Sends ACK (???) Closed Passive: Client Decides to Terminate Connection & Sends a FIN. Server sees FIN, sends FIN to application, application (or is it AIX??) sends ACK. CLOSE_WAIT Application does terminating cleanup, then sends a FIN. LAST_ACK Client sends ACK. Closed. The points to note are FIN_WAIT_1 or 2 and TIME_WAIT are Active close states. CLOSE_WAIT & LAST_ACK are passive close states. Also, if sockets are "stuck" in CLOSE_WAIT, it's due to the server application (httpd) not finishing its socket cleanup and sending the last FIN. In other words, it's an application problem. The other thing John focused on was trying to clean up the many errors we're seeing with an "entstat en1" command, specifically under the "Receive Statistics:" column, the "No Resource Errors:" line. Most of the web servers have many thousands of "No Resource Errors". Just like PMR #90825,49R (see 5/21/99 10:30 entry above), when I had troubles on the CWS, I changed the transmit & receive buffer size on the card itself to the max. The process, repeated from above, is to - Take node out of rotation (duh!), - ssh-ing into root on the en0 (SP/2) side, - ifconfig en1 down detach - rmdev -l en1 (I don't think this is necessary, is it?) - smitty chgenet - ent1 And changed the - "TRANSMIT queue size" from 512 to 150. Beware of a discrepancy with what the help text (PF1) says for this field, namely that the "Valid values range from 20 through 150", and what the prompt (PF4) says, that the valid range is 20-2048. 150 is the limit. The guy didn't know, but guessed that perhaps the fact this was set to 512, which is over the 150 limit, was causing us grief. - "RECEIVE buffer pool size" from the default 37 to the max of 64. - mkdev -l en1 - route add default 192.168.56.251 - route add -net 9 -netmask 255.0.0.0 192.168.56.252 I made these changes on nodes 3 & 4, our gold web servers, since they were the two that have been giving us the most troubles lately. ============================================================================== PMR # 07981,49R x While converting the CWS from DCE 2.1 to DCE 2.2, --/--/99 x (I had just successfully converted the AIX level from --:-- x 4.2.1 to 4.3.2), I commented out the starting up of --who-- x DCE/DFS from /etc/inittab and rebooted so DCE wasn't --mail address-- x running at all, then installed the new DCE 2.2 code, --phone-- x including PTF set 5, then ran migrate.dce, then xxxxxxxxxxxxxxxxxxxxxxx migrate.dfs. Both appear to have worked fine. I then defined dceunixd as well. Now when I start DCE/DFS by running "/etc/rc.dce all", everything but the dfsd daemon starts. The last few lines in /etc/dce/cfgdce.log are ... Starting the DFSBIND daemon... The DFS kernel extension dfscore.ext has successfully loaded. Waiting up to 120 seconds for the daemon to start. DFSBIND daemon successfully started. Starting the DFSD daemon... The DFS kernel extension dfscore.ext has successfully loaded. readRPC_SUPPORTED_NETADDRS Waiting up to 120 seconds for the daemon to start. Waited 5 seconds. Waited 10 seconds. ... Waited 120 seconds. 0x113155ed: The following component is not running, and is not registered in DCED as running: DFS client. unknown math function "DCF_MESSAGE" 0x1138da69: Unable to start the DFS client. 0x1138da5d: The components on DFS host, as0000e0 did not start successfully. 0x113159fb: Start did not complete successfully. We tried unconfiguring DFS with a rmdfs -l -F all and reconfiguring it, but that didn't seem to work either. Liz called back, poked around and said dfsd *was* running, it was just cleaning out the old dfscache, which takes a long time the first time. She thinks that there was just a "disconnect" between the start-up script waiting "only" 2 minutes, and the dfsd daemon taking longer than that the first time cleaning out the DFS cache. ============================================================================== PMR # 14498,49R x This is a continuation of the DCE 2.1 to DCE 2.2 8/31/99 x conversion problems I was having before. This time 9:00 x it was as0206e0/1 that I converted AIX 4.2.1 to 4.3.2, Mike Patton x DCE 2.1 to 2.2, and PSSP 2.4 to 3.1, on Saturday, vdpatton@us.ibm.com x August 28. See PMR 07981,49R above. 8-442-7072 x xxxxxxxxxxxxxxxxxxxxxxx Back then, Liz Hughes focused on the messages you see in /etc/dce/cfgdce.log, and indeed, there are error messages in there. As near as I can piece together, here's what happened Saturday, 7:37 System booted (error logging turned on). 7:46 System shutdown by user. 7:51 System booted (error logging turned on). 7:54 /usr/lpp/dce/bin/migcheck runs, according to cfgdce.log. Gets 0x11315066: The system call (chdir) failed with a return code of -1 ... 0x11315069: An error occurred creating the file /opt/dcelocal/tmp/cfgdce.sem. Because the /opt/dcelocal/tmp (=/var/dce/tmp) directory didn't exist. 7:54 System shutdown (error logging turned off). 8:30 System Upgraded from AIX 4.2.1 to AIX 4.3.2, DCE & PSSP included. 9:10 System shutdown by user. 9:15 Finished upgrading. System booted (error logging turned on). 9:16 start.dce runs. Gets 0x1131504a: A failure occurred during the copy of /opt/dcelocal/var/dced/Acl.db to /opt/dcelocal/var/dced/backup/Acl.tdb. 0x11315073: The directory, /opt/dcelocal/var/dced/backup, was not found. Because /opt/dcelocal/var/dced/backup (=/var/dce/dced/backup) didn't exist. 9:29 show.cfg (lsdce) ran, saying "0x11315b69: A new release of DCE has been installed. The DCE configuration data needs to be migrated. Please run migrate.dce. 9:39 start.dce ran again. Got .../backup directory doesn't exist error again. 9:39 /var/dce/tmp & /opt/dcelocal created. 9:40 start.dce ran again. This time, DCE got converted & set up ok, with 9:41 show.cfg running to show the DCE components configured ok, and then 9:41 show.cfg again showing no DFS components configured. I don't know what was happening between 9:41 & 9:48, but at 9:48 show.cfg ran again, showing the DCE components configured ok, and 9:48 show.cfg again, showing no DFS components configured. Then 9:48 I probably invoked /etc/rc.dce by hand, causing start.dce to run, which after about a hundred, normal-looking messages, gets these error message. ... Starting the DFSD daemon... The DFS kernel extension dfscore.ext has successfully loaded. readRPC_SUPPORTED_NETADDRS Waiting up to 120 seconds for the daemon to start. Waited 5 seconds. Waited 10 seconds. ... Waited 120 seconds. 0x113155ed: The following component is not running, and is not registered in DCED as running: DFS client. unknown math function "DCF_MESSAGE" 0x1138da69: Unable to start the DFS client. 0x1138da5d: The components on DFS host, as0206e0 did not start successfully. 0x113159fb: Start did not complete successfully. 0x11315066: The system call (chdir) failed with a return code of -1 and error number of 2. 10:17 System shutdown by user. 10:21 System booted (error logging turned on). 10:22 /etc/rc.dce runs again from /etc/inittab. Gets same errors as above. 12:09 start.dce ran again, I guess by hand? Gets same errors as above. 12:14 start.dce ran again, I guess by hand? Gets same errors as above. 12:25 System shutdown by user. 12:30 System booted (error logging turned on). 12:30 /etc/rc.dce runs again from /etc/inittab. Gets same errors as above. Hmmmm. I thought I had to run /etc/rc.dce by hand a few minutes after the 12:30 boot. Was my problem all along, that I had :once: rather than :wait: in /etc/inittab? Hmmmmmm. Since I wasn't sure, I had Julian close the PMR. ============================================================================== PMR # 14610,49R x I chose to reinstall AIX 4.3.2 on our new S70, and 8/31/99 x afterwards I noticed that the new 32 GB SSA drives 5:15 x did not configure correctly. So, I Ishmall x mount as0000e0:/spdata/sys1/install/aix432/lppsource /mnt --mail address-- x & cfgmgr -i /mnt --phone-- x but I got an err message saying there are the xxxxxxxxxxxxxxxxxxxxxxx following missing filesets. devices.pci.0c000c02 devices.pci.14105800 devices.pci.14109100 = Something to do with the SSA ?? devices.pci.ssa I hunted around and found devices.pci.14109100 on some CD that came with the S70, but didn't find the other 3 anywhere. Kimberly from the install group came on the line, but I figured things out by the time she was of any help. I applied the AIX service I had & reran cfgmgr, and it installed the right filesets. Evidently, the base 4.3.2 cfgmgr code had the wrong list of filesets for these devices. ============================================================================== PMR # 23385,L11 x Note the different Branch Office number, due to my 9/08/99 x using Customer # IN00018, rather than my normal 11:40 x 5336519, 'cause this stupid first level AIX Support Diane Swan x guy claimed 5336519 didn't have support for the 8-526-1933 x Lotus Domino Go webserver. But then, he initially Also Alan Rich x claimed that IN00018 didn't either, but when I at 8-526-0362 x finally told him that LDG was under the Websphere Both in Raleigh x umbrella, he said, "Oh, yeah. You do have Websphere xxxxxxxxxxxxxxxxxxxxxxx Support, so you're ok." Sounds like I may have troubles with this in the future. I was told that Wayne Tippery (8-262-3804) was the person to call to get IBM Internal customer numbers updated. Anyway, I didn't call Wayne this time, 'cause I finally got through. All I wanted to do is get the latest Lotus Domino Go Webserver, which is at least 4.6.2.6, to update our Net.Commerce servers, especially as0206e0/1, which is today running LDG 4.6.2.51. This is so I can get a modern SSL certificate from Equifax. Diane told me I could get it by rftp-ing to service2.boulder.ibm.com as userid=webuser, password=code4you, cd-ing to /internet/DGW/aix, and picking up both the *tar file & the service.txt. ============================================================================== PMR # 24566,49R x When updating as0204e1 from AIX 4.2.1 to AIX 4.3.2, 10/06/99 x something I did on 9/9/99 (cute) on as0201e1 and --:-- x everything worked fine, I had lots of troubles with Donovan & Frank x the NIM/PSSP scripts not setting up the NFS stuff --mail address-- x correctly. The directories that should be --phone-- x NFS-exported include xxxxxxxxxxxxxxxxxxxxxxx /spdata/sys1/install/aix432/lppsource /spdata/sys1/install/aix432/spot/spot_aix432/usr /spdata/sys1/install/pssp/bosinst_data_migrate /export/nim/scripts/as0204e0.script All with -ro,root=as0204e0.patent.ibm.com,access=as0204e0.patent.ibm.com at the end. I was getting different things, e.g. the lppsource line not having the extra stuff on it, or missing lines altogether. Frank called on 10/13 and we ran a bunch of commands, picking on as0201=2 2 1=18 spchvgobj -i bos.obj.ssp.432 -r rootvg -v aix432 -p PSSP-3.1 2 2 1 and either a full spbootins -r migrate 2 2 1, which would run setup_server, or more simply, just allnimres -l 18 which does the same thing. All we were looking for/at is whether or not /etc/exports or exportfs shows the -ro,root=as0202e0.patent.ibm.com:,access=as0202e0.patent.ibm.com: for the spot, bosinst_data, & script lines. While Frank was on the phone, they all did. Everything worked correctly. Use spchvgobj -i bos.obj.ssp.421 -r rootvg -v aix421 -p PSSP-2.4 2 2 1 spbootins -s no -r disk 2 2 1 and unallnimres -l 18 to reset things back to normal. Since we couldn't get it to fail, we decided to close the PMR until it happens again. ============================================================================== PMR # 27516,49R x For the last two nights, ar0143e1 has crashed at 10/26/99 x around 2:00 - 2:30 in the morning. The Support 9:40 x Center had me run sysdumpdev -L to get the dump Chet Holt x device (/dev/dumplv), then script crash.out to chetholt@austin.ibm.com x start recording these next commands. 8-523-4138 x xxxxxxxxxxxxxxxxxxxxxxxxx crash /dev/dumplv, then once inside crash, stat, status, symptom, cpu, trace -m, od prog_log 8, errpt -a, then quit. I then sent it all to Chet via mail -s'PMR 27516,49R' chetholt@austin.ibm.com < /tmp/crash.out Chet called back a few minutes later & said we needed to run a fsck on one (some?) of our file systems. I know that on Wednesday of last week, Kin was trying to run fsck on /usr. I sent him a note telling him he needed to get into a limited function shell, but I don't know what he did. Turns out it was the /ips file system that was bad. I rebooted with nothing getting started up, ran fsck to fix things, and all was well again. ============================================================================== PMR # 29192,49R x Got a crash on ar0144 this morning, so I called 11/05/99 x it in. Got Chet again (see previous PMR). We went 9:10 x through the same sequence of commands to collect Chet Holt x data, then mailed him the stuff. chetholt@austin.ibm.com x sysdumpdev -L To determine the dump device, 8-523-4138 x which was /dev/dumplv, the dump size (46MB), and xxxxxxxxxxxxxxxxxxxxxxxxx the dump status (0), which said the dump completed successfully. The commands he wanted mailed to him were, script crash.out crash /dev/dumplv stat status symptom cpu t -m errpt -a quit To leave crash quit To finish script command mail -s'PMR 29192,49R' chetholt@austin.ibm.com < crash.out Chet called back at 1:00 to say this has been fixed. I've intentionally kept these image servers a bit downlevel 'cause Kin said applying some libpthreads fix on them, would break them. Kin didn't tell me 'till a week ago when I asked, that this problem has been fixed when they went to a later version of the MQM(?) code. Anyway, Chet's gonna send me a CD with the AIX 4.2.1 fixes on them, 'cause I don't have room on the CWS to store both the 4.2.1 & 4.3.2 fixes, so I've blown away the 4.2.1 fixes. ============================================================================== PMR # 40421,49R x ar0072e0 crashes soon after DFS comes up. I have 11/13/99 x just upgraded its AIX from 4.2.1 to 4.3.2, keeping 2:30 x it at DCE 2.1. Julian says it's a known problem Julian x with AIX 4.3.2 & DCE 2.1 and to just apply the --mail address-- x latest fixes. I did, converting to DCE 2.2 along --phone-- x the way. xxxxxxxxxxxxxxxxxxxxxxxxx Now, DCE won't come up at all. The problem is with the DCE Migration step. The /opt/dcelocal/etc/cfgdce.log says 0x11315b5a: DCE migration cannot be performed because the following files were not found: /opt/dcelocal/etc/mkdce.data /lpp/save.config/etc/dce/rc.dce After much digging, I discovered that when I installed the DCE 2.2 code, the installation replaced the link Ed had set up at /etc/dce, pointing to /dceconfig/etcdce, with a directory. This directory is where all the config files & other stuff is. After much fooling around, trying to combine the old /dceconfig/etc/dce directory with the new /etc/dce, I'm waiting on level 3. Meanwhile, I noticed that the /var/dce link is also now a directory. The next time I upgrade DCE on 71 or 73, undo the links at lrwxrwxrwx 1 root system 17 Oct 11 1997 /etc/dce -> /dceconfig/etcdce lrwxrwxrwx 1 root system 17 Oct 11 1997 /var/dce -> /dceconfig/vardce lrwxrwxrwx 1 root system 15 Oct 11 1997 /krb5 -> /dceconfig/krb5 Bill first had me backup what we have, namely /etc/dce /krb5 /usr/lib /usr/ccs /var/dce /etc/dce.2.2.orig /dceconfig /etc/objrepos /usr/lpp/dce* which was quite a bit of data. I had to tar cvf - $(cat /tmp/root) | compress > /usr/bill.tar.Z and mv /usr/bill.tar.Z /dceconfig and even then it was 120MB. Then he wanted to disable any replica attempts to ar0072e0, so as root on ar0073, I tried doing these dcecp cmds, clearinghouse cat clearinghouse disable /.../patent.ibm.com/ar0072e0_ch but that didn't work, I guess 'cause 73 didn't think that 72 was alive. Bill then had me do this command, cdscp set dir /.:/ to new epoch master /.:/ar0073e0_ch readonly /.:/ar0071e0_ch exclude /.:/ar0072e0_ch Then running his /tmp/whannon2 script, which was dcecp -c dir synch /.:/ for i in `cdsli -Rd` cdsli -Rd returns 59 lines, /.:/hosts, do 50 /.:/hosts/ lines, and echo Synching $i 8 others (e.g. /.:/users) dcecp -c dir synch $i done This kept failing, apparently timing out, with these messages Synching /.:/hosts Error: Unable to communicate with any CDS server Synching /.:/hosts/ar0071e0 Error: Unable to communicate with any CDS server Synching /.:/hosts/ar0072e0 Error: Unable to communicate with any CDS server Synching /.:/hosts/ar0073e0 Bill then came up with cat > /tmp/finddirs << EOF for i in \`cdsli -Rd \` do cdscp show dir \$i CDS_Replicas done EOF which created a 914-line /tmp/finddirs file. Anyway, Bill lost me in what he was trying to do. This last script creates a /tmp/finddirs, which you can run to create something that shows if ar0072 is in there or not. For the 27 CDS directories that 72 was replicating (stored in /tmp/whannon4), I ran for i in $(cat /tmp/whannon4);do export dit=$i;/tmp/whannon1;done where whannon1 was cdscp set dir $dit to new epoch master /.:/ar0073e0_ch readonly /.:/ar0071e0_ch exclude /.:/ar0072e0_ch Then Bill got under dcecp to do dcecp dcecp> sec_admin which changes the prompt. Interesting. sec-admin> si /.: -u To "bind" to the master sec_admin> lrep -all That showed that 73 was the master & 71 and 72 were replicas and by the way, the last update was on 11/8/99. Bill had me do a sec_admin> delrep subsys/dce/sec/ar0072e0 to delete the 72 security replica. Now a lrep -all shows that 72 is "marked for deletion". Bill suggests following these steps 1) Change /etc/dce, /var/dce, & /krb5 to directories. Copy command was cp -pRh /dceconfig/etcdce/.orig* /etc/dce to preserve the links. cp -pRh /dceconfig/vardce/* /var/dce cp -pRh /dceconfig/krb5/* /krb5 2) Unmount /dceconfig. 3) Force install DCE 2.1 (will lslpp now say DCE 2.2 is uninstalled?? - Yes.) Had to get the DCE 2.1 base code from Ed's execute.adtech.internet.ibm.com:/export/nim/lpp_source/aix432/PROD/DCEDFS-2.1 Got an error trying to NFS-mount something from the CWS. What now?? mount cws:/spdata/sys1/install/aix421/lppsource /mnt exec(): 0509-036 Cannot load program /usr/lib/drivers/nfs.ext because of the following errors: 0509-025 The /usr/lib/drivers/nfs.ext file is not executable or not in correct XCOFF format. nfsmnthelp: Cannot run a file that does not have a valid format. Shit! /usr/lib/drivers/nfs.ext is in the bos.net.nfs.client fileset, which did get more updated (4.3.2.10) on ar0072e0, than it did on 71 (4.3.2.6). I scp -p root@ar0071e0:/usr/lib/drivers/nfs.ext /usr/lib/drivers/nfs.ext and was then able to mount. I'll have to call this problem in later. 71 can mount ok. This is fixed by bos.net.nfs.client.4.3.2.11, closed in the last week or two. 4) Apply fixes, too (if not done earlier). Had to download them from http://www-4.ibm.com/software/network/dce/support/fixes/dceaix.html (DCE PTF Set 27 = IY87874) 5) Copy the 3 directories from /dceconfig to mount /dceconfig cp -pRh /dceconfig/etcdce.orig/* /etc/dce cp -pRh /dceconfig/vardce/* /var/dce cp -pRh /dceconfig/krb5/* /krb5 6) Try the rmdce -o local sec_srv to deconfigure the security server. but don't be surprised if it doesn't work. (It worked fine.) 7) Try to start DCE DCE & DFS seemed to start up ok (Yay!), but not all my aggregates are there. I'm missing my SSA devices. Even a lscfg doesn't show them. fts lsaggr ar0072e0 There are 2 aggregates on the server ar0072e0 (ar0072e0.patent.ibm.com): scsi0 (/dev/scsi0lv): id=2 (LFS) scsi1 (/dev/scsi1lv): id=3 (LFS) Missing are ssa0-2. lspv hdisk0 009000733866c1cb rootvg hdisk1 009001466cdd2e2d scsi1vg hdisk2 009001466cdc8cde scsi0vg hdisk3 009001466c667a49 rootvg Missing are hdisk4-6. Noted that the filesets for the SSA device drivers (devices.mca.8f97.*) weren't the same on 72 as they were on 71. 72 had funky 98.2.1.1008 levels, evidently from when we picked up pre-release drivers back who-knows-when. On 71, they were the normal 4.3.2.0 levels. I went back to the base 4.3.2 code on the CWS and force-installed the right filesets, and rebooted. 73 is ok, too (4.2.1.x). That didn't fix things, cfgmgr still won't configure the drives. Called the Support Center back. PMR 40457,49R. Greg talked me through picking up the latest SSA code from http://www.hursley.ibm.com/~ssa/rs6k. I got the tar file, expanded it on 72, which put stuff in /usr/sys/inst.images, then completely removed all SSA filesets (installp) and devices (rmdev), rebooted, installed most of the stuff from /usr/sys/inst.images, then ran cfgmgr, which now saw all the devices. While the drives were not in use, we took the opportunity to insure the drives' microcode was up to date (the 16 GB drives needed to be updated). Tried to start DCE/DFS again, and secd is dying. Shit! I thought we got past this. Oh, yeah. I had to rmdce -o local sec_srv again, then /etc/rc.dce all came up with all aggregates exported ok. fts lsaggr ar0072e0 There are 5 aggregates on the server ar0072e0 (ar0072e0.patent.ibm.com): ssa0 (/dev/ssa0lv): id=1 (LFS) scsi0 (/dev/scsi0lv): id=2 (LFS) scsi1 (/dev/scsi1lv): id=3 (LFS) ssa1 (/dev/ssa1lv): id=4 (LFS) ssa2 (/dev/ssa2lv): id=5 (LFS) 8) If not done earlier, (it was done earlier) rmdce -o local sec_srv To deconfigure the security server. then mkdce -o full sec_srv To recreate the security replica & synch it. This gives me the message Enter password to be assigned to initial DCE accounts: What's this asking for? I gave it the current cell_admin password. It then went on to say Cannot configure sec_srv until sec_cl is unconfigured Current state of DCE configuration: cds_cl COMPLETE CDS Clerk cds_second COMPLETE Additional CDS Server rpc COMPLETE RPC Endpoint Mapper sec_cl COMPLETE Security Client so I don't know what to make of this. Bob from the Support Center talked me through doing this from smitty. It's a lot easier that way. I think the command should have been mkdce -R -s ar0073e0 sec_srv 9) If all's ok, then fix CDS replicas by cdscp set dir /.:/ to new epoch master /.:/ar0073e0_ch readonly /.:/ar0071e0_ch readonly /.:/ar0072e0_ch This wasn't necessary. The step above essentially did this. 10) And on 73, modify /tmp/whannon1 (exclude -> readonly) and re-run for i in $(cat /tmp/whannon4);do export dit=$i;/tmp/whannon1;done ============================================================================== PMR # 53242,49R x Some servlet pages on the Net.Commerce server, 12/29/99 x as0206e0/1, quit working. A test URL to use is 9:00 x http://as0206e1/servlet/com.ibm.ipnfb/servlets.IPNAdminServlet John Mahoney x The only other clue were these lines in the httpd jmahone@us.ibm.com x log located in /arc/httpds/logs/httpd-errors.Dec291999. 8-444-4635 x [was error] ose_init : Failed in timebomb validation. xxxxxxxxxxxxxxxxxxxxxxxxx ose_init : Your timebomb is corrupted or has expired. ose_init : Please obtain another copy of the product. John had evidently seen this before and quickly sent me a file to fix it. Just replace the trx.properties file in the /properties directory. This being an IBM Websphere Application Server problem, "install_root" is /usr/WebSphere/AppServer, so the file that needed to be replaced is /usr/WebSphere/AppServer/properties/trx.properties. The old one contained 2IOJFOVZGubqx The new one contains 4669803 Whatever that file is, replacing it & recycling the web server with a startsrc -s httpd -e DB2INSTANCE=inst1 command fixed the problem. I went to the other websphere machines, ar0079e0, baboon, and ncc-312, to replace this file, but the file didn't exist. The file belongs to the IBMWebAS.base.core fileset, which on as0206, is at 2.0.3.1, whereas on the other 3 machines, it's only at 2.0.0.0. Perhaps it's just something with the newer release? ============================================================================== PMR # 57093,49R x Since converting to DCE 2.2, my pretty_fts_lsfldb.tcl 01/25/00 x script that I run weekly on ar0073e0, doesn't run. 10:45 x After running for 36 hours, it core dumps. I've Jeff Pickering x tried looking at it, but there's nothing wrong on --mail address-- x my end and this used to work with DCE 2.1. 8-442-7243 x xxxxxxxxxxxxxxxxxxxxxxxxx I've got DCE 2.2.0.7 installed and 73 is AIX 4.3. ============================================================================== PMR # 59486,49R x There are multiple things wrong with the Patent --/--/00 x server DFS cell. First of all, I spent 10 days --:-- x trying to move the 60 GB patent.verity fileset from Paul Brennfleck x 73:ssa1 to 73:ssa3. The move should take about --mail address-- x 15 hours to finish, but after 3-5 days, I either 8-989-6897 x killed the move myself, or the system would get xxxxxxxxxxxxxxxxxxxxxxxxx rebooted due to other problems (who knows? Maybe it's due to the same causes). After fin x ============================================================================== PMR # 24630,L11 x The cdsclerk died on ar0071 on 4/4/00 at 3:53 AM. 4/05/00 x I had to stop (even tho' it wasn't running) the 10:10 x cdsclerk, restart it, and also restart the repserver Julian Owens x process. Things looked ok after all that. jowens1@us.ibm.com x --phone-- x I searched rshelp.austin.ibm.com (what a site!) xxxxxxxxxxxxxxxxxxxxxxxxx but didn't find anything, so I packaged up everything using their senddata.pl utility, which I had tucked away at /dfs/apps/userlocal/bin/senddata.pl. Here's the README I created to send with the package. I pointed senddata to /dumpfs/data and it created a 74 MB /dumpfs/data/datapkg.tar.Z.uu file, which I rftp'd to testcase.software.ibm.com and put at /aix/toibm/24630.L11-L3DCE-datapkg.tar.Z.uu. Notes on this core file, by Rick Jasper, System Admin Almaden Research Center IBM in San Jose, California (408) 927-2731 or tieline 457-2731 This core file was found at /var/dce/adm/directory/cds/cdsclerk/core and occurred on Tuesday, April 4th, 2000 at 3:53 AM when nobody was in, so there was nothing too interesting going on. The core file is 269,429,847 bytes big and is complete. Things returned to normal on the server once I restarted the cdsclerk & repserver, and are running fine now. The machine this occurred on (ar0071e0) is one of three DCE/DFS servers in my cell, the other two being ar0072e0 & ar0073e0, which is the master. All machines are running AIX 4.3.2 and DCE 2.2 at the latest levels, specifically DCE is at PTF set 7. This is the associated line from the /opt/dcelocal/var/svc/fatal.log file 2000-04-04-02:35:57.632-08:00I----- cdsclerk(30362) FATAL cds general clerk_client.c 1557 0x000011ae msgID=0x10D0AB83 Routine pthread_mutex_lock failed : status = -1. Finally, here is the associated error log entry --------------------------------------------------------------------------- LABEL: CORE_DUMP IDENTIFIER: C60BB505 Date/Time: Tue Apr 4 03:53:24 Sequence Number: 17317 Machine Id: 009000734C00 Node Id: ar0071e0 Class: S Type: PERM Resource Name: SYSPROC Description SOFTWARE PROGRAM ABNORMALLY TERMINATED Probable Causes SOFTWARE PROGRAM User Causes USER GENERATED SIGNAL Recommended Actions CORRECT THEN RETRY Failure Causes SOFTWARE PROGRAM Recommended Actions RERUN THE APPLICATION PROGRAM IF PROBLEM PERSISTS THEN DO THE FOLLOWING CONTACT APPROPRIATE SERVICE REPRESENTATIVE Detail Data SIGNAL NUMBER 6 USER'S PROCESS ID: 30362 FILE SYSTEM SERIAL NUMBER 13 INODE NUMBER 8194 PROGRAM NAME cdsclerk ADDITIONAL INFORMATION pthread_k 88 ?? _p_raise 64 raise 34 abort B8 dce_svc_p 3F0 link_free 28C _pthread_ 40 clerk_cli 440 _pthread_ C4 ?? Symptom Data REPORTABLE 1 INTERNAL ERROR 1 SYMPTOM CODE PIDS/576539300 LVLS/410 PCSS/SPI2 FLDS/cdsclerk SIG/6 FLDS/link_free VALU/28c --------------------------------------------------------------------------- ============================================================================== PMR # 24740,L11 x After installing AIX service to the S80, I didn't 04/27/00 x notice that the bosboot command didn't run. 4:00 x Normally, you see these msgs after an update_all, Alex x 0503-409 installp: bosboot verification starting... --mail address-- x installp: bosboot verification completed. --phone-- x 0503-408 installp: bosboot process starting... xxxxxxxxxxxxxxxxxxxxxxxxx bosboot: Boot image is 5331 512 byte blocks. Well, when we rebooted the machine next, the network and other things didn't come up. Logging on through the serial port, we were able to poke around and looking at the install msgs in /smit.log, we didn't see those bosboot msgs. Alex had me run bosboot -ad /dev/ipldevice and I rebooted and everything was fine. ============================================================================== PMR # xxxxx,L11 x x --/--/00 x x --:-- x x --who-- x x --mail address-- x x --phone-- x x xxxxxxxxxxxxxxxxxxxxxxxxx x x ============================================================================== PMR # xxxxx,L11 x x --/--/00 x x --:-- x x --who-- x x --mail address-- x x --phone-- x x xxxxxxxxxxxxxxxxxxxxxxxxx x x