Notes on Jim's Work File Searching 10-21-2004 ================================================================================= The jwl_patsearch3 fcgi-bin Perl program will run on all the main web servers (www in production & www7 in stage), which will "tunnel" to the jwl_patquery3 binary running on the Verity broker machines. Although Jim started with the production patsearch, he's apparently stripped out all the conditional coding of default variables based on which environment we were running in. E.G. in /dfs/stage/ipn/fcgi-bin/jwl_patquery3, we have just $default_verity_server="localhost"; $default_tunnel = "jwl_patquery3"; $default_pqdatabase = "pdbfree"; whereas in /dfs/stage/ipn/fcgi-bin/patsearch, there's all kinds of logic to detect the development nodes versus production versus ... It looks like we need to add that logic back in. The jwl_patsearch3 Perl program uses the WlCacheDir line from the properties file (see the grid below), to possibly create and definitely use, the .dat files. Thus it needs R/W access to the /dfs/wlcache directory. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - The jwl_patquery3 binary will only run on the Verity broker machines (dephds041 & dephds042). This is because they need to read the local copy of the full Verity collection at /ips/coll/v4. Since this is only where they'll be running, I will also put all hdb files in the local /ips/coll local file system at /ips/coll/hdb. (I'll have to verify this decision with Carol.) jwl_patquery3 also needs R/W access to the WlCacheDir directory, and R/O access to the VerityCollDir & HdbDir directories (see the grid below). ================================================================================= Jim right now has stuff in /dfs/cdrom, which of course isn't where it should be permanently. See his /dfs/stage/ipn/config/wlsearch.properties file, which specifies # Root directory for Verity collections export VerityCollDir=/dfs/verity/v4 # Home directory for wlsearch binaries export WlsearchScriptDir=/dfs/cdrom/wlsearch # Jim's database files of the Verity index index (gets /hdb appended) export HdbDir=/dfs/cdrom # Working Directory. Unauthenticated ipsrun needs R/W access export WlCacheDir=/dfs/cdrom/wlsearch/wlcache # Verity binaries. Only used at build/link time. export VdkHome=/dfs/verity/verityk222/_rs6k41/bin - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - I'm not positive about this grid, but Jim looked it over and thought it was right, but you never know. He was doing it from memory and I don't trust the absolutely accuracy of this, but to the best of my knowledge, these are the times/events that these wlsearch.properties variables are used. Time / Event VerityCollDir WlsearchScriptDir HdbDir WlCacheDir VdkHome ================ ============= ================= ====== ========== ======= hdb creation-ksh Yes-RO -hdb Yes-RO Yes-RW hdb update Yes-RO Yes-RW patsearch Yes-RW patquery Yes-RO Yes-RO Yes-RW getpatn Yes-RO patquery Compile Yes - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - One thing Jim said, which didn't prove to be true for all of his programs, was HOW each program knew where to find the properties file. Jim said it always found it by starting with your current directory, going up one level directory, then back down to config/wlsearch.properties, i.e. ../config/wlsearch.properties Thus for the web environment, it belonged in /dfs/prod/ipn/config (or stage). But I found out that the getpatn program (and maybe others, I don't know), insisted you be cd'd to where the wlsearch.properties file was. Not good. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - IMO, the permanent homes for the wlsearch.properties directories should be Variable Name Production Staging ================= ================================ ================================= VerityCollDir /ips/coll/v4 /ips/coll/v4 WlsearchScriptDir /dfs/prod/ipn/workfiles/wlsearch /dfs/stage/ipn/workfiles/wlsearch HdbDir /ips/coll /ips/coll WlCacheDir /dfs/.rw/wlcache /dfs/.rw/wlcache Be aware that /dfs/stage/ipn/workfiles is a link to /dfs/prod/ipn/workfiles/stage (!!) so /dfs/stage/ipn/workfiles is actually under a /dfs/prod/ipn/workfiles subdirectory. To start implementing this scheme, I mkdir /dfs/prod/ipn/workfiles/wlsearch mkdir /dfs/prod/ipn/workfiles/stage/wlsearch keeping the two (prod & stage) separate. In contrast to the hdb directory, where I decided there's no need to keep a production and stage database of these hdb files. mkdir /dfs/prod/ipn/workfiles/hdb ln -s ../hdb /dfs/prod/ipn/workfiles/stage/hdb Similar to how we today handle the Verity collection, with a copy in /dfs and shadowed to local disks for each machine that needs it, we should have a copy of the hdb directory in /dfs/verity/hdb which would be copied to local disks on both 41 & 42 (in their /ips/coll/hdb). The problem is the verity fileset is 187GB out of 193GB full and has come under a lot of scrutiny lately. It may be difficult to talk Carol out of another 4-10GB (today, Jim's /dfs/cdrom/hdb directory has 4GB of hdb files and at least 3GB of working space). I also need to ask for two new wlcache filesets, keeping with the scheme we have today in /dfs/.rw where dlcache & gifcache are links into the real directories (and # is the DFS server number where the fileset is). The quota on these guys don't have to be large. 1GB should be more than enough. ================================================================================= Notes on building Jim's patquery program, I had to do these steps manually. Jim was able to compile patquery on penguin, but when I tried doing so on my jasper machine, I was missing the Verity libraries which were in his make file to be at /ips/coll/verity. On my machine, /ips was a link into /dfs/ips, so I ln -s verity261 /dfs/ips/coll/verity and from penguin, cp -pRl /ips/coll/verity261 /dfs/ips/coll/verity - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Combine the ~/ipn-CVS/patquery.common & ~/ipn-CVS/patquery/patquery.fcgi.multi directories, into one "Build" directory. This is how I did it. Jim's "Build" directory can be found at /u/jhees/patquery. cd ~/ipn-CVS mkdir patquery.build cd patquery.build cp -p ../patquery.common/[a-z]* . cp -p ../patquery/patquery.fcgi.multi/[a-z]* . Had to add ln -s ../wlsearch/c_source/workfile.h to ln.ksh, then run it to link in the other needed files from ../wlsearch/c_source We finally had all the pieces in one place so the makefile would work. make -f makefile.db PQBND=jwlpq - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - But I had to modify his makefile.db into makefile.dbv7 'cause this week, 59 & 61 are still at DB2 version 7.2.9, and one cannot bind a DB2 version 8 bind file on a version 7 database. The primary change, was to change /home/inst1/sqllib/lib to /usr/lpp/db2_07_01/lib and /home/inst1/sqllib/include to /usr/lpp/db2_07_01/include But I had to do the "db2 prep" command on a DB2 version 7 machine. I tried playing around with my environment variables to pick up the DB2 version 7 db2 command, but failed doing it that way. Had to do things by hand, referring to the makefile for details. This strips out backslash-newlines creating patquery.db1.sqc sed -f strip.sed patquery.db.sqc >patquery.db1.sqc Then move that patquery.db1.sqc file to Southbury so I can do the "db2 prep" command there. scp -p patquery.db1.sqc rickjas@dephds059: As rickjas on 59, db2 connect to patent user ipsrun using ipsrun_password db2 prep patquery.db1.sqc bindfile using jwlpq.bnd package using jwlpq \ queryopt 9 blocking all isolation CS insert buf That command generated good patquery.db1.c and jwlpq.bnd files, using patquery.db1.sqc as input. Bring those good files back to San Jose so I can compile it. cd ~/ipn-CVS/patquery.build scp -p rickjas@dephds059:patquery.db1.c . scp -p rickjas@dephds059:jwlpq.bnd . Let the makefile compile it, make -f makefile.dbv7 PQBND=jwlpq To put what you need in Southbury, scp -p jwlpq.bnd rickjas@dephds059:/dfs/stage/ipn/fcgi-bin scp -p patquery.db.jwlpq rickjas@dephds059:/dfs/stage/ipn/fcgi-bin/jwl_patquery3 (patquery.db.jwlpq is the stripped version of patquery.db) To bind, for both 59 & 61, login as rickjas so that I'll have my DCE credentials, cd /dfs/stage/ipn/fcgi-bin db2 connect to patent user ipsrun using ipsrun_password db2 bind jwlpq.bnd If this was the first time this package had been bound, tweak the permissions db2 grant execute on package jwlpq to public --------------------------------------------------------------------------------- When I connected to the database as ipsadmin instead of ipsrun, we got these error doing the bind LINE MESSAGES FOR patquery.db1.sqc ------ -------------------------------------------------------------------- SQL0060W The "C" precompiler is in progress. 1234 SQL0204N "IPSADMIN.HITLISTS" is an undefined name. SQLSTATE=42704 1251 SQL0204N "IPSADMIN.HITLISTS" is an undefined name. SQLSTATE=42704 1622 SQL0204N "IPSADMIN.HITLISTS" is an undefined name. SQLSTATE=42704 1631 SQL0204N "IPSADMIN.HITLISTS" is an undefined name. SQLSTATE=42704 SQL0095N No bind file was created because of previous errors. SQL0092N No package was created because of previous errors. SQL0091W Precompilation or binding was ended with "6" errors and "0" warnings. I'm not sure why Jim is referencing the hitlists. Is this necessary? --------------------------------------------------------------------------------- When we tried to use the jwlpq.bnd file generated by Jim's makefile.do file, which connects to the DB2 version 8 pdbtran, we get a DB2 version 8 bind file that works only on DB2 v8. If you try using that bind file on the version 7 59 or 61, we get these errors, LINE MESSAGES FOR jwlpq.bnd ------ -------------------------------------------------------------------- SQL0061W The binder is in progress. SQL0028C The release number of the bind file is not valid. SQL0082C An error has occurred which has terminated processing. SQL0092N No package was created because of previous errors. SQL0091N Binding was ended with "3" errors and "0" warnings. ================================================================================= To build Jim's other programs, cd ~/ipn-CVS/wlsearch/c_source makeall.ksh or for individual files, make -f make_hdb This will build things into the directory one-level up (eg, ~/ipn-CVS/wlsearch) Jim built things in /u/jhees/wlsearch. ================================================================================= To test Work File Searching, authenticate to https://www7.delphion.com as jhees/jhees and go to https://www7.delphion.com/jwl_simple3 Select one of his work files, say "mag needle" and type something in the "Work File Search", say needle and hit the "Search" button. That button calls fcgi-bin/jwl_patsearch3, but this sometimes times out after 30 seconds. We need to add stuff to the Verity web server config files and probably increase the fcgi-bin timeout value, but I get lost in this stuff. Sometimes, you would get back a screen that made it look like it really did something, but Jim said no, it wasn't working. It really failed. Jim mentioned that we need to resynch the fcgi-bin/jwl_patsearch3 and maybe the /dfs/stage/ipn/htdocs/jwl_simple3.html.en, and I also see three /dfs/stage/ipn/htdocs/hti/jwl* files to look at. ================================================================================= To test Jim's patquery, you can run it from a command line. Best is to login as ipsrun on dephds041 or dephds042 and kdestroy so that you most reflect the running environment when the web server runs. cd /dfs/stage/ipn/fcgi-bin jwl_patquery3 -c /dfs/cdrom/wlsearch/wlcache/wl_3625614_2003-05-20_18-10-28-149646.dat book (The above fails if you're not on 41 or 42. Something having to do with the gateway, but who cares?) ================================================================================= Use wlsearch/init_wlsearch.ksh to initially create all the hdb files. Jim did this on October 20, 2004. See the /dfs/cdrom/hdb/*hdb files. To keep it current, log onto the system where you can find Rebecca's *lst file (the list of newly-indexed patents), make sure you're authenticated to write into the hdb directory (now /dfs/cdrom/hdb, but later it'll be different), cd /dfs/stage/ipn/config so that the hdb_updwithlst can find the wlsearch.properties file in the ../config directory (yes, I know that's stupid), then /dfs/cdrom/wlsearch/hdb_updwithlst deapps < /ips/coll/prep/deapps5.lst /dfs/cdrom/wlsearch/hdb_updwithlst inpadoc < /ips/coll/prepi/inpadoc*lst /ips/coll/prepi/inpadoc*del Reading the hdb files from /dfs ran incredibly slow. I've got to get this process running from the local file system. Carol thought it might be easy enough to get the gateway to spit out a file of patn-Doc ID. That should help. When all this gets streamlined, it will of course, need to be integrated into Rebecca's Verity update process. ================================================================================= A note on Jim's hdb file naming scheme of starting & ending patent numbers. U_US04130808___US04769604__.hdb <-- Note how each name ends, with the first patn U_US04769604___US05410506__.hdb <-- of the next file (eg US04769604). The ending U_US05410506___US06056244__.hdb <-- patn is non-inclusive. So US04769604 is U_US06056244___US06696216__.hdb <-- not in U_US04130808___US04769604__.hdb, U_US06696216___US24034315A1.hdb <-- it's in U_US04769604___US05410506__.hdb U_US24034315A1_none.hdb <-- ================================================================================= Jim's hdb utility has a few useful options. -create Useful for when you want to cut off one hdb to start another. EG, if I decided to cut off U_US24034315A1_none.hdb and I knew the last patent was US24035000A1 (how would I know this?) then I would mv U_US24034315A1_none.hdb U_US24034315A1_US24035001A1.hdb hdb -create U_US24035001A1_none.hdb -dump Spits out lots of lines like this, evidently one per patent. 15:32:52 US24156575A1 bib_docid: 1014242 bib_colln: usapps(83) ft_docid: 260987 ft_colln: usappsft.recent(86) 15:32:52 US24189343A1 bib_docid: 1047001 bib_colln: usapps(83) ft_docid: 293746 ft_colln: usappsft.recent(86) 15:32:52 USD0262199__ bib_docid: 703523 bib_colln: bibonly1(1) ft_docid: 186020 ft_colln: usft4(90) -stats To see some statistics on this hdb file. For example, hdb -stats /dfs/cdrom/hdb/U_US24034315A1_none.hdb shows maxInUseCt=14 <-- Maximum # records in any hash slot (14 is a lot) alernateReadCt=0 (should equal->) computed=0 maxLinkedListCt=1 maxllUseCt=14 <-- Same as maxInUseCt above?? zeroCt=0 <-- Number of hash slots that are empty (???) recCt=902986 <-- Number of records in this hdb -time Times how long it takes to query one or any number of patents from an hdb file. For example, echo US24034315A1 | /dfs/cdrom/wlsearch/hdb -time /dfs/cdrom/hdb/U_US24034315A1_none.hdb returns 16:13:42 Query Elapsed 0.000 secs -struct Takes no options and returns AllocItemCt=31 HashSize = 65536 HdbHeader size=16 type h start=0 length=1 hashSize start=4 length=4 allocItemCt start=8 length=4 size start=12 length=1 HdbItem size=28 type i start=0 length=1 patn start=1 length=13 bibDocid start=16 length=4 ftDocid start=20 length=4 bibCollid start=24 length=1 ftCollid start=25 length=1 size start=26 length=1 HdbItemArray size=884 type a start=0 length=1 item start=4 length=868 inUseCt start=872 length=2 next start=876 length=4 size start=880 length=2 HdbHandle size=1172 type n start=0 length=1 header start=4 length=16 arraybuff start=20 length=884 filehandle start=904 length=4 stream start=908 length=4 fileName start=912 length=256 size start=1168 length=2 I have no idea what all that stuff means. =================================================================================