pdd/00000832.pdd /.../..../....../assists/00000000.abt /.../..../....../......./00000000.wld wld = Spanning Word List "Style" files ========================================================================= Verity has zones and fields. There are 3 kinds of fields, date, text, and integers. See Carol's /afs/d/u/cht/misc/zones.txt file. Verity knows which you're searching based on the syntax you've entered, - zones searches look like search-term IN field_name or search-term IN (fn1,fn2,fn3 ...) and the IN is case-insensitive & could have brackets, ala - field searches look like field-name field-operator search-term e.g. Date field ===> PD = <> < > <= >= 4-22-1955 e.g. Text field ===> assg = <> < > <= >= jasper e.g. Integer field ===> docid = <> < > <= >= nnnnnn Other text field operators: starts, contains, substr, matches, ends. Field searches don't search the entire collection, they only search the specified field, important when trying to avoid Verity's maximum number of documents returned. Data can be both a zone and a field. You would want to do this if you wanted to be able to search for ranges on certain data, e.g. pd >= 2000-01-01 and pd <= 2000-12-31 Another reason for defining fields (as in the case of Assignee and Title) is to be able to return that data in a result set. You can't return zones. The patent data that are both zones and fields, are English Description Zone Full & Short Names Field Full & Short Names =================== ======================= ======================== International Class Main & MC CLAS Title & TTL TTL Application Date appldate & AD AD Issue Date issuedate & PD PD Priority Date & DP DP Patent Number & PN VdkVgwKey ========================================================================= There's Verity documentation on systems that have the Verity code installed. On rhino, it's at /ips/coll/verity/tdk/doc/html directory. Since these are local files, you need to run your web browser on the Verity system, exporting your display back to your X server. The root of this documentation is on rhino, file:///ips/coll/verity/tdk/doc/html/index.htm in JAPIO, file:///dfs/prod/verity/tdk251/REMOVE_THIS/tdk251/doc/html/index.htm ========================================================================= Digging into the above on-line documentation, Appendix D in the "Collection Building Guide", is titled "Custom Thesaurus". The Verity thesaurus file has an extension of *.syd, e.g. in JAPIO, it's at /dfs/prod/verity/tdk261/common/english/vdk20.syd. on rhino, it's at /ips/coll/verity/common/english/vdk20.syd. To decode it, you can do this command (done in JAPIO), cd /dfs/prod/verity/tdk261/common/english /dfs/prod/verity/tdk261/_rs6k41/bin/mksyd -locale english -dump -syd vdk20.syd -f raj.ctl This created a 5,420-line raj.ctl file that can then be used to add your own synonyms, modify what's there, or just to see what's defined. Be aware that special Verity keywords are 'single-quoted', e.g. not, and, or, all, greater, any, word, etc. Another interesting aspect of the thesaurus is, one word can be in multiple lists. I.E. it can be a synonym for many different words with different meanings. For example, the word "clear" appears on 27 different synonym lists. The html pages say Verity will substitute 27 sets of (a,b,c) (d,e,f) (g,h,i) ... but you never see any of these expansions. Doing a search on, e.g. clear title will return more than 250,000 patents, due to such non-intuitive synonyms as pay, hurdle, brighten, pass. The lines from the synonym file were list: "pay,clear,return,produce,net,draw,earn,gain,gross,yield,realize,repay,bring in" list: "clear,negotiate,hurdle" list: "clear,brighten,clear up" list: "clear,pass,carry" list: "open,clear" ========================================================================= The significant parts of a collection are, - The "partitions", located in the /ips/coll//parts directory, consisting of *.ddd ("Document Dataset Description") and *.did ("Document Instance Data") files. The *.ddd files contain the fields (not zones) information & all the internal information about the the collection. The *.did files are all the tokenized words in alphabetical order. - The "spanning word list", located at /ips/coll//assists/00000000.wld, by far the largest file. - The "partition descriptor" file, located in the /ips/coll//pdd directory. - The "style" files, located in the /ips/coll//style directory, which controls how Verity works. Interesting files include - style.vgw, which points to the gateway file and has the userid and password to use when connecting to the database. - There's also the "stops word list", style.stp, which may contain a list of words to not index. Words like a, is, an, of, on, the, and, it, from, used to be in that list, but we may be going towards an empty list, because it affects searching, e.g. "Bank of America". - The "transaction" directory, /ips/coll/, which always contains - the "transaction file", data.trn. Whenever Verity does anything, e.g. indexing some new patents into the collection, Verity puts those commands in this file. Should the operation abort, these queued commands remain in this data.trn file and they are reexecuted the next time anybody runs a mkvdk command. - and may contain locks (*.lck) when an collection is being updated for example. There are two primary Verity programs, the indexer and the searcher. For each patent in a given list of patents, the indexer program extracts information from a DB/2 database, and adds all the words it finds not in its stop word list, to its collection. The searcher program is run by the web servers to search that collection. When either program runs, a connection to the DB/2 database is attempted. This DB/2 connection is done by calling the "gateway" program, typically located at /ips/coll/gateway, and using the userid and password defined in the style.vgw file. ========================================================================= To check the integrity of a collection, e.g. after a collection is updated, run Verity's rcvdk utility program. - Login to ipsadmin on the Verity server. - Invoke the program with the collection name, e.g. rcvdk /ips/coll/coll_wo You should see "Successfully attached to 1 collection." - At the "RC> " prompt, search for some phrase, e.g. s thin-film You should see something along the lines of Search update: finished (100%). Retrieved: 500(3290)/2192963. which is telling you that it finished the search and that it returned the first 500 of 3290 patents that contained "thin-film", out of 2,192,963 patents it has indexed in the collection. Your actual numbers of course, will vary. - At the "RC> " prompt, enter quit to exit the rcvdk program. ========================================================================= To do more with rcvdk, you can get in expert mode. For example, rcvdk testjpft RC> x To get into expert mode. Expert mode enabled RC> s jasper To search on something. Search update: finished (100%). Retrieved: 3(3)/3625. RC> r To retrieve what your search found. Retrieved: 3(3)/3625 Number SCORE VdkVgwKey 1: 0.80 US06263102__ 2: 0.77 US06263000__ 3: 0.77 US06262610__ Note that by default, all you see are those three columns. To see more, RC> fields score 5 vdkvgwkey 12 datasrc 4 clas 12 assg 25 title 28 Now when you say r (retrieve your search results), you get more data. RC> r Retrieved: 3(3)/3625 Number score vdkvgwkey data clas assg title 1: 0.80 US06263102__ G06K00900 U.S. Philips Corporation Color and contour signal gen 2: 0.77 US06263000__ H04M01100 Hitachi Telecom Technolog Remote trading call terminal 3: 0.77 US06262610__ G11C02702 National Semiconductor Co Voltage sample and hold circ There are three questions this brings up. 1) How do you know what fields you can specify in the "fields" command? Answer: See the style.ufl (and style.ddd, too) for the collection. 2) What does patquery & patsearch do to get the columns they want? Answer: They are different -i options used to specify the columns they want Verity to return. (I think it's controlled by the -g option, too, but I don't know those details.) 3) How do you know the zones/fields you can search for with rcvdk? See the ReplaceAbbreviations subroutine in patquery source, /afs/d/projects/search/patquery.common/patquery.c. You'll see, e.g. Abbreviation => Verity Zones English Description Example Term AB => abstract Abstract AD => appldate Application Date AN => applnumber Application Number CP => priority Priority Country DS => designated Designated Countries IC => class All Classes (or International Classes??) IN => inventor Inventor KI => kind Kind MC => mainclass Main Class NC => nationalclass National Class PA => assignee Applicant PC => country Publication Country PD => issuedate Publication Date ******> PD => priority Publication Date This is a bug in patquery. This should be DP, not PD. PD= "Publication Date" according to the NPO Advanced Search Screen. PN => number Publication Number PR => priority Priority Number TI => title Title UP => upnumber Publication Number Only For example, after doing a "Priority Number" search from the Boolean Search page on the web, I can see that it uses PR. But when using rcvdk to search, you need to know that PR = priority. ========================================================================= To test out Verity's parser, you can use the testqp utility available on any machine with Verity installed, logged in as say, ipsadmin (so that the Verity code is in your PATH). $ testqp > a or b a b Says Verity would parse "a or b" with stemming and or-ing the 2 terms together. > pd=1 2 2000 pd = 1 2 2000 A normal date field, but it turns out that this Verity parsing checker, doesn't know squat about what you're searching, i.e. it doesn't know that PD is a data field. All it did was figure out that this was a field search, not a zone search, and it strung the rest of it together as your search argument. Witness > junk=1 2 3 4 5 6 7 junk = 1 2 3 4 5 6 7 or > junk=1 2 3 4 5 6 7 adsf=4jk3lk43 43;lk432 junk = 1 2 3 4 5 6 7 adsf = 4jk3lk43 43;lk432 You can use rcvdk to see how Verity would actually handle the different strings, $ rcvdk /ips/coll/v4/bibonly4 rc v2.6.1 Attaching to collection: /ips/coll/v4/bibonly4 Successfully attached to 1 collection. Type 'help' for a list of commands. RC> s (pd=2 29 2000) Search update: finished (100%). Retrieved: 500(3195)/697982. A normal date search. Leap Day 2000 was a Tuesday. What does Verity do with a 1-word date search argument? It fills in ones for the missing pieces, so "2 2000" is the same as "2 1 2000" and "2000" = "1 1 2000" which always returns no hits -- the US PTO never issues patents on a holiday. RC>RC> s (pd=2 2000) Search update: finished (100%). Retrieved: 500(3176)/697982. RC> s (pd=2 1 2000) Search update: finished (100%). Retrieved: 500(3176)/697982. RC> s (pd=2000) Search update: finished (100%). Retrieved: 0(0)/697982. But it's interesting that s (pd=2) gives errors. Hmmmm. Only if 999 < nnnn < 3000 will Verity interpret a single number as the year. ========================================================================= Verity's mkvdk utility program is used to perform most operations on a collection. It's used to - Optimize a collection, e.g. mkvdk -collection /ips/coll/coll_wo -optimize tuneup - Squeeze unused space from a collection, e.g. mkvdk -collection /ips/coll/coll_wo -squeeze tuneup - Index more patents into a collection, optionally optimizing the collection when finished, mkvdk -collection /ips/coll/coll_us1 -insert US0H000803__ for one patent, or more commonly mkvdk -collection /ips/coll/coll_us1 -insert @patn_list_file_name for lots of patents, where patn_list_file_name is a file name of a patent list. - Get some basic information about the collection, mkvdk -collection /ips/coll/coll_wo -about ========================================================================= When a Verity indexing step (/ips/bin/bibupdidx) fails, the following steps need to be done to clean up junk before retrying it. The Verity indexing process creates a new, temporary partition in the /ips/bin//parts directory, named numerically one more that the last active partition, updates that new partition, then merges it into the last active partition. For example, suppose 00000156.ddd and 00000156.did were our last, good partition in our collection, and an attempt to index more patents into the collection, aborted. You would likely see a bunch of new files named 00000157.* in the /ips/coll//parts directory. To clean up from this aborted indexing process, you would remove all these 00000157 files, i.e. rm /ips/coll//00000157*. Obviously, you want to be very careful you are erasing only newly-created files (ls -ltr) and only the files with the highest partition name. There are other steps as well. Here's the full procedure. - Logon as ipsadmin on the Verity machine. - cd to where all the Verity collections live, cd /ips/coll. - cd to the collection that was being updated, e.g. coll_wo. - Clean up any garbage files in the parts directory as described above, e.g. ls -ltr parts and rm parts/00000nnn* - Remove any lock files in trans, i.e. rm trans/*lck - vi the trans/data.trn file to get rid of any queued transactions (everything after the first 4 lines). Insure any lines you are deleting correspond to what you just cleaned up in the parts directory. Normally, Verity is not doing anything and this file is only four lines long. But when Verity is doing something like indexing a bunch of patents, those transactions can make this file very large. But even when it's very large, the same first four lines are in the data.trn file. - Remove any files in the temp and work directories, rm temp/* work/* - Double-check, via rcvdk, that you have an "intact" collection, e.g. rcvdk /ips/coll/coll_wo and s thin-film. ========================================================================= Here's all I know about "binding" a gateway. The source for the Verity gateways are in /afs/d/projects/search/gateways. When Carol or Azam create a gateway, (at least) the following files are created, - gatejpft.bnd => The "bind" file. This file is used to "bind" with DB/2, e.g. db2 bind gatejpft.bnd. What the db2 bind command does, is to register this "bind" file in the DB/2 syscat table. When the corresponding gatejpft.so program runs and first tries to do something (typically a connect) with DB/2, DB/2 uses this bounded thingy to understand C-program variable names and DB/2 commands used by the program. A bind file has in it, enough information to identify the program that uses it, so that when that program runs, this bind file is used. - gatejpft.so => The gateway itself. This is the executable that is called by the Verity mkvdk command when a collection is updated. It is pointed to by that collection's .../style/style.vgw file (see below). Verity passes as input to gatejpft.so, the patent name, userid, and password as given in style.vgw, in order to connect to the database. - gatejpft.so.debug => A debug version of gatejpft.so. To use this you would use a modified style.vgw file. Conveniently, it uses the identical bind file, so you don't have to bind anything else. Some of the nitty-gritty details if you want to understand the source and the process (the details may not be 100% accurate). C source -----> C compiler ----> .o -----> AIX Linker -----> .so file (in our case) OR C source -----> m4 macro preprocessor -----> .sqc file (Static Sequal C Program) .sqc file -----> db2 prep -----> .bnd file and -----> real .c file real .c file ----> C compiler ----> .o .bnd file -----> db2 bind -----> syscat table. or \-----> db2bfd -----> .bfd file, an English translation of the .bnd file, for human consumption. To "bind" a gateway, connect to the database and issue the db2 bind command, (Note: Binding a new patquery is different than binding a gateway !!! See your notes in the patquery file for those details. ) e.g. db2 connect to patent user ipsadmin using inst1_password (or inst1, it doesn't matter) db2 bind gateipcc.bnd (or whatever your "bind" file is named) db2 grant execute on package GATEIPCC to public (package name must be all uppercase) This adds an entry to DB/2's syscat.packages table that you can see with this command db2 "select substr(pkgschema,1,10) as PKGSCHEMA,pkgname,substr(boundby,1,9) as boundby,\ last_bind_time,explicit_bind_time from syscat.packages where pkgname like 'GATE%'" You can see the grants on this package with this command, db2 "select substr(grantor,1,10) as GRANTOR,substr(grantee,1,10) as GRANTEE,\ granteetype,substr(pkgschema,1,10) as PKGSCHEMA,pkgname,controlauth,\ bindauth,executeauth from syscat.packageauth where pkgname like 'GATE%'" Verity's style.vgw file (eg, /ips/coll/coll_us1/style/style.vgw) points to the gateway.so file and also contains the userid & DB/2 password to connect to the database with. Here is a sample style.vgw file. # style.vgw -- IBM DB2 driver for US Patent database (V4) $control: 1 gateway: { dda: "DLL:/dfs/prod/verity/gateway/gateway.so:IBMVgwPatV4Db" repository: "PATENT" user: "ipsrun inst1_password" } $$ This file seems to be identical for all the Japanese collections. Azam says this is because JAPIO's EPA, EPB, US Bib-only, and WO collections are all bib-only, i.e. not full-text. When we installed the US Full Text in Japan, we put the Full Text gateway at /dfs/prod/verity/gateway/gatejpft. The executable, shared-object file is typically named gateway.so (thus multiple gateways must be in different directories), but the bind file must be unique among all bind files "registered" in DB/2. So for example, the JAPIO US Full Text bind file was named gatejpft.bnd. It showed up in DB/2 as PKGNAME BOUNDBY LAST_BIND_TIME EXPLICIT_BIND_TIME -------- --------- -------------------------- -------------------------- GATEJP IPSADMIN 2001-03-30-04.28.58.628739 2001-03-30-04.28.58.628739 GATEJPFT INST1 2001-09-11-07.47.03.952751 2001-09-11-07.47.03.952751 2 record(s) selected. The first time I tried binding it, I got errors. db2 connect to patent user ipsadmin using inst1_password Database Connection Information Database server = DB2/6000 6.1.0 SQL authorization ID = IPSADMIN Local database alias = PATENT db2 bind gatejpft.bnd LINE MESSAGES FOR gatejpft.bnd ------ -------------------------------------------------------------------- SQL0061W The binder is in progress. 4785 SQL0204N "INST1.CLAS_ECL" is an undefined name. SQLSTATE=42704 4861 SQL0551N "IPSADMIN" does not have the privilege to perform operation "SELECT" on object "INST1.FAMI". SQLSTATE=42501 5045 SQL0551N "IPSADMIN" does not have the privilege to perform operation "SELECT" on object "INST1.OABS". SQLSTATE=42501 SQL0082C An error has occurred which has terminated processing. SQL0092N No package was created because of previous errors. SQL0091N Binding was ended with "5" errors and "0" warnings. Turns out there were two problems. First, there was no CLAS_ECL table. It's called CLAS_ECLA in the NPO world and is unused anyway. I told Azam to not use it. The other two errors were permission problems. See below on how I corrected it. The second time I tried binding it, I got different errors. It's that CLAS_ECL table again. It was defined without a NUM column. Sander and I decided to just kill the table, since it was only used in the UK Patent Office, and that is now dead. Then when I tried to index using this gateway, I got the following errors, Error: Could not get text for patent US0D435715__. rc = -551. Error E2-0531 (Document Index): Document 2 (US0D435715__): Unable to index document - SKIPPING db2GetMain returned -551 It turns out, I didn't know about the "db2 grant execute" command. ========================================================================= Permissions on DB/2 tables are kept in the syscat.tabauth table. The description (evidently changed from what is documented in the DB/2 book, is Column Type Type name schema name Length Scale Nulls ------------------------------ --------- ------------------ -------- ----- ----- GRANTOR SYSIBM VARCHAR 128 0 No GRANTEE SYSIBM VARCHAR 128 0 No GRANTEETYPE SYSIBM CHARACTER 1 0 No TABSCHEMA SYSIBM VARCHAR 128 0 No TABNAME SYSIBM VARCHAR 128 0 No CONTROLAUTH SYSIBM CHARACTER 1 0 No ALTERAUTH SYSIBM CHARACTER 1 0 No DELETEAUTH SYSIBM CHARACTER 1 0 No INDEXAUTH SYSIBM CHARACTER 1 0 No INSERTAUTH SYSIBM CHARACTER 1 0 No SELECTAUTH SYSIBM CHARACTER 1 0 No REFAUTH SYSIBM CHARACTER 1 0 No UPDATEAUTH SYSIBM CHARACTER 1 0 No To query it, db2 "select substr(grantor,1,11),substr(grantee,1,11),substr(tabname,1,11),SELECTAUTH\ from syscat.tabauth where tabname='FAMI'" Do a db2 list tables to see the possible tables and notice that the tabname column must be given in upper case. To fix the above DB/2 bind problem, I had to minimally, db2 "grant select on fami to ipsadmin" and db2 "grant select on oabs to ipsadmin" but the correct full permissions are db2 "grant alter,delete,index,insert,select,references,update on fami to ipsadmin" and db2 "grant alter,delete,index,insert,select,references,update on oabs to ipsadmin" ========================================================================= The way the Verity collection is tied to the pull-down menu in the advanced search screen, is by a file that is included, /ips/include/coll_general.lst. For example, on 9-14-2001, before the US Full Text upgrade, Japan's was We of course, will add a line. ========================================================================= The following came from Japan's /dfs/download/Verity_Changes_For_USFT/README for when we gave them the US Full Text stuff in September, 2001. It talks about our plan for their USFT collection and has details on how to build a collection. This directory contains the following files and directories: README - This file. Index_US_FT_Collection.sh - Utility to index the US Full Text collections. Make_Empty_USFT_Collections.sh - Utility to generate the empty Verity collections and the virtual collection file. Make_Patent_Lists_From_DB2.sh - Utility to generate the 9 bulk load files at /ips/coll/prep/coll_usftx.bulk. style.usft Directory - Which contains the style files for the new US Full Text Collections. The plan for the US Full Text Verity Collection, is to split it up into 9 pieces, /ips/coll/coll_usft1 through /ips/coll/coll_usft9. This will keep the number of patents in each collection down to a reasonable number. There is a virtual collection file, /ips/coll/coll_usft, that lists each of these real collections. The plan also has us creating a new US Full Text collection for each new year, e.g. for the first US week in 2002, we will create /ips/coll/coll_usft10, adding that to the /ips/coll/coll_usft file. As of August 31, 2001, here are the sizes of the 9 collections. collection name issue date # of documents -------------- ------------------------ -------------- coll_usft1 Before 1991-12-31 106,882 coll_usft2 1992-01-01 to 1993-06-30 160,832 coll_usft3 1993-07-01 to 1994-12-31 170,348 coll_usft4 1995-01-01 to 1996-06-30 170,654 coll_usft5 1996-07-01 to 1997-12-31 189,277 coll_usft6 1998-01-01 to 1998-12-31 163,263 coll_usft7 1999-01-01 to 1999-12-31 169,251 coll_usft8 2000-01-01 to 2000-12-31 176,191 coll_usft9 2001-01-01 to present 123,188 =========== Total: 1,429,886 As ipsadmin on ips05i, we have already run the Make_Patent_Lists_From_DB2.sh script to create 9 lists of US patents at /ips/coll/prep/coll_usft1.bulk through /ips/coll/prep/coll_usft9.bulk, which will be used to "bulk" index each Verity collection. The next step is, again as ipsadmin on ips05i, to run the Make_Empty_USFT_Collections.sh to initialize the 9 collections. Finally, the Index_US_FT_Collection.sh script is called to index the 9 collections. This will take hours or days to run for each collection. When running Index_US_FT_Collection.sh, you should probably use nohup and save your output. For example, nohup Index_US_FT_Collection.sh 1 > coll_usft1.out 2>&1 & Don't run Index_US_FT_Collection.sh for more than one collection at a time. It's been our experience that the indexing will more likely fail with multiple instances running, for example, running out of paging space. Also keep an eye on your disk space utilization. ========================================================================= The Make_Empty_USFT_Collections.sh command above, just basically did for i in 1 2 3 4 5 6 7 8 9 do mkvdk -create -collection /ips/coll/coll_usft$i \ -style /dfs/download/Verity_Changes_For_USFT/style.usft done where /dfs/download/Verity_Changes_For_USFT/style.usft was a directory containing all the style files for the collection, namely -rwxr-xr-x 1 jasper patent 2170 Sep 29 1999 style.ddd -rw-rw-r-- 1 jasper patent 71 Jul 17 1998 style.dft -rwxrwxr-x 1 jasper patent 1700 Apr 27 1998 style.did -rw-rw-rw- 1 jasper patent 1845 Jul 25 2001 style.lex -rwxr-xr-x 1 jasper patent 1128 Jul 01 1998 style.pdd -rw-rw-rw- 1 jasper patent 1109 Jul 29 2001 style.plc -rw-rw-r-- 1 jasper patent 3714 Aug 10 1998 style.prm -rwxrwxr-x 1 jasper patent 816 Apr 27 1998 style.sid -rw-rw-r-- 1 jasper patent 101 Sep 29 1999 style.stp -rw-rw-r-- 1 jasper patent 996 Sep 05 10:03 style.ufl -rw-rw-rw- 1 nobody nobody 225 Sep 11 06:54 style.vgw -rwxrwxr-x 1 jasper patent 1338 Apr 27 1998 style.wld ========================================================================= On 2-8-2002, I created new pieces to the EPA, EPB, US (bib-only), and WO collections. I also created a totally new collection for US Apps called usapps at /ips/coll/coll_usapps. Here's what I did. First of all, the data I was giving them was EPA for the years 1979-1992, EPB for the years 1980-1992, PCT for the years 1978-1990, US for the years 1971-1990, and US Apps for the years 2001 onward. The plan is, for the US Applications, to create a completely new collection, and for the rest, to create another index piece and insert it in front of the already existing collection. When creating a new piece to an existing collection, you should copy the style files from the other pieces, even if they are out of date. For the new US Apps collection, I'll use the existing US Full Text style files, which were created just a few months ago. To create the new collections, logon as ipsadmin on ips05i and type mkvdk -create -collection /ips/coll/coll_epa0 -style /ips/coll/coll_epa/style mkvdk -create -collection /ips/coll/coll_epb0 -style /ips/coll/coll_epb/style mkvdk -create -collection /ips/coll/coll_wo0 -style /ips/coll/coll_wo/style mkvdk -create -collection /ips/coll/coll_us0 -style /ips/coll/coll_us2/style mkvdk -create -collection /ips/coll/coll_usapps -style /ips/coll/coll_usft9/style This created empty collections at the various /ips/coll/coll_* directories. Now to index the patents, I ran into a little headache. When I loaded this data, I carefully did a db2 import ... update, which intentionally did NOT load any rows that were already in the database. The result was many rows rejected because they already existed. This is the reject counts for the main table. Collection | # Rows Rejected | # Rows Added | # Rows Extracted | Reject Ratio ===========+=================+==============+==================+============= EPA | 16 | 623,452 | 623,468 | 0.003% EPB | 4 | 207,906 | 207,910 | 0.002% PCT | 16,106 | 66,558 | 82,664 | 19.4% US | 0 | 1,435,525 | 1,435,525 | 0% US Apps | 0 | 63,893 | 63,893 | 0% ===========+=================+==============+==================+============= 5 Collections 16,126 2,397,334 2,413,460 0.672% This means that I cannot simply use the load files as the list of patents that need to be indexed. Especially for the EPB & PCT collections, there would be many patents that would be indexed twice, in my new collection piece and in the already-existing collection. The problem is how do I cull the list of which patents I did successfully add, from the whole list? The DB/2 msg file I kept from the db2 import, only gives line numbers. It would be possible, but troublesome to craft a Perl script to combine the two, but it would take a while to run, I predict. I used Sander's tool at /dfs/tools/listfields to create a list of patents that are now in the Verity collection, and bounced that list off the complete list of patents. As ipsadmin on ips05i, cd /dfs/tools # Needed due to the funky way the script is written. echo $PATH=$PATH: listfields -s /ips/coll/coll_epa > verity.epa.patents mv verity.epa.patents /dfs/download/new_old_verity listfields -s /ips/coll/coll_epb > verity.epb.patents mv verity.epb.patents /dfs/download/new_old_verity listfields -s /ips/coll/coll_wo > verity.pct.patents mv verity.pct.patents /dfs/download/new_old_verity listfields -s /ips/coll/coll_us1 > us1 listfields -s /ips/coll/coll_us2 > us2 sort -u us1 us2 > /dfs/download/new_old_verity/verity.us.patents More cut-ing & comm-ing, finally produced EPA_Verity.list with 623,452 lines EPB_Verity.list with 207,906 lines PCT_Verity.list with 66,558 lines US_Verity.list with 1,435,525 lines Apps_Verity.list with 63,893 lines ========= 2,397,334 Good. It checks. Finally, to produce the Verity bulk load files, sed 's/^/vdkvgwkey: /;a\^J<>' Apps_Verity.list > /ips/coll/prep/coll_usapps.bulk sed 's/^/vdkvgwkey: /;a\^J<>' EPA_Verity.list > /ips/coll/prep/coll_epa0.bulk sed 's/^/vdkvgwkey: /;a\^J<>' EPB_Verity.list > /ips/coll/prep/coll_epb0.bulk sed 's/^/vdkvgwkey: /;a\^J<>' PCT_Verity.list > /ips/coll/prep/coll_wo0.bulk sed 's/^/vdkvgwkey: /;a\^J<>' US_Verity.list > /ips/coll/prep/coll_us0.bulk where this --------------->^J is a real . With all these pieces into place, we're now ready to use the doBulkLoad.sh script to index the list of patents in the $BULKFILE, e.g. /ips/coll/prep/coll_us0.bulk. See ~jasper/aixnotes/ksh_examples/doBulkLoad.sh for the script as it existed on 2-11-2002. ========================================================================= To remove patents from DB2 & Verity, As inst1 or ipsadmin on ips03i, db2 connect to patent db2 "delete from main where patn in ('US21032345A1','US21047528A1', 'US21047529A1','US21049835A1','US21049836A1')" Then as ipsadmin on ips05i, create a raj.del file with these patents, one per line. Then, before running the mkvdk command, it's always best to first check to see if there's a "time bomb" left over from some other failed Verity transaction. Do a wc -l /ips/coll/coll_usapps/trans/data.trn and verify it's 4 lines or less. If it's not, then investigate to see what transaction is currently running or last failed. See the notes above on how to clean this up. If all is clear, then you can mkvdk -collection /ips/coll/coll_usapps -delete @raj.del to delete the patents. ========================================================================= One way to test to see if a certain week's worth of patents have been successfully indexed, is to login as ipsadmin on ips05i (Patolis example) and rcvdk /ips/coll/coll_us3 rc v2.6.1 Attaching to collection: /ips/coll/coll_us3 Successfully attached to 1 collection. Type 'help' for a list of commands. RC> s pd=2002-07-23 Search update: finished (100%). Retrieved: 0(0)/586939. RC> s pd=2002-07-16 Search update: finished (100%). Retrieved: 500(3643)/586939. I did this when they claimed problems after loading and indexing the US week 30 data (ISD=PD=2002-07-23). Verity claimed to not know anything about that Issue Date, telling me their indexing effort failed. I had to look at their log at /ips/coll/list/usa02week30.out to see that nothing happened when they indexed. ======================================================================= To debug a bit and see what Carol cooks up for Verity to index, you can logon to ipsadmin on grumpy, klog jasper, To test the current gateway (as of March 8, 2004 that is), cd /afs/d/projects/search/gateways/gateway.bib mkvdk -collection /ips/coll/v4/testbib -nooptimize -update USD0477395__ or mkvdk -collection /ips/coll/v4/testinpa -nooptimize -update ITMI000099A0 To test the new stuff, cd /afs/d/projects/search/gateways/gateway3.bib mkvdk -collection /ips/coll/v4/testde -nooptimize -update USD0477395__ I was interested in the text string created and passed to Verity to index in the number zone, so it helped to tack this on the mkvdk cmd, ... | gnugrep -A8 'BegZ number' This gave me stuff like (indentations added by me for clarity), BegZ number BegZ upnumber Buffer: RE28481 EndZ upnumber Text: USRE028481__ USRE28481 US RE28481 E1 EndZ number So the upnumber zone had only RE28481, but the number zone had all of RE28481 USRE028481__ USRE28481 US RE28481 E1. That was the new stuff. The old stuff only had RE028481 & USRE028481__. ======================================================================= Carol keeps a log file on the dsearch machines in the /ips/vserver/logs directory, named patqry_log PID RC #Hits Out of# #Q Secs Referer #Cols Collections ------- -- ----- ------- -- ---- ----------- ----- ---------------------------------- 1380356 0 63 4881213 28 32 scroll 2 bibonly usapps 487606 0 152 9032923 62 1 refine 5 epaft epbft patentft pctft usappsft 1126640 0 0 3879073 24 1 quicksearch 1 bibonly 1605736 0 0 11061524 24 4 boolquery 2 bibonly japan 1253382 0 32 40127817 48 36 advquery 9 deappsft deft epaft epbft inpadoc japan patentft pctft usappsft 688156 0 43 37614731 46 3 scroll 11 bibonly de epa epb inpadoc inpadup2 inpadup3 inpadup4 inpadup5 pct usapps 1355786 0 82 9032917 98 5 scroll 5 bibonly epa epb pct usapps 1380356 0 0 2890353 42 3 advquery 2 epa epb 487606 0 32 40127817 51 2 choose 9 deappsft deft epaft epbft inpadoc japan patentft pctft usappsft 1126640 0 0 3879073 13 0 quicksearch 1 bibonly Where #Q=The number of Characters in the Query That first line for example, had 28 characters in the query, searching two collections (bibonly & usapps), took 32 seconds and returned 63 hits out of 4,881,213 patents. =======================================================================