Steps to merge Patolis's imag table with Southbury's imag table. Used when we did the "Get Rid Of Jukeboxes" project for Patolis. ============================================================================= A Snapshot of the Patolis IMAG Table Collection CD Label Year Range # CDs # Patents Comment ========== ======== ========== ===== ========= ======= EP epb 1980-1999 554 276,666 EP esp 1979-1999 597 399,104 EP mepa 2000+ 273 238,535 PDF EP mepb 2000+ 166 122,320 PDF WO wld 1978-1999 960 406,807 WO miwo 2000 261 92,522 CDs WO miwo 2001+ 133 293,473 DVDs ===== ========= ======= 2944 1,829,427 ============================================================================= What non-IPN-SBY Images Does Patolis Have? In Patolis, as inst1 on ips03i, db2 "select substr(patn,1,2),substr(patn,11,1),count(*) from imag \ where cdlabel<>'IPN-SBY' group by substr(patn,1,2),substr(patn,11,1) \ order by substr(patn,1,2),substr(patn,11,1)" (1 minute) Patn Count -- - ----------- EP A 632,526 EP B 395,481 WO A 777,987 ============================================================================= How are the two imag tables defined, e.g. columns & restraints? db2look -d patent -a -e -m > raj.db2look Differences are 1) Southbury has DATASRC CHAR(3) NOT NULL WITH DEFAULT 'XXX' NAMESRC VARCHAR(30) NOT NULL WITH DEFAULT 'UNKNOWN' after BIB_ISD, and Patolis has XKIND CHAR(2) 2) Southbury has a Primary Key of PATN, DATASRC, CDLABEL Patolis has a Unique Index of PATN, CDLABEL ============================================================================= How is the NAMESRC & CDLABEL in Southbury's imag table used? How many NAMESRC's are UNKNOWN? Are they from the pre-DATASRC/NAMESRC days? db2 "select substr(patn,1,2),count(*) from imag where namesrc='UNKNOWN' \ group by substr(patn,1,2) order by substr(patn,1,2)" (2 minutes) patn Count Min(patn) ISD Max(patn) ISD -- ----------- ------------ ---------- ------------ ---------- EP 1,342,507 EP00000001B1 01/07/1981 EP01033917B1 01/30/2002 fixed JP 4,817,637 US 2,029,412 US00004670__ /1846 USX5583937__ 10/07/1997 WO 97,440 WO09508556A1 03/30/1995 WO09944409A1 09/02/1999 fixed ========= 8,286,996 total - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - How many cdlabels are UNKNOWN? db2 "select substr(patn,1,2),count(*) from imag where cdlabel='UNKNOWN' \ group by substr(patn,1,2) order by substr(patn,1,2)" (1 minutes) patn Count Patent Numbers ISD -- ----------- --------------------------------------------------- ---- EP 12 EP00735825B1 EP00754001B1 EP00778778B1 EP00779039B1 2002 (3 deleted) EP00786209B1 EP00796045B1 EP00845951B1 EP00853890B1 EP00904339B1 EP00915663B1 EP00915668B1 EP01033917B1 US 3,799,047 US00000001__ to US06415228__ and USX3512100__ 2002 WO 4 WO09508556A1 WO09925688A1 WO09925732A2 WO09925877A1 ========= 3,799,063 total - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - How many have both namesrc and cdlabel UNKNOWN? db2 "select substr(patn,1,2),count(*) from imag where \ namesrc='UNKNOWN' and cdlabel='UNKNOWN' group by substr(patn,1,2) order by substr(patn,1,2)" (1 minute) patn Count Patent Number Range Insert_TS -- ----------- -------------------------- --------- EP 9 EP00735825B1 - EP01033917B 2002 fixed (both =) US 18 US01607383__ - US03289512__ 2002 WO 4 WO09508556A1 - WO09925877A1 2002 fixed (both =) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - How many have namesrc different than cdlabel, but namesrc and cdlabel not UNKNOWN? db2 "select substr(patn,1,2),count(*) from imag where cdlabel<>namesrc \ and cdlabel<>'UNKNOWN' and namesrc<>'UNKNOWN' \ group by substr(patn,1,2) order by substr(patn,1,2)" (2 minutes) patn Count -- ----------- JP 3 US 1,172,129 1,148 of these have cdlabel='Unknown' (not UNKNOWN) 1,169,906 of these have a good cdlabel but namesrc=bib* 1,074 of these have a good cdlabel but namesrc=usp* WO 201,550 201,548 of these have a good cdlabel but namesrc=ericvs fixed ========= "good cdlabel" means not Unknown|UNKNOWN 1,373,682 ================================================================================ How many patents in Patolis have more than one image? db2 "select substr(patn,1,10),count(*) from imag where cdlabel<>'IPN-SBY' \ group by substr(patn,1,10) having count(*)>1 \ order by substr(patn,1,10)" > raj.dup.images (2 minutes) patn Count 2 Images 3 Images 4 Images 5 Images 6 Images -- ----------- -------- -------- -------- -------- -------- EP 269,041 222,292 46,482 266 1 0 WO 101,678 93,423 7,745 488 21 1 ========= ======= ====== === == = 370,719 315,715 54,227 754 22 1 The patent with the most images (6), was WO09915684, PATN SRH ABS DESC AMEND DRAWING CLAIM BIBLIO IMAGE_PAGES CDLABEL INSERT_TS ------------ --- --- ---- ----- ------- ----- ------ ----------- ---------- ---------- WO09915684A2 - 1 3 - 198 195 1 279 wld1999049 2001-02-07 WO09915684A3 3 1 - - - - 1 8 wld1999087 2001-02-07 WO09915684AB - 1 2 1 - - 1 2 wld1999121 2001-02-07 WO09915684AC 280 1 3 1 198 187 1 285 wld1999140 2001-02-07 WO09915684AC 279 1 3 1 236 233 1 284 wld1999188 2001-02-08 WO09915684AC 279 1 3 1 197 194 1 284 miwo2000073 2001-02-08 If we bring down all copies of these multiple images, there are 797,243. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Actually, that may have been the wrong thing to look at. Yes, there may be more than one image associated with WO09915684, but really, how many duplicates are there? E.G. there's really only 1 set of 3 duplicates for WO09915684AC. db2 "select patn,count(*) from imag where cdlabel<>'IPN-SBY' \ group by patn having count(*)>1 order by patn" > raj.dup.images.2 (2 minutes) patn Count 2 Images 3 Images 4 Images -- ----------- -------- -------- -------- EP 634 631 3 WO 13,553 13,072 477 4 ====== ====== === = 14,187 13,703 480 4 If we bring down all copies of these multiple images, there are only 32,200. ============================================================================= How many EP or WO images does Patolis have, that Southbury doesn't? Step 1) Bring Patolis's imag table, over to Southbury. db2 "export to /dfs/download/patolis.imag.del of del \ select patn,srh,abs,desc,amend,drawing,claim,biblio,image_pages,cdlabel,xkind \ from imag where substr(patn,1,2) in ('WO','EP') and cdlabel<>'IPN-SBY'" compress /dfs/download/patolis.imag.del Compressed 87 MB down to 15 MB. (3 minutes total) Just for interest, on 2-26-2003, this produced 1,028,007 EP and 777,987 WO rows. ========= 1,805,994 total images. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Step 2) Create, load, and index the japioimag table in Southbury. In Southbury, as inst1 on dephds059, cd Japio.US.Update scp1 -p inst1@ips06i.ips4db2.com:/dfs/download/patolis.imag.del.Z . This took about 8 minutes to download at 30 kiloBytes/second. Hmmmmm. Interesting. Last November, I was able to get 55kB/second. uncompress patolis.imag.del.Z db2 "connect to patent" db2 "drop table japioimag" db2 "create table japioimag (patn character(12), srh smallint, abs smallint, \ desc smallint, amend smallint, drawing smallint, claim smallint, \ biblio smallint, image_pages smallint, cdlabel character(15), xkind character(2))" db2 "import from patolis.imag.del of del commitcount 10000 insert into japioimag" This took 12 minutes to load the 1,805,994 rows, which is 150,000 rows/minute. rm patolis.imag.del To create the index, db2 "create unique index ji1 on japioimag(patn,cdlabel)" which took 6 minutes each, then also db2 "create index ji2 on japioimag(cdlabel)" which took less than 1 minute (why?). Also db2 "runstats on table inst1.japioimag with distribution and \ detailed indexes all shrlevel change" ============================================================================= Doing that "Multiple" query again, this time against the japioimag table in Southbury, db2 "select patn,count(*) from japioimag \ group by patn having count(*)>1 order by patn" > new.JAPAN.Multiple.Counts patn Count 2 Images 3 Images 4 Images ---- ----------- -------- -------- -------- EP 786 782 4 0 WO 15,268 14,729 534 5 ====== ====== === = 16,054 15,511 538 5 And for all the detailed data, db2 "select patn,srh,abs,desc,amend,drawing,claim,biblio,image_pages,cdlabel,xkind \ from japioimag where patn in \ (select patn from japioima group by patn having count(*)>1) \ order by patn,cdlabel" > new.JAPAN.Multiple.Data If we bring down all copies of these multiple images, there are 32,656. But Patolis loaded the imag data for miwo2003020, but not the images themselves in /dfs/images, so those images are not available. Outside of those miwo2003020 images, we have 1,576 EP Images and 30,972 WO Images to get ====== 32,518 Images Total across 1,061 CDs CD Label CD Count Patent Count Comments ======== ======== ============ ================================= epb 29 31 If not there, then get from TPS. esp 111 136 If not there, then get from TPS. mepa 169 807 If not there, then get from TPS. mepb 128 602 If not there, then get from TPS. wld 242 473 If not there, then get from TPS. miwo2000 246 2,711 If not there, then get from TPS. === ===== 925 4,760 in new.JAPAN.Multiple.Data.no.miwo2003020.cd.patn.TPS.list Running check_all.pl now to see how many of this group are left to do (340/1652 so far ...) miwo2001 53 9,872 Still getable from DFS miwo2002 61 14,619 Still getable from DFS miwo2003 17 3,234 Still getable from DFS === ====== 131 27,725 in new.JAPAN.Multiple.Data.no.miwo2003020.cd.patn.DFS.list Running check_all.pl now to see how many of this group are left to do (none so far ...) impact 4 18 Recently Added From Southbury. Don't have to get. unknown 1 15 Recently Added From Southbury. Don't have to get. = == 5 33 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Step 3) How many EP & WO images does Patolis have that Southbury doesn't? db2 "select substr(j.patn,1,2) as patn,count(j.patn) as count \ from japioimag j where \ not exists (select i.patn from imag i where i.patn=j.patn) \ group by substr(j.patn,1,2) order by substr(j.patn,1,2)" PATN COUNT ---- ------- EP 30,353 WO 27,442 ====== 57,795 Different Patents (not images) Or to also capture and sort by the Japio CD label, db2 "select j.cdlabel,count(j.patn) as count \ from japioimag j where \ not exists (select i.patn from imag i where i.patn=j.patn) \ group by j.cdlabel order by j.cdlabel" CDLABEL Count ========= ======== wld 120 Different CDs esp 110 Different CDs miwo 27 Different CDs epb 4 Different CDs mepa 1 Different CDs ================= 262 Different CDs - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - What about the other way? Does Southbury have any EP or WO images that Patolis is missing? db2 "select substr(i.patn,1,2) as patn,count(i.patn) as count \ from imag i where substr(i.patn,1,2) in ('EP','WO') and \ not exists (select j.patn from japioimag j where j.patn=i.patn) \ group by substr(i.patn,1,2) order by substr(i.patn,1,2)" PATN COUNT ---- ------- EP 767,218 WO 69,672 ======= 836,890 Curious. ============================================================================= What do we have? | Southbury | Patolis ---+-------------+----------- EP | 1,766,932 | 1,028,007 WO | 806,339 | 777,987 | ========= | ========= 2,573,271 | 1,805,994 Cases: (First, I resolved all EP and WO rows where cdlabel=UNKNOWN or namesrc=UNKNOWN or cdlabel<>namesrc) Now, all EP & WO imag rows have cdlabel=namesrc and none are UNKNOWN) 1A) EP with all rows identical = 0 (db2 -cvf count1.sql) 1B) WO with all rows identical = 10,448 2A) EP in Patolis, but missing from Southbury = 30,353 (db2 -cvf count2.sql) 2B) WO in Patolis, but missing from Southbury = 27,442 3A) EP in Southbury, but missing from Patolis = 767,218 (db2 -cvf count3.sql) 3B) WO in Southbury, but missing from Patolis = 69,672 4A) EP all rows identical except cdlabel = 0 (db2 -cvf count4.sql) 4B) WO all rows identical except cdlabel = 20 ... Other A) EP Other 1,000,000 Other B) WO Other 700,000 =============================================================================