| From: | Rajesh Kumar Mallah <mallah(at)trade-india(dot)com> | 
|---|---|
| To: | pgsql-sql(at)postgresql(dot)org | 
| Subject: | Tsearch2 question: getting histogram of the vector elements | 
| Date: | 2004-03-10 19:54:17 | 
| Message-ID: | 404F7269.9010301@trade-india.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-sql | 
Greetings!
My original problem is to de duplicate a list of  around 0.3 million 
company names.
Since a company name can be potentially (mis)spelt in numerous ways 
exactmatch
obviously wont work.
To make the searches faster i am using tsearch. For each company name i 
want to
search other companies whose name is similar to the company in question.
Since inclusion of all the vector elements of a given company reduces the
chance of matching i am thinking of excluding the high frequency words
from the query.
Hence i need to find the high frequency elements like say 'consulting' , 
'limited' , 'Private'
'Industries' that occur commonly in company names.
In my table i have populated the co_name_vec feild as 
strip(to_tsvector(co_name))
can anyone help me analyzing the co_name_vec for the high frequency words?
Also i would like to know alternate / better solution to this problem.
Regds
Mallah.
SAMPLE DATA.
+-----------------------------------------------------+----------------------------------------------------------+
|                       co_name                       
|                       co_name_vec                        |
+-----------------------------------------------------+----------------------------------------------------------+
| European Trade Partner & Consulting                 | 'trade' 
'consult' 'partner' 'european'                   |
| Gulbrandsen Chemicals Pvt. Ltd.                     | 'ltd' 'pvt' 
'chemic' 'gulbrandsen'                       |
| Govt. of Karnataka, Vision Group on Biotechnology   | 'govt' 'group' 
'vision' 'karnataka' 'biotechnolog'       |
| Digital Globalsoft Ltd. (A Hewlett Packard Company) | 'ltd' 'digit' 
'compani' 'hewlett' 'packard' 'globalsoft' |
| Shanon Construction Material Industries             | 'materi' 
'shanon' 'industri' 'construct'                 |
| singpore india trade rsources company               | 'india' 'trade' 
'rsourc' 'compani' 'singpor'             |
| RGV TELECOM CONSULTANTS PVT. LTD.                   | 'ltd' 'pvt' 
'rgv' 'consult' 'telecom'                    |
| avid information search and documents (p) ltd.      | 'p' 'ltd' 'avid' 
'inform' 'search' 'document'            |
| Tavant Technologies India (P) Ltd.                  | 'p' 'ltd' 
'india' 'tavant' 'technolog'                   |
| Maschinen Fabrik (India) Pvt. Ltd                   | 'ltd' 'pvt' 
'india' 'fabrik' 'maschinen'                 |
| Manishri Refractories and Ceramics Pvt. Ltd.        | 'ltd' 'pvt' 
'ceram' 'manishri' 'refractori'              |
| xavier export  import  management  institute        | 'manag' 'export' 
'import' 'xavier' 'institut'            |
| Best InformationTechnology ltd.                     | 'ltd' 'best' 
'informationtechnolog'                      |
| FutureCalls Technology Private Limited              | 'limit' 'privat' 
'futurecal' 'technolog'                 |
| mak controls and systems pvt ltd                    | 'ltd' 'mak' 
'pvt' 'system' 'control'                     |
| NATIONAL RESEARCH CENTRE FOR CASHEW                 | 'centr' 'cashew' 
'nation' 'research'                     |
| The Madras Aluminium Company Ltd.                   | 'ltd' 'madra' 
'compani' 'aluminium'                      |
| Shriram Institute for Industrial Research           | 'shriram' 
'industri' 'institut' 'research'               |
| All India Carpet Trade Fair Committee               | 'fair' 'india' 
'trade' 'carpet' 'committe'               |
| Tuff Security & Allied Services                     | 'alli' 'tuff' 
'secur' 'servic'                           |
+-----------------------------------------------------+----------------------------------------------------------+
(20 rows)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Stephan Szabo | 2004-03-10 20:49:34 | Re: Inserting data in a table using sub-selects | 
| Previous Message | Andreas Joseph Krogh | 2004-03-10 19:52:21 | Inserting data in a table using sub-selects |