Tsearch2 question: getting histogram of the vector elements

From: Rajesh Kumar Mallah <mallah(at)trade-india(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: Tsearch2 question: getting histogram of the vector elements
Date: 2004-03-10 19:54:17
Message-ID: 404F7269.9010301@trade-india.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql


Greetings!

My original problem is to de duplicate a list of around 0.3 million
company names.

Since a company name can be potentially (mis)spelt in numerous ways
exactmatch
obviously wont work.

To make the searches faster i am using tsearch. For each company name i
want to
search other companies whose name is similar to the company in question.

Since inclusion of all the vector elements of a given company reduces the
chance of matching i am thinking of excluding the high frequency words
from the query.

Hence i need to find the high frequency elements like say 'consulting' ,
'limited' , 'Private'
'Industries' that occur commonly in company names.

In my table i have populated the co_name_vec feild as
strip(to_tsvector(co_name))
can anyone help me analyzing the co_name_vec for the high frequency words?

Also i would like to know alternate / better solution to this problem.

Regds
Mallah.

SAMPLE DATA.

+-----------------------------------------------------+----------------------------------------------------------+
| co_name
| co_name_vec |
+-----------------------------------------------------+----------------------------------------------------------+
| European Trade Partner & Consulting | 'trade'
'consult' 'partner' 'european' |
| Gulbrandsen Chemicals Pvt. Ltd. | 'ltd' 'pvt'
'chemic' 'gulbrandsen' |
| Govt. of Karnataka, Vision Group on Biotechnology | 'govt' 'group'
'vision' 'karnataka' 'biotechnolog' |
| Digital Globalsoft Ltd. (A Hewlett Packard Company) | 'ltd' 'digit'
'compani' 'hewlett' 'packard' 'globalsoft' |
| Shanon Construction Material Industries | 'materi'
'shanon' 'industri' 'construct' |
| singpore india trade rsources company | 'india' 'trade'
'rsourc' 'compani' 'singpor' |
| RGV TELECOM CONSULTANTS PVT. LTD. | 'ltd' 'pvt'
'rgv' 'consult' 'telecom' |
| avid information search and documents (p) ltd. | 'p' 'ltd' 'avid'
'inform' 'search' 'document' |
| Tavant Technologies India (P) Ltd. | 'p' 'ltd'
'india' 'tavant' 'technolog' |
| Maschinen Fabrik (India) Pvt. Ltd | 'ltd' 'pvt'
'india' 'fabrik' 'maschinen' |
| Manishri Refractories and Ceramics Pvt. Ltd. | 'ltd' 'pvt'
'ceram' 'manishri' 'refractori' |
| xavier export import management institute | 'manag' 'export'
'import' 'xavier' 'institut' |
| Best InformationTechnology ltd. | 'ltd' 'best'
'informationtechnolog' |
| FutureCalls Technology Private Limited | 'limit' 'privat'
'futurecal' 'technolog' |
| mak controls and systems pvt ltd | 'ltd' 'mak'
'pvt' 'system' 'control' |
| NATIONAL RESEARCH CENTRE FOR CASHEW | 'centr' 'cashew'
'nation' 'research' |
| The Madras Aluminium Company Ltd. | 'ltd' 'madra'
'compani' 'aluminium' |
| Shriram Institute for Industrial Research | 'shriram'
'industri' 'institut' 'research' |
| All India Carpet Trade Fair Committee | 'fair' 'india'
'trade' 'carpet' 'committe' |
| Tuff Security & Allied Services | 'alli' 'tuff'
'secur' 'servic' |
+-----------------------------------------------------+----------------------------------------------------------+
(20 rows)

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Stephan Szabo 2004-03-10 20:49:34 Re: Inserting data in a table using sub-selects
Previous Message Andreas Joseph Krogh 2004-03-10 19:52:21 Inserting data in a table using sub-selects