Re: Tsearch2 question: getting histogram of the vector elements

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Rajesh Kumar Mallah <mallah(at)trade-india(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Tsearch2 question: getting histogram of the vector elements
Date: 2004-03-11 09:01:37
Message-ID: Pine.GSO.4.58.0403111159240.25727@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Read
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
("Check words statistics")

tsearch2 comes with stat function.

for example - top 10 words from apod (Astronomy picture of the day):

select * from stat('select ti from apod') order by ndoc desc, nentry desc,word limit 10;

word | ndoc | nentry
--------+------+--------
year | 1283 | 2064
star | 1273 | 3908
imag | 1267 | 1967
light | 1232 | 2206
pictur | 1177 | 1442
earth | 1059 | 1798
visibl | 992 | 1259
bright | 936 | 1335
котор | 903 | 1441
эт | 879 | 1397

Oleg

On Thu, 11 Mar 2004, Rajesh Kumar Mallah wrote:

>
> Greetings!
>
> My original problem is to de duplicate a list of around 0.3 million
> company names.
>
> Since a company name can be potentially (mis)spelt in numerous ways
> exactmatch
> obviously wont work.
>
> To make the searches faster i am using tsearch. For each company name i
> want to
> search other companies whose name is similar to the company in question.
>
> Since inclusion of all the vector elements of a given company reduces the
> chance of matching i am thinking of excluding the high frequency words
> from the query.
>
> Hence i need to find the high frequency elements like say 'consulting' ,
> 'limited' , 'Private'
> 'Industries' that occur commonly in company names.
>
> In my table i have populated the co_name_vec feild as
> strip(to_tsvector(co_name))
> can anyone help me analyzing the co_name_vec for the high frequency words?
>
> Also i would like to know alternate / better solution to this problem.
>
>
> Regds
> Mallah.
>
>
>
> SAMPLE DATA.
>
> +-----------------------------------------------------+----------------------------------------------------------+
> | co_name
> | co_name_vec |
> +-----------------------------------------------------+----------------------------------------------------------+
> | European Trade Partner & Consulting | 'trade'
> 'consult' 'partner' 'european' |
> | Gulbrandsen Chemicals Pvt. Ltd. | 'ltd' 'pvt'
> 'chemic' 'gulbrandsen' |
> | Govt. of Karnataka, Vision Group on Biotechnology | 'govt' 'group'
> 'vision' 'karnataka' 'biotechnolog' |
> | Digital Globalsoft Ltd. (A Hewlett Packard Company) | 'ltd' 'digit'
> 'compani' 'hewlett' 'packard' 'globalsoft' |
> | Shanon Construction Material Industries | 'materi'
> 'shanon' 'industri' 'construct' |
> | singpore india trade rsources company | 'india' 'trade'
> 'rsourc' 'compani' 'singpor' |
> | RGV TELECOM CONSULTANTS PVT. LTD. | 'ltd' 'pvt'
> 'rgv' 'consult' 'telecom' |
> | avid information search and documents (p) ltd. | 'p' 'ltd' 'avid'
> 'inform' 'search' 'document' |
> | Tavant Technologies India (P) Ltd. | 'p' 'ltd'
> 'india' 'tavant' 'technolog' |
> | Maschinen Fabrik (India) Pvt. Ltd | 'ltd' 'pvt'
> 'india' 'fabrik' 'maschinen' |
> | Manishri Refractories and Ceramics Pvt. Ltd. | 'ltd' 'pvt'
> 'ceram' 'manishri' 'refractori' |
> | xavier export import management institute | 'manag' 'export'
> 'import' 'xavier' 'institut' |
> | Best InformationTechnology ltd. | 'ltd' 'best'
> 'informationtechnolog' |
> | FutureCalls Technology Private Limited | 'limit' 'privat'
> 'futurecal' 'technolog' |
> | mak controls and systems pvt ltd | 'ltd' 'mak'
> 'pvt' 'system' 'control' |
> | NATIONAL RESEARCH CENTRE FOR CASHEW | 'centr' 'cashew'
> 'nation' 'research' |
> | The Madras Aluminium Company Ltd. | 'ltd' 'madra'
> 'compani' 'aluminium' |
> | Shriram Institute for Industrial Research | 'shriram'
> 'industri' 'institut' 'research' |
> | All India Carpet Trade Fair Committee | 'fair' 'india'
> 'trade' 'carpet' 'committe' |
> | Tuff Security & Allied Services | 'alli' 'tuff'
> 'secur' 'servic' |
> +-----------------------------------------------------+----------------------------------------------------------+
> (20 rows)
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Rute Solipa 2004-03-11 15:01:29 Re: About pg_dump
Previous Message Tom Lane 2004-03-11 06:40:06 Re: randomized order in select?