Re: Google Summer of Code 2008

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Google Summer of Code 2008
Date: 2008-03-08 19:29:36
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 8 Mar 2008, Jan Urbaski wrote:

> Oleg Bartunov wrote:
>> Jan,
>> the problem is known and well requested. From your promotion it's not
>> clear what's an idea ?
>>> Tom Lane wrote:
>>>> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
>>>> writes:
>>>>> 2. Implement better selectivity estimates for FTS.
> OK, after reading through the some of the code the idea is to write a custom
> typanalyze function for tsvector columns. It could look inside the tsvectors,
> compute the most commonly appearing lexemes and store that information in
> pg_statistics. Then there should be a custom selectivity function for @@ and
> friends, that would look at the lexemes in pg_statistics, see if the tsquery
> it got matches some/any of them and return a result based on that.

such function already exists, it's ts_stat(). The problem with ts_stat() is
its performance, since it sequentually scans ALL tsvectors. It's possible to
write special function for tsvector data type, which will be used by
analyze, but I'm not sure sampling is a good approach here.
The way we could improve performance of gathering stats using ts_stat() is
to process only new documents. It may be not as fast as it looks because of
lot of updates, so one need to think more about.

> I have a feeling that in many cases identifying the top 50 to 300 lexemes
> would be enough to talk about text search selectivity with a degree of
> confidence. At least we wouldn't give overly low estimates for queries
> looking for very popular words, which I believe is worse than givng an overly
> high estimate for a obscure query (am I wrong here?).

Unfortunately, selectivity estimation for query is much difficult than
just estimate frequency of individual word.

Oleg Bartunov, Research Scientist, Head of AstroNet (,
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su,
phone: +007(495)939-16-83, +007(495)939-23-83

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-03-08 20:13:18 Re: Google Summer of Code 2008
Previous Message Jan Urbański 2008-03-08 18:50:02 Re: Google Summer of Code 2008