|From:||Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>|
|To:||Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>|
|Cc:||Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org|
|Subject:||Re: Google Summer of Code 2008|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
On Sat, 8 Mar 2008, Jan Urbaski wrote:
> Oleg Bartunov wrote:
>> the problem is known and well requested. From your promotion it's not
>> clear what's an idea ?
>>> Tom Lane wrote:
>>>> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
>>>>> 2. Implement better selectivity estimates for FTS.
> OK, after reading through the some of the code the idea is to write a custom
> typanalyze function for tsvector columns. It could look inside the tsvectors,
> compute the most commonly appearing lexemes and store that information in
> pg_statistics. Then there should be a custom selectivity function for @@ and
> friends, that would look at the lexemes in pg_statistics, see if the tsquery
> it got matches some/any of them and return a result based on that.
such function already exists, it's ts_stat(). The problem with ts_stat() is
its performance, since it sequentually scans ALL tsvectors. It's possible to
write special function for tsvector data type, which will be used by
analyze, but I'm not sure sampling is a good approach here.
The way we could improve performance of gathering stats using ts_stat() is
to process only new documents. It may be not as fast as it looks because of
lot of updates, so one need to think more about.
> I have a feeling that in many cases identifying the top 50 to 300 lexemes
> would be enough to talk about text search selectivity with a degree of
> confidence. At least we wouldn't give overly low estimates for queries
> looking for very popular words, which I believe is worse than givng an overly
> high estimate for a obscure query (am I wrong here?).
Unfortunately, selectivity estimation for query is much difficult than
just estimate frequency of individual word.
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
|Next Message||Tom Lane||2008-03-08 20:13:18||Re: Google Summer of Code 2008|
|Previous Message||Jan Urbański||2008-03-08 18:50:02||Re: Google Summer of Code 2008|