Re: Google Summer of Code 2008

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Google Summer of Code 2008
Date: 2008-03-08 22:06:55
Message-ID: 47D30DFF.9080907@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Oleg Bartunov wrote:
> On Sat, 8 Mar 2008, Jan Urbaski wrote:
>> OK, after reading through the some of the code the idea is to write a
>> custom typanalyze function for tsvector columns. It could look inside

> such function already exists, it's ts_stat(). The problem with ts_stat() is
> its performance, since it sequentually scans ALL tsvectors. It's
> possible to
> write special function for tsvector data type, which will be used by
> analyze, but I'm not sure sampling is a good approach here.

I too was concerned about whether looking through the tsvectors of 100
random documents would be enough to get satisfying results.
But maybe it would? I haven't seen many databases in my life, but I have
a feeling that in many cases the documents stored in them have many
things in common, they use similar vocabulary or so. For example, a
database containing the PostgreSQL documentation would have a lot of
records with the word 'tuple' or 'column' and a sampling approach should
catch at least these most common words.

By the way, does ALTER TABLE SET STATISTICS influence the number of
tuples chosen by ANALYZE to do it's work, or just the amount of
statistics extracted from that sample? It wasn't obvious to me while
reading the code.

> The way we could improve performance of gathering stats using ts_stat()
> is to process only new documents. It may be not as fast as it looks
> because of
> lot of updates, so one need to think more about.

That's the second approach I thought about. Suppose we let the user
choose a table built by doing a INSERT ... SELECT * FROM ts_stat(...)
and associate it somehow with the table containing the indexed
documents. We could then give quite precise estimates on text search
queries by looking for lexemes frequencies in that table. We could also
give the users a trigger function, similar to tsvector_update_trigger,
that could be installed for inserts, updates and deletes and would
"merge" the changes to a document with this statistics table. Or just
cram that data into pg_statistics and make the trigger operate on that?

This looks like a more complicated approach. It sure would give a whole
lot better results, but I'm worried about the additional overhead on
every non-readonly operation. Anyway, I'm not even sure if this approach
is sane: if we're to do it for tsvectors (essentialy keep an extra table
with the computed frequencies of values from a column), then why not do
it for integers?

> Unfortunately, selectivity estimation for query is much difficult than
> just estimate frequency of individual word.

Sure, given something like 'cats & dogs'::tsquery the frequency of 'cat'
and 'dog' won't suffice. But at least it's a starting point and if we
estimate that 80% of the documents have 'dog' and 70% have 'cat' then we
can tell for sure that at least 50% have both and that's a lot better
than 0.1% that's being returned now.

Regards,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Cramer 2008-03-08 22:12:25 Re: Doubt in index scan code
Previous Message Michał Zaborowski 2008-03-08 20:56:41 constraint with no check