Re: Google Summer of Code 2008

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Google Summer of Code 2008
Date: 2008-03-04 18:39:20
Message-ID: 47CD9758.4070403@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Oleg Bartunov wrote:
> Jan,
>
> the problem is known and well requested. From your promotion it's not
> clear what's an idea ?

I guess the first approach could be to populate some more columns in
pg_statistics for tables with tsvectors. I see there are some statistics
already being gathered (pg_stat's histogram_bounds are populated for
tsvector columns), so maybe one could use that?
Even remembering a few of the most frequently appearing lexemes could in
my opinion help. I plotted distinct lexemes against the number documents
containing them (basically the output of stat()) in one of our databases
and came out with this:
http://www.fiok.pl/~jurbanski/kaired-depesze.png
The three high values are really stopwords, and partially because of
that I wrote my first FTS patch, but this shows that if we'd remember
the ~40 most frequent lexemes, we could give much better estimates for
popular queries (and I think are the ones that hurt performance most are
those which underestimate the row count).

As for a more general solution I'd have to read deeper into the tsearch
code to understand how the tsvector type and @@ operator work and give
it a bit more thought. I'm planning to do that in the next three weeks
(read: before the student applications period starts). Maybe some kind
of heuristic could be implemented? Possibly someone could load some
information specific to her language, which would tell the planner how
common (more or less) a given word is?

Another attempt at it would be: return lower estimates for tsqueries
consisting of more parts - 'X'::tsquery is usually far less selective
than 'X & Y & Z & V'::tsquery.

I searched through the list archives, but couldn't find any other
attempts at this problem - were there any?

Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Zdenek Kotala 2008-03-04 19:30:43 Re: How to handle error message in PG_CATCH
Previous Message Tom Lane 2008-03-04 17:49:25 Re: "could not open relation 1663/16384/16584: No such file or directory" in a specific combination of transactions with temp tables