Re: gsoc, text search selectivity and dllist enhancments

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc, text search selectivity and dllist enhancments
Date: 2008-07-10 21:02:36
Message-ID: 14238.1215723756@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> writes:
> Still, there's a decision to be made: after how many lexemes should the
> pruning occur?

The way I think it ought to work is that the number of lexemes stored in
the final pg_statistic entry is statistics_target times a constant
(perhaps 100). I don't like having it vary depending on tsvector width
--- why for example should a column having a few wide tsvectors get a
bigger stats entry than one with many narrow ones? (Not to mention the
issue of having to estimate the average or max width before you can
start the counting run.)

But in any case, given a target number of lexemes to accumulate,
I'd suggest pruning with that number as the bucket width (pruning
distance). Or perhaps use some multiple of the target number, but
the number itself seems about right. The LC paper says that the
bucket width w is equal to ceil(1/e) where e is the maximum frequency
estimation error, and that the maximum number of table entries needed
is log(eN)/e after N lexemes have been scanned. For the values of e
and N we are going to be dealing with, this is likely to work out to
a few times 1/e, in other words the table size is a few times w.
(They prove it's at most 7w given reasonable assumptions about data
distribution, regardless of how big N gets; though I think our values
for N aren't large enough for that to matter.)

The existing compute_minimal_stats code uses a table size of twice the
target number of values, so setting w to maybe a half or a third of the
target number would reproduce the current space usage. I don't see a
problem with letting it get a little bigger though, especially since we
can expect that the lexemes aren't very long. (compute_minimal_stats
can't assume that for arbitrary data types...)

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michelle Caisse 2008-07-10 21:24:27 Re: Generating code coverage reports
Previous Message Tom Lane 2008-07-10 20:37:28 Re: gsoc, text search selectivity and dllist enhancments