Re: gsoc, text search selectivity and dllist enhancments

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc, text search selectivity and dllist enhancments
Date: 2008-07-11 06:18:25
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> writes:
>> Tom Lane wrote:
> Well, (1) the normal measure would be statistics_target *tsvectors*,
> and we'd have to translate that to lexemes somehow; my proposal is just
> to use a fixed constant instead of tsvector width as in your original
> patch. And (2) storing only statistics_target lexemes would be
> uselessly small and would guarantee that people *have to* set a custom
> target on tsvector columns to get useful results. Obviously broken
> defaults are not my bag.

Fair enough, I'm fine with a multiplication factor.

>> Also, the existing code decides which elements are worth storing as most
>> common ones by discarding those that are not frequent enough (that's
>> where num_mcv can get adjusted downwards). I mimicked that for lexemes
>> but maybe it just doesn't make sense?
> Well, that's not unreasonable either, if you can come up with a
> reasonable definition of "not frequent enough"; but that adds another
> variable to the discussion.

The current definition was "with more occurrences than 0.001 of total
rows count, but no less than 2". Copied right off
compute_minimal_stats(), I have no problem with removing it. I think its
point is to guard you against a situation where all elements are more or
less unique, and taking the top N would just give you some random noise.
It doesn't hurt, so I'd be for keeping the mechanism, but if people feel
different, then I'll just drop it.

Jan Urbanski
GPG key ID: E583D7D2

ouden estin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Urbański 2008-07-11 06:23:05 Re: gsoc, text search selectivity and dllist enhancments
Previous Message Gurjeet Singh 2008-07-11 04:23:17 Postgres 8.1 doesn't like pg_standby's -l option