Re: gsoc, text search selectivity and dllist enhancments

From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>, "Postgres - Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc, text search selectivity and dllist enhancments
Date: 2008-07-04 19:20:08
Message-ID: 486E77E8.6010404@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
>> Tom Lane wrote:
>>> The data structure I'd suggest is a simple array of pointers
>>> to the underlying hash table entries. Since you have a predetermined
>>> maximum number of lexemes to track, you can just palloc the array once
>>> --- you don't need the expansibility properties of a list.
>
>> The number of lexemes isn't predetermined. It's 2 * (longest tsvector
>> seen so far), and we don't know beforehand how long the longest tsvector is.
>
> Hmm, I had just assumed without looking too closely that it was stats
> target times a fudge factor. What is the rationale for doing it as
> above? I don't think I like the idea of the limit varying over the
> course of the scan --- that means that lexemes in different places
> in the input will have significantly different probabilities of
> surviving to the final result.

Well, clearly if the list is smaller than the longest tsvector,
inserting all elements of that long tsvector will flush out all other
entries. Or if we throw away the newly inserted entries, some elements
will never have a chance to climb up the list. I'm not sure where the
"times two" figure comes from, maybe it's just a fudge factor, but the
bottom line is that the minimum size needed depends on the size of the
longest tsvector.

(Jan is offline until Saturday...)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2008-07-05 05:39:54 Re: PATCH: CITEXT 2.0
Previous Message Alvaro Herrera 2008-07-04 19:12:01 Re: pgsql: Fix a couple of bugs in win32 shmem name generation: * Don't cut