Re: gsoc, text search selectivity and dllist enhancments

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc, text search selectivity and dllist enhancments
Date: 2008-07-10 20:37:28
Message-ID: 13892.1215722248@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Jan Urbaski wrote:
>> Oh, one important thing. You need to choose a bucket width for the LC
>> algorithm, that is decide after how many elements will you prune your
>> data structure. I chose to prune after every twenty tsvectors.

> Do you prune after X tsvectors regardless of the numbers of lexemes in
> them? I don't think that preserves the algorithm properties; if there's
> a bunch of very short tsvectors and then long tsvectors, the pruning
> would take place too early for the initial lexemes. I think you should
> count lexemes, not tsvectors.

Yeah. I haven't read the Lossy Counting paper in detail yet, but I
suspect that the mathematical proof of limited error doesn't work if the
pruning is done on a variable spacing. I don't see anything very wrong
with pruning intra-tsvector; the effects ought to average out, since the
point where you prune is going to move around with respect to the
tsvector boundaries.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-07-10 21:02:36 Re: gsoc, text search selectivity and dllist enhancments
Previous Message Jan Urbański 2008-07-10 20:32:26 Re: gsoc, text search selectivity and dllist enhancments