Re: gsoc, text search selectivity and dllist enhancments

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc, text search selectivity and dllist enhancments
Date: 2008-07-10 20:15:04
Message-ID: 48766DC8.1000006@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> writes:
>> Do you think it's worthwhile to implement the LC algorithm in C and send
>> it out, so others could try it out? Heck, maybe it's worthwhile to
>> replace the current compute_minimal_stats() algorithm with LC and see
>> how that compares?
>
> Very possibly. I repeat that the current implementation of
> compute_minimal_stats is very ad-hoc code and wasn't written with an eye
> to high performance. Replacing it with an algorithm that someone
> actually thought about might well be worth doing.

Here's a patch that combines both patches included here:
http://archives.postgresql.org/message-id/48649261.5040703@students.mimuw.edu.pl
and adds a C implementation of the Lossy Counting algorithm.

It defines two typanalyze functions: ts_typanalyze_std and
ts_typanalyze_lc, so you can see what statistics are gathered by each of
them. It's meant for easy applying to HEAD, updating pg_type and running
ANALYZE on a few tables with tsvectors (i.e. testing, not commiting).

My observations are: the LC algorithm beats the previous one by a fairly
large margin (20-30%) timewise. The results are almost identical, I got
discrepancies of about 0.05 for some lexemes' frequencies. I intend to
stick with LC for tsvectors and that'll allow to throw away the Dllist
changes.

If I want to keep my GSoC schedule I won't be able to implement LC for
general statistics gathering, but it's trivial. If no one gets about it
I can do it after the Summer of Code (if only to see how it'll work).

Oh, one important thing. You need to choose a bucket width for the LC
algorithm, that is decide after how many elements will you prune your
data structure. I chose to prune after every twenty tsvectors. You might
consider:
- picking some other arbitrary value
- making it depend on the largest tsvector size
- making it depend on the statistics_target
- pruning after each X lexemes instead of after each Y tsvectors,
because now the buckets will vary in width and you can argue that the
order of input makes a difference again. OTOH the situation here is a
bit different: you get streams of mutually different elements (lexemes
inside a tsvector are all different) and pruning in the middle of such
stream might be unfair for lexemes that are still to be processed. Hmm,
dunno.

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

Attachment Content-Type Size
gsoc08-tss-03-with-dllist.diff.gz application/gzip 9.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-07-10 20:19:52 Re: Generating code coverage reports
Previous Message Tom Lane 2008-07-10 20:11:42 Re: [PATCHES] WIP: executor_hook for pg_stat_statements