Re: gsoc, text search selectivity and dllist enhancments

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc, text search selectivity and dllist enhancments
Date: 2008-07-14 00:54:19
Message-ID: 18435.1215996859@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> writes:
> OK, here's the (hopefully final) version of the typanalyze function for
> tsvectors. It applies to HEAD and passes regression tests.

> I now plan to move towards a selectivity function that'll use the
> gathered statistics.

Applied with some revisions.

Rather than making pg_statistic stakind 4 be specific to tsvector,
I thought it'd be better to define it as "most common elements", with
the idea that it could be used for array and array-like types as well as
tsvector. (I'm not actually planning to go off and make that happen
right now, but it seems like a pretty obvious extension.)

I thought it was a bit schizophrenic to repurpose
pg_stats.most_common_freqs for element frequencies while creating a
separate column for the elements themselves. What I've done for the
moment is to define both most_common_vals and most_common_freqs as
referring to the elements in the case of tsvector (or anything else that
has stakind 4 in place of stakind 1). You could make an argument for
inventing *two* new pg_stats columns instead, but I think that is
probably overkill; I doubt it'll be useful to have both MCV and MCELEM
stats for the same column. This could easily be changed though.

I removed the prune step after the last tsvector. I'm not convinced
that the LC algorithm's guarantees still hold if we prune partway
through a bucket, and anyway it's far from clear that we'd save enough
in the sort step to compensate for more HASH_REMOVE operations. I'm
open to being convinced otherwise.

I made some other cosmetic changes, but those were the substantive ones.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-07-14 01:17:35 Re: PATCH: CITEXT 2.0 v3
Previous Message David E. Wheeler 2008-07-13 23:13:50 Re: PATCH: CITEXT 2.0 v3