Heikki Linnakangas wrote:
> Jan Urbański wrote:
>> So right now the idea is to:
>> (1) pre-sort STATISTIC_KIND_MCELEM values
>> (2) build an array of pointers to detoasted values in tssel()
>> (3) use binary search when looking for MCELEMs during tsquery analysis
> Sounds like a plan. In (2), it's even better to detoast the values
> lazily. For a typical one-word tsquery, the binary search will only look
> at a small portion of the elements.
Here's another version.
Most common lexemes get sorted before storing in pg_statistic. The
ordering is on length first and value second. That way we can avoid
strncmp() calls when the lexemes have different lengths (and lexemes
know their lengths, so the data is readily available). Also, in the
binary search routine during selectivity estimation we can sometimes
avoid detoasting (I think) Datums from the pg_statistic MCELEM array.
See comments in code.
Pre-sorting introduced one problem (see XXX in code): it's not easy
anymore to get the minimal frequency of MCELEM values. I was using it to
assert that the selectivity of a tsquery node containing a lexeme not in
MCELEM is no more that min(MCELEM freqs) / 2. That's only significant
when the minimum frequency is less than DEFAULT_TS_SEL * 2, so I'm kind
of inclined to ignore it and maybe drop a comment in the code that this
may be a potential problem.
If nothing is fundamentally broken with this, I'll repeat my profiling
tests to see if anything has been gained.
GPG key ID: E583D7D2
In response to
pgsql-hackers by date
|Next:||From: Jan Urbański||Date: 2008-08-14 20:27:55|
|Subject: Re: gsoc, oprrest function for text search take 2|
|Previous:||From: Tom Lane||Date: 2008-08-14 19:04:35|
|Subject: Re: WIP: patch to create explicit support for semi and anti joins |