ANALYZE versus expression indexes with nondefault opckeytype

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: ANALYZE versus expression indexes with nondefault opckeytype
Date: 2010-07-31 03:53:26
Message-ID: 16619.1280548406@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been poking at the FTS problem example recently submitted by
Artur Dabrowski. It contains an index declared like so:

CREATE INDEX idx_keywords_ger ON search_tab
USING gin (to_tsvector('german'::regconfig, keywords));

which can be queried like so

select * from search_tab where
to_tsvector('german', keywords) @@ to_tsquery('german', 'whatever');

which is all fine and matches suggestions in our documentation. But I
was dismayed to discover that ANALYZE wasn't storing any statistics
for this expression index, which rather hobbles the planner's ability
to produce useful estimates for the selectivity of such WHERE clauses.

The reason for that turns out to be this bit in analyze.c:

/*
* Can't analyze if the opclass uses a storage type
* different from the expression result type. We'd get
* confused because the type shown in pg_attribute for
* the index column doesn't match what we are getting
* from the expression. Perhaps this can be fixed
* someday, but for now, punt.
*/
if (exprType(indexkey) !=
Irel[ind]->rd_att->attrs[i]->atttypid)
continue;

Since a tsvector index does have a storage type different from tsvector
(in fact, it's shown as TEXT), this code decides to punt and not analyze
the column. I think it's past time we did something about this.

After a bit of study of the code, it appears to me that it's not too
difficult to fix: we just have to use the expression's result type
rather than the index column's atttypid in the subsequent processing.
ANALYZE never actually looks at the index column contents (indeed
cannot, since our index AMs provide no API for extracting the contents),
so atttypid isn't really all that relevant to it. However, this'll
imply an API/ABI break for custom typanalyze functions: we'll need to
store the actual type OID in struct VacAttrStats, and use that rather
than looking to attr->atttypid or any of the other type-dependent
fields of the Form_pg_attribute. So that seems to make it not practical
to back-patch.

I thought of an ugly hack that would avoid an API/ABI break: since
VacAttrStats.attr is a copy of the pg_attribute data, we could scribble
on it, changing atttypid, attlen, attbyval, etc to match the index
expression result type. This seems pretty grotty, but it would allow
the fix to be back-patched into existing branches.

Comments? I'm not sure which way to jump here.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-07-31 04:40:03 rbtree code breaks GIN's adherence to maintenance_work_mem
Previous Message Robert Haas 2010-07-31 01:55:45 Re: reducing NUMERIC size for 9.1