Re: gincostestimate and hypothetical indexes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Julien Rouhaud <julien(dot)rouhaud(at)dalibo(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gincostestimate and hypothetical indexes
Date: 2015-12-01 17:38:34
Message-ID: 14875.1448991514@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Julien Rouhaud <julien(dot)rouhaud(at)dalibo(dot)com> writes:
> On 01/12/2015 00:37, Tom Lane wrote:
>> Maybe we could do something along the lines of pretending that 90% of the
>> index size given by the plugin is entry pages? Don't know what a good
>> ratio would be exactly, but we could probably come up with one with a bit
>> of testing.

> I used zero values because gincostestimate already handle empty
> statistics, and pretend that 100% of the pages are entry pages:

Yeah, but that code is pretty bogus. It was never intended to do more
than minimally hold the fort until someone had vacuumed. If we're trying
to support hypothetical-index plugins with this code, it should try to
do something a bit more realistic.

I did a bit of investigation using some sample data I had laying around
(JSONB and TSVECTOR data). It looks like assuming that entry pages are
90% of the index is not too awful; I saw actual values ranging from
80% to 94%. The real weak spot of the current code, however, is

numEntries = numTuples; /* bogus, but no other info available */

which is just as bogus as it says, because numTuples is going to be the
heap tuple count not anything specific to GIN. Often you'd expect the
number of entries to be some multiple of the number of tuples, because
the whole point of GIN is to be able to index components of the indexed
column's values. But on the other hand if there are not a lot of
distinct component values, you could get many fewer entries than tuples.

Based on what I saw in this small sample, I'm inclined to propose setting
numEntries to 100 times numEntryPages, that is, assume 100 entries per
entry page. That could easily be off by a factor of 2, but it seems more
robust than just blindly using the heap tuple count.

(Of course, all of this is predicated on the assumption that the
hypothetical-index plugin gives us some realistic value for the index's
size in pages, but if it doesn't it's the plugin's fault.)

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2015-12-01 17:51:42 Re: proposal: multiple psql option -c
Previous Message Greg Stark 2015-12-01 17:17:21 Re: [RFC] overflow checks optimized away