Re: GIN improvements part2: fast scan

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN improvements part2: fast scan
Date: 2014-01-24 18:21:08
Message-ID: 52E2AF14.9020306@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/24/2014 01:58 PM, Alexander Korotkov wrote:
> On Thu, Jan 23, 2014 at 8:22 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
>> wrote:
>
>> In summary, these are fairly small patches, and useful on their, so I
>> think these should be committed now. But please take a look and see if the
>> logic in scanGetItem/keyGetItem looks correct to you. After this, I think
>> the main fast scan logic will go into keyGetItem.
>
> Good, thanks! Now I can reimplement fast scan basing on this patches.

I hope we're not wasting effort doing the same thing, but I was also
hacking that; here's what I got. It applies on top of the previous set
of patches.

I decided to focus on the ginget.c changes, and worry about the catalog
changes later. Instead, I added an abstraction for calling the ternary
consistent function, with a shim implementation that checks if there is
only one UNKNOWN argument, and tries the regular boolean consistent
function "both ways" for the UNKNOWN argument. That's the same strategy
we were already using when dealing with a key with one lossy page, so I
refactored that to also use the new ternary consistent function.

That abstraction can be used to do other optimizations in the future.
For example, building the truth table like I suggested a long time ago,
or simply checking for some common cases, like if the consistent
function implements plain AND logic. Or caching the results of expensive
consistent functions. And obviously, that is where the call to the
opclass-provided tri-state consistent function will go to.

Then, I rewrote keyGetItem to make use of the ternary consistent
function, to perform the "pre-consistent" check. That is the core logic
from your patch. Currently (ie. without the patch), we loop through all
the entries, and advance them to the next item pointer > advancePast,
and then perform the consistent check for the smallest item among those.
With the patch, we first determine the smallest item pointer among the
entries with curItem > advancePast, and call that minItem. The others
are considered as "unknown", as they might contain an item X where
advancePast < X < minItem. Normally, we would have to load the next item
from that entry to determine if X exists, but we can skip it if the
ternary pre-consistent function says that X cannot match anyway.

In addition to that, I'm using the ternary consistent function to check
if minItem is a match, even if we haven't loaded all the entries yet.
That's less important, but I think for something like "rare1 | (rare2 &
frequent)" it might be useful. It would allow us to skip fetching
'frequent', when we already know that 'rare1' matches for the current
item. I'm not sure if that's worth the cycles, but it seemed like an
obvious thing to do, now that we have the ternary consistent function.

(hmm, I should put the above paragraphs in a comment in the patch)

This isn't exactly the same structure as in your patch, but I found the
concept easier to understand when written this way. I did not implement
the sorting of the entries. It seems like a sensible thing to do, but
I'd like to see a test case that shows the difference before bothering
with it. If we do it, a binary heap probably makes more sense than
keeping the array fully sorted.

There's one tradeoff here that should be mentioned: In most cases, it's
extremely cheap to fetch the next item from an entry stream. We load a
page worth of items into the array, so it's just a matter of pulling the
next item from the array. Instead of trying to "refute" such items based
on other entries, would it be better to load them and call the
consistent function the normal way for them? Refuting might slash all
the entries in one consistent check, but OTOH, when refuting fails, the
consistent check was a waste of cycles. If we only tried to refute items
when the alternative would be to load a new page, there would be less
change of a performance regression from this patch.

Anyway, how does this patch look to you? Did I get the logic correct?

Do you have some test data and/or queries that you could share easily?

- Heikki

Attachment Content-Type Size
0001-Add-the-concept-of-a-ternary-consistent-check-and-us.patch text/x-diff 23.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2014-01-24 18:26:58 Re: new json funcs
Previous Message Claudio Freire 2014-01-24 17:58:45 Re: Minmax indexes