Re: Resjunk sort columns, Heikki's index-only quals patch, and bug #5000

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Resjunk sort columns, Heikki's index-only quals patch, and bug #5000
Date: 2009-09-15 11:53:13
Message-ID: 603c8f070909150453g4b3a9871x50b05314945b6d55@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 15, 2009 at 5:47 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Robert Haas wrote:
>> Hi, I'm reviewing this patch for the 2009-09 CommitFest.
>
> Thank you!
>
>> It doesn't seem to compile.
>
> Looks like the patch has bitrotted, sorry about that. Attached is an
> updated version. I've also pushed this to my git repository at
> git://git.postgresql.org/git/users/heikki/postgres.git, branch "heapfetch".

OK, I'll pull from that.

>> Actually, before I even tried compiling this, I was looking through
>> the joinpath.c changes, since that is an area of the code with which I
>> have some familiarity.  As I'm sure you're aware, the lack of
>> commenting makes it quite difficult to understand what this is trying
>> to do, and the functions are poorly named.  It isn't self-explanatory
>> what "bubbling up" means, even in the limited context of joinpath.c.
>
> Yeah, understood :-(. The bubbling up refers to moving HeapFetch nodes
> above a join, which I explained earlier in this thread:
> http://archives.postgresql.org/message-id/4A90448F.4060601@enterprisedb.com,
>
>> Leaving that aside, I think that the approach here is likely wrong;
>> the decision about when to perform a heap fetch doesn't seem to be
>> based on cost, which I think it needs to be.
>
> Right, cost estimation and making choices based on it still missing.
>
>>  Consider A IJ B, with
>> the scan over A implemented as an index scan.  It seems to me that if
>> the join selectivity is < 1, then assuming there's a choice, we
>> probably want to join A to B and then do the heap fetches against A
>> afterwards.  But if the join selectivity is > 1 (consider, for
>> example, a cross join), we probably want to do the heap fetches first.
>
> Hmm, good point. I didn't consider that join selectivity can be > 1.
>
> A more common scenario is that there's an additional filter condition on
> the HeapFetch, with a selectivity < 1. It can then cheaper to perform
> the heap fetches first and only join the remaining rows that satisfy the
> filter condition.

Well, again, it seems to me that it entirely depends on whether the IJ
increases or decreases the number of rows. You want to do the heap
fetches at the point where there are the fewest of them to do, and you
can't know that a priori. When you start talking about "more common"
scenarios, what you really mean is "more common in the queries I
normally do", and that's not the same as what other people's queries
do. (See, for example, previous discussions on -performance, where it
turns out that my suggested "fix" for a particular kind of planner
problem is the exact opposite of Kevin Grittner's fix for a problem
with the same code; the existing code bounds a certain value from
below at 1 - I suggested raising it to 2, he suggested lowering it to
0.)

> It could also be cheaper to perform HeapFetch first if there's a lot of
> dead tuples in the table, as the heap fetch weeds them out. I'm not sure
> if we can estimate that in a meaningful way.

Depends whether the selectivity calculations take this into account -
off the top of my head, I'm not sure.

>> Am I right to think that the heap fetch node is not optional?  i.e.
>> even if only columns in the index are ever used, we need to make sure
>> that a heap fetch node gets inserted to check visibility.  Is that
>> right?
>
> Correct. The plan is to eventually use the visibility map to skip the
> heap access altogether, but we'll need to make the visibility map 100%
> reliable first. We'll still need a HeapFetch node to perform the
> visibility check, but it could perform it against the visibility map
> instead of the heap.
>
>> I also think that the use of the term index-only quals is misleading.
>> It seemed good when you first posted to -hackers about it, but looking
>> at the patch, it means we have both index quals and index-only quals,
>> and it seems like that might not be the greatest naming.  Maybe we
>> could call them index-tuple quals or index-evaluated quals or
>> something.
>
> Hmm, I'm not too fond of the naming either. Not sure I like those
> alternatives any better, though.

Maybe someone else will come up with a better idea. :-)

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Brendan Jurd 2009-09-15 12:52:26 Re: WIP: generalized index constraints
Previous Message Marcos Luis Ortiz Valmaseda 2009-09-15 11:48:48 Re: PGCluster-II Progress