From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|

To: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |

Cc: | pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |

Subject: | Re: Resjunk sort columns, Heikki's index-only quals patch, and bug #5000 |

Date: | 2009-09-21 00:43:45 |

Message-ID: | 603c8f070909201743r1fcc3405w1980962d2bd30455@mail.gmail.com |

Views: | Raw Message | Whole Thread | Download mbox | Resend email |

Thread: | |

Lists: | pgsql-hackers |

On Tue, Sep 15, 2009 at 7:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>>> Consider A IJ B, with

>>> the scan over A implemented as an index scan. It seems to me that if

>>> the join selectivity is < 1, then assuming there's a choice, we

>>> probably want to join A to B and then do the heap fetches against A

>>> afterwards. But if the join selectivity is > 1 (consider, for

>>> example, a cross join), we probably want to do the heap fetches first.

>>

>> Hmm, good point. I didn't consider that join selectivity can be > 1.

>>

>> A more common scenario is that there's an additional filter condition on

>> the HeapFetch, with a selectivity < 1. It can then cheaper to perform

>> the heap fetches first and only join the remaining rows that satisfy the

>> filter condition.

>

> Well, again, it seems to me that it entirely depends on whether the IJ

> increases or decreases the number of rows. You want to do the heap

> fetches at the point where there are the fewest of them to do, and you

> can't know that a priori. When you start talking about "more common"

> scenarios, what you really mean is "more common in the queries I

> normally do", and that's not the same as what other people's queries

> do. (See, for example, previous discussions on -performance, where it

> turns out that my suggested "fix" for a particular kind of planner

> problem is the exact opposite of Kevin Grittner's fix for a problem

> with the same code; the existing code bounds a certain value from

> below at 1 - I suggested raising it to 2, he suggested lowering it to

> 0.)

I've been mulling over this problem all week. I haven't gotten all

that far, but here are my thoughts.

Suppose we're planning some joinrel within a large join nest. We have

two partial paths A and B with identical pathkeys. Path A does not

involve an index scan, so we know its exact cost. Path B involves an

index scan, so a heap fetch will have to be done at some point, but

since we can't yet know where the best place to do that heap fetch is,

so we can't know the exact cost of B. Basically, the decision we face

here is whether to keep both plans A and B or to discard one of them

as clearly inferior to the other. As a preliminary observation, this

is the logic that is performed by add_path(), which gets called A LOT

in planning large join nests, so the performance impact of what

happens there needs to be measured carefully.

We can, however, put some bounds on the cost of B. We know what the

cost of B is disregarding the heap fetch (or heap fetches) that will

eventually need to be performed. This cost is also the minimum total

cost of B, in the case where some yet-to-be-performed join is

estimated to return no rows, and thus the heap fetch is estimated to

not actually need to fetch anything. We can also bound the maximum

cost of B: it certainly can't be any higher than the cost of doing all

the heap fetches immediately above the index fetches. We could

probably compute a tighter upper bound by considering various possible

positions for the heap fetches at or below the level of the joinrel,

but I doubt that it's worth the effort.

Instead, what we can do is compare the low-estimate for B to the

estimate for A. If it's higher, discard B. Otherwise, compare the

high-estimate for B to the estimate for A. If it's lower, discard A.

Otherwise, keep both paths. More generally, we can conceive of each

path as having low and high estimates, and we can say that for two

paths with identical pathkeys, P dominates P' if the high-estimate for

P is less than the low-estimate for P'.

In this view of the world, we don't actually need to represent the

heap-fetches in the path trees during joinrel planning. Instead, we

build up a set of possible paths for the whole joinrel, and then at

the end, we go back and look at any remaining paths that require heap

fetches to be inserted and figure out the best place to put them. The

major problem I see with this approach is that it may slow down

add_path() too much, but if that turns out to be the case I don't have

another idea short of attacking the problem using some kind of

heuristics that will probably be less accurate than an analysis of the

type described above.

Thoughts?

...Robert

- Re: Resjunk sort columns, Heikki's index-only quals patch, and bug #5000 at 2009-09-15 11:53:13 from Robert Haas

From | Date | Subject | |
---|---|---|---|

Next Message | Robert Haas | 2009-09-21 01:25:17 | Re: Resjunk sort columns, Heikki's index-only quals patch, and bug #5000 |

Previous Message | Jeff Davis | 2009-09-21 00:21:51 | Re: operator exclusion constraints [was: generalized index constraints] |