Re: Unexpected sequential scan on an indexed column

From: Eddy Escardo-Raffo <eescardo(at)kikini(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Unexpected sequential scan on an indexed column
Date: 2009-11-16 11:18:25
Message-ID: 4eaa4a5e0911160318r68a3b61er5406f0886a7de0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

OK, I think that after reading this
doc<http://www.postgresql.org/files/developer/optimizer.pdf> (which
I hadn't encountered before) about the optimizer, something clicked in my
brain and I think I can answer my own question. I was basically thinking
from my own perspective rather than from the query planner's perspective:
- From my perspective I know that the subselect will return very few values,
so naively I expected that the planner would be able to do a bitmap index
scan with the small set of values returned, without needing to do a join
(such as the nested loop join it ended up choosing).
- However (and this is probably obvious to all of you), the query
planner doesn't really know for a fact that a sub-select will result in a
small number of rows, so it guesses based on its statistics what the best
kind of join would be. A 'bitmap index scan' is not one of the choices for a
join, I'm guessing because a 'nested loop join with inner index scan' is a
more generally applicable strategy that can get the same order of magnitude
of performance in restriction cases that end up being as simple as an IN
(list) restriction. However, there are more competing possibilities for
picking an appropriate join strategy than for picking a strategy to apply an
IN (list) restriction, so the planner may not pick the 'nested loop join
with inner index scan' if the ANALYZE statistics don't guide it that way,
even if that would be the best strategy in the end.
I guess the only way I can think of to make a generic planner that would
have performend well even in the lopsided statistics case is to create some
plan nodes with contingency conditions. E.g.:

Plan: Nested loop join with sequential scan
Assumption: all table values are the same
Contingency plan: nested loop join with index scan

Then, if the assumption for the plan is violated early enough while
executing the plan, the query executor would abort that plan node execution
and start over with the contingency plan.

I guess implementing this kind of system in a generic way could get pretty
hairy, and given my limited experience I don't know if the proportion of
query plans that would be improved by having these kinds of contingency
plans is significant enough to warrant the cost of developing this system,
but I'm gathering that most query planners (including the postgres planner)
don't do this kind of contingency planning :)

Thanks!
Eddy
On Sun, Nov 15, 2009 at 5:46 PM, Eddy Escardo-Raffo <eescardo(at)kikini(dot)com>wrote:

> I was using VALUES in my examples to more closely mirror the results of a
> sub-select (I abstracted everything else away and noticed that even just
> using VALUES syntax instead of a sub-select, the performance was bad). The
> full statement I had that led me into this more narrow investigation in the
> first place looks more like:
>
> explain analyze SELECT u.userid FROM users u, (SELECT locid FROM locations
> WHERE ...) l WHERE u.location = l.locid LIMIT 10;
>
> Based on the investigation so far, it seems like this kind of statement
> will perform well when the users.location distribution is not overwhelmingly
> lopsided, but not otherwise. However, using the IN (list) notation with a
> list of integer literals seems to perform well no matter what is the
> specific distribution of values in the users.location column.
>
> I would like to understand why this is so, to help me write better queries
> in the future.
>
> Thanks,
> Eddy
> On Sun, Nov 15, 2009 at 5:23 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Eddy Escardo-Raffo <eescardo(at)kikini(dot)com> writes:
>> > For C, the planner estimated 10 thousand rows. For D, the planner
>> estimated
>> > 100 thousand rows, yet for E the planner estimated only 1 row, which is
>> the
>> > closest to reality. So, is there any way to specify a query that has a
>> > SUB-SELECT that returns a small set of values so that the planner treats
>> it
>> > similar to how it treats statement E, or does statement E get its
>> additional
>> > edge precisely from the fact that the restriction is defined by integer
>> > literals?
>>
>> Currently there is no attempt to look at the exact contents of a VALUES
>> construct for planning purposes. For the examples you're showing it
>> seems like the IN (list) notation is more compact and more widely used,
>> so improving the VALUES alternative doesn't seem that exciting.
>>
>> regards, tom lane
>>
>
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Wayne Beaver 2009-11-16 16:13:49 Re: Manual vacs 5x faster than autovacs?
Previous Message Laurent Laborde 2009-11-16 09:00:56 Re: limiting performance impact of wal archiving.