Re: Unexpected sequential scan on an indexed column

From: Eddy Escardo-Raffo <eescardo(at)kikini(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Unexpected sequential scan on an indexed column
Date: 2009-11-16 01:06:26
Message-ID: 4eaa4a5e0911151706s31b2f9e1w23dae77f2db8c001@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Yeah, that was it. Thanks! I do have one more question at the bottom,
though, if anyone has enough time to read through my analysis

If I create the table as:

CREATE TABLE users
(
userid integer NOT NULL,
location integer NOT NULL,
CONSTRAINT pk_users PRIMARY KEY (userid)
)
WITH (
OIDS=FALSE
);

CREATE INDEX idx_users_location
ON users
USING btree
(location);

INSERT INTO users (userid,location) SELECT GENERATE_SERIES(1,1000000) ,
GENERATE_SERIES(1,1000000)/100000;
UPDATE users SET location=76543 WHERE userid=100092;
UPDATE users SET location=76543 WHERE userid=997000;

So, now we have 10 distinct location values, evenly distributed, one value
(10) with only one row and one value (76543) with 2 rows. If, after this
setup I do *statement C:*

explain analyze SELECT userid FROM users, (VALUES (76543), (892), (10)) val
(id) WHERE location = val.id;

Nested Loop (cost=0.00..17277.21 rows=300000 width=4) (actual
time=0.023..0.06 rows=3 loops=1)
-> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=4) (actual
time0.002..0.004 rows=3 loops=1)
-> Index Scan using idx_users_location on users (cost=0.00..4509.06
rows=10000 width=8) (actual time=0.008..0.009 rows=1 loops=3)
Index Cond: (users.location = "*VALUES*".column1)
Total runtime: 0.078 ms
(5 rows)

*and if I do statement D:*

explain analyze SELECT userid FROM users WHERE location IN (VALUES (76543),
(892), (10));
Nested Loop (cost=0.05..17277.24 rows=300000 width=4) (actual
time=0.033..0.056 rows=3 loops=1)
-> HashAggregate (cost=0.05..0.08 rows=3 width=4) (actual
time=0.012..0.015 rows=3 loops=1)
-> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=4)
(actual time=0.002..0.004 rows=3 loops=1)
-> Index Scan using idx_users_location on users (cost=0.00..4509.06
rows=100000 width=8) (actual time=0.007..0.009 rows=1 loops=3)
Index Cond: (users.location = "*VALUES*".column1)
Total runtime: 0.094 ms
(6 rows)

Where C has a slight edge over D (I ran them both about 5 times and it seems
like C is approx. 20% faster for this specific data set). So, I think this
will work pretty good. However, I'm still curious (for my own education) as
to why something like the following has even more of an edge over the
previous two alternatives. *Statement E*:

explain analyze SELECT userid FROM users WHERE location IN (76543, 892, 10);

Bitmap Heap Scan on users (cost=12.91..16.93 rows=1 width=4) (actual
time=0.035..0.038 rows=3 loops=1)
Recheck Cond: (location = ANY ('{76543,892,10}'::integer[]))
-> Bitmap Index Scan on idx_users_location (cost=0.00..12.91 rows=1
width=0) (actual time=0.027..0.027 rows=3 loops=1)
Index Cond: (location = ANY ('{76543,892,10}'::integer[]))
Total runtime: 0.072 ms
(5 rows)

For C, the planner estimated 10 thousand rows. For D, the planner estimated
100 thousand rows, yet for E the planner estimated only 1 row, which is the
closest to reality. So, is there any way to specify a query that has a
SUB-SELECT that returns a small set of values so that the planner treats it
similar to how it treats statement E, or does statement E get its additional
edge precisely from the fact that the restriction is defined by integer
literals? If so, I think it's ok, because it seems like statements C or D
will work well enough when the location distribution is realistic, but I'd
like to be educated for the future :)

Thanks again!
Eddy

On Sun, Nov 15, 2009 at 3:59 PM, Eddy Escardo-Raffo <eescardo(at)kikini(dot)com>wrote:

> Thanks, Tom. I had discarded the possibility of data type mismatch already,
> which was your first guess, but was wondering if the lopsided distribution
> of location values would lead the planner to make a decision that is good on
> average but bad for this particular query, as you point out in your second
> guess.
>
> I'll try populating the test users with a more evenly distributed location
> field, which will be more realistic anyway, and see if that works out
> better.
>
> BTW, the -1 is not really a dummy value, but it's just a value that we have
> been using in tests for "fake test location ID". I just started performance
> measurement for my application and so far had measured performance with
> every user being in the same default location and things seemed to be going
> well, so I tried to switch a couple users to a different location and see
> what happened, and that made performance drop significantly.
> (even more detail: my queries also limit results to 10 approx, so DB
> quickly found 10 rows that match location -1, but it took a while to
> discover there weren't more than 2 rows with the other value).
>
> Thanks!
> Eddy
>
> On Sun, Nov 15, 2009 at 3:33 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Eddy Escardo-Raffo <eescardo(at)kikini(dot)com> writes:
>> > The table used in this query is called "users", and it has columns
>> "userid"
>> > (primary key) and "location".
>> > The "location" column is indexed.
>> > The users table has 1 million rows, and all rows have integer typed
>> value
>> > '-1' for "location" column, except for 2 rows that have the integer
>> value
>> > '76543'.
>>
>> Oh, after poking at it a bit more, I realize the problem: the planner
>> doesn't want to use an indexscan because it assumes there's a
>> significant probability that the search will be for -1 (in which case
>> the indexscan would be slower than a seqscan, as indeed your results
>> prove). Even though it could know in this particular case that the
>> comparison value isn't -1, I doubt that teaching it that would help your
>> real queries where it will probably be impossible to determine the
>> comparison values in advance.
>>
>> I would suggest considering using NULL rather than inventing a dummy
>> value for unknown locations. The estimation heuristics will play a
>> lot nicer with that choice.
>>
>> regards, tom lane
>>
>
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2009-11-16 01:23:19 Re: Unexpected sequential scan on an indexed column
Previous Message Eddy Escardo-Raffo 2009-11-15 23:59:31 Re: Unexpected sequential scan on an indexed column