Interesting case of index un-usage

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Interesting case of index un-usage
Date: 2014-04-09 18:56:23
Message-ID: CAGTBQpb_jfZbfM3JPYZEwphrqScJwMCasx3D6Lx+Vw46obArcQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I have this table that is quite big (several gig).

I was looking for a row manually (because a query would take too long)
- I know there is correlation between id and date, so I was doing
manual binary search for the id range that holds certain date, and I
found an interesting case where the planner makes a significant snafu:

select created from non_bid_logs where id >= 788991892 order by id limit 100;

> Limit (cost=0.00..185.15 rows=100 width=16)
> -> Index Scan using non_bid_logs_pkey on non_bid_logs (cost=0.00..33973433.99 rows=18349427 width=16)
> Index Cond: (id >= 788991892)

That uses the pk over id to get the first 100 rows above that. Quite
straightforward and correct - and fast.

Now... I originally tried:

select created from non_bid_logs where id >= 788991892 limit 100;

The same plan should work, and still be fast. But I get:

> Limit (cost=0.00..12.30 rows=100 width=8)
> -> Seq Scan on non_bid_logs (cost=0.00..2257215.96 rows=18350037 width=8)
> Filter: (id >= 788991892)

This seems like a snafu of cost estimation. The planner should know
about the spatial correlation of "id", it's not clustered manually,
but quite naturally clustered, and yet it estimates the limit will
find the rows so fast?

If I do:

select correlation from pg_stats where tablename = 'non_bid_logs' and
attname = 'id';

I get:

0.272682

I don't know if that's realistic, I don't really know how to interpret
that number. But, experimentally, the seqscan performs horribly.

If I set enable_seqscan=off, and retry, I get:

> Limit (cost=0.00..185.16 rows=100 width=8)
> -> Index Scan using non_bid_logs_pkey on non_bid_logs (cost=0.00..33978925.99 rows=18351396 width=8)
> Index Cond: (id >= 788991892)

So the planner knows about the index, it's just that it believes
(somehow foolishly) that the seqscan will be faster.

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Burgholzer 2014-04-09 19:48:57 Re: Optimizing Time Series Access
Previous Message Bruce Momjian 2014-04-09 16:51:27 Re: PGSQL, checkpoints, and file system syncs