Query planner suggestion, for indexes with similar but not exact ordering.

From: Andrew Barnham <andrew(dot)barnham(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Query planner suggestion, for indexes with similar but not exact ordering.
Date: 2011-11-14 22:22:46
Message-ID: CAC9_YpafQD5kEsG8XXyVK2iDd0aKgUF6m-DpN7yPMAhpZm3=7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi all. Been using postgres for years, and lurking on this newsgroup for a
short while now to help me gain the benefit of your expertise and
experience and learn how to get most out of postgresql possible.

I do a fair bit of work on tables using composite keys. I have discovered
a couple of things lately that may or may not be worth looking at in terms
of query planner.

Consider my archetypal table. Based on real data.

table master
franchise smallint,
partnumber varchar

It is a spare parts list, combining part numbers from multiple suppliers
(franchise). Part numbers are typically unique but sometimes there are
duplicates. I have use cases which concern both finding parts by a
specific franchise or finding parts system wide. In my table I have follow
stats:
* Number of records : 2,343,569
* Number of unique partnumber records : 2,130,379 (i.e. for a given
partnumber there is on average, 1.1 records. i.e. a partnumber is used by
1.1 suppliers. The partnumber with the most number of records = 8 records.
* Number of unique suppliers : 35

Now consider following query: its purpose is to render next 20 rows at an
aribtrary position. The position being after record matching franchise=10,
partnumber='1' in partnumber then franchise order.

select * from master where partnum>='1' and (partnum>'1' or franchise>10)
order by partnum,franchise limit 20;

Now if I have a composite index on partnum + franchise. This query performs
the way you would expect and very quickly.

But if I have an index on partnum only the system seqscan's master. And
yields poor performance. i.e.:
============
Limit (cost=143060.23..143060.28 rows=20 width=93) (actual
time=2307.986..2307.998 rows=20 loops=1)
-> Sort (cost=143060.23..148570.14 rows=2203967 width=93) (actual
time=2307.982..2307.986 rows=20 loops=1)
Sort Key: partnum, franchise
Sort Method: top-N heapsort Memory: 19kB
-> Seq Scan on master (cost=0.00..84413.46 rows=2203967
width=93) (actual time=0.019..1457.001 rows=2226792 loops=1)
Filter: (((partnum)::text >= '1'::text) AND
(((partnum)::text > '1'::text) OR (franchise > 10)))
Total runtime: 2308.118 ms

I wonder, if it is possible and worthwhile, to setup the query planner to
recognize that because of the stats I indicate above, that a sort by
partnum is almost exactly the same as a sort by partnum+franchise. And
doing a Index scan on partnum index, and sorting results in memory will be
dramatically faster. The sort buffer only needs to be very small, will
only grow to 8 records only at most in my above example. The buffer will
scan partnum index, and as long as partnum is the same, it will sort that
small segment, as soon as the partnum increments when walking the index,
the buffer zeros out again for next sort group.

Artificially simulating this in SQL (only works with foreknowledge of max
count of records for a given part. i.e. +8 ) shows the dramatic theoretical
performance gain over the above.
explain analyze select * from (select * from master where partnum>='1'
order by partnum limit 20+8) x where partnum>'1' or franchise>10 order by
partnum,franchise limit 20;
=====
Limit (cost=77.71..77.75 rows=16 width=230) (actual time=0.511..0.555
rows=20 loops=1)
-> Sort (cost=77.71..77.75 rows=16 width=230) (actual
time=0.507..0.524 rows=20 loops=1)
Sort Key: x.partnum, x.franchise
Sort Method: quicksort Memory: 21kB
-> Subquery Scan x (cost=0.00..77.39 rows=16 width=230) (actual
time=0.195..0.367 rows=28 loops=1)
Filter: (((x.partnum)::text > '1'::text) OR (x.franchise >
10))
-> Limit (cost=0.00..76.97 rows=28 width=93) (actual
time=0.180..0.282 rows=28 loops=1)
-> Index Scan using master_searchpartkey on master
(cost=0.00..6134000.35 rows=2231481 width=93) (actual time=0.178..0.240
rows=28 loops=1)
Index Cond: ((partnum)::text >= '1'::text)
Total runtime: 0.695 ms

Of course I could just make sure I create indexes with match my order by
fields perfectly; which is exactly what I am doing right now. But I
thought that maybe it might be worth while considering looking at allowing
some sort of in memory sort to be overlaid on an index if the statistics
indicate that the sorts are very nearly ordered.

Andrew

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2011-11-14 22:57:49 Re: Slow queries / commits, mis-configuration or hardware issues?
Previous Message Cody Caughlan 2011-11-14 21:58:52 Re: Slow queries / commits, mis-configuration or hardware issues?