Re: Very specialised query

From: Matthew Wakeling <matthew(at)flymine(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Very specialised query
Date: 2009-03-30 15:57:16
Message-ID: alpine.DEB.2.00.0903301618190.21772@aragorn.flymine.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

>> Shouldn't Postgres favour a "between" index scan over an open-ended
>> one?

On Fri, 27 Mar 2009, Tom Lane wrote:
> Currently the planner only notices that for a range check that involves
> comparisons of the same variable expression to two constants (or
> pseudoconstants anyway). In principle it might be reasonable to have a
> heuristic that reduces the estimated selectivity in the example above,
> but it looks to me like it'd make clauselist_selectivity() a lot slower
> and more complicated. When you see (l1.end > l2.start), how do you know
> which variable to try to match up against others? And if you try to
> match both, what do you do when you get matches for both?

Arguably, having multiple "greater than" constraints on a field should not
affect the selectivity much, because the separate constraints will all be
throwing away the same set of rows. So having a single "greater than" will
halve the number of rows, while two "greater than"s will divide the number
of rows by three, etc. That's a vast approximation of course, and should
be skewed by the statistics.

However, combining a "greater than" with a "less than" has a different
effect on selectivity. If the numbers were completely random with
identical value spread in each field, then a single "greater than" would
halve the number of rows and an additional "less than" would halve the
number again. However, in most real life situations the values will not be
completely random, but will be very skewed, as in our case. Unless the
planner starts collecting statistics on the standard distribution of the
difference between two fields, that will be hard to account for though.

Have a look at the following EXPLAIN ANALYSE:

SELECT
l1.id AS id1,
l2.id AS id2
FROM
location l1,
location l2
WHERE
l1.objectid = 228000093
AND l2.objectid = 228000093
AND l1.id <> l2.id
AND l1.start < l2.end
AND l1.end > l2.start
AND l1.start < l2.start
UNION ALL
SELECT
l1.id AS id1,
l2.id AS id2
FROM
location l1,
location l2
WHERE
l1.objectid = 228000093
AND l2.objectid = 228000093
AND l1.id <> l2.id
AND l1.start < l2.end
AND l1.end > l2.start
AND l1.start >= l2.start;

QUERY PLAN
----------------------------------------------------------------------------------------------------------
Append (cost=0.00..20479179.74 rows=138732882 width=8)
(actual time=0.055..2362748.298 rows=258210 loops=1)
-> Nested Loop (cost=0.00..9545925.46 rows=69366441 width=8)
(actual time=0.053..627.038 rows=99436 loops=1)
Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end))
-> Index Scan using location_test_obj_end on location l1
(cost=0.00..55966.07 rows=43277 width=12)
(actual time=0.025..58.604 rows=43534 loops=1)
Index Cond: (objectid = 228000093)
-> Index Scan using location_test_obj_start on location l2
(cost=0.00..123.10 rows=4809 width=12)
(actual time=0.003..0.006 rows=2 loops=43534)
Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start < l2.start))
-> Nested Loop (cost=0.00..9545925.46 rows=69366441 width=8)
(actual time=0.041..2361632.540 rows=158774 loops=1)
Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end))
-> Index Scan using location_test_obj_end on location l1
(cost=0.00..55966.07 rows=43277 width=12)
(actual time=0.009..80.814 rows=43534 loops=1)
Index Cond: (objectid = 228000093)
-> Index Scan using location_test_obj_start on location l2
(cost=0.00..123.10 rows=4809 width=12)
(actual time=0.012..29.823 rows=21768 loops=43534)
Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start >= l2.start))
Total runtime: 2363015.959 ms
(14 rows)

Note again the two leaf index scans that really matter for performance. On
one of them, there is a difference, and the other is open ended.

Expected rows Seen rows
difference 4809 2
open-ended 4809 21768

The first query fragment takes 700ms to execute, and the second takes
about forty minutes. It is effectively doing a nested loop join with
hardly any benefit gained from the indexes at all, over a simple index on
objectid. I may as well make the query a lot simpler, and do:

SELECT
l1.id AS id1,
l2.id AS id2
FROM
location l1,
location l2
WHERE
l1.objectid = 228000093
AND l2.objectid = 228000093
AND l1.id <> l2.id
AND l1.start < l2.end
AND l1.end > l2.start

Unless this particular issue is improved in the planner, I don't think
this particular style of query is going to work for us. I know that the
database could in theory return a result in about 1400ms. I'll see how
close to that I can get with plpgsql.

Matthew

--
First law of computing: Anything can go wro
sig: Segmentation fault. core dumped.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Matthew Wakeling 2009-03-30 15:59:05 Re: Very specialised query
Previous Message Marc Mamin 2009-03-30 15:35:47 Re: Very specialised query