From: | Decibel! <decibel(at)decibel(dot)org> |
---|---|
To: | Alexey Nalbat <nalbat(at)price(dot)ru> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: unnesesary sorting after Merge Full Join |
Date: | 2008-02-23 20:49:19 |
Message-ID: | D317B25B-7995-43A2-8863-55A8505F827A@decibel.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Feb 21, 2008, at 4:08 AM, Alexey Nalbat wrote:
> I'd like to use ORDER BY in any specified order and LIMIT, OFFSET
> for paging query results.
> The query is FULL OUTER JOIN of two tables by field id. I think the
> results of Merge Full Join
> to be ordered by some "combined id". And there is no need in extra
> Sort if I specify ORDER BY
> that "combined id". But unfortunately it is not so.
>
> Here is a simple example:
> -- BEGIN
> create table t1 as select generate_series(1,1000000,2) as id;
> create table t2 as select generate_series(1,1000000,3) as id;
>
> create index i1 on t1 ( id );
> create index i2 on t2 ( id );
>
> analyze t1;
> analyze t2;
>
> explain analyze
> select id, t1.*, t2.*
> from t1 natural full join t2
> order by 1 limit 10 offset 10;
>
> drop table t1;
> drop table t2;
> -- END
>
> Postgresql chooses such plan:
> Limit (cost=44080.12..44080.15 rows=10 width=8) (actual
> time=6724.850..6724.906 rows=10 loops=1)
> -> Sort (cost=44080.10..45330.10 rows=500000 width=8) (actual
> time=6724.806..6724.845 rows=20 loops=1)
> Sort Key: (COALESCE(t1.id, t2.id))
> Sort Method: top-N heapsort Memory: 25kB
> -> Merge Full Join (cost=0.00..30775.28 rows=500000
> width=8) (actual time=0.142..5237.289 rows=666667 loops=1)
> Merge Cond: (t1.id = t2.id)
> -> Index Scan using i1 on t1 (cost=0.00..15212.30
> rows=500000 width=4) (actual time=0.079..1188.601 rows=500000 loops=1)
> -> Index Scan using i2 on t2 (cost=0.00..10146.30
> rows=333334 width=4) (actual time=0.051..793.635 rows=333334 loops=1)
>
> The desired plan is much faster:
> Limit (cost=0.62..1.23 rows=10 width=8) (actual time=0.262..0.366
> rows=10 loops=1)
> -> Merge Full Join (cost=0.00..30775.28 rows=500000 width=8)
> (actual time=0.156..0.303 rows=20 loops=1)
> Merge Cond: (t1.id = t2.id)
> -> Index Scan using i1 on t1 (cost=0.00..15212.30
> rows=500000 width=4) (actual time=0.088..0.120 rows=15 loops=1)
> -> Index Scan using i2 on t2 (cost=0.00..10146.30
> rows=333334 width=4) (actual time=0.056..0.078 rows=11 loops=1)
>
> I found comment in src/backend/optimizer/path/pathkeys.c:
> * EXCEPTION: in a FULL or RIGHT join, we cannot treat the result as
> * having the outer path's path keys, because null lefthand rows may be
> * inserted at random points. It must be treated as unsorted.
>
> How can I get rid of this sorting? Or could this behavior of Merge
> Full Join be improved?
Theoretically, this can be improved, but I suspect it would be non-
trivial. I suspect that the problem is the planner doesn't realize
that the join key could never be null, which is often not the case.
Consider this example:
decibel=# create table t1 as select generate_series(1,1000000,2) as id1;
SELECT
decibel=# create table t2 as select generate_series(1,1000000,3) as id2;
SELECT
Create index, etc.
explain analyze
select id1, id2
from t1 full join t2 on (t1.id1=t2.id2)
order by 1 limit 10 offset 10;
Note that in this case you have to re-sort, because NULLs sort
differently.
As a workaround, I suggest creating a table that contains all the IDs
from t1 and t2. You could maintain this table via a trigger if you
wanted. You could then quickly determine the exact IDs you wanted,
and then join against the two real tables.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828
From | Date | Subject | |
---|---|---|---|
Next Message | dvanatta | 2008-02-24 00:33:01 | Return Query with simple function |
Previous Message | Decibel! | 2008-02-23 20:07:29 | Re: Approaches for Lookup values (codes) in OLTP application |