Re: PoC: Partial sort

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Marti Raudsepp <marti(at)juffo(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: PoC: Partial sort
Date: 2014-02-09 17:37:29
Message-ID: CAPpHfds7j4_=aAQixSDbWyim1588kboB8LOB8wj+Ei0RbcvozQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 6, 2014 at 12:39 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:

> On Thu, Feb 6, 2014 at 5:31 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > Hmm, sounds a little steep. Why is it so expensive? I'm probably
> > missing something here, because I would have thought that planner
> > support for partial sorts would consist mostly of considering the same
> > sorts we consider today, but with the costs reduced by the batching.
>
> I guess it's because the patch undoes some optimizations in the
> mergejoin planner wrt caching merge clauses and adds a whole lot of
> code to find_mergeclauses_for_pathkeys. In other code paths the
> overhead does seem to be negligible.
>
> Notice the removal of:
> /* Select the right mergeclauses, if we didn't already */
> /*
> * Avoid rebuilding clause list if we already made one;
> * saves memory in big join trees...
> */
>

This is not only place that worry me about planning overhead.
See get_cheapest_fractional_path_for_pathkeys. I had to estimate number of
groups for each sorting column in order to get right fractional path. For
partial sort path, cost of first batch should be included into initial
cost.
If don't do so, optimizer can pick up strange plans basing on assumption
that it need only few rows from inner node. See an example.

create table test1 as (
select id,
(random()*100)::int as v1,
(random()*10000)::int as v2
from generate_series(1,1000000) id);

create table test2 as (
select id,
(random()*100)::int as v1,
(random()*10000)::int as v2
from generate_series(1,1000000) id);

create index test1_v1_idx on test1 (v1);

Plan without fraction estimation in
get_cheapest_fractional_path_for_pathkeys:

postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1
order by t1.v1, t1.id limit 10;
QUERY PLAN

----------------------------------------------------------------------------------------------------------
Limit (cost=198956893.20..198956913.33 rows=10 width=24)
-> Partial sort (cost=198956893.20..19909637942.82 rows=9791031169
width=24)
Sort Key: t1.v1, t1.id
Presorted Key: t1.v1
-> Nested Loop (cost=0.42..19883065506.84 rows=9791031169
width=24)
Join Filter: (t1.v1 = t2.v1)
-> Index Scan using test1_v1_idx on test1 t1
(cost=0.42..47600.84 rows=1000000 width=12)
-> Materialize (cost=0.00..25289.00 rows=1000000 width=12)
-> Seq Scan on test2 t2 (cost=0.00..15406.00
rows=1000000 width=12)
(9 rows)

Current version of patch:

postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1
order by t1.v1, t1.id limit 10;
QUERY PLAN

----------------------------------------------------------------------------------------------------------
Limit (cost=3699913.43..3699913.60 rows=10 width=24)
-> Partial sort (cost=3699913.43..173638549.67 rows=9791031169
width=24)
Sort Key: t1.v1, t1.id
Presorted Key: t1.v1
-> Merge Join (cost=150444.79..147066113.70 rows=9791031169
width=24)
Merge Cond: (t1.v1 = t2.v1)
-> Index Scan using test1_v1_idx on test1 t1
(cost=0.42..47600.84 rows=1000000 width=12)
-> Materialize (cost=149244.84..154244.84 rows=1000000
width=12)
-> Sort (cost=149244.84..151744.84 rows=1000000
width=12)
Sort Key: t2.v1
-> Seq Scan on test2 t2 (cost=0.00..15406.00
rows=1000000 width=12)
(11 rows)

I don't compare actual execution times because I didn't wait until first
plan execution ends up :-)
But anyway costs are extraordinary and inner sequential scan of 1000000
rows is odd.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-02-09 17:38:18 Re: specifying repeatable read in PGOPTIONS
Previous Message Robert Haas 2014-02-09 17:31:46 Re: specifying repeatable read in PGOPTIONS