Re: PoC: Partial sort

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Marti Raudsepp <marti(at)juffo(dot)org>
Cc: Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: PoC: Partial sort
Date: 2014-01-14 19:28:28
Message-ID: CAPpHfduKE=-5cimGBQo=LftD-ervkao=H+HmCRkRPzVCLvaiAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 14, 2014 at 11:16 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:

> On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
> >> I implemented a new
> >> enable_partialsort GUC to make it easier to turn on/off
>
> > I though about such option. Generally not because of testing convenience,
> > but because of overhead of planning. This way you implement it is quite
> > naive :) For instance, merge join rely on partial sort which will be
> > replaced with simple sort.
>
> Oh, this actually highlights a performance regression with the partial
> sort patch. I assumed the planner will discard the full sort because of
> higher costs, but it looks like the new code always assumes that a Partial
> sort will be cheaper than a Join Filter without considering costs. When
> doing a join USING (unique_indexed_value, something), the new plan is
> significantly worse.
>
> Unpatched:
> marti=# explain analyze select * from release a join release b using (id,
> name);
> Merge Join (cost=0.85..179810.75 rows=12 width=158) (actual
> time=0.011..1279.596 rows=1232610 loops=1)
> Merge Cond: (a.id = b.id)
> Join Filter: ((a.name)::text = (b.name)::text)
> -> Index Scan using release_id_idx on release a (cost=0.43..79120.04
> rows=1232610 width=92) (actual time=0.005..211.928 rows=1232610 loops=1)
> -> Index Scan using release_id_idx on release b (cost=0.43..79120.04
> rows=1232610 width=92) (actual time=0.004..371.592 rows=1232610 loops=1)
> Total runtime: 1309.049 ms
>
> Patched:
> Merge Join (cost=0.98..179810.87 rows=12 width=158) (actual
> time=0.037..5034.158 rows=1232610 loops=1)
> Merge Cond: ((a.id = b.id) AND ((a.name)::text = (b.name)::text))
> -> Partial sort (cost=0.49..82201.56 rows=1232610 width=92) (actual
> time=0.013..955.938 rows=1232610 loops=1)
> Sort Key: a.id, a.name
> Presorted Key: a.id
> Sort Method: quicksort Memory: 25kB
> -> Index Scan using release_id_idx on release a
> (cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.007..449.332
> rows=1232610 loops=1)
> -> Materialize (cost=0.49..85283.09 rows=1232610 width=92) (actual
> time=0.019..1352.377 rows=1232610 loops=1)
> -> Partial sort (cost=0.49..82201.56 rows=1232610 width=92)
> (actual time=0.018..1223.251 rows=1232610 loops=1)
> Sort Key: b.id, b.name
> Presorted Key: b.id
> Sort Method: quicksort Memory: 25kB
> -> Index Scan using release_id_idx on release b
> (cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.004..597.258
> rows=1232610 loops=1)
> Total runtime: 5166.906 ms
> ----
>

Interesting. Could you share the dataset?

There's another "wishlist" kind of thing with top-N heapsort bounds; if I
> do a query with LIMIT 1000 then every sort batch has Tuplesortstate.bound
> set to 1000, but it could be reduced after each batch. If the first batch
> is 900 rows then the 2nd batch only needs the top 100 rows at most.
>

Right. Just didn't implement it yet.

> Also, I find the name "partial sort" a bit confusing; this feature is not
> actually sorting *partially*, it's finishing the sort of partially-sorted
> data. Perhaps "batched sort" would explain the feature better? Because it
> does the sort in multiple batches instead of all at once. But maybe that's
> just me.
>

I'm not sure. For me "batched sort" sounds like we're going to sort in
batch something that we sorted separately before. Probably I'm wrong
because I'm far from native english :)

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-01-14 19:32:09 Re: shared memory message queues
Previous Message Stephen Frost 2014-01-14 19:27:44 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance