Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From: James Coleman <jtc331(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Shaun Thomas <shaun(dot)thomas(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andreas Karlsson <andreas(at)proxel(dot)se>
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date: 2020-03-31 22:00:00
Message-ID: CAAaqYe-KHM=AL6dCWKQgjHYk2JfTDXOZMLWa3seOJnJrhRNukQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 31, 2020 at 5:53 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Tue, Mar 31, 2020 at 02:23:15PM -0400, James Coleman wrote:
> >On Mon, Mar 30, 2020 at 9:14 PM Tomas Vondra
> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >>
> >> The main thing I've been working on today is benchmarking how this
> >> affects planning. And I'm seeing a regression that worries me a bit,
> >> unfortunately.
> >>
> >> The test I'm doing is pretty simple - build a small table with a bunch
> >> of columns:
> >>
> >> create table t (a int, b int, c int, d int, e int, f int, g int);
> >>
> >> insert into t select 100*random(), 100*random(), 100*random(),
> >> 100*random(), 100*random(), 100*random(), 100*random()
> >> from generate_series(1,100000) s(i);
> >>
> >> and then a number of indexes on subsets of up to 3 columns, as generated
> >> using the attached build-indexes.py script. And then run a bunch of
> >> explains (so no actual execution) sorting the data by at least 4 columns
> >> (to trigger incremental sort paths), measuring timing of the script.
> >>
> >> I did a bunch of runs on current master and v46 with incremental sort
> >> disabled and enabled, and the results look like this:
> >>
> >> master off on
> >> --------------------------
> >> 34.609 37.463 37.729
> >>
> >> which means about 8-9% regression with incremental sort. Of course, this
> >> is only for planning time, for execution the impact is going to be much
> >> smaller. But it's still a bit annoying.
> >>
> >> I've suspected this might be either due to the add_partial_path changes
> >> or the patch adding incremental sort to additional places, so I tested
> >> those parts individually and the answer is no - add_partial_path changes
> >> have very small impact (~1%, which might be noise). The regression comes
> >> mostly from the 0002 part that adds incremental sort. At least in this
> >> particular test - different tests might behave differently, of course.
> >>
> >> The annoying bit is that the overhead does not disappear after disabling
> >> incremental sort. That suggests this is not merely due to considering
> >> and creating higher number of paths, but due to something that happens
> >> before we even look at the GUC ...
> >>
> >> I think I've found one such place - if you look at compare_pathkeys, it
> >> has this check right before the forboth() loop:
> >>
> >> if (keys1 == keys2)
> >> return PATHKEYS_EQUAL;
> >>
> >> But with incremental sort we don't call pathkeys_contained_in, we call
> >> pathkeys_common_contained_in instead. And that does not call
> >> compare_pathkeys and does not have the simple equality check. Adding
> >> the following check seems to cut the overhead in half, which is nice:
> >>
> >> if (keys1 == keys2)
> >> {
> >> *n_common = list_length(keys1);
> >> return true;
> >> }
> >>
> >> Not sure where the rest of the regression comes from yet.
> >
> >I noticed in the other function we also optimize by checking if either
> >keys list is NIL, so I tried adding that, and it might have made a
> >minor difference, but it's hard to tell as it was under 1%.
> >
>
> Which other function? I don't see this optimization in compare_pathkeys,
> that only checks key1/key2 after the forboth loop, but that's something
> different.

pathkeys_useful_for_ordering checks both inputs.

> I may be wrong, but my guess would be that this won't save much, because
> the loop should terminate immediately. The whole point is not to loop
> over possibly many pathkeys when we know it's exactly the same pathkeys
> list (same pointer). When one of the lists is NIL the loop should
> terminate right away ...
>
> >I also ran perf with a slightly modified version of your test that
> >uses psql, and after the above changes was seeing something like a
> >3.5% delta between master and this patch series. Nothing obvious in
> >the perf report though.
> >
>
> Yeah, I don't think there's anything obviously more expensive.
>
> >This test is intended to be somewhat worst case, no? At what point do
> >we consider the trade-off worth it (given that it's not plausible to
> >have zero impact)?
> >
>
> Yes, more or less. It was definitely designed to do that - it merely
> plans the query (no execution), with many applicable indexes etc. It's
> definitely true that once the exection starts to take more time, the
> overhead will get negligible. Same for reducing number of indexes.
>
> And of course, for queries that can benefit from incremental sort, the
> speedup is likely way larger than this.
>
> In general, I think it'd be naive that we can make planner smarter with
> no extra overhead spent on planning, and we can never accept patches
> adding even tiny overhead. With that approach we'd probably end up with
> a trivial planner that generates just a single query plan, because
> that's going to be the fastest planner. A realistic approach needs to
> consider both the planning and execution phase, and benefits of this
> patch seem to be clear - if you have queries that do benefit from it.
>
> Let's try to minimize the overhead a bit more, and then we'll see.

Any thoughts you have already on an approach for this?

James

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2020-03-31 22:02:57 Re: [HACKERS] Restricting maximum keep segments by repslots
Previous Message Andres Freund 2020-03-31 21:55:36 Re: Improving connection scalability: GetSnapshotData()