Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Shaun Thomas <shaun(dot)thomas(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andreas Karlsson <andreas(at)proxel(dot)se>
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date: 2020-04-01 01:59:25
Message-ID: 20200401015925.bmbbqc3ple5rf3yn@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:
>On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
>> >On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
>> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> >>
>> >> On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
>> >> >On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
>> >> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> >> >>
>> >> >> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
>> >> >> >Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
>> >> >> >> In general, I think it'd be naive that we can make planner smarter with
>> >> >> >> no extra overhead spent on planning, and we can never accept patches
>> >> >> >> adding even tiny overhead. With that approach we'd probably end up with
>> >> >> >> a trivial planner that generates just a single query plan, because
>> >> >> >> that's going to be the fastest planner. A realistic approach needs to
>> >> >> >> consider both the planning and execution phase, and benefits of this
>> >> >> >> patch seem to be clear - if you have queries that do benefit from it.
>> >> >> >
>> >> >> >I think that's kind of attacking a straw man, though. The thing that
>> >> >> >people push back on, or should push back on IMO, is when a proposed
>> >> >> >patch adds significant slowdown to queries that it has no or very little
>> >> >> >hope of improving. The trick is to do expensive stuff only when
>> >> >> >there's a good chance of getting a better plan out of it.
>> >> >> >
>> >> >>
>> >> >> Yeah, I agree with that. I think the main issue is that we don't really
>> >> >> know what the "expensive stuff" is in this case, so it's not really
>> >> >> clear how to be smarter :-(
>> >> >
>> >> >To add to this: I agree that ideally you'd check cheaply to know
>> >> >you're in a situation that might help, and then do more work. But here
>> >> >the question is always going to be simply "would we benefit from an
>> >> >ordering, and, if so, do we have it already partially sorted". It's
>> >> >hard to imagine that reducing much conceptually, so we're left with
>> >> >optimizations of that check.
>> >> >
>> >>
>> >> I think it depends on what exactly is the expensive part. For example if
>> >> it's the construction of IncrementalSort paths, then maybe we could try
>> >> do a quick/check check if the path can even be useful by estimating the
>> >> cost and only then building the path.
>> >>
>> >> That's what we do for join algorithms, for example - we first compute
>> >> initial_cost_nestloop and only when that seems cheap enough we do the
>> >> more expensive stuff.
>> >>
>> >> But I'm not sure the path construction is the expensive part, as it
>> >> should be disabled by enable_incrementalsort=off. But the regression
>> >> does not seem to disappear, at least not entirely.
>> >>
>> >>
>> >> >> One possibility is that it's just one of those regressions due to change
>> >> >> in binary layout, but I'm not sure know how to verify that.
>> >> >
>> >> >If we are testing with a case that can't actually add more paths (due
>> >> >to it checking the guc before building them), doesn't that effectively
>> >> >leave one of these two options:
>> >> >
>> >> >1. Binary layout/cache/other untraceable change, or
>> >> >2. Changes due to refactored function calls.
>> >> >
>> >>
>> >> Hmm, but in case of (1) the overhead should be there even with tests
>> >> that don't really have any additional paths to consider, right? I've
>> >> tried with such test (single table with no indexes) and I don't quite
>> >> see any regression (maybe ~1%).
>> >
>> >Not necessarily, if the cost is in sort costing or useful pathkeys
>> >checking, right? We have run that code even without incremental sort,
>> >but it's changed from master.
>> >
>>
>> Ah, I should have mentioned I've done most of the tests on just the
>> basic incremental sort patch (0001+0002), without the additional useful
>> paths. I initially tested the whole patch series, but after discovering
>> the regression I removed the last part (which I suspected might be the
>> root cause). But the regression is still there, so it's not that.
>>
>> It might be in the reworked costing, yeah. But then I'd expect those
>> function to show in the perf profile.
>
>Right. I'm just grasping at straws on that.
>
>> >> (2) might have impact, but I don't see any immediate supects. Did you
>> >> have some functions in mind?
>> >
>> >I guess this is where the lines blur: I didn't see anything obvious
>> >either, but the changes to sort costing...should probably not have
>> >real impact...but...
>> >
>>
>> :-(
>>
>> >> BTW I see the patch adds pathkeys_common but it's not called from
>> >> anywhere. It's probably leftover from an earlier patch version.
>> >>
>
>BTW, I think I'm going to rename the pathkeys_common_contained_in
>function to something like pathkeys_count_contained_in, unless you
>have an objection to that. The name doesn't seem obvious at all to me.
>

WFM

>> >> >There's not anything obvious in point (2) that would be a big cost,
>> >> >but there are definitely changes there. I was surprised that just
>> >> >eliminating the loop through the pathkeys on the query and the index
>> >> >was enough to save us ~4%.
>> >> >
>> >> >Tomas: Earlier you'd wondered about if we should try to shortcut the
>> >> >changes in costing...I was skeptical of that originally, but maybe
>> >> >it's worth looking into? I'm going to try backing that out and see
>> >> >what the numbers look like.
>> >> >
>> >
>> >BTW, I did this test, and it looks like we can get back something
>> >close to 1% by reverting that initial fix on partial path costing. But
>> >we can't get rid of it all the time, at the very least. *Maybe* we
>> >could condition it on incremental sort, but I'm not convinced that's
>> >the only place it's needed as a fix.
>> >
>>
>> Sounds interesting. I actually tried how much the add_partial_path
>> change accounts for, and you're right it was quite a bit. But I forgot
>> about that when investigating the rest.
>>
>> I wonder how large would the regression be without add_partial_path and
>> with the fix in pathkeys_common_contained_in.
>>
>> I'm not sure how much we want to make add_partial_path() dependent on
>> particular GUCs, but I guess if it gets rid of the regression, allows us
>> to commit incremental sort and we can reasonably justify that only
>> incremental sort needs those paths, it might be acceptable.
>
>That's a good point.
>
>> >> I've described the idea about something like initial_cost_nestloop and
>> >> so on. But I'm a bit skeptical about it, considering that the GUC only
>> >> has limited effect.
>> >>
>> >>
>> >> A somewhat note is that the number of indexes has pretty significant
>> >> impact on planning time, even on master. This is timing of the same
>> >> explain script (similar to the one shown before) with different number
>> >> of indexes on master:
>> >>
>> >> 0 indexes 7 indexes 49 indexes
>> >> -------------------------------------------
>> >> 10.85 12.56 27.83
>> >>
>> >> The way I look at incremental sort is that it allows using indexes for
>> >> queries that couldn't use it before, possibly requiring a separate
>> >> index. So incremental sort might easily reduce the number of indexes
>> >> needed, compensating for the overhead we're discussing here. Of course,
>> >> that may or may not be true.
>> >
>> >One small idea, but I'm not yet sure it helps us a whole lot: if the
>> >query pathkeys is only length 1, then we could skip the additional
>> >path creation.
>> >
>>
>> I don't follow. Why would we create incremental sort in this case at
>> all? With single-element query_pathkeys the path is either unsorted or
>> fully sorted - there's no room for incremental sort. No?
>
>Well, we shouldn't, that's what I'm getting. But I didn't see anything
>in the code now that explicitly excludes that case when decided
>whether or not to create an incremental sort path, unless I'm missing
>something obvious.

Well, my point is that create_ordered_paths() looks like this:

is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);

if (is_sorted)
{
... old code
}
else
{
if (input_path == cheapest_input_path)
{
... old code
}

/* With incremental sort disabled, don't build those paths. */
if (!enable_incrementalsort)
continue;

/* Likewise, if the path can't be used for incremental sort. */
if (!presorted_keys)
continue;

... incremental sort path
}

Now, with single-item sort_pathkeys, the input path can't be partially
sorted. It's either fully sorted - in which case it's handled by the
first branch. Or it's not sorted at all, so presorted_keys==0 and we
never get to the incremental path.

Or did you mean to use the optimization somewhere else?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2020-04-01 02:01:22 Re: Race condition in SyncRepGetSyncStandbysPriority
Previous Message Bruce Momjian 2020-04-01 01:55:48 Re: backend type in log_line_prefix?