Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From: James Coleman <jtc331(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Shaun Thomas <shaun(dot)thomas(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date: 2020-03-24 03:13:59
Message-ID: CAAaqYe-r_xoTUBUE4iz03kyJdrFQzQXCLqvw5L7s74RN63VVQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 22, 2020 at 10:17 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
>
> On Fri, Mar 20, 2020 at 8:56 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
> > Hi,
> >
> > I've looked at v38 but it seems it's a bit broken by some recent explain
> > changes (mostly missing type in declarations). Attached is v39 fixing
> > those issues, and including a bunch of fixes based on a review - most of
> > the changes is in comments, so I've instead kept them in separate "fix"
> > patches after each part.
> >
> > In general I'm mostly happy with the current shape of the patch, and
> > unless there are some objections I'd like to get some of it committed
> > sometime next week.
> >
> > I've done a fair amount of testing with various queries, and the plan
> > changes seem pretty sensible. I'm still not entirely sure whether to be
> > a bit conservative and only tweak the first patch adding incremental
> > sort to extra places, or commit both.
> >
> > The main thing I still have on my plate is assessment of how much more
> > expensive can the planning due to increased number of paths we
> > generate/keep (due to considering extra pathkeys). I haven't seen any
> > significant slowdowns, but I plan to look at some extreme cases (many
> > similar and applicable indexes etc.).
>
> I'm currently incorporating all of the fixes you proposed into the
> main patch series, as well as doing a thorough read-through of the
> current state of the patch. I'm hoping to reply tomorrow with:
>
> - Fix patches of my own to clean up and add additional comments.
> - Catalog all of the current open questions (XXX, etc.) in the patch
> to more easily discuss them in the mailing list.

Here's the above.

Current TODOs:

1. src/backend/optimizer/util/pathnode.c add_partial_path()
* XXX Perhaps we could do this only when incremental sort is enabled,
* and use the simpler version (comparing just total cost) otherwise?

I don't have a strong opinion here. It doesn't seem like a significant
difference in terms of cost?

2. Not marked in the patch, but in nodeIncrementalSort.c
ExecIncrementalSort() I wonder if perhaps we should move the algorithm
discussion comments up to the file header comment. On the other hand,
I suppose it could be valuable to leave the the file header comment
more high level about the mathematical properties of incremental sort
rather than discussing the details of the hybrid mode.

3. nodeIncrementalSort.c ExecIncrementalSort() in the main for loop:
* TODO: do we need to check for interrupts inside these loops or
* will the outer node handle that?

4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
is suspect. I've mentioned previously I don't have a great mental
model of how rescan works and its invariants (IIRC someone said it was
about moving around a result set in a cursor). Regardless I'm pretty
sure this code just doesn't work correctly. Additionally the sort_Done
variable is poorly named; it probably would make more sense to call it
something like "scanHasBegun". I'm waiting to change it though until
cleaning up this code more holistically.

5. planner.c create_ordered_paths:
* XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
* other pathkeys (grouping, ...) like generate_useful_gather_paths.

6. regress/expected/incremental_sort.out:
-- TODO if an analyze happens here the plans might change; should we
-- solve by inserting extra rows or by adding a GUC that would somehow
-- forcing the time of plan we expect.

Maybe this isn't an actual issue (I have vague memory of auto-analyze
being disabled during these regression tests)?

7. Not listed as a comment in the patch, but I need to modify the
testing for analyze output to parse out the memory/disk stats to the
tests are stable.

8. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* XXX At the moment this can only ever return a list with a single element,
* because it looks at query_pathkeys only. So we might return the pathkeys
* directly, but it seems plausible we'll want to consider other orderings
* in the future.

This might be something we just leave in as a comment?

9. In the same function as the above:
* Considering query_pathkeys is always worth it, because it might let us
* avoid a local sort.

That seems like a copy from the fdw code; I didn't remove it in the
attached patchset, because I think I have a diff on a branch somewhere
that standardizes some of this shared code between here and the
postgres_fdw code.

10. optimizer/path/allpaths.c generate_useful_gather_paths:
* XXX I wonder if we need to consider adding a projection here, as
* create_ordered_paths does.

11. In the same function as the above:
* XXX Can't we skip this (maybe only for the cheapest partial path)
* when the path is already sorted? Then it's likely duplicate with
* the path created by generate_gather_paths.

12. In the same function as the above:
* XXX This is not redundant with the gather merge path created in
* generate_gather_paths, because that merely preserves ordering of
* the cheapest partial path, while here we add an explicit sort to
* get match the useful ordering.

13. planner.c create_ordered_paths:
* XXX This is probably duplicate with the paths we already generate
* in generate_useful_gather_paths in apply_scanjoin_target_to_paths.

Attached is v40, including quite a bit of comment cleanup primarily.

James

Attachment Content-Type Size
v40-0001-Consider-low-startup-cost-when-adding-partial-pa.patch text/x-patch 4.6 KB
v40-0004-fix-lots-of-comments.patch text/x-patch 31.3 KB
v40-0002-comment-rewording-typo.patch text/x-patch 1.7 KB
v40-0005-Consider-incremental-sort-paths-in-additional-pl.patch text/x-patch 13.7 KB
v40-0003-Implement-incremental-sort.patch text/x-patch 154.2 KB
v40-0006-A-couple-more-places-for-incremental-sort.patch text/x-patch 8.9 KB
v40-0007-fix.patch text/x-patch 851 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-03-24 03:22:16 Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
Previous Message Tomas Vondra 2020-03-24 03:13:09 Re: Parallel grouping sets