Re: Ordered Partitioned Table Scans

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Antonin Houska <ah(at)cybertec(dot)at>
Cc: Julien Rouhaud <rjuju123(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Ordered Partitioned Table Scans
Date: 2018-11-01 03:14:11
Message-ID: CAKJS1f-2=8wCzMSXtp+K8qE4hdAOaqBTaSTTi9N3xhrve9d3BQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On 1 November 2018 at 04:01, Antonin Houska <ah(at)cybertec(dot)at> wrote:
> * As for the logic, I found generate_mergeappend_paths() to be the most
> interesting part:
>
> Imagine table partitioned by "i", so "partition_pathkeys" is {i}.
>
> partition 1:
>
> i | j
> --+--
> 0 | 0
> 1 | 1
> 0 | 1
> 1 | 0
>
> partition 2:
>
> i | j
> --+--
> 3 | 0
> 2 | 0
> 2 | 1
> 3 | 1
>
> Even if "pathkeys" is {i, j}, i.e. not contained in "partition_pathkeys", the
> ordering of the subpaths should not change the way tuples are split into
> partitions.
>
> Obviously a problem is if "partition_pathkeys" and "pathkeys" lists start with
> different items. To propose more generic rule, I used this example of
> range-partitioned table, where "i" and "j" are the partitioning keys:
>
> partition 1:
>
> i | j | k
> ---+---+---
> 0 | 0 | 1
> 0 | 0 | 0
>
> partition 2:
>
> i | j | k
> ---+---+---
> 0 | 1 | 0
> 0 | 1 | 1
>
> If the output "pathkey" is {i, k}, then the Append path makes rows of both
> partitions interleave:
>
> i | j | k
> ---+---+---
> 0 | 0 | 0
> 0 | 1 | 0
> 0 | 0 | 1
> 0 | 1 | 1
>
> So in general I think the restriction is that no valid position of "pathkeys"
> and "partition_pathkeys" may differ. Or in other words: the shorter of the 2
> pathkey lists must be contained in the longer one. Does it make sense to you?

I understand what you're saying. I just don't understand what you
think is wrong with the patch in this area.

> Another problem I see is that build_partition_pathkeys() continues even if it
> fails to create a pathkey for some partitioning column. In the example above
> it would mean that the table can have "partition_pathkeys" equal to {j}
> instead of {i, j} on some EC-related conditions. However such a key does not
> correspond to reality - this is easier to imagine if another partition is
> considered.
>
> partition 3:
>
> i | j | k
> ---+---+---
> 1 | 0 | 1
> 1 | 0 | 0
>
> So I think no "partition_pathkeys" should be generated in that case. On the
> other hand, if the function returned the part of the list it could construct
> so far, it'd be wrong because such incomplete pathkeys could pass the checks I
> proposed above for reasons unrelated to the partitioning scheme.

Oops. That's a mistake. We should return what we have so far if we
can't make one of the pathkeys. Will fix.

> The following comments are mostly on coding:
>
> * Both qsort_partition_list_value_cmp() and qsort_partition_rbound_cmp() have
> this sentence in the header comment:
>
> Note: If changing this, see build_partition_pathkeys()
>
> However I could not find other use besides that in
> RelationBuildPartitionDesc().

While the new code does not call those directly, the new code does
depend on the sort order of the partitions inside the PartitionDesc,
which those functions are responsible for. Perhaps there's a better
way to communicate that.

Actually, I think the partitioning checking code I added to pathkeys.c
does not belong there. Likely those checks should live with the other
partitioning code in the form of a bool returning function. I'll
change that now. It means we don't have to work that out twice as I'm
currently running it once for forward and once for the backwards scan
case. Currently the code is very simple but if we start analysing
list partition bounds then it will become slower.

> * create_append_path():
>
> /*
> * Apply query-wide LIMIT if known and path is for sole base relation.
> * (Handling this at this low level is a bit klugy.)
> */
> if (root != NULL && pathkeys != NULL &&
> bms_equal(rel->relids, root->all_baserels))
> pathnode->limit_tuples = root->limit_tuples;
> else
> pathnode->limit_tuples = -1.0;
>
> I think this optimization is not specific to AppendPath / MergeAppendPath,
> so it could be moved elsewhere (as a separate patch of course). But
> specifically for AppendPath, why do we have to test pathkeys? The pathkeys
> of the AppendPath do not necessarily describe the order of the set to which
> LIMIT is applied, so their existence should not be important here.

The pathkeys != NULL could be removed. I was just trying to maintain
the status quo for Appends without pathkeys. In reality it currently
does not matter since that's only used as a parameter for cost_sort().
There'd be no reason previously to have a Sort path as a subpath in an
Append node since the order would be destroyed after the Append.
Perhaps we should just pass it through as one day it might be useful.
I just can't currently imagine why.

> * If pathkeys is passed, shouldn't create_append_path() include the
> cost_sort() of subpaths just like create_merge_append_path() does? And if
> so, then create_append_path() and create_merge_append_path() might
> eventually have some common code (at least for the subpath processing) to be
> put into a separate function.

It does. It's just done via the call to cost_append().

> * Likewise, create_append_plan() / create_merge_append_plan() are going to be
> more similar then before, so some refactoring could also make sense.
>
> Although it's not too much code, I admit the patch is not trivial, so I'm
> curious about your opinion.

I think the costing code is sufficiently different to warant not
sharing more. For example, the startup costing is completely
different. Append can start on the startup cost of the first subpath,
but MergeAppend takes the sum of the startup cost of all subpaths.

I've attached v4 of the patch. I think this addresses all that you
mentioned apart from the first one, due to not understanding the
problem.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
v4-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch application/octet-stream 52.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-11-01 03:17:35 Re: PG vs macOS Mojave
Previous Message Steve Singer 2018-11-01 03:07:44 Re: heap_sync seems rather oblivious to partitioned tables (wal_level=minimal)