Re: Ordered Partitioned Table Scans

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Ordered Partitioned Table Scans
Date: 2018-10-29 00:44:33
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Thanks for looking at this.

On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> I just had a look at your patch. I see that you implemented only a
> subset of the possible optimizations (only the case for range
> partitionoing without subpartitions). This has been previously
> discussed, but we should be able to do similar optimization for list
> partitioning if there's no interleaved values, and also for some cases
> of multi-level partitioning.

I had thought about these cases but originally had thought they would
be more complex to implement than I could justify. On review, I've
found some pretty cheap ways to handle both sub-partitions and for
LIST partitioned tables. Currently, with LIST partitioned tables I've
coded it to only allow the optimisation if there's no DEFAULT
partition and all partitions are defined with exactly 1 Datum. This
guarantees that there are no interleaved values, but it'll just fail
to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4). The
reason that I didn't go to the trouble of the additional checks was
that I don't really want to add any per-partition overhead to this.
If RelOptInfo had a Bitmapset of live partitions then we could just
check the partitions that survived pruning. Amit Langote has a
pending patch which does that and some other useful stuff, so maybe we
can delay fixing that until the dust settles a bit in that area. Amit
and I are both working hard to remove all these per-partition
overheads. I imagine he'd also not be in favour of adding code that
does something for all partitions when we've pruned down to just 1.
I've personally no objection to doing the required additional
processing for the non-pruned partitions only. We could also then fix
the case where we disable the optimisation if there's a DEFAULT
partition without any regards to if it's been pruned or not.

> Concerning the implementation, there's at least one issue: it assumes
> that each subpath of a range-partitioned table will be ordered, with
> is not guaranteed. You need to to generate explicit Sort nodes nodes
> (in the same order as the query_pathkey) for partitions that don't
> have an ordered path and make sure that this path is used in the
> Append. Here's a simplistic case showing the issue (sorry, the
> partition names are poorly chosen):

Thanks for noticing this. I had been thrown off due to the fact that
Paths are never actually created for these sorts. On looking further I
see that we do checks during createplan to see if the path is
suitability sorted and just create a sort node if it's not. This seems
to go against the whole point of paths, but I'm not going to fight for
changing it, so I've just done the Append the same way as MergeAppend
handles it.

> Also, if a LIMIT is specified, it should be able to give better
> estimates, at least if there's no qual. For instance:
> >
> ------------------------------------------------------------------------------------------------------->
> Limit (cost=0.00..0.35 rows=10 width=15)
> -> Append (cost=0.00..7705.56 rows=219999 width=15)
> -> Seq Scan on simple_0_1 (cost=0.00..309.00 rows=20000 width=15)
> -> Index Scan using simple_1_2_id_idx on simple_1_2
> (cost=0.29..3148.28 rows=99999 width=14)
> -> Index Scan using simple_2_3_id_idx on simple_2_3
> (cost=0.29..3148.29 rows=100000 width=16)
> (5 rows)
> In this case, we should estimate that the SeqScan (or in a corrected
> version the Sort) node should not return more than 10 rows, and each
> following partition should be scanned at all, and cost each path
> accordingly. I think that this is quite important, for instance to
> make sure that natively sorted Append is chosen over a MergeAppend
> when there are some subpath with explicit sorts, because with the
> Append we probably won't have to execute all the sorts if the previous
> partition scans returned enough rows.

In my patch, I'm not adding any additional paths. I'm just adding an
Append instead of a MergeAppend. For what you're talking about the
limit only needs to be passed into any underlying Sort so that it can
become a top-N sort. This is handled already in create_limit_path().
Notice in the plan you pasted above that the limit has a lower total
cost than its Append subnode. That's because create_limit_path()
weighted the Limit total cost based on the row count of the limit and
its subpath. ... 7705.56 / 219999 * 10 = ~0.35.

> FWIW, both those cases were handled (probably with some bugs though)
> in the previous patches Ronan and I sent some time ago. Also, I did
> not forget about this feature, I planned to work on it in hope to have
> it included in pg12. However, I won't have a lot of time to work on
> it before December.

I apologise for not noticing your patch. I only went as far as
checking the November commitfest to see if anything existed already
and I found nothing there. I have time to work on this now, so likely
it's better if I continue, just in case your time in December does not

v2 of the patch is attached. I've not had time yet to give it a
throughout post write review, but on first look it seems okay.

The known limitations are:

* Disables the optimisation even if the DEFAULT partition is pruned.
* Disables the optimisation if LIST partitioned tables have any
partitions allowing > 1 value.
* Fails to optimise UNION ALLs with partitioned tables.

David Rowley
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
v2-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch application/octet-stream 48.3 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2018-10-29 01:16:03 Re: [HACKERS] Transactions involving multiple postgres foreign servers, take 2
Previous Message Michael Paquier 2018-10-29 00:05:47 Re: Multiple Wait Events for extensions