Re: Ordered Partitioned Table Scans

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: david(dot)rowley(at)2ndquadrant(dot)com
Cc: Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Ordered Partitioned Table Scans
Date: 2018-10-30 23:24:59
Message-ID: CAOBaU_Za0yGuOSvZktE7k=FPBD2PydcCGKMRRfExNuNJ5O=jpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 29, 2018 at 1:44 AM David Rowley
<david(dot)rowley(at)2ndquadrant(dot)com> wrote:
>
> On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> > I just had a look at your patch. I see that you implemented only a
> > subset of the possible optimizations (only the case for range
> > partitionoing without subpartitions). This has been previously
> > discussed, but we should be able to do similar optimization for list
> > partitioning if there's no interleaved values, and also for some cases
> > of multi-level partitioning.
>
> I had thought about these cases but originally had thought they would
> be more complex to implement than I could justify. On review, I've
> found some pretty cheap ways to handle both sub-partitions and for
> LIST partitioned tables. Currently, with LIST partitioned tables I've
> coded it to only allow the optimisation if there's no DEFAULT
> partition and all partitions are defined with exactly 1 Datum. This
> guarantees that there are no interleaved values, but it'll just fail
> to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4). The
> reason that I didn't go to the trouble of the additional checks was
> that I don't really want to add any per-partition overhead to this.

I see, but the overhead you mention is because you're doing that check
during the planning in build_partition_pathkeys(). As advised by
Robert quite some time ago
(https://www.postgresql.org/message-id/CA+TgmobOWgT1=zyjx-q=7s8akXNODix46qG0_-YX7K369P6ADA@mail.gmail.com),
we can store that information when the PartitionDesc is built, so
that would it wouldn't be problematic. Since checking for overlapping
values is straightforward with the BoundInfoData infrastructure, it'd
be a pity to miss this optimization in such cases, which I believe
would not be rare.

> If RelOptInfo had a Bitmapset of live partitions then we could just
> check the partitions that survived pruning. Amit Langote has a
> pending patch which does that and some other useful stuff, so maybe we
> can delay fixing that until the dust settles a bit in that area. Amit
> and I are both working hard to remove all these per-partition
> overheads. I imagine he'd also not be in favour of adding code that
> does something for all partitions when we've pruned down to just 1.
> I've personally no objection to doing the required additional
> processing for the non-pruned partitions only. We could also then fix
> the case where we disable the optimisation if there's a DEFAULT
> partition without any regards to if it's been pruned or not.

Those are quite worthwhile enhancements, and being able to avoid a
MergeAppend if the problematic partitions have been prune would be
great! I didn't followed thoroughly all the discussions about the
various optimization Amit and you are working on, but I don't think it
would be incompatible with a new flag and the possibility to have the
sorted append with multi valued list partitions?

>
> > Concerning the implementation, there's at least one issue: it assumes
> > that each subpath of a range-partitioned table will be ordered, with
> > is not guaranteed. You need to to generate explicit Sort nodes nodes
> > (in the same order as the query_pathkey) for partitions that don't
> > have an ordered path and make sure that this path is used in the
> > Append. Here's a simplistic case showing the issue (sorry, the
> > partition names are poorly chosen):
>
> Thanks for noticing this. I had been thrown off due to the fact that
> Paths are never actually created for these sorts. On looking further I
> see that we do checks during createplan to see if the path is
> suitability sorted and just create a sort node if it's not. This seems
> to go against the whole point of paths, but I'm not going to fight for
> changing it, so I've just done the Append the same way as MergeAppend
> handles it.

Yes, I had quite the same reaction when I saw how MergeAppend handles it.

>
> > Also, if a LIMIT is specified, it should be able to give better
> > estimates, at least if there's no qual. For instance:
> >
> > EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10;
> > QUERY PLAN
> > >
> > ------------------------------------------------------------------------------------------------------->
> > Limit (cost=0.00..0.35 rows=10 width=15)
> > -> Append (cost=0.00..7705.56 rows=219999 width=15)
> > -> Seq Scan on simple_0_1 (cost=0.00..309.00 rows=20000 width=15)
> > -> Index Scan using simple_1_2_id_idx on simple_1_2
> > (cost=0.29..3148.28 rows=99999 width=14)
> > -> Index Scan using simple_2_3_id_idx on simple_2_3
> > (cost=0.29..3148.29 rows=100000 width=16)
> > (5 rows)
> >
> > In this case, we should estimate that the SeqScan (or in a corrected
> > version the Sort) node should not return more than 10 rows, and each
> > following partition should be scanned at all, and cost each path
> > accordingly. I think that this is quite important, for instance to
> > make sure that natively sorted Append is chosen over a MergeAppend
> > when there are some subpath with explicit sorts, because with the
> > Append we probably won't have to execute all the sorts if the previous
> > partition scans returned enough rows.
>
> In my patch, I'm not adding any additional paths. I'm just adding an
> Append instead of a MergeAppend. For what you're talking about the
> limit only needs to be passed into any underlying Sort so that it can
> become a top-N sort. This is handled already in create_limit_path().
> Notice in the plan you pasted above that the limit has a lower total
> cost than its Append subnode. That's because create_limit_path()
> weighted the Limit total cost based on the row count of the limit and
> its subpath. ... 7705.56 / 219999 * 10 = ~0.35.

Yes. But the cost of the first partition in this example is wrong
since there was no additional sort on top of the seq scan.

However, I now realize that, as you said, what your patch does is to
generate an Append *instead* of a MergeAppend if the optimization was
possible. So there can't be the problem of a MergeAppend chosen over
a cheaper Append in some cases, sorry for the noise. I totally missed
that because when I worked on the same topic last year we had to
generate both Append and MergeAppend. At that time Append were not
parallel-aware yet, so there could be faster parallel MergeAppend in
some cases.

> > FWIW, both those cases were handled (probably with some bugs though)
> > in the previous patches Ronan and I sent some time ago. Also, I did
> > not forget about this feature, I planned to work on it in hope to have
> > it included in pg12. However, I won't have a lot of time to work on
> > it before December.
>
> I apologise for not noticing your patch. I only went as far as
> checking the November commitfest to see if anything existed already
> and I found nothing there.

No worries, it's more than a year old now (I'm quite ashamed I didn't
come back on this sooner).

> I have time to work on this now, so likely
> it's better if I continue, just in case your time in December does not
> materialise.

I entirely agree.

> v2 of the patch is attached. I've not had time yet to give it a
> throughout post write review, but on first look it seems okay.

I've registered as a reviewer. I still didn't have a deep look at
the patch yet, but thanks a lot for working on it!

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2018-10-30 23:29:25 Re: [HACKERS] Optional message to user when terminating/cancelling backend
Previous Message David Rowley 2018-10-30 22:41:35 Re: Super PathKeys (Allowing sort order through precision loss functions)