Re: Ordered Partitioned Table Scans

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: david(dot)rowley(at)2ndquadrant(dot)com
Cc: Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Ordered Partitioned Table Scans
Date: 2018-10-27 14:49:07
Message-ID: CAOBaU_bjV9iCidOgsrpgVjY+AXgNxFO+y6=PLLh5x8d2_d1wpQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Fri, Oct 26, 2018 at 1:01 PM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> On Fri, Oct 26, 2018 at 6:40 AM David Rowley
> <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> >
> > On 26 October 2018 at 16:52, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > > I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this
> > > last year, but the partitioning-related planning code hadn't advanced then
> > > as much as it has today, so they sort of postponed working on it.
> > > Eventually their patch was returned with feedback last November. Here's
> > > the link to their email in case you wanted to read some comments their
> > > proposal and patch got, although some of them might be obsolete.
> > >
> > > https://www.postgresql.org/message-id/2401607.SfZhPQhbS4%40ronan_laptop
> >
> > Thanks. I wasn't aware, or ... forgot. Looks like back then was tricky
> > times to be doing this. Hopefully, the dust has settled a little bit
> > now.
>
> As far as I remember, the biggest problems we had was
> to handle multi-level partitionning, when the query is ordered by all
> or a subset of the partition keys, and/or with a mix of ASC/DESC
> clauses. It also required some extra processing on the cost part for
> queries that can be naturally ordered and contain a LIMIT clause,
> since we can estimate how many partitions will have to be scanned.

I just had a look at your patch. I see that you implemented only a
subset of the possible optimizations (only the case for range
partitionoing without subpartitions). This has been previously
discussed, but we should be able to do similar optimization for list
partitioning if there's no interleaved values, and also for some cases
of multi-level partitioning.

Concerning the implementation, there's at least one issue: it assumes
that each subpath of a range-partitioned table will be ordered, with
is not guaranteed. You need to to generate explicit Sort nodes nodes
(in the same order as the query_pathkey) for partitions that don't
have an ordered path and make sure that this path is used in the
Append. Here's a simplistic case showing the issue (sorry, the
partition names are poorly chosen):

CREATE TABLE simple (id integer, val text) PARTITION BY RANGE (id);
CREATE TABLE simple_1_2 PARTITION OF simple FOR VALUES FROM (1) TO (100000);
CREATE TABLE simple_2_3 PARTITION OF simple FOR VALUES FROM (100000)
TO (200000);
CREATE TABLE simple_0_1 PARTITION OF simple FOR VALUES FROM (-100000) TO (1);

INSERT INTO simple SELECT id, 'line ' || id FROM
generate_series(-19999, 199999) id;

CREATE INDEX ON simple_1_2 (id);
CREATE INDEX ON simple_2_3 (id);

EXPLAIN SELECT * FROM simple ORDER BY id ;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Append (cost=0.00..7705.56 rows=219999 width=15)
-> Seq Scan on simple_0_1 (cost=0.00..309.00 rows=20000 width=15)
-> Index Scan using simple_1_2_id_idx on simple_1_2
(cost=0.29..3148.28 rows=99999 width=14)
-> Index Scan using simple_2_3_id_idx on simple_2_3
(cost=0.29..3148.29 rows=100000 width=16)
(4 rows)

Also, if a LIMIT is specified, it should be able to give better
estimates, at least if there's no qual. For instance:

EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10;
QUERY PLAN
>
------------------------------------------------------------------------------------------------------->
Limit (cost=0.00..0.35 rows=10 width=15)
-> Append (cost=0.00..7705.56 rows=219999 width=15)
-> Seq Scan on simple_0_1 (cost=0.00..309.00 rows=20000 width=15)
-> Index Scan using simple_1_2_id_idx on simple_1_2
(cost=0.29..3148.28 rows=99999 width=14)
-> Index Scan using simple_2_3_id_idx on simple_2_3
(cost=0.29..3148.29 rows=100000 width=16)
(5 rows)

In this case, we should estimate that the SeqScan (or in a corrected
version the Sort) node should not return more than 10 rows, and each
following partition should be scanned at all, and cost each path
accordingly. I think that this is quite important, for instance to
make sure that natively sorted Append is chosen over a MergeAppend
when there are some subpath with explicit sorts, because with the
Append we probably won't have to execute all the sorts if the previous
partition scans returned enough rows.

FWIW, both those cases were handled (probably with some bugs though)
in the previous patches Ronan and I sent some time ago. Also, I did
not forget about this feature, I planned to work on it in hope to have
it included in pg12. However, I won't have a lot of time to work on
it before December.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-10-27 15:22:03 Re: Resetting PGPROC atomics in ProcessInit()
Previous Message Alvaro Herrera 2018-10-27 14:45:57 Re: BUG #15446: Crash on ALTER TABLE