[Proposal] Make the optimiser aware of partitions ordering

From: Ronan Dunklau <ronan(dot)dunklau(at)dalibo(dot)com>
To: pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: julien(dot)rouhaud(at)dalibo(dot)com
Subject: [Proposal] Make the optimiser aware of partitions ordering
Date: 2017-03-20 10:31:10
Message-ID: 2401607.SfZhPQhbS4@ronan_laptop
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hello,

With native partitioning landing in Postgres 10, we (Julien Rouhaud and
myself) had the idea for the
following simple optimisation. This simple optimisation comes from a real use
case.

===== Proposal =====

With range partitioning, we guarantee that each partition contains non-
overlapping values. Since we know the range allowed for each partition, it is
possible to sort them according to the partition key (as is done already for
looking up partitions).

Thus, we ca generate sorted Append plans instead of MergeAppend when sorting
occurs on the partition key.

For example, consider the following model and query:

CREATE TABLE parent (id int) PARTITION BY RANGE (id);
CREATE TABLE child2 PARTITION OF parent FOR VALUES FROM (10) TO (20);
CREATE TABLE child1 PARTITION OF parent FOR VALUES FROM (1) TO (10);
CREATE INDEX ON child1(id);
CREATE INDEX ON child2(id);

EXPLAIN (COSTS OFF) SELECT * FROM parent ORDER BY id desc;

QUERY PLAN
--------------------------------------------------------------
Merge Append
Sort Key: parent.id DESC
-> Sort
Sort Key: parent.id DESC
-> Seq Scan on parent
-> Index Only Scan Backward using child2_id_idx on child2
-> Index Only Scan Backward using child1_id_idx on child

We can guarantee that every value found in child2 will have a greater id than
any value found in child1. Thus, we could generate a plan like this:

QUERY PLAN
---------------------------------------------------------------
Append
Sort Key: child2.id DESC
-> Index Only Scan Backward using child2_id_idx1 on child2
-> Index Only Scan Backward using child1_id_idx1 on child1

Skipping the merge step altogether.

This has the added benefit of providing an "early stop": if we add a limit to
the query, we will only scan the indexes that are necessary for fetching the
required amount of tuples. This is especially useful if we add a WHERE clause
not covered by the index with the limit. Consider the following example, with
a table "webvisits" partitioned by a ts colum. If we want to retrieve the
latest 100 hits from a specific ip:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM webvisits WHERE ipaddr =
'93.184.216.34' ORDER BY ts DESC LIMIT 10;


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..32.32 rows=10 width=104) (actual time=0.080..0.109 rows=10
loops=1)
Buffers: shared hit=7
-> Append (cost=0.43..401009.65 rows=125740 width=104) (actual
time=0.078..0.105 rows=10 loops=1)
Sort Key: webvisits_201712.ts DESC
Buffers: shared hit=7
-> Index Scan Backward using webvisits_201712_ipaddr_ts_idx on
webvisits_201712 (cost=0.43..133617.70 rows=41901 width=104) (actual
time=0.077..0.101 rows=10 loops=1)
Index Cond: (ipaddr = '93.184.216.34'::inet)
Buffers: shared hit=7
-> Index Scan Backward using webvisits_201711_ipaddr_ts_idx on
webvisits_201711 (cost=0.43..131514.18 rows=41243 width=104) (never executed)
Index Cond: (ipaddr = '93.184.216.34'::inet)
-> Index Scan Backward using webvisits_201710_ipaddr_ts_idx on
webvisits_201710 (cost=0.43..135801.71 rows=42587 width=104) (never executed)
........

We have developed a proof-of-concept implementation for this optimisation.

For sub partitioning, intermediate Append nodes can be generated.

Consider the following examples:

CREATE TABLE parentparent (c1 int, c2 int) PARTITION BY range (c1);

-- Subpartition only on second column
CREATE TABLE parent1 PARTITION OF parentparent FOR VALUES FROM (1) TO (10)
PARTITION BY range (c2);

-- Subpartition on both columns
CREATE TABLE parent2 PARTITION OF parentparent FOR VALUES FROM (10) TO (20)
PARTITION BY range (c1, c2);

CREATE TABLE child11 PARTITION OF parent1 FOR VALUES FROM (1) TO (10);
CREATE TABLE child12 PARTITION OF parent1 FOR VALUES FROM (10) TO (20);
CREATE TABLE child21 PARTITION OF parent2 FOR VALUES FROM (10, 1) TO (15, 10);
CREATE TABLE child22 PARTITION OF parent2 FOR VALUES FROM (15, 10) TO (20,
20);

EXPLAIN (COSTS OFF) SELECT * FROM parentparent ORDER BY c1;
QUERY PLAN
---------------------------------------
Append
Sort Key: parent1.c1
-> Sort
Sort Key: parent1.c1
-> Append
-> Seq Scan on parent1
-> Seq Scan on child11
-> Seq Scan on child12
-> Append
Sort Key: child21.c1
-> Sort
Sort Key: child21.c1
-> Seq Scan on child21
-> Sort
Sort Key: child22.c1
-> Seq Scan on child22

EXPLAIN (COSTS OFF) SELECT * FROM parentparent ORDER BY c1, c2;
QUERY PLAN
------------------------------------------------
Append
Sort Key: parent1.c1, parent1.c2
-> Sort
Sort Key: parent1.c1, parent1.c2
-> Append
-> Seq Scan on parent1
-> Seq Scan on child11
-> Seq Scan on child12
-> Append
Sort Key: child21.c1, child21.c2
-> Sort
Sort Key: child21.c1, child21.c2
-> Seq Scan on child21
-> Sort
Sort Key: child22.c1, child22.c2
-> Seq Scan on child22

===== Patch design =====

First, we don't really know what we're doing :). But this is a proof of
concept, and if there is interest in this patch, let us know how we should
have done things and we're willing to rewrite it entirely if needed.

==== Overview ====

In the optimiser, we generate PathKeys representing the two sort orders by
wich partition can be ordered (ASC and DESC), only for "native" range-based
partitioned tables, and store them in RelOptInfo associated with every parent
table, along with a list of OIDs corresponding to the partitions.

Then, when the time comes to generate append paths, we add another step which
tries to match the query_pathkeys to those of the partition, and generate
another append node with the sorted paths for each partitions, in the expected
order, and a pathkey to the append path itself to indicate its already sorted.

==== Known Problems with the patch ====

Once again, keep in mind it's a proof-of-concept

- It is in conflict with the existing patch designed to avoid adding the
parent partitions
to the append node. As such, it will almost certainly need a full rewrite to
adapt to that since that is done in this patch in a probably less than ideal
fashion.
- As of now, the patch doesn't try to match the partition pathkey to
mergejoinable pathkeys.
- maybe a new node should be created altogether, instead of porting most of
the MergeAppend struct fields to the basic Append ?
- the way we store the PathKeys and partitions OID directly in RelOptInfo is
probably wrong
- the major impact on existing queries is the fact that to handle sub-
partitioning, RangeTblEntries representing intermediate append nodes are added
(but just in the case of native partitioning, regular inheritance is not
affected). See updated prepunion.c for what that means. Those RTE are then
ignored
when generating the real Append nodes.
- new regression tests have not been written yet, and existing ones haven't
been "fixed" to take into account the different partition ordering: in case of
sub partitioning, the order is now "depth-first" rather than "breadth-first"
like it was earlier.
- no documentation has been added, although we don't really know where it
should go
- the patch lacks a lot of comments
- the key displayed in EXPLAIN output comes from the first partition, not the
parent.
- maybe a "real" SortPath should be generated, instead of generating Sort
nodes
out of nowhere when creating the plan. This has been done to conform to what
MergeAppend already did.

===== Questions =====

Is there interest for such an optimization ?
Should we work from the patch proposed to remove parent tables already ?
Should there be a new Node for such a "SortedAppend" operation or is it fine
keeping it with the Append node already ? Should that instead be an
optimization of the MergeAppend Node ?
What is or is not acceptable with regards to storing things in RelOptInfo
(among others) ?

More generally, any kind of feedback that will help us get on the right track
will be appreciated.

--
Ronan Dunklau & Julien Rouhaud
http://dalibo.com - http://dalibo.org

Attachment Content-Type Size
sorted_append_nodes_for_partitioning.patch text/x-patch 49.0 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stas Kelvich 2017-03-20 11:10:09 Re: logical decoding of two-phase transactions
Previous Message Pavel Stehule 2017-03-20 10:30:56 Re: PoC plpgsql - possibility to force custom or generic plan