Re: [HACKERS] path toward faster partition pruning

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] path toward faster partition pruning
Date: 2018-03-27 10:42:42
Message-ID: 21e00a6a-1e8d-d866-ae40-07c2058512de@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On 2018/03/23 21:19, Amit Langote wrote:
> On 2018/03/21 6:29, Robert Haas wrote:
>> + /*
>> + * If there are multiple pruning steps, we perform them one after another,
>> + * passing the result of one step as input to another. Based on the type
>> + * of pruning step, perform_pruning_step may add or remove partitions from
>> + * the set of partitions it receives as the input.
>> + */
>>
>> The comment sounds great, but the code doesn't work that way; it
>> always calls bms_int_members to intersect the new result with any
>> previous result. I'm baffled as to how this manages to DTRT if
>> COMBINE_OR is used. In general I had hoped that the list of pruning
>> steps was something over which we were only going to iterate, not
>> recurse. This definitely recurses for the combine steps, but it's
>> still (sorta) got the idea of a list of iterable steps. That's a
>> weird mix.
>
> At the top-level (in get_matching_partitions), it is assumed that the
> steps in the input list come from implicitly AND'd clauses, so the
> intersection between partition sets that we get for each.
>
> Anyway, after David's rewrite of this portion of the patch incorporated in
> the latest patch, things look a bit different here, although there is
> still recursion for combine steps. I'm still considering how to make the
> recursion go away.

I have managed to make the recursion go away in the attached updated
version. I guess that's the result of employing the idea of a "output
register" for individual pruning steps as mentioned in Robert's email
upthread where he detailed the "pruning steps" approach [1].

With the new patch, pruning steps for arguments of, say, an OR clause are
not performed recursively. Instead, each pruning step is performed
independently and its output is stored in a slot dedicated to it. Combine
steps are always executed after all of the steps corresponding to its
arguments have been executed. That's ensured by the way steps are allocated.

Jesper, off-list, reported an unused variable which has been removed in
the updated patch. Thanks Jesper! He also pointed out a case with a
list-partitioned table where pruning doesn't a produce a result as one
would expect and what constraint exclusion would produce.

create table lp (a char) partition by list (a);
create table lp_ad partition of lp for values in ('a', 'd');
create table lp_bc partition of lp for values in ('b', 'c');
create table lp_default partition of lp default;
explain (costs off) select * from lp where a > 'a' and a < 'd';
QUERY PLAN
-----------------------------------------------------------
Append
-> Seq Scan on lp_ad
Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
-> Seq Scan on lp_bc
Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
-> Seq Scan on lp_default
Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
(7 rows)

One would expect that lp_ad is not scanned.

With the implementation where a > 'a' and a < 'd' are used to prune
separately, this cannot be avoided given the way
get_partitions_for_keys_list() added by the patch works. What happens is
we prune first with a step generated for a > 'a', which returns partitions
for all datums in the table's boundinfo greater than 'a' that have a
partition assigned, which means we'll include the partition that accepts
'd'. Then when pruning with a < 'd', we select partitions for all datums
less than 'd' that have a partition assigned, which means we end up
including the partition that accepts 'a'. We intersect the result of
running these independent steps, but lp_ad is present in the result sets
of both the sets, so it ends up in the final result. Maybe there is a way
to fix that, but I haven't done anything about it yet.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoahUxagjeNeJTcJkD0rbk%2BmHTXROzWcEd%2BtZ8DuQG83cg%40mail.gmail.com

Attachment Content-Type Size
v41-0001-Add-partsupfunc-to-PartitionSchemeData.patch text/plain 3.4 KB
v41-0002-Add-more-tests-for-partition-pruning.patch text/plain 16.3 KB
v41-0003-Faster-partition-pruning.patch text/plain 119.7 KB
v41-0004-Add-only-unpruned-partitioned-child-rels-to-part.patch text/plain 24.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2018-03-27 10:46:55 Re: [HACKERS] MERGE SQL Statement for PG11
Previous Message Andrew Dunstan 2018-03-27 10:35:53 Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()