Quick Links

Re: [HACKERS] path toward faster partition pruning

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] path toward faster partition pruning
Date:	2018-03-02 11:21:21
Message-ID:	39224135-ddb7-1254-3c5e-e8696d5346ff@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Thanks for your comments.

On 2018/03/02 4:13, Robert Haas wrote:
> On Wed, Feb 28, 2018 at 11:53 PM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> Attached updated patches.
>
> + memcpy(part_scheme->partsupfunc, partkey->partsupfunc,
> + sizeof(FmgrInfo) * partnatts);
>
> You can't copy an FmgrInfo by just applying memcpy() to it. Use fmgr_info_copy.

Oops, will fix.

> I don't like the comments at the top of partprune.c very much. It
> seems strange to document individual functions here; those functions
> can (and should) be documented in their individual header comments.
> What we should have at the top of the file is a discussion of the
> overall theory of operation of this module, something that currently
> seems not to exist anywhere in the patch.

Sorry about that. Thanks to your comments here and David's reply, I will
try to address that in the next version.

> I tried to figure that out
> by looking at the new data structures the patch introduces:
> PartitionPruneContext, PartScanKeyInfo, PartitionClauseInfo, and
> PartClause. It looks like the general idea idea is that code that
> wants to use these facilities does the following:
>
> Step 1. Generate a PartitionPruneContext. In this patch, this seems
> to consist entirely of copying information from the RelOptInfo or its
> PartitionScheme.
> Step 2. Call generate_partition_clauses() to extract relevant clauses
> from rel->baserestrictinfo and generate a PartClauseInfo.
> Step 3. Call get_partitions_from_clauses() to generate a set of
> unpruned partition indexes. Internally, that function will first
> populate a PartScanKeyInfo from the PartClauseInfo by calling
> extract_bounding_datums(). Then it calls get_partitions_for_keys()
> which generates the set of unpruned partition indexes from the
> PartitionPruneContext and PartScanKeyInfo.
>
> I guess there are two main things about this that seem odd to me:
>
> 1. I don't see why the partition pruning code should ever be
> responsible for evaluating anything itself, as it does currently via
> evaluate_expr(). For plan-time partition pruning, we are already
> using eval_const_expressions() to perform as much Const-simplification
> as possible. If we see an OpExpr with a partitioning column on one
> side, then the other side is either a Const, in which case we can
> perform pruning, or it's something whose value can't be known until
> runtime, in which case we cannot. There should be no case in which
> eval_const_expressions() fails to simplify something to a constant yet
> we know the value at plan time; if such a case exists, then it's an
> opportunity to improve eval_const_expressions(), not a reason to do be
> inconsistent with the rules it applies. For run-time pruning, it is
> probably semantically wrong and certainly undesirable from a
> performance perspective to spin up a new ExecutorState and a new
> ExprState every time we need to reassess the decision about which
> partitions need to be scanned. Instead, the expressions whose values
> are inputs to the pruning process should be computed in the
> ExecutorState/ExprState for the main query and the results should be
> passed to the partition-pruning code as inputs to the decision-making
> process.

As I said in my earlier reply, I have removed the part that involved the
pruning code calling evaluate_expr().

> 2. All processing of clauses should happen at plan time, not run time,
> so that we're not repeating work. If, for example, a prepared query
> is executed repeatedly, the clauses should get fully processed when
> it's planned, and then the only thing that should happen at execution
> time is that we take the values to which we now have access and use
> them to decide what to prune. With this design, we're bound to repeat
> at least the portion of the work done by get_partitions_from_clauses()
> at runtime, and as things stand, it looks like the current version of
> the run-time partition pruning patch repeats the
> generate_partition_clauses() work at runtime as well.
>
> It seems to me as though what we ought to be doing is extracting
> clauses that are of the correct general form and produces a list of
> <partition-column-index>, <partition-operator-strategy>, <expression>
> triplets sorted by increasing partition column index and operator
> strategy number and in a form that can be represented as a Node tree.
> This can be attached to the plan for use at runtime or, if there are
> at least some constants among the expressions, used at plan time for
> partition exclusion. So the main interfaces would be something like
> this:
>
> extern List *extract_partition_pruning_clauses(RelOptInfo *rel);
> extern PartitionPruneContext *CreatePartitionPruneContext(List *);
> extern Bitmapset *PerformStaticPartitionPruning(PartitionPruneContext *);
> extern Bitmapset *PerformDynamicPartitionPruning(PartitionPruneContext
> *, Datum *values, bool *isnull);
>
> In PerformStaticPartitionPruning(), we'd just ignore anything for
> which the expression was non-constant; to call
> PerformDynamicPartitionPruning(), the caller would need to evaluate
> the expressions from the PartitionPruneContext using the appropriate
> EState and then pass the results into this function.
>
> I realize that I'm hand-waving a bit here, or maybe more than a bit,
> but in general I think it's right to imagine that the work here needs
> to be segregated into two very well-separated phases. In the first
> phase, we do all possible work to assess which clauses are relevant
> and put them in a form where the actual pruning work can be done as
> cheaply as possible once we know the values. This phase always
> happens at plan time. In the second phase, we get the values, either
> by extracting them from Const nodes at plan time or evaluating
> expressions at runtime, and then decide which partitions to eliminate.
> This phase happens at plan time if we have enough constants available
> to do something useful and at runtime if we have enough non-constants
> to do something useful. Right now there doesn't seem to be a clean
> separation between these two phases and I think that's not good.

Hmm, I see that things can be improved and various points you've mentioned
here to improve the high-level interfaces seem useful. I'll try to rework
the patch based on those and submit one early next week.

Looking at the rough interface sketch in your message, it seems that the
product of whatever steps we end up grouping into phase 1 should be
something that can be put into a Node tree (PartitionPruneContext?),
because we may need to attach it to the plan if some of the needed values
will only be made available during execution.

Given the patch's implementation, we'll have to make the structure of that
Node tree a bit more complex than a simple List. For one thing, the patch
handles OR clauses by performing pruning separately for each arm and them
combining partitions selected across OR arms using set union. By
"performing pruning" in the last sentence I meant following steps similar
to ones you wrote in your message:

1. Segregating pruning clauses into per-partition-key Lists, that is,
generate_partition_clauses() producing a PartitionClauseInfo,

2. Removing redundant clauses from each list, that is,
remove_redundant_clauses() to produce lists with just one member per
operator strategy for each partition key,

3. Extracting Datum values from the clauses to form equal/min/max tuples
and setting null or not null bits for individual keys, that is,
extract_bounding_datums() producing a PartScanKeyInfo, and

4. Finally pruning with those Datum tuples and null/not null info, that
is, get_partitions_for_keys().

Steps 2-4 are dependent on clauses providing Datums, which all the clauses
may or may not do. Depending on whether or not, we'll have to defer those
steps to run time.

So,

* What do we encode into the Node tree attached to the plan? Clauses that
haven't gone through steps 2 and 3 (something like PartitionClauseInfo)
or the product of step 3 (something like PartScanKeyInfo)?

* How do we account for OR clauses? Perhaps by having the aforementioned
Node trees nested inside the top-level one, wherein there will be one
nested node per arm of an OR clause.

Thanks,
Amit

In response to

Re: [HACKERS] path toward faster partition pruning at 2018-03-01 19:13:39 from Robert Haas

Responses

Re: [HACKERS] path toward faster partition pruning at 2018-03-02 15:47:58 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Magnus Hagander	2018-03-02 11:23:58	Re: [PATCH] Verify Checksums during Basebackups
Previous Message	Arthur Zakirov	2018-03-02 11:19:25	Re: [PROPOSAL] Shared Ispell dictionaries