Re: [HACKERS] path toward faster partition pruning

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] path toward faster partition pruning
Date: 2018-03-01 19:13:39
Message-ID: CA+Tgmoa4D1c4roj7L8cx8gkkeBWAZD=MTcXKxTwBnsLRHD3rig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 28, 2018 at 11:53 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Attached updated patches.

+ memcpy(part_scheme->partsupfunc, partkey->partsupfunc,
+ sizeof(FmgrInfo) * partnatts);

You can't copy an FmgrInfo by just applying memcpy() to it. Use fmgr_info_copy.

I don't like the comments at the top of partprune.c very much. It
seems strange to document individual functions here; those functions
can (and should) be documented in their individual header comments.
What we should have at the top of the file is a discussion of the
overall theory of operation of this module, something that currently
seems not to exist anywhere in the patch. I tried to figure that out
by looking at the new data structures the patch introduces:
PartitionPruneContext, PartScanKeyInfo, PartitionClauseInfo, and
PartClause. It looks like the general idea idea is that code that
wants to use these facilities does the following:

Step 1. Generate a PartitionPruneContext. In this patch, this seems
to consist entirely of copying information from the RelOptInfo or its
PartitionScheme.
Step 2. Call generate_partition_clauses() to extract relevant clauses
from rel->baserestrictinfo and generate a PartClauseInfo.
Step 3. Call get_partitions_from_clauses() to generate a set of
unpruned partition indexes. Internally, that function will first
populate a PartScanKeyInfo from the PartClauseInfo by calling
extract_bounding_datums(). Then it calls get_partitions_for_keys()
which generates the set of unpruned partition indexes from the
PartitionPruneContext and PartScanKeyInfo.

I guess there are two main things about this that seem odd to me:

1. I don't see why the partition pruning code should ever be
responsible for evaluating anything itself, as it does currently via
evaluate_expr(). For plan-time partition pruning, we are already
using eval_const_expressions() to perform as much Const-simplification
as possible. If we see an OpExpr with a partitioning column on one
side, then the other side is either a Const, in which case we can
perform pruning, or it's something whose value can't be known until
runtime, in which case we cannot. There should be no case in which
eval_const_expressions() fails to simplify something to a constant yet
we know the value at plan time; if such a case exists, then it's an
opportunity to improve eval_const_expressions(), not a reason to do be
inconsistent with the rules it applies. For run-time pruning, it is
probably semantically wrong and certainly undesirable from a
performance perspective to spin up a new ExecutorState and a new
ExprState every time we need to reassess the decision about which
partitions need to be scanned. Instead, the expressions whose values
are inputs to the pruning process should be computed in the
ExecutorState/ExprState for the main query and the results should be
passed to the partition-pruning code as inputs to the decision-making
process.

2. All processing of clauses should happen at plan time, not run time,
so that we're not repeating work. If, for example, a prepared query
is executed repeatedly, the clauses should get fully processed when
it's planned, and then the only thing that should happen at execution
time is that we take the values to which we now have access and use
them to decide what to prune. With this design, we're bound to repeat
at least the portion of the work done by get_partitions_from_clauses()
at runtime, and as things stand, it looks like the current version of
the run-time partition pruning patch repeats the
generate_partition_clauses() work at runtime as well.

It seems to me as though what we ought to be doing is extracting
clauses that are of the correct general form and produces a list of
<partition-column-index>, <partition-operator-strategy>, <expression>
triplets sorted by increasing partition column index and operator
strategy number and in a form that can be represented as a Node tree.
This can be attached to the plan for use at runtime or, if there are
at least some constants among the expressions, used at plan time for
partition exclusion. So the main interfaces would be something like
this:

extern List *extract_partition_pruning_clauses(RelOptInfo *rel);
extern PartitionPruneContext *CreatePartitionPruneContext(List *);
extern Bitmapset *PerformStaticPartitionPruning(PartitionPruneContext *);
extern Bitmapset *PerformDynamicPartitionPruning(PartitionPruneContext
*, Datum *values, bool *isnull);

In PerformStaticPartitionPruning(), we'd just ignore anything for
which the expression was non-constant; to call
PerformDynamicPartitionPruning(), the caller would need to evaluate
the expressions from the PartitionPruneContext using the appropriate
EState and then pass the results into this function.

I realize that I'm hand-waving a bit here, or maybe more than a bit,
but in general I think it's right to imagine that the work here needs
to be segregated into two very well-separated phases. In the first
phase, we do all possible work to assess which clauses are relevant
and put them in a form where the actual pruning work can be done as
cheaply as possible once we know the values. This phase always
happens at plan time. In the second phase, we get the values, either
by extracting them from Const nodes at plan time or evaluating
expressions at runtime, and then decide which partitions to eliminate.
This phase happens at plan time if we have enough constants available
to do something useful and at runtime if we have enough non-constants
to do something useful. Right now there doesn't seem to be a clean
separation between these two phases and I think that's not good.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2018-03-01 19:15:41 Re: "failed to find parent tuple for heap-only tuple" error as an ERRCODE_DATA_CORRUPTION ereport()
Previous Message Tom Lane 2018-03-01 19:09:33 Re: [HACKERS] Support to COMMENT ON DATABASE CURRENT_DATABASE