Quick Links

Re: [HACKERS] path toward faster partition pruning

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] path toward faster partition pruning
Date:	2018-03-02 15:47:58
Message-ID:	CA+TgmoahUxagjeNeJTcJkD0rbk+mHTXROzWcEd+tZ8DuQG83cg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Mar 2, 2018 at 6:21 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Looking at the rough interface sketch in your message, it seems that the
> product of whatever steps we end up grouping into phase 1 should be
> something that can be put into a Node tree (PartitionPruneContext?),
> because we may need to attach it to the plan if some of the needed values
> will only be made available during execution.

Right. You might also end up with two representations: a
Node-tree-style representation that contains all of the information we
need, and another, faster form into which it gets converted before
use.

> Given the patch's implementation, we'll have to make the structure of that
> Node tree a bit more complex than a simple List. For one thing, the patch
> handles OR clauses by performing pruning separately for each arm and them
> combining partitions selected across OR arms using set union. By
> "performing pruning" in the last sentence I meant following steps similar
> to ones you wrote in your message:
>
> 1. Segregating pruning clauses into per-partition-key Lists, that is,
> generate_partition_clauses() producing a PartitionClauseInfo,
>
> 2. Removing redundant clauses from each list, that is,
> remove_redundant_clauses() to produce lists with just one member per
> operator strategy for each partition key,
>
> 3. Extracting Datum values from the clauses to form equal/min/max tuples
> and setting null or not null bits for individual keys, that is,
> extract_bounding_datums() producing a PartScanKeyInfo, and
>
> 4. Finally pruning with those Datum tuples and null/not null info, that
> is, get_partitions_for_keys().
>
> Steps 2-4 are dependent on clauses providing Datums, which all the clauses
> may or may not do. Depending on whether or not, we'll have to defer those
> steps to run time.

I don't see that there's a real need to perform step 2 at all. I
mean, if you have x > $1 and x > $2 in the query, you can just compute
the set of partitions for the first clause, compute the set of
partitions for the second clause, and then intersect. That doesn't
seem obviously worse than deciding which of $1 and $2 is greater and
then pruning only based on whichever one is greater in this case.

> * What do we encode into the Node tree attached to the plan? Clauses that
> haven't gone through steps 2 and 3 (something like PartitionClauseInfo)
> or the product of step 3 (something like PartScanKeyInfo)?
>
> * How do we account for OR clauses? Perhaps by having the aforementioned
> Node trees nested inside the top-level one, wherein there will be one
> nested node per arm of an OR clause.

Suppose we define the notion of a pruning program. A pruning program
can use any number of registers, which have integer numbers starting
with 0 and counting upward as high as necessary. Each register holds
a Bitmapset. The result of a pruning program is the value of register
0 when the program completes. A pruning program consists of a list of
steps, each of which is either a PruningBaseStep or a
PruningCombineStep. A PruningCombineStep modifies the contents of the
target register based on the contents of a source register in one of
the following three ways: (1) UNION -- all bits set in source become
set in target; (2) INTERSECT -- all bits clear in source become clear
in target; (3) DIFFERENCE -- all bits set in source become clear in
target. A PruningBaseStep consists of a strategy (equality,
less-than, etc.), an output register, and list of expressions --
either as many as there are partition keys, or for range partitioning
perhaps fewer; it prunes based on the strategy and the expressions and
overwrites the output register with the partitions that would be
selected.

Example #1. Table is hash-partitioned on a and b. Given a query like
SELECT * FROM tab WHERE a = 100 AND b = 233, we create a single-step
program:

1. base-step (strategy =, register 0, expressions 100, 233)

If there were an equality constraint on one of the two columns, we
would not create a pruning program at all, because no pruning is
possible.

Example #2. Table is list-partitioned on a. Given a query like SELECT
* FROM tab WHERE (a = $1 OR a = $2) AND a != $3, we create this
program:

1. base-step (strategy =, register 0, expressions $1)
2. base-step (strategy =, register 1, expressions $2)
3. base-step (strategy !=, register 2, expressions $3)
4. combine-step (target-register 0, source-register 1, strategy union)
5. combine-step (target-register 0, source-register 2, strategy difference)

(This is unoptimized -- one could do better by reversing steps 3 and 4
and using reusing register 1 instead of needing register 2, but that
kind of optimization is probably not too important.)

Example #3. Table is range-partitioned on a and b. Given a query like
SELECT * FROM tab WHERE (a = 40 AND b > $1) OR (a = $2 AND b = $3), we
do this:

1. base-step (strategy >, register 0, expressions 40, $1)
2. base-step (strategy =, register 1, expressions $2, $3)
3. combine-step (target-register 0, source-register 1, strategy union)

You might need a few extra gadgets here to make all of this work --
e.g. another base-step strategy to handle ScalarArrayOpExpr; I'm just
trying to convey the basic idea here. It's pretty easy to see how to
store a program like this as a node tree: just create PruningBaseStep
and PruningCombineStep nodes and stick them into a List. At execution
time transform the List into an array and loop over it.

Or possibly it would be better to have two lists, one of base steps
without explicit register numbers, where step N always outputs to
register N, and then a second list of combine steps. Then at
execution time you could have an array of PruningBaseStep * and an
array of PruningCombineStep * instead of a combined array of Node *,
which might be quicker to process.

But regardless of what you do exactly, I think you should try to come
up with some kind of representation that is basically uniform,
handling all the things you support in a similar fashion. The current
patch has basically separate and somewhat ad-hoc representations for
the regular case, the <> case, and the OR case, which I think is not
ideal because you end up with more code and a certain amount of
repeated logic.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: [HACKERS] path toward faster partition pruning at 2018-03-02 11:21:21 from Amit Langote

Responses

Re: [HACKERS] path toward faster partition pruning at 2018-03-02 16:03:18 from David Rowley
Re: [HACKERS] path toward faster partition pruning at 2018-03-03 00:32:46 from David Rowley
Re: [HACKERS] path toward faster partition pruning at 2018-03-05 08:38:17 from Amit Langote

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2018-03-02 15:49:13	Re: perltidy version
Previous Message	Alexander Kuzmenkov	2018-03-02 15:47:44	Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.