Re: [HACKERS] path toward faster partition pruning

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] path toward faster partition pruning
Date: 2018-02-18 02:25:47
Message-ID: CAKJS1f9By2G=WZ5Yk12fvHHVGq7WnjGCdwWbRrFmoCmiOcSYFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 17 February 2018 at 22:39, David Rowley
<david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> 10. I'd rather see bms_next_member() used here:
>
> /* Add selected partitions' RT indexes to result. */
> while ((i = bms_first_member(partindexes)) >= 0)
> result = bms_add_member(result, rel->part_rels[i]->relid);
>
> There's not really much use for bms_first_member these days. It can be
> slower due to having to traverse the unset lower significant bits each
> loop. bms_next_member starts where the previous loop left off.
>
> Will try to review more tomorrow.

As I mentioned yesterday, here's the remainder of the review:

11. The following comment is misleading. It says: "We currently handle
two such cases:", then it goes on to say the 2nd case is not handled.

/*
* Handle cases where the clause's operator does not belong to
* the partitioning operator family. We currently handle two
* such cases: 1. Operators named '<>' are not listed in any
* operator family whatsoever, 2. Ordering operators like '<'
* are not listed in the hash operator families. For 1, check
* if list partitioning is in use and if so, proceed to pass
* the clause to the caller without doing any more processing
* ourselves. 2 cannot be handled at all, so the clause is
* simply skipped.
*/

12. The following code should test for LIST partitioning before doing
anything else:

if (!op_in_opfamily(opclause->opno, partopfamily))
{
Oid negator;

/*
* To confirm if the operator is really '<>', check if its
* negator is a equality operator. If it's a btree
* equality operator *and* this is a list partitioned
* table, we can use it prune partitions.
*/
negator = get_negator(opclause->opno);
if (OidIsValid(negator) &&
op_in_opfamily(negator, partopfamily))
{
Oid lefttype;
Oid righttype;
int strategy;

get_op_opfamily_properties(negator, partopfamily,
false,
&strategy,
&lefttype, &righttype);

if (strategy == BTEqualStrategyNumber &&
context->strategy == PARTITION_STRATEGY_LIST)
is_ne_listp = true;
}

/* Cannot handle this clause. */
if (!is_ne_listp)
continue;
}

The code should probably be in the form of:

if (!op_in_opfamily(opclause->opno, partopfamily))
{
if (context->strategy != PARTITION_STRATEGY_LIST)
continue;

...

if (strategy == BTEqualStrategyNumber)
is_ne_listp = true;
}

that way we'll save 3 syscache lookups when a <> clause appears in a
RANGE or HASH partitioned table.

13. The following code makes assumptions that the partitioning op
family is btree:

/*
* In case of NOT IN (..), we get a '<>', which while not
* listed as part of any operator family, we are able to
* handle it if its negator is an equality operator that
* is in turn part of the operator family.
*/
if (!op_in_opfamily(saop_op, partopfamily))
{
Oid negator = get_negator(saop_op);
int strategy;
Oid lefttype,
righttype;

if (!OidIsValid(negator))
continue;
get_op_opfamily_properties(negator, partopfamily, false,
&strategy,
&lefttype, &righttype);
if (strategy != BTEqualStrategyNumber)
continue;
}

this might not be breakable today, but it could well break in the
future, for example, if hash op family managed to grow two more
strategies, then we could get a false match on the matching strategy
numbers (both 3).

14. The following code assumes there will be a right op:

if (IsA(clause, OpExpr))
{
OpExpr *opclause = (OpExpr *) clause;
Expr *leftop,
*rightop,
*valueexpr;
bool is_ne_listp = false;

leftop = (Expr *) get_leftop(clause);
if (IsA(leftop, RelabelType))
leftop = ((RelabelType *) leftop)->arg;
rightop = (Expr *) get_rightop(clause);

This'll crash with the following:

create function nonzero(p int) returns bool as $$ begin return (p <>
0); end;$$ language plpgsql;
create operator # (procedure = nonzero, leftarg = int);
create table listp (a int) partition by list (a);
create table listp1 partition of listp for values in(1);
select * from listp where a#;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

You need to ensure that there are 2 args.

15. Various if tests in extract_partition_clauses result in a
`continue` when they should perhaps be a `break` instead.

For example:

if (!PartCollMatchesExprColl(partcoll, opclause->inputcollid))
continue;

This clause does not depend on the partition key, so there's no point
in trying to match this again for the next partition key.

This item is perhaps not so important, as it's only a small
inefficiency, but I just wanted to point it out. Also, note that plain
Var partition keys cannot be duplicated, but expressions can, so there
may be cases that you don't want to change to `break`

Other conditions which possibly should change to `break` instead of
`continue` include:

/* Only IS [NOT] TRUE/FALSE are any good to us */
if (btest->booltesttype == IS_UNKNOWN ||
btest->booltesttype == IS_NOT_UNKNOWN)
continue;

16. The following code looks a bit fragile:

leftop = IsA(clause, Var)
? (Expr *) clause
: (Expr *) get_notclausearg((Expr *) clause);

This appears to assume that the partition key will be a plain Var and
not an expression. I tried to break this with:

create table bp (a bool, b bool) partition by ((a < b));
create table bp_true partition of bp for values in('t');
explain select * from bp where (a < b);

however, naturally, the parser builds an OpExpr instead of a
BooleanTest for this case. If it had built a BooleanTest, then the
above code would mistakenly call get_notclausearg on the (a < b) Expr.

Do you have reason to believe that the code is safe and a good idea?

17. Which relation is the comment talking about?

/*
* get_partitions_from_args
*
* Returns the set of partitions of relation, each of which satisfies some
* clause in or_args.
*/
static Bitmapset *
get_partitions_from_or_args(PartitionPruneContext *context, List *or_args)

18. "sets a field", would it not be better to mention constfalse?:

* returns right after finding such a clause and before returning, sets a field
* in context->clauseinfo to inform the caller that we found such clause.

19. "clauses"

* partitioning, we don't require all of eqkeys to be operator clausses.

20. There does not seem to be a need to palloc0 here. palloc seems fine.

keys->ne_datums = (Datum *)
palloc0(list_length(clauseinfo->ne_clauses) *
sizeof(Datum));

This, of course, may leave unset memory in any unused items, but you
never iterate beyond what n_ne_datums gets set to anyway, so I don't
see the need to zero any extra elements.

21. A code comment should be added to the following code to mention
that these are not arrays indexed by partition key as they're only
ever used for LIST partitioning, which only supports a single key.

/* Datum values from clauses containing <> operator */
Datum *ne_datums;
int n_ne_datums;

22. Can you include: "'keyisnotnull' may also be set for the given
partition key when a strict OpExpr is encountered" in the following
comment?

* Based on a IS NULL or IS NOT NULL clause that was matched to a partition
* key, the corresponding bit in 'keyisnull' or 'keyisnotnull' is set.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2018-02-18 03:43:38 Re: BUG #15044: materialized views incompatibility with logical replication in postgres 10
Previous Message Michael Paquier 2018-02-18 00:53:27 Re: missing toast table for pg_policy