Re: [HACKERS] path toward faster partition pruning

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] path toward faster partition pruning
Date: 2018-02-19 09:19:29
Message-ID: 9cd2f096-b174-8b4c-0b68-b597ef5d93f0@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi David.

Thanks a lot for the review comments. Replying to all of your comments.

On 2018/02/17 18:39, David Rowley wrote:
> On 15 February 2018 at 18:57, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> Here is an updated version.
>
> Thanks for sending v27. I've had a quick look over it while I was
> working on the run-time prune patch. However, I've not quite managed a
> complete pass of this version yet
>
> A couple of things so far:
>
> 1. Following loop;
>
> for (i = 0; i < partnatts; i++)
> {
> if (bms_is_member(i, keys->keyisnull))
> {
> /* Only the default partition accepts nulls. */
> if (partition_bound_has_default(boundinfo))
> return bms_make_singleton(boundinfo->default_index);
> else
> return NULL;
> }
> }
>
> could become:
>
> if (partition_bound_has_default(boundinfo) &&
> !bms_is_empty(keys->keyisnull)
> return bms_make_singleton(boundinfo->default_index);
> else
> return NULL;
>
> 2. Is the following form of loop necessary?
>
> for (i = 0; i < partnatts; i++)
> {
> if (bms_is_member(i, keys->keyisnull))
> {
> keys->n_eqkeys++;
> keyisnull[i] = true;
> }
> }
>
> Can't this just be:
>
> i = -1;
> while ((i = bms_next_member(keys->keyisnull, i)) >= 0)
> {
> keys->n_eqkeys++;
> keyisnull[i] = true;
> }
>
> Perhaps you can just Assert(i < partnatts), if you're worried about that.
>
> Similar code exists in get_partitions_for_keys_range

Both 1 and 2 are good suggestions, so done that way.

>
> 3. Several comments mention partition_bound_bsearch, but there is now
> no such function.
>
> 4. "us" should be "is"
>
> * not be any unassigned range to speak of, because the range us unbounded

Fixed.

> 5. The following code is more complex than it needs to be:
>
> /*
> * Since partition keys with nulls are mapped to the default range
> * partition, we must include the default partition if some keys
> * could be null.
> */
> if (keys->n_minkeys < partnatts || keys->n_maxkeys < partnatts)
> {
> for (i = 0; i < partnatts; i++)
> {
> if (!bms_is_member(i, keys->keyisnotnull))
> {
> include_def = true;
> break;
> }
> }
> }
>
>
> Instead of the for loop, couldn't you just write:
>
> include_def = (bms_num_members(keys->keyisnotnull) < partnatts);

Indeed, it's much simpler. Though, I wrote it as:

+ if (bms_num_members(keys->keyisnotnull) < partnatts)
+ include_def = true;

> 6. The following comment is not well written:
>
> * get_partitions_excluded_by_ne_datums
> *
> * Returns a Bitmapset of indexes of partitions that can safely be removed
> * due to each such partition's every allowable non-null datum appearing in
> * a <> opeartor clause.
>
> Maybe it would be better to write:
>
> * get_partitions_excluded_by_ne_datums
> *
> * Returns a Bitmapset of partition indexes that can safely be removed due to
> * the discovery of <> clauses for each datum value allowed in the partition.
>
> if not, then "opeartor" needs the spelling fixed.

Sure, your rewrite sounds much better.

> 7. "The following"
>
> * Followig entry points exist to this module.

Fixed.

>
> Are there any other .c files where we comment on all the extern
> functions in this way? I don't recall seeing it before.

Hmm, not like the way this patch does, but some files in the executor do
have such introductory comments using description of functions exported by
the module. See for example, nodeSeqscan.c, execMain.c, etc.

Not saying that this is the best way to introduce the module, but this is
the one I went with for now. If this format is not very informative, I'm
willing to rewrite it some other way.

> 8. The following may as well just: context.partnatts = partnatts;
>
> context.partnatts = rel->part_scheme->partnatts;
>
> 9. Why palloc0? Wouldn't palloc be ok?
>
> context.partkeys = (Expr **) palloc0(sizeof(Expr *) *
> context.partnatts);
>
> Also, no need for context.partnatts, just partnatts should be fine.

Fixed both. Yes, palloc suffices.

> 10. I'd rather see bms_next_member() used here:
>
> /* Add selected partitions' RT indexes to result. */
> while ((i = bms_first_member(partindexes)) >= 0)
> result = bms_add_member(result, rel->part_rels[i]->relid);
>
> There's not really much use for bms_first_member these days. It can be
> slower due to having to traverse the unset lower significant bits each
> loop. bms_next_member starts where the previous loop left off.

Thanks for clarifying. Used bms_next_member().

> Will try to review more tomorrow.
>
On 2018/02/18 11:25, David Rowley wrote:
> As I mentioned yesterday, here's the remainder of the review:
>
> 11. The following comment is misleading. It says: "We currently handle
> two such cases:", then it goes on to say the 2nd case is not handled.
>
> /*
> * Handle cases where the clause's operator does not belong to
> * the partitioning operator family. We currently handle two
> * such cases: 1. Operators named '<>' are not listed in any
> * operator family whatsoever, 2. Ordering operators like '<'
> * are not listed in the hash operator families. For 1, check
> * if list partitioning is in use and if so, proceed to pass
> * the clause to the caller without doing any more processing
> * ourselves. 2 cannot be handled at all, so the clause is
> * simply skipped.
> */

You're right. We don't really "handle" 2. Rewrote the comment like this:

+ * Normally we only bother with operators that are listed as
+ * being part of the partitioning operator family. But we
+ * make an exception in one case -- operators named '<>' are
+ * not listed in any operator family whatsoever, in which
+ * case, we try to perform partition pruning with it only if
+ * list partitioning is in use.

>
> 12. The following code should test for LIST partitioning before doing
> anything else:
>
> if (!op_in_opfamily(opclause->opno, partopfamily))
> {
> Oid negator;
>
> /*
> * To confirm if the operator is really '<>', check if its
> * negator is a equality operator. If it's a btree
> * equality operator *and* this is a list partitioned
> * table, we can use it prune partitions.
> */
> negator = get_negator(opclause->opno);
> if (OidIsValid(negator) &&
> op_in_opfamily(negator, partopfamily))
> {
> Oid lefttype;
> Oid righttype;
> int strategy;
>
> get_op_opfamily_properties(negator, partopfamily,
> false,
> &strategy,
> &lefttype, &righttype);
>
> if (strategy == BTEqualStrategyNumber &&
> context->strategy == PARTITION_STRATEGY_LIST)
> is_ne_listp = true;
> }
>
> /* Cannot handle this clause. */
> if (!is_ne_listp)
> continue;
> }
>
> The code should probably be in the form of:
>
> if (!op_in_opfamily(opclause->opno, partopfamily))
> {
> if (context->strategy != PARTITION_STRATEGY_LIST)
> continue;
>
> ...
>
> if (strategy == BTEqualStrategyNumber)
> is_ne_listp = true;
> }
>
> that way we'll save 3 syscache lookups when a <> clause appears in a
> RANGE or HASH partitioned table.

Good idea, done.

> 13. The following code makes assumptions that the partitioning op
> family is btree:
>
> /*
> * In case of NOT IN (..), we get a '<>', which while not
> * listed as part of any operator family, we are able to
> * handle it if its negator is an equality operator that
> * is in turn part of the operator family.
> */
> if (!op_in_opfamily(saop_op, partopfamily))
> {
> Oid negator = get_negator(saop_op);
> int strategy;
> Oid lefttype,
> righttype;
>
> if (!OidIsValid(negator))
> continue;
> get_op_opfamily_properties(negator, partopfamily, false,
> &strategy,
> &lefttype, &righttype);
> if (strategy != BTEqualStrategyNumber)
> continue;
> }
>
> this might not be breakable today, but it could well break in the
> future, for example, if hash op family managed to grow two more
> strategies, then we could get a false match on the matching strategy
> numbers (both 3).

I guess we'd be able to anything useful with a NOT IN (..) only with list
partitioning, so I changed the code and surrounding comments like I did
per your comment above. So, if we process it only if list partitioning is
used, the existing code can safely assume that partitioning operator
family is btree.

> 14. The following code assumes there will be a right op:
>
> if (IsA(clause, OpExpr))
> {
> OpExpr *opclause = (OpExpr *) clause;
> Expr *leftop,
> *rightop,
> *valueexpr;
> bool is_ne_listp = false;
>
> leftop = (Expr *) get_leftop(clause);
> if (IsA(leftop, RelabelType))
> leftop = ((RelabelType *) leftop)->arg;
> rightop = (Expr *) get_rightop(clause);
>
> This'll crash with the following:
>
> create function nonzero(p int) returns bool as $$ begin return (p <>
> 0); end;$$ language plpgsql;
> create operator # (procedure = nonzero, leftarg = int);
> create table listp (a int) partition by list (a);
> create table listp1 partition of listp for values in(1);
> select * from listp where a#;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> You need to ensure that there are 2 args.

Oops, so we can't prune with unary operator clause. Added the check on
number of args.

> 15. Various if tests in extract_partition_clauses result in a
> `continue` when they should perhaps be a `break` instead.
>
> For example:
>
> if (!PartCollMatchesExprColl(partcoll, opclause->inputcollid))
> continue;
>
> This clause does not depend on the partition key, so there's no point
> in trying to match this again for the next partition key.
>
> This item is perhaps not so important, as it's only a small
> inefficiency, but I just wanted to point it out. Also, note that plain
> Var partition keys cannot be duplicated, but expressions can, so there
> may be cases that you don't want to change to `break`
>
> Other conditions which possibly should change to `break` instead of
> `continue` include:
>
> /* Only IS [NOT] TRUE/FALSE are any good to us */
> if (btest->booltesttype == IS_UNKNOWN ||
> btest->booltesttype == IS_NOT_UNKNOWN)
> continue;

You're quite right. Replaced all those continue's with break.

> 16. The following code looks a bit fragile:
>
> leftop = IsA(clause, Var)
> ? (Expr *) clause
> : (Expr *) get_notclausearg((Expr *) clause);
>
> This appears to assume that the partition key will be a plain Var and
> not an expression. I tried to break this with:
>
> create table bp (a bool, b bool) partition by ((a < b));
> create table bp_true partition of bp for values in('t');
> explain select * from bp where (a < b);
>
> however, naturally, the parser builds an OpExpr instead of a
> BooleanTest for this case. If it had built a BooleanTest, then the
> above code would mistakenly call get_notclausearg on the (a < b) Expr.
>
> Do you have reason to believe that the code is safe and a good idea?

Thanks for the example. I think there were problems here. The way the
current code matched these specially shaped clauses with a Boolean
partition key wouldn't have resulted in correct pruning for your example
and perhaps more cases.

I looked at match_boolean_index_clause() and decided that we needed
something similar here. So, in the updated patch, I've added a
match_boolean_partition_clause(), which will be called at the beginning of
the loop to recognize such specially shaped clauses as applicable to a
Boolean partition key and add them to the set of pruning clauses. With
the new code:

create table bp (a int, b int) partition by list ((a < b));
create table bp_true partition of bp for values in ('true');
create table bp_false partition of bp for values in ('false');

explain select * from bp where (a < b) is true;
QUERY PLAN
----------------------------------------------------------------
Append (cost=0.00..38.25 rows=753 width=8)
-> Seq Scan on bp_true (cost=0.00..38.25 rows=753 width=8)
Filter: ((a < b) IS TRUE)
(3 rows)

explain select * from bp where (a < b) is false;
QUERY PLAN
------------------------------------------------------------------
Append (cost=0.00..38.25 rows=1507 width=8)
-> Seq Scan on bp_false (cost=0.00..38.25 rows=1507 width=8)
Filter: ((a < b) IS FALSE)
(3 rows)

explain select * from bp where (a < b) is not false;
QUERY PLAN
----------------------------------------------------------------
Append (cost=0.00..38.25 rows=753 width=8)
-> Seq Scan on bp_true (cost=0.00..38.25 rows=753 width=8)
Filter: ((a < b) IS NOT FALSE)
(3 rows)

explain select * from bp where (a < b) = true;
QUERY PLAN
----------------------------------------------------------------
Append (cost=0.00..38.25 rows=753 width=8)
-> Seq Scan on bp_true (cost=0.00..38.25 rows=753 width=8)
Filter: (a < b)
(3 rows)

explain select * from bp where (a < b) = false;
QUERY PLAN
-----------------------------------------------------------------
Append (cost=0.00..38.25 rows=753 width=8)
-> Seq Scan on bp_false (cost=0.00..38.25 rows=753 width=8)
Filter: (a >= b)
(3 rows)

explain select * from bp where (a < b);
QUERY PLAN
----------------------------------------------------------------
Append (cost=0.00..38.25 rows=753 width=8)
-> Seq Scan on bp_true (cost=0.00..38.25 rows=753 width=8)
Filter: (a < b)
(3 rows)

explain select * from bp where not (a < b);
QUERY PLAN
-----------------------------------------------------------------
Append (cost=0.00..38.25 rows=753 width=8)
-> Seq Scan on bp_false (cost=0.00..38.25 rows=753 width=8)
Filter: (a >= b)
(3 rows)

> 17. Which relation is the comment talking about?
>
> /*
> * get_partitions_from_args
> *
> * Returns the set of partitions of relation, each of which satisfies some
> * clause in or_args.
> */
> static Bitmapset *
> get_partitions_from_or_args(PartitionPruneContext *context, List *or_args)

Fixed the comment.

> 18. "sets a field", would it not be better to mention constfalse?:
>
> * returns right after finding such a clause and before returning, sets a field
> * in context->clauseinfo to inform the caller that we found such clause.

You're right.

> 19. "clauses"
>
> * partitioning, we don't require all of eqkeys to be operator clausses.

Fixed.

> 20. There does not seem to be a need to palloc0 here. palloc seems fine.
>
> keys->ne_datums = (Datum *)
> palloc0(list_length(clauseinfo->ne_clauses) *
> sizeof(Datum));
>
> This, of course, may leave unset memory in any unused items, but you
> never iterate beyond what n_ne_datums gets set to anyway, so I don't
> see the need to zero any extra elements.

Agreed about using palloc.

> 21. A code comment should be added to the following code to mention
> that these are not arrays indexed by partition key as they're only
> ever used for LIST partitioning, which only supports a single key.
>
> /* Datum values from clauses containing <> operator */
> Datum *ne_datums;
> int n_ne_datums;

OK, done. Also adjusted the comment above the struct's definition.

> 22. Can you include: "'keyisnotnull' may also be set for the given
> partition key when a strict OpExpr is encountered" in the following
> comment?
>
> * Based on a IS NULL or IS NOT NULL clause that was matched to a partition
> * key, the corresponding bit in 'keyisnull' or 'keyisnotnull' is set.

Done.

Attached updated patches. Thanks again!

Regards,
Amit

Attachment Content-Type Size
v28-0001-Modify-bound-comparision-functions-to-accept-mem.patch text/plain 6.5 KB
v28-0002-Refactor-partition-bound-search-functions.patch text/plain 8.2 KB
v28-0003-Add-parttypid-partcollation-partsupfunc-to-Parti.patch text/plain 5.2 KB
v28-0004-Faster-partition-pruning.patch text/plain 105.2 KB
v28-0005-Add-only-unpruned-partitioned-child-rels-to-part.patch text/plain 24.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-02-19 09:40:02 extern keyword incorrectly used in some function definitions
Previous Message David Rowley 2018-02-19 09:02:24 Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath