Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning
Date: 2018-01-11 09:52:09
Message-ID: 943fc1d6-f828-424d-890a-6f70c3a54dc4@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi David.

Thanks a lot of the review. I'm replying to your multiple emails here.

On 2017/12/28 11:07, David Rowley wrote:
> I've just completed a pass over the v17 patch set. I've found a number
> of things that need to be addressed. Some might seem a bit nit-picky,
> sorry about that. However, many of the others genuinely need to be
> addressed.
>
> 1. The following code calls RelationGetPartitionQual, but the output
> of that is only needed when partition_bound_has_default(boundinfo) is
> true.
>
> Can you change this to only get the RelationGetPartitionQual when it's required?

OK, done that way.

> 2. The header comment in match_clauses_to_partkey() does not give any
> warning that 'clauses' is modified within the function.
>
> The function should create a copy of the clauses before modifying
> them. This will save you having to do any list_copy calls when you're
> calling the function.

OK, added a list_copy on clauses at the beginning of the function.

>
> The header comment is also not very clear about what the return value
> of the function is.

Fixed the comment to describe return values.

> 3. "method" I think should be "strategy". We've pretty much
> standardised on that term everywhere else, so let's keep to standard.
>
> /*
> * Does the query specify a key to be null or not null? Partitioning
> * handles null partition keys specially depending on the partitioning
> * method in use, we store this information.
> */

Fixed.

> 4. "relation" should be in single quotes, since you're talking about
> the parameter named "relation". Likewise with "partclauses", otherwise
> it just seems like bad English.
>
> * Determine the set of partitions of relation that will satisfy all
> * the clauses contained in partclauses

Fixed.

> 5. partdesc's assignment can be delayed until it's needed. This will
> save generating it when constfalse == true
>
> static Bitmapset *
> get_partitions_from_clauses_recurse(Relation relation, int rt_index,
> List *clauses)
> {
> PartitionDesc partdesc = RelationGetPartitionDesc(relation);

Moved the whole line to a tiny block where partdesc is used.

> 6. In the following comment, I'd have expected almost identical code,
> just with some other List, but the code probably differs a bit too
> much to use "Ditto".
>
> /*
> * Ditto, but this time or_clauses.
> */

Hmm, OK. I tried rewriting the comment a bit.

>
> 7. Comment claims we use "set union", but we're really just collecting
> the members from the other set in:
>
> /*
> * Partition sets obtained from mutually-disjunctive clauses are
> * combined using set union.
> */
> result = bms_add_members(result, arg_partset);

Guess the comment doesn't really add much in this case, so just deleted it.

> 8. These arrays could just be initialized up to partkey->partnatts.
> I'd imagine most of the time this will be just 1, so would save
> needlessly setting the 31 other elements, although, perhaps it's a bad
> idea to optimize this.
>
> memset(keyclauses_all, 0, sizeof(keyclauses_all));
> /* false means we don't know if a given key is null */
> memset(keyisnull, false, sizeof(keyisnull));
> /* false means we don't know if a given key is not null */
> memset(keyisnotnull, false, sizeof(keyisnull));
>
> The last two of these could just be Bitmapsets and you'd not need any
> memset at all. PARTITION_MAX_KEYS just so happens to be the same as
> BITS_PER_BITMAPWORD in a standard build, so you'd always be able to
> mark everything with a single bitmap word. This would help a bit in
> the various places that you're counting the true elements, for
> example, the following code:
>
> for (i = 0; i < partkey->partnatts; i++)
> {
> if (!keys->keyisnotnull[i])
> {
> include_def = true;
> break;
> }
> }
>
> could become:
>
> include_def = !bms_is_empty(keys->keyisnotnull);
>
> if you converted all these to Bitmapsets.

I liked the idea of using Bitmapset for keyisnull and keyisnotnull, so
implemented it.

> 9. The following comment would be better read as: /* clause does not
> match this partition key */
>
> /* Clause not meant for this column. */
>
> 10. The following comment talks about handling less than operators for
> hash opfamilies, but the code only handles <> for btree and list
> partitioning.
>
> * Handle some cases wherein the clause's operator may not
> * belong to the partitioning operator family. For example,
> * operators named '<>' are not listed in any operator
> * family whatsoever. Also, ordering opertors like '<' are
> * not listed in the hash operator family.
>
> "opertors" should be spelled "operators"

Rewrote the comments here a bit.

> 11. In the following comment "operator" should be "operators":
>
> * are constrained by clauses containing equality operator, unless hash
>
> Likewise for:
>
> * IS NULL clauses while remaining have clauses with equality operator.

Fixed.

> 12. The following code in classify_partition_bounding_keys probably
> should use EXPR_MATCHES_PARTKEY.
>
> /* Does leftop match with this partition key column? */
> if ((IsA(arg, Var) &&
> ((Var *) arg)->varattno == partattno) ||
> equal(arg, partexpr))

Done.

> 13. The comment for classify_partition_bounding_keys does not seem to
> define well what the meaning of the return value is:
>
> * classify_partition_bounding_keys
> * Classify partition clauses into equal, min, and max keys, along with
> * any Nullness constraints and return that information in the output
> * argument keys (number of keys is the return value)
>
> I had thought all along that this must mean the number of distinct
> partition keys that we've found useful clauses for, but it's possible
> to fool this with duplicate IS NULL checks.
>
> create table hp (a int, b text) partition by hash (a, b);
> create table hp0 partition of hp for values with (modulus 4, remainder 0);
> create table hp3 partition of hp for values with (modulus 4, remainder 3);
> create table hp1 partition of hp for values with (modulus 4, remainder 1);
> create table hp2 partition of hp for values with (modulus 4, remainder 2);
>
> explain select * from hp where a is null and a is null;
>
> This causes classify_partition_bounding_keys to return 2. I can't see
> any bad side effects of this, but I think the comment needs to be
> improved. Perhaps it's better just to make the function return bool,
> returning true if there are any useful keys?

OK, I made classify_partition_bounding_keys() return a bool instead.

> 14. The switch statement in partition_op_strategy could be simplified
> a bit and be written more as:
>
> case BTLessEqualStrategyNumber:
> *incl = true;
> /* fall through */
> case BTLessStrategyNumber:
> result = PART_OP_LESS;
> break;

That's better, done.

> 15. No default case statements in partition_op_strategy() could result
> in "result" not being set.

I added a default case, but it's simply an elog(ERROR, ...).

> 16. I'm unsure what the following comment means:
>
> * To set eqkeys, we must have found the same for partition key columns.
>
> The word "for" seems wrong, or I'm not sure what it wants to ensure it
> finds the same for.

I rewrote this comment a bit, since like you, I too am no longer able to
make sense of it. Just wanted to say here that to set keys->eqkeys at
all, we must have found matching clauses containing equality operators for
all partition keys columns.

> 17. Header comment for partition_op_strategy is out-of-date.
>
> /*
> * Returns -1, 0, or 1 to signify that the partitioning clause has a </<=,
> * =, and >/>= operator, respectively. Sets *incl to true if equality is
> * implied.
> */
>
> It'll never return -1.

Oops, fixed.

> 18. The following comment has some grammar issues:
>
> /*
> * If couldn't coerce to the partition key type, that is, the type of
> * datums stored in PartitionBoundInfo, no hope of using this
> * expression for anything partitioning-related.
> */
>
> Would be better with:
>
> /*
> * If we couldn't coerce the partition key type, that is, the type
> * of datums stored in PartitionBoundInfo, then there's no hope of
> * using this expression for anything partitioning-related.
> */

OK, done.

> 19. In partkey_datum_from_expr() the Assert(false) at the end seems
> overkill. You could just get rid of the default in the switch
> statement and have it fall through to the final return. This would
> save 4 lines of code.

Actually, let's just get rid of the switch. If I remove the default case
like you suggest, there are NodeTag enum values not handled warnings.

> 20. The word "and" I think needs removed from the following comment:
>
> * Couldn't compare; keep hash_clause set to the previous value and
> * so add this one directly to the result. Caller would

Actually, it seems better to keep the "and" and remove the "so".

> Probably also needs a comment after "value"

Done.

> 21. The following comment seems to indicate 'cur' is an operator, but
> it's a PartClause:
>
> /* The code below is for btree operators, which cur is not. */
>
> It might be better to write
>
> /* The code below handles Btree operators which are not relevant to a
> hash-partitioned table. */

Agreed, done.

> 22. "Stuff" should be "The code" in:
>
> * Stuff that follows closely mimics similar processing done by

Done.

> 23. In remove_redundant_clauses() and various other places, you have a
> variable named partattoff. What's the meaning of this name? I'd
> imagine it is short for "partition attribute offset", but it's a
> "partition key index", so why not "partkeyidx"?
>
> Of course, you might claim that an array index is just an offset, but
> it is a little confusing if you think of attribute offsets in a
> TupleDesc.

Hmm, OK. Replaced partattoff with partkeyidx.

> 24. get_partitions_for_keys() you're using "return" and "break" in the
> switch statement. You can remove the breaks;

Oops, fixed.

> 25. The Assert(false) in get_partitions_for_keys seems overkill. I
> think it's fine to just do:
>
> return NULL; /* keep compiler quiet */

Done, too.

> 26. The following code comment in get_partitions_for_keys_hash() does
> not seem very well written:
>
> * Hash partitioning handles puts nulls into a normal partition and
> * doesn't require to define a special null-accpting partition.
> * Caller didn't count nulls as a valid key; do so ourselves.
>
> Maybe "puts" should be "storing"?
>
> Also, does hash partitioning actually support a NULL partition? This
> seems to say it doesn't require, but as far as I can see it does not
> *support* a NULL partition. The comment is a bit misleading.
> "accpting" should be "accepting".

I rewrote that comment. I wanted to say that, unlike range and list
partitioning which have special handling for null values (there is only
one list/range partition at any given time that could contain nulls in the
partition key), hash partitioning does not. All hash partitions could
contain nulls in one or more of the partition keys and hence we must
consider nulls as regular equality keys.

> 27. In get_partitions_for_keys_hash() why is it possible to get a
> result_index below 0? In that case, the code will end up triggering
> the Assert(false), but if you want to Assert something here then maybe
> it should be Assert(result_index >= 0)?
>
> if (result_index >= 0)
> return bms_make_singleton(result_index);

I too used to think that it's impossible to get result_index < 0, but it's
actually possible, because not all required hash partitions may have been
defined yet:

create table hashp (a int) partition by hash (a);
create table hashp0 partition of hashp for values with (modulus 4,
remainder 0);

If user only defines one partition like shown above, then there are no
partitions for when the remainders for a given partition key turns out to
be 1, 2, or 3.

create table hashp (a int) partition by hash (a);
create table hashp0 partition of hashp for values with (modulus 4,
remainder 0);

insert into hashp values (1);
INSERT 0 1
explain select * from hashp where a = 1;
QUERY PLAN
--------------------------------------------------------------
Append (cost=0.00..41.88 rows=13 width=4)
-> Seq Scan on hashp0 (cost=0.00..41.88 rows=13 width=4)
Filter: (a = 1)
(3 rows)

insert into hashp values (2);
ERROR: no partition of relation "hashp" found for row
DETAIL: Partition key of the failing row contains (a) = (2).

explain select * from hashp where a = 2;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

I guess I forgot to remove that Assert(false) when fixing the code after
realizing that it's indeed possible to not get a hash partition for some keys.

Removed the Assert.

> 28. Is there any point in the following loop in get_partitions_for_keys_list()?
>
> /*
> * We might be able to get the answer sooner based on the nullness of
> * keys, so get that out of the way.
> */
> for (i = 0; i < partkey->partnatts; i++)
> {
>
> Won't partkey->partnatts always be 1? Maybe you can Assert that
> instead, just in case someone forgets to change the code if LIST is to
> support multiple partition keys.

I guess that's a leftover from when all partition strategies were handled
in one function. Got rid of the loop and added the Assert.

>
> 29. In get_partitions_for_keys_list and get_partitions_for_keys_range
> partdesc is only used for an Assert. Maybe you can just:
>
> Assert(RelationGetPartitionDesc(rel)->nparts > 0);

OK, done.

> 30. The Assert(false) in get_partitions_for_keys_list() looks
> suspiciously easy to hit...
>
> create table lp (a int not null) partition by list(a);
> create table lp1 partition of lp for values in(1);
> explain select * from lp where a > 11; -- Assert fail!
>
> 31. ranget?
>
> * get_partitions_for_keys_range
> * Return partitions of a ranget partitioned table for requested keys

The Assert indeed was easy to hit. Removed.

> 32. I'm not quite sure what this comment is saying:
>
> /*
> * eqoff is gives us the bound that is known to be <= eqkeys,
> * given how partition_bound_bsearch works. The bound at eqoff+1,
> * then, would be the upper bound of the only partition that needs
> * to be scanned.
> */

Rewrote the comment as:

/*
* The bound at eqoff is known to be <= eqkeys, given the way
* partition_bound_bsearch works. Considering the same as the lower
* bound of the partition that eqkeys falls into, the bound at
* eqoff + 1 would be its upper bound, so use eqoff + 1 to get the
* desired partition's index.
*/

Any better?

> 33. "one" -> "the one"
>
> * Find the leftmost bound that satisfies the query, i.e., one that
> * satisfies minkeys.

Fixed.

> 34. Why "will" there be multiple partitions sharing the same prefix?
>
> * If only a prefix of the whole partition key is provided, there will
> * be multiple partitions whose bound share the same prefix. If minkey
>
> Perhaps "will" should be "may"?

I guess you're right.

> 35. I think "would've" in the following comment is not correct. This
> seems to indicate that this has already happened, which it has not.
> "will have to be" might be more correct?
>
> * satisfy the query, but don't have a valid partition assigned. The
> * default partition would've been included to cover those values.

"will have to be included" sounds correct to me too.

> 36. "will" -> "with"?
>
> * Since partition keys will nulls are mapped to default range

Fixed.

> 37. "If no" -> "If there's no"
>
> * If no tuple datum to compare with the bound, consider
> * the latter to be greater.

>
> 38. I don't see anything about setting keynullness here:
>
> /*
> * Get the clauses that match the partition key, including information
> * about any nullness tests against partition keys. Set keynullness to
> * a invalid value of NullTestType, which 0 is not.
> */
> partclauses = match_clauses_to_partkey(root, rel,
> list_copy(rel->baserestrictinfo),
> &contains_const,
> &constfalse);

I guess the comment is too ancient. Fixed.

>
> 39. "paritions" -> "partitions"
>
> * Else there are no clauses that are useful to prune any paritions,

Fixed.

> 40. In match_clauses_to_partkey, the following code pulls the varnos
> from each operand of the expression. It would be better to just pull
> the side that's needed (if any) a bit later rather than always doing
> both.
>
> left_relids = pull_varnos((Node *) leftop);
> right_relids = pull_varnos((Node *) rightop);

Ah, done.

> 41. I think a list_copy is missing here:
>
> /*
> * Accumulate the live partitioned children of this child, if it's
> * itself partitioned rel.
> */
> if (childrel->part_scheme)
> partitioned_rels = list_concat(partitioned_rels,
> childrel->live_partitioned_rels);
>
> Surely if we don't perform a list_copy of
> childrel->live_partitioned_rels then subsequent additions to the
> resulting list will inadvertently add new items to the
> childrel->live_partitioned_rels?

Yeah, I think what's in the patch now is clearly hazardous. Added a
list_copy on childrel->live_partitioned_rels.

> 42. Is this because of an existing bug?
>
> @@ -1906,11 +1904,13 @@ explain (costs off) select * from mcrparted
> where abs(b) = 5; -- scans all parti
> Filter: (abs(b) = 5)
> -> Seq Scan on mcrparted3
> Filter: (abs(b) = 5)
> + -> Seq Scan on mcrparted4
> + Filter: (abs(b) = 5)
> -> Seq Scan on mcrparted5
> Filter: (abs(b) = 5)

Actually, to constraint exclusion's eyes, mcrparted4's partition
constraint is refuted by abs(b) = 5.

With the new patch, we simply do not perform any partition pruning for
that clause, because we do not have a constraint on an earlier partition
key column.

> 43. In partition_prune.sql you have a mix of /* */ and -- comments.
> Please just use --

Oops, fixed.

On 2017/12/29 15:47, David Rowley wrote:
> Just a few extras that I found:
>
> 44. In match_clauses_to_partkey you're making use of
> estimate_expression_value(), I don't think this is safe.
>
> if (IsA(estimate_expression_value(root, rightop), Const))
> *contains_const = true;
>
> The only other places I see using this in the planner are for costing
> purposes. Also, the header comment for that function says it's not
> safe. Particularly "This effectively means that we plan using the
> first supplied value of the Param.". If that's the case, then if we're
> planning a generic plan, then wouldn't it be possible that the planner
> chooses the current supplied parameter value and prune away partitions
> based on that value. That would make the plan invalid for any other
> parameter, but it's meant to be a generic plan, so we can't do that.

You might be right. Perhaps, I was thinking of eval_const_expressions()
there, which I guess should be enough for this purpose.

>
> 45. Why use a list_copy() here?
>
> /*
> * For a nested ArrayExpr, we don't know how to get the
> * actual scalar values out into a flat list, so we give
> * up doing anything with this ScalarArrayOpExpr.
> */
> if (arrexpr->multidims)
> continue;
>
> elem_exprs = list_copy(arrexpr->elements);

I guess I was just being paranoid there. No need, so removed.

On 2018/01/10 7:35, David Rowley wrote:
> There's a small problem with get_partitions_for_keys_list(), the
> following case hits the Assert(false) at the bottom of that function.
>
> create table ab_c (a int not null, b char) partition by list(a);
> create table abc_a2 (b char, a int not null) partition by list(b);
> create table abc_a2_b3 partition of abc_a2 for values in ('3');
> alter table ab_c attach partition abc_a2 for values in (2);
>
> select * from ab_c where a between 1 and 2 and b <= '2';

I removed that Assert per your comment above (in an earlier email of yours).

On 2018/01/10 10:55, David Rowley wrote:
> One more thing I discovered while troubleshooting a bug Beena reported
> in the run-time partition pruning patch is that
> classify_partition_bounding_keys properly does;
>
> if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
> constexpr = rightop;
> else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr))
> constexpr = leftop;
> else
> /* Clause not meant for this column. */
> continue;
>
> for OpExpr clauses, but does not do the same for leftop for the
> ScalarArrayOpExpr test.

I'm not sure why we'd need to do that? Does the syntax of clauses that
use a ScalarArrayOpExpr() allow them to have the partition key on RHS?
Can you point me to the email where Beena reported the problem in question?

On 2018/01/10 13:18, David Rowley wrote:
> Beena's tests on the run-time partition pruning patch also indirectly
> exposed a problem with this patch.
>
> Basically, the changes to add_paths_to_append_rel() are causing
> duplication in partition_rels.
>
> A test case is:
>
> create table part (a int, b int) partition by list(a);
> create table part1 partition of part for values in(1) partition by list (b);
> create table part2 partition of part1 for values in(1);
>
> select * from part;
>
> partition_rels ends up with 3 items in the list, but there's only 2
> partitions here. The reason for this is that, since planning here is
> recursively calling add_paths_to_append_rel, the list for part ends up
> with itself and part1 in it, then since part1's list already contains
> itself, per set_append_rel_size's "rel->live_partitioned_rels =
> list_make1_int(rti);", then part1 ends up in the list twice.

It seems that I found the problem. Currently, the set_append_rel_size()
step already accumulates the full list in the root partitioned table's
rel->live_partitioned_rels, that is, the list of RT indexes of *all*
partitioned relations in the tree. Then, when add_paths_to_append_rel()
tries to accumulate child rel's live_partitioned_rels into the parent's,
duplication occurs, because the latter already contains all the entries as
compiled by the earlier step. I think having only the latter do the
accumulation is better, because even partition-wise join code needs this
facility and it only ever calls add_paths_to_append_rel().

Attached updated version of the patch set containing fixes for almost all
the things mentioned above.

Thanks again for your thoughtful review comments and sorry that I couldn't
reply sooner.

Thanks,
Amit

Attachment Content-Type Size
v18-0001-Some-interface-changes-for-partition_bound_-cmp-.patch text/plain 11.6 KB
v18-0002-Introduce-a-get_partitions_from_clauses.patch text/plain 63.8 KB
v18-0003-Move-some-code-of-set_append_rel_size-to-separat.patch text/plain 8.6 KB
v18-0004-More-refactoring-around-partitioned-table-Append.patch text/plain 12.7 KB
v18-0005-Teach-planner-to-use-get_partitions_from_clauses.patch text/plain 44.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sergei Kornilov 2018-01-11 09:55:27 Re: [HACKERS] Restricting maximum keep segments by repslots
Previous Message Antonin Houska 2018-01-11 09:34:42 Re: [HACKERS] Secondary index access optimizations