Re: [HACKERS] path toward faster partition pruning

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] path toward faster partition pruning
Date: 2018-02-20 07:04:23
Message-ID: 2ebcdf32-66eb-2f82-5f6b-e8a464cfe6a7@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi David.

Thanks for the review.

On 2018/02/19 22:40, David Rowley wrote:
> On 19 February 2018 at 22:19, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> Attached updated patches. Thanks again!
>
> Thanks for making those changes. I've made another pass of v28 and
> have a few more comments.
>
> The patch is starting to look good, but there are some new changes in
> recent versions which still don't look quite right.
>
> 1. This does not fully make sense:
>
> /*
> * Remove the indexes of partitions excluded due to each of
> * those partitions' *all* of allowed datums appearing in
> * keys->ne_datums, that is compared to the partition key
> * using <> operator.
> */
>
> "each of those partitions' *all* of allowed" is not correct.
>
> Maybe better to write:
>
> /*
> * Remove the indexes of any partitions which cannot possibly
> * contain rows matching the clauses due to key->ne_datums containing
> * all datum values which are allowed in the given partition. This
> * is only possible to do in LIST partitioning as it's the only
> * partitioning strategy which allows the specification of exact values.
> */

Ah, your rewrite sounds much better.

> 2. Mine does not, but some compilers may complain about
> get_partitions_for_keys() result variable being uninitalised in
> get_partitions_for_keys. Probably the easiest fix would be to just
> assign to NULL in the default case.

Done.

> 3. Did you mean to put this Assert() inside the loop?
>
> memset(keyisnull, false, sizeof(keyisnull));
> i = -1;
> while ((i = bms_next_member(keys->keyisnull, i)) >= 0)
> {
> keys->n_eqkeys++;
> keyisnull[i] = true;
> }
> Assert(i < partnatts);
>
> i will always be -2 at the end of the loop. Seems like a useless
> Assert in its current location.

Face palm! Fixed.

> 4. Can you add a comment here to say: "Note: LIST partitioning only
> supports a single partition key, therefore this function requires no
> looping over the partition keys"
>
> /*
> * get_partitions_for_keys_list
> * Return partitions of a list partitioned table for requested keys
> *
> * This interprets the keys and looks up partitions in the partition bound
> * descriptor using the list partitioning semantics.
> */

OK, done.

> 5. The following comment contains a bit of duplication to the comment
> which comes after it. Maybe the following:
>
> /*
> * If the query is looking for null keys, there can only be one such
> * partition. Return the same if one exists.
> */
>
> can be changed to:
>
> /* Handle clauses requesting a NULL valued partition key */

You're right. Changed.

> 6. The following comment does not quite make sense:
>
> /* Exactly matching datum exists. */
>
> Probably better to write:
>
> /* An exact matching datum exists. */

OK, done.

> 7. "If it's not equal (<)" I think you mean (>), not (<), in:
>
> * The bound at minoff is <= minkeys, given the way
> * partition_list_bsearch() works. If it's not equal (<), then
> * increment minoff to make it point to the datum on the right
> * that necessarily satisfies minkeys. Also do the same if it is
> * equal but minkeys is exclusive.
>
> However, the comment is a bit clumsy. Maybe the following is better?
>
> /*
> * partition_list_bsearch returning a positive number means that
> * minkeys[0] must be greater than or equal to the smallest datum.
> * If we didn't find an exact matching datum (!is_equal) or if the
> * operator used was non-inclusive (>), then in both of these
> * cases we're not interested in the datum pointed to by minoff,
> * but we may start getting matches in the partition which the
> * next datum belongs to, so point to that one instead. (This may
> * be beyond the last datum in the array, but we'll detect that
> * later.)
> */

Your rewrite is much better.

> 8. The following comment could be improved:
>
> * minkeys is greater than the datums of all non-default partitions,
> * meaning there isn't one to return. Return the default partition if
> * one exists.
>
> how about:
>
> * The value of minkeys[0] is greater than all of the datums we have
> * partitions for. The only possible partition that could contain a
> * match is the default partition. Return that, if it exists.

OK, adopted your rewrite.

> 9. The following could also be improved:
>
> * The bound at maxoff is <= maxkeys, given the way
> * partition_list_bsearch works. If the bound at maxoff exactly
> * matches maxkey (is_equal), but the maxkey is exclusive, then
> * decrement maxoff to point to the bound on the left.
>
> how about:
>
> * partition_list_bsearch returning a positive number means that
> * maxkeys[0] must be greater than or equal to the smallest datum.
> * If the match found is an equal match, but the operator used is
> * non-inclusive of that value (<), then the partition belonging
> * to maxoff cannot match, so we'll decrement maxoff to point to
> * the partition belonging to the previous datum. We might end up
> * decrementing maxoff down to -1, but we'll handle that later.

OK, done.

> 10. Can you append " This may not technically be true for some data
> types (e.g. integer types), however, we currently lack any sort of
> infrastructure to provide us with proofs that would allow us to do
> anything smarter here." to:
>
> * For range queries, always include the default list partition,
> * because list partitions divide the key space in a discontinuous
> * manner, not all values in the given range will have a partition
> * assigned.

Hmm, that seems to make sense, so added the text.

I guess you know it already, but I'm trying to say in the comment that
list partitioning, by definition, does not force you to enumerate all
values that a data type may specify to exist.

> 11. get_partitions_for_keys_range seems to prefer to do "minoff -= 1",
> but get_partitions_for_keys_list likes to "minoff--", can this be made
> the same? Personally, I like -- over -= 1 as it's shorter. Although I
> do remember having an argument with my university professor about
> this. He claimed -= 1 was clearer... I'm still unsure what he found so
> confusing about -- ...

I will go with minoff-- for consistency as you say.

> 12. The following code could be optimised a little for the case when
> there's no default:
>
> /*
> * There may exist a range of values unassigned to any non-default
> * partition between the datums at minoff and maxoff.
> */
> for (i = minoff; i <= maxoff; i++)
> {
> if (boundinfo->indexes[i] < 0)
> {
> include_def = true;
> break;
> }
> }
>
> /*
> * Since partition keys with nulls are mapped to the default range
> * partition, we must include the default partition if some keys
> * *could* be null.
> */
> if (bms_num_members(keys->keyisnotnull) < partnatts)
> include_def = true;
>
> if (include_def && partition_bound_has_default(boundinfo))
> result = bms_add_member(result, boundinfo->default_index);
>
> return result;
>
> Maybe something more like:
>
> if (!partition_bound_has_default(boundinfo))
> return result;
>
> /*
> * There may exist a range of values unassigned to any non-default
> * partition between the datums at minoff and maxoff.
> */
> for (i = minoff; i <= maxoff; i++)
> {
> if (boundinfo->indexes[i] < 0)
> return bms_add_member(result, boundinfo->default_index);
> }
>
> /*
> * Since partition keys with nulls are mapped to the default range
> * partition, we must include the default partition if some keys
> * *could* be null.
> */
> if (bms_num_members(keys->keyisnotnull) < partnatts)
> return bms_add_member(result, boundinfo->default_index);
>
> Which is saves a bit of needless work when there's no default to add,
> and also saves a few lines, including the line where you declare the
> include_def variable.

Ah, that's neat. Done that way. I also changed other code that sets
include_def to instead add the default partition index to result right
away, instead of just setting include_def. So, include_def is now gone.

> 13. Variable name:
>
> Bitmapset *partitioned_rels_bms = NULL;
>
> This should likely be called partitioned_relids, and be of type Relids
> instead of Bitmapset.

Makes sense, done.

> 14. This line removal seems surplus. It should probably be fixed
> independently of this patch.
>
> parent_roots[appinfo->child_relid] = subroot;
> -
> continue;

Hadn't noticed that. Fixed.

> 15. I'm unsure how safe the following code is:
>
> while ((parent_rti = bms_first_member(partitioned_rels_bms)) >= 0)
> partitioned_rels = lappend_int(partitioned_rels, parent_rti);
>
> You're now putting this list in ascending order of relid, but some
> code in set_plan_refs assumes the root partition is the first element:
>
> root->glob->rootResultRelations =
> lappend_int(root->glob->rootResultRelations,
> linitial_int(splan->partitioned_rels));
>
> By luck, the first element might today be the root due to the way we
> expand the inheritance hierarchy, but I think this code is wrong to
> rely on that.
>
> I'm not really a fan of having the root partition be the first element
> in the List. I would much rather see a Relids type and a special Index
> field for the root, but that might be more changes that you'd like to
> make here. I just don't think what you have now is correct.

Hmm, so you're saying that it's not future-proof for the code here to
assume that the root table will always get the smallest RT index of the
tables in a given partition tree. I guess it is a concern that exists
independently of the changes this patch makes, but I agree with the
concern. We can write another patch to break that assumption by using the
method you suggest of adding a Index variable to the ModifyTable node to
store the root table index.

> 16. This should probably return Relids rather than Bitmapset *.
>
> Bitmapset *
> prune_append_rel_partitions(PlannerInfo *root, RelOptInfo *rel)
>
> Please update the mention of Bitmapset in the comment at the top of
> the function too.

OK, done.

> 17. hmm, my patch did palloc(), not palloc0(). My original patch was
> broken and missed this, but v2 got it.
>
> context->clauseinfo = partclauseinfo =
> palloc0(sizeof(PartitionClauseInfo));
>
> There's no need to palloc0() here. You're setting all the fields to
> zero just below. As far as I understand it, it's only Node types that
> we have to go through the rigmarole of doing both.

Ah, you're right. Used palloc.

> 18. I tentatively agree with you having changed the continue to break
> in the following:
>
> /* We can't use any volatile value to prune partitions. */
> if (contain_volatile_functions((Node *) valueexpr))
> break;
>
> I believe it's not wrong to break here, but keep in mind you're
> testing valueexpr rather than something with the OpExpr itself. The
> reason I'm not saying this is wrong is that if the valueexpr is
> volatile then it cannot possibly match another partition key anyway,
> so there's likely no point in continuing to look for another match...
> You should likely write a comment to explain this a bit I think all> of the other places you've changed to break look fine. The
> ScalarArrayOpExpr volatile function test is fine to break from since
> the operands cannot be reversed in that case, so rightop certainly
> can't match a partition key.

When working on this comment, I realized we shouldn't really "break" on
the collation mismatch. Multiple keys can have the same expression, but
different collation and "break"ing in this case would mean, we'd miss
matching it to another key that has the matching collation. IOW, order in
which clauses appear in the input list determines if they're matched to
partition keys correctly or not.

See this somewhat made up example:

create table rp (a text) partition by range (substr(a, 1) collate "en_GB",
substr(a, 1) collate "en_US");
create table rp1 partition of rp for values from ('a', 'a') to ('a', 'e');
create table rp2 partition of rp for values from ('a', 'e') to ('a', 'z');
create table rp3 partition of rp for values from ('b', 'a') to ('b', 'e');

For the following query:

select * from rp where substr(a, 1) = 'e' collate "en_US" and substr(a, 1)
= 'a' collate "en_GB";

With the current code, we'll end up discarding (via break) the 1st clause
after its collation (en_US) doesn't match the 1st key's collation (en_GB)
and thus we'll end with only one clause, that is the 2nd one, being added
to matched clauses. That would result in only p3 being pruned, whereas
both rp1 and rp3 should be pruned.

I've fixed that. With the new code, both the expression and the collation
should match before we conclude that the clause matched the partition key
and then check other properties of the clause like operator strictness,
valueexpr volatility, etc. If those other properties are not satisfied,
we can "break", because with those properties they won't be useful for
pruning, even if it matched some other key.

> 19. I might have caused this, but there's no such variable as 'cur'
>
> /* cur is more restrictive, so replace the existing. */

Fixed this and a few other instances.

> 20. Is there a difference between
> partition_bound_has_default(context->boundinfo) and
> context->has_default_part? Any reason for both? Your code uses both. I
> don't yet understand why has_default_part was added.

One cannot use partition_bound_has_default(context->boundinfo) within
partprune.c, because PartitionBoundInfo's definition is private to
partition.c. Maybe, if someday we expose it by exporting it (and related
partition bound manipulating functions) in, say, a partbound.h, we won't
need to have to resort to the same value being stored in two different
places like this.

> That's all I have for now.
>
> It's getting close. Good work!
>
> I keep having to turn up my strictness level each time I review. I'd
> like to be able to turn it up all the way earlier, but I fear I may
> still be reviewing v1 if I'd done that :-)

It's really great that you've looked into these patches in such great
detail, suggesting various performance improvements (big and small) and
also pointing out various edge cases. Thank you very much! :)

Attached updated version.

Regards,
Amit

Attachment Content-Type Size
v29-0001-Modify-bound-comparision-functions-to-accept-mem.patch text/plain 6.5 KB
v29-0002-Refactor-partition-bound-search-functions.patch text/plain 8.2 KB
v29-0003-Add-parttypid-partcollation-partsupfunc-to-Parti.patch text/plain 5.2 KB
v29-0004-Faster-partition-pruning.patch text/plain 110.1 KB
v29-0005-Add-only-unpruned-partitioned-child-rels-to-part.patch text/plain 23.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rajkumar Raghuwanshi 2018-02-20 07:56:11 Re: [HACKERS] path toward faster partition pruning
Previous Message Andrew Dunstan 2018-02-20 06:59:04 Re: ALTER TABLE ADD COLUMN fast default