Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning
Date: 2018-01-12 03:30:38
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On 12 January 2018 at 15:27, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2018/01/11 19:23, David Rowley wrote:

>> ERROR: operator 531 is not a member of opfamily 1976
> You'll be able to see that the error no longer appears with the attached
> updated set of patches, but I'm now seeing that the resulting plan with
> patched for this particular query differs from what master (constraint
> exclusion) produces. Master produces a plan with no partitions (as one
> would think is the correct plan), whereas patched produces a plan
> including the xy1 partition. I will think about that a bit and post
> something later.

Thanks for looking at that.

I've got a few more things for you. I'm only partway through another
pass, but it makes sense to post what I have now if you're working on
a new version.

1. partitioing -> partitioning

* Strategy of a partition clause operator per the partitioing operator class

2. get_partitions_from_clauses() modifies partclauses without
mentioning it in the header. I think you need to either:

a) warn about this in the header comment; or
b) do a list_copy() before list_concat()
c) do list_truncate back to the original length after you're done with the list.

3. get_partitions_from_clauses_recurse(), with:

result = bms_add_range(result, 0, partdesc->nparts - 1);

You could change that to bms_add_range(NULL, ...) and ditch the
assignment of result to NULL at the start of the function.

4. classify_partition_bounding_keys() now returns bool, but the return
statement is still:

return keys->n_eqkeys + keys->n_minkeys + keys->n_maxkeys + n_keynullness;

my compiler didn't warn about that, but I'd imagine some might.

Instead, can you make it:

if (keys->n_eqkeys > 0 || keys->n_minkeys > 0 ||
keys->n_maxkeys > 0 || n_keynullness > 0)
return true;

return false;

probably equal keys are the most likely case, so it'll be good to
short circuit instead of performing addition on a bunch of stuff we
don't care about anymore.

5. In classify_partition_bounding_keys, why do we "continue" here?

clause = rinfo->clause;
if (rinfo->pseudoconstant &&
!DatumGetBool(((Const *) clause)->constvalue))
*constfalse = true;

Is there any point in searching further?

Also, if you were consistent with the return value for
classify_partition_bounding_keys when you've set *constfalse = true;
you wouldn't need to handle the case twice like you are in

6. I think it would be nicer if get_partitions_from_ne_clauses returns
a set of partitions that could be excluded.

So instead of:

* get_partitions_from_ne_clauses
* Return partitions of relation that satisfy all <> operator clauses in
* ne_clauses. Only ever called if relation is a list partitioned table.


* get_partitions_from_ne_clauses
* Returns a Bitmapset of partitions that can be safely excluded due to
* not-equal clauses existing for all possible partition values. It is only
* valid to call this for LIST partitioned tables.

and instead of:

result = bms_add_range(NULL, 0, partdesc->nparts - 1);
result = bms_del_members(result, excluded_parts);

return result;

Just do:

return excluded_parts;

and in get_partitions_from_clauses_recurse(), do bms_del_members
instead of bms_int_members.

there's less bit shuffling and it seems cleaner. Perhaps the function
name would need to be changed if we're inverting the meaning too.

(I've attached a patch which makes this change along with an idea in #8 below)

7. The following comment claims the function sets *datum, but there's
no param by that name:

* partkey_datum_from_expr
* Extract constant value from expr and set *datum to that value
static bool
partkey_datum_from_expr(PartitionKey key, int partkeyidx,
Expr *expr, Datum *value)

8. The code in get_partitions_from_ne_clauses() does perform quite a
few nested loops. I think a more simple way to would be to track the
offsets you've seen in a Bitmapset. This would save you having to
check for duplicates, as an offset can only contain a single datum.
You'd just need to build a couple of arrays after that, one to sum up
the offsets found per partition, and one for the total datums allowed
in the partition. If the numbers match then you can remove the

I've written this and attached it to this email. It saves about 50
lines of code and should perform much better for complex cases, for
example, a large NOT IN list. This also implements #6.

9. "the same" -> "it"

* In case of NOT IN (..), we get a '<>', which while not
* listed as part of any operator family, we are able to
* handle the same if its negator is indeed a part of the
* partitioning operator family.

10. in classify_partition_bounding_keys: "0" -> "false"

/* Return if no work to do below. */
if (!will_compute_keys || *constfalse)
return 0;

Likewise for:

if (*constfalse)
return 0;

11. I don't see partition_bound_bsearch used anywhere below the
following comment:

* Generate bounding tuple(s).
* We look up partitions in the partition bound descriptor using, say,
* partition_bound_bsearch(), which expects a Datum (or Datums if multi-
* column key). So, extract the same out of the constant argument of
* each clause.

I also don't know what the comment is trying to say.


* operator and sets *incl if equality is implied

should be:

* operator and set *incl to true if the operator's strategy is inclusive.

13. What does "the same" mean in:

* and add this one directly to the result. Caller would
* arbitrarily choose one of the many and perform
* partition-pruning with the same. It's possible that mutual

I think you quite often use "the same" to mean "it". Can you change that?

14. Not sure what parameter you're talking about here.

* Evaluate 'leftarg op rightarg' and set *result to its value.
* leftarg and rightarg referred to above actually refer to the constant
* operand (Datum) of the clause contained in the parameters leftarg and
* rightarg below, respectively. And op refers to the operator of the
* clause contained in the parameter op below.

15. "the latter" is normally used when you're referring to the last
thing in a list which was just mentioned. In this case, leftarg_const
and rightarg_const is the list, so "the latter" should mean
rightarg_const, but I think you mean to compare them using the

* If the leftarg_const and rightarg_const are both of the type expected
* by op's operator, then compare them using the latter.

16. There are a few things to improve with the following comment:

* Hash partitioning stores partition keys containing nulls in regular
* partitions. That is, the code that determines the hash partition for
* a given row admits nulls in the partition key when computing the key's
* hash. So, here we treat any IS NULL clauses on partition key columns as
* equality keys, along with any other non-null values coming from equality
* operator clauses.

"admits" is not the correct word here, and "hash" should be "correct",
but there are more mistakes, so might be easier just to rewrite to:

* Since tuples with NULL values in the partition key columns are
stored in regular partitions,
* we'll treat any IS NULL clauses here as regular equality clauses.

17. The following example will cause get_partitions_for_keys_hash to misbehave:

create table hashp (a int, b int) partition by hash (a, b);
create table hashp1 partition of hashp for values with (modulus 4, remainder 0);
create table hashp2 partition of hashp for values with (modulus 4, remainder 1);
create table hashp3 partition of hashp for values with (modulus 4, remainder 3);
create table hashp4 partition of hashp for values with (modulus 4, remainder 2);
explain select * from hashp where a = 1 and a is null;

The following code assumes that you'll never get a NULL test for a key
that has an equality test, and ends up trying to prune partitions
thinking we got compatible clauses for both partition keys.

memset(keyisnull, false, sizeof(keyisnull));
for (i = 0; i < partkey->partnatts; i++)
if (bms_is_member(i, keys->keyisnull))
keyisnull[i] = true;

* Can only do pruning if we know all the keys and they're all equality
* keys including the nulls that we just counted above.
if (keys->n_eqkeys == partkey->partnatts)

The above code will need to be made smarter. It'll likely crash if you
change "b" to a pass-by-ref type.

18. The following code:

int other_idx = -1;

* Only a designated partition accepts nulls, which if there
* exists one, return the same.
if (partition_bound_accepts_nulls(boundinfo) ||
other_idx = partition_bound_accepts_nulls(boundinfo)
? boundinfo->null_index
: boundinfo->default_index;
if (other_idx >= 0)
return bms_make_singleton(other_idx);
return NULL;

should be simplified to:

* NULLs may only exist in the NULL partition, or in the
* default, if there's no NULL partition.
if (partition_bound_accepts_nulls(boundinfo))
return bms_make_singleton(boundinfo->null_index);
else if (partition_bound_has_default(boundinfo))
return bms_make_singleton(boundinfo->default_index);
return NULL;

19. "exists" -> "are"

* If there are no datums to compare keys with, but there exist
* partitions, it must be the default partition.

also, instead of writing "it must be the default partition." it should
be better to say "just return the default partition."

20. I don't think the return NULL should ever hit, is it worth putting
a comment to say /* shouldn't happen */

if (boundinfo->ndatums == 0)
if (partition_bound_has_default(boundinfo))
return bms_make_singleton(boundinfo->default_index);
return NULL;

21. Can the following comment does not explain the situation well:

* boundinfo->ndatums - 1 is the last valid list partition datums
* index.

There's really no possible non-default partition for this case, so
perhaps we should just return the default, if one exists. We do go on
to check the n_maxkeys needlessly for this case. At the very least the
comment should be changed to:

* minkeys values are greater than any non-default partition.
* We'll check that for case below.

but I think it's worth just doing the default partition check there
and returning it, or NULL. It should help reduce confusion.

Can you also perform a self-review of the patch? Some of the things
I'm picking up are leftovers from a previous version of the patch. We
might never get through this review if you keep leaving those around!

I won't continue reviewing again until next week, so don't rush.

David Rowley
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
get_partitions_excluded_by.patch application/octet-stream 8.8 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2018-01-12 03:35:22 Re: [HACKERS] Crash on promotion when recovery.conf is renamed
Previous Message Robert Haas 2018-01-12 03:19:02 Re: [HACKERS] [BUGS] BUG #14825: enum type: unsafe use?