Re: [POC] hash partitioning

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: amul sul <sulamul(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>, David Steele <david(at)pgmasters(dot)net>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] hash partitioning
Date: 2017-05-12 17:09:34
Message-ID: CAFjFpRfHqSGBjNgJV2p+C4Yr5Qxvwygdsg4G_VQ6q9NTB-i3MA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 12, 2017 at 6:08 PM, amul sul <sulamul(at)gmail(dot)com> wrote:
> Hi,
>
> Please find the following updated patches attached:
>
> 0001-Cleanup.patch : Does some cleanup and code refactoring required
> for hash partition patch. Otherwise, there will be unnecessary diff in
> 0002 patch

Thanks for splitting the patch.

+ if (isnull[0])
+ cur_index = partdesc->boundinfo->null_index;
This code assumes that null_index will be set to -1 when has_null is false. Per
RelationBuildPartitionDesc() this is true. But probably we should write this
code as
if (isnull[0])
{
if (partdesc->boundinfo->has_null)
cur_index = partdesc->boundinfo->null_index;
}
That way we are certain that when has_null is false, cur_index = -1 similar to
the original code.

Additional arguement to ComputePartitionAttrs() isn't used anywhere in this
patch, so may be this better be part of 0002. If we do this the only change
that will remain in patch is the refactoring of RelationBuildPartitionDesc(),
so we may consider merging it into 0002, unless we find that some more
refactoring is needed. But for now, having it as a separate patch helps.

Here's some more comments on 0002

+ * In the case of hash partitioning, datums is a 2-D array, stores modulus and
+ * remainder values at datums[x][0] and datums[x][1] respectively for each
+ * partition in the ascending order.

This comment about datums should appear in a paragraph of itself and may be
rephrased as in the attached patch. May be we could also add a line about
ndatums for hash partitioned tables as in the attached patch.

+ * (see the above enum); NULL for has and list
typo s/has/hash

+ if (key->strategy == PARTITION_STRATEGY_HASH)
+ {
+ ndatums = nparts;
+ hbounds = (PartitionHashBound **) palloc(nparts *
+
sizeof(PartitionHashBound *));
+ i = 0;
+ foreach (cell, boundspecs)
+ {
+ PartitionBoundSpec *spec = lfirst(cell);
+
[ clipped ]
+ hbounds[i]->index = i;
+ i++;
+ }
For list and range partitioned table we order the bounds so that two
partitioned tables have them in the same order irrespective of order in which
they are specified by the user or hence stored in the catalogs. The partitions
then get indexes according the order in which their bounds appear in ordered
arrays of bounds. Thus any two partitioned tables with same partition
specification always have same PartitionBoundInfoData. This helps in
partition-wise join to match partition bounds of two given tables. Above code
assigns the indexes to the partitions as they appear in the catalogs. This
means that two partitioned tables with same partition specification but
different order for partition bound specification will have different
PartitionBoundInfoData represenation.

If we do that, probably partition_bounds_equal() would reduce to just matching
indexes and the last element of datums array i.e. the greatest modulus datum.
If ordered datums array of two partitioned table do not match exactly, the
mismatch can be because missing datums or different datums. If it's a missing
datum it will change the greatest modulus or have corresponding entry in
indexes array as -1. If the entry differs it will cause mismatching indexes in
the index arrays.

+ * is not a factor of 15.
+ *
+ *
+ * Get greatest bound in array boundinfo->datums which is
An extra line here.

+ if (offset < 0)
+ {
+ nmod = DatumGetInt32(datums[0][0]);
+ valid_bound = (nmod % spec->modulus) == 0;
+ }
+ else
+ {
+ pmod = DatumGetInt32(datums[offset][0]);
+ valid_bound = (spec->modulus % pmod) == 0;
+
+ if (valid_bound && (offset + 1) < ndatums)
+ {
+ nmod = DatumGetInt32(datums[offset + 1][0]);
+ valid_bound = (nmod % spec->modulus) == 0;
+ }
+ }
May be name the variables as prev_mod(ulus) and next_mod(ulus) for better
readability.

+ * for p_p1: satisfies_hash_partition(2, 1, hash_fn(a), hash_fn(b))
+ * for p_p2: satisfies_hash_partition(4, 2, hash_fn(a), hash_fn(b))
+ * for p_p3: satisfies_hash_partition(8, 0, hash_fn(a), hash_fn(b))
+ * for p_p4: satisfies_hash_partition(8, 4, hash_fn(a), hash_fn(b))
The description here may be read as if we are calling the same hash function
for both a and b, but that's not true. So, you may want to clarify that
in hash_fn(a) hash_fn means hash function specified for key a.

+ if (key->partattrs[i] != 0)
+ {
+ keyCol = (Node *) makeVar(1,
+ key->partattrs[i],
+ key->parttypid[i],
+ key->parttypmod[i],
+ key->parttypcoll[i],
+ 0);
+
+ /* Form hash_fn(value) expression */
+ keyCol = (Node *) makeFuncExpr(key->partsupfunc[i].fn_oid,
+ get_fn_expr_rettype(&key->partsupfunc[i]),
+ list_make1(keyCol),
+ InvalidOid,
+ InvalidOid,
+ COERCE_EXPLICIT_CALL);
+ }
+ else
+ {
+ keyCol = (Node *) copyObject(lfirst(partexprs_item));
+ partexprs_item = lnext(partexprs_item);
+ }
I think we should add FuncExpr for column Vars as well as expressions.

The logic to compare two bounds is duplicated in qsort_partition_hbound_cmp()
and partition_bound_cmp(). Should we separate it into a separate function
accepting moduli and remainders. That way in case we change it in future, we
have to change only one place.

I think we need more comments for compute_hash_value(), mix_hash_value() and
satisfies_hash_partition() as to what each of them accepts and what it
computes.

+ /* key's hash values start from third argument of function. */
+ if (!PG_ARGISNULL(i + 2))
+ {
+ values[i] = PG_GETARG_DATUM(i + 2);
+ isnull[i] = false;
+ }
+ else
+ isnull[i] = true;
You could write this as
isnull[i] = PG_ARGISNULL(i + 2);
if (isnull[i])
values[i] = PG_GETARG_DATUM(i + 2);

+ * Identify a btree or hash opclass to use. Currently, we use only
+ * btree operators, which seems enough for list and range partitioning,
+ * and hash operators for hash partitioning.

The wording, if not read carefully, might be read as "we use only btree
operators". I suggest we rephrase it as "Identify opclass to use. For
list and range
partitioning we use only btree operators, which seems enough for those. For
hash partitioning, we use hash operators." for clarity.

+ foreach (lc, $5)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
A search on WITH in gram.y shows that we do not handle WITH options in gram.y.
Usually they are handled at the transformation stage. Why is this an exception?
If you do that, we can have all the error handling in
transformPartitionBound().

+DATA(insert OID = 5028 ( satisfies_hash_partition PGNSP PGUID 12 1 0
2276 0 f f f f f f i s 3 0 16 "23 23 2276" _null_ _null_ _null_ _null_
_null_ satisfies_hash_partition _null_ _null_ _null_ ));
Why is third argument to this function ANY? Shouldn't it be INT4ARRAY (variadic
INT4)?

I am yet to review the testcases and thumb through all the places using
PARTITION_STRATEGY_RANGE/LIST to make sure that we are handling
PARTITION_STRATEGY_HASH in all those places.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
0002_additional_changes.patch text/x-patch 1.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-05-12 17:12:01 Re: eval_const_expresisions and ScalarArrayOpExpr
Previous Message Robert Haas 2017-05-12 17:09:19 Re: [POC] hash partitioning