Re: [POC] hash partitioning

From: amul sul <sulamul(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>, David Steele <david(at)pgmasters(dot)net>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] hash partitioning
Date: 2017-05-14 07:00:58
Message-ID: CAAJ_b97NK6MB9fr0esOohJ_XFFYVcTBgRR99byPdr=eT6ygERA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 12, 2017 at 10:39 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> On Fri, May 12, 2017 at 6:08 PM, amul sul <sulamul(at)gmail(dot)com> wrote:
>> Hi,
>>
>> Please find the following updated patches attached:
>>
>> 0001-Cleanup.patch : Does some cleanup and code refactoring required
>> for hash partition patch. Otherwise, there will be unnecessary diff in
>> 0002 patch
>
> Thanks for splitting the patch.
>
> + if (isnull[0])
> + cur_index = partdesc->boundinfo->null_index;
> This code assumes that null_index will be set to -1 when has_null is false. Per
> RelationBuildPartitionDesc() this is true. But probably we should write this
> code as
> if (isnull[0])
> {
> if (partdesc->boundinfo->has_null)
> cur_index = partdesc->boundinfo->null_index;
> }
> That way we are certain that when has_null is false, cur_index = -1 similar to
> the original code.
>
Okay will add this. I still don't understood point of having has_null
variable, if no null accepting partition exists then null_index is
alway set to -1 in RelationBuildPartitionDesc. Anyway, let not change
the original code.

> Additional arguement to ComputePartitionAttrs() isn't used anywhere in this
> patch, so may be this better be part of 0002. If we do this the only change
> that will remain in patch is the refactoring of RelationBuildPartitionDesc(),
> so we may consider merging it into 0002, unless we find that some more
> refactoring is needed. But for now, having it as a separate patch helps.
>
Okay.

> Here's some more comments on 0002
>
> + * In the case of hash partitioning, datums is a 2-D array, stores modulus and
> + * remainder values at datums[x][0] and datums[x][1] respectively for each
> + * partition in the ascending order.
>
> This comment about datums should appear in a paragraph of itself and may be
> rephrased as in the attached patch. May be we could also add a line about
> ndatums for hash partitioned tables as in the attached patch.
>
Thanks, looks good to me; will include this.

[...]
>
> + if (key->strategy == PARTITION_STRATEGY_HASH)
> + {
> + ndatums = nparts;
> + hbounds = (PartitionHashBound **) palloc(nparts *
> +
> sizeof(PartitionHashBound *));
> + i = 0;
> + foreach (cell, boundspecs)
> + {
> + PartitionBoundSpec *spec = lfirst(cell);
> +
> [ clipped ]
> + hbounds[i]->index = i;
> + i++;
> + }
> For list and range partitioned table we order the bounds so that two
> partitioned tables have them in the same order irrespective of order in which
> they are specified by the user or hence stored in the catalogs. The partitions
> then get indexes according the order in which their bounds appear in ordered
> arrays of bounds. Thus any two partitioned tables with same partition
> specification always have same PartitionBoundInfoData. This helps in
> partition-wise join to match partition bounds of two given tables. Above code
> assigns the indexes to the partitions as they appear in the catalogs. This
> means that two partitioned tables with same partition specification but
> different order for partition bound specification will have different
> PartitionBoundInfoData represenation.
>
> If we do that, probably partition_bounds_equal() would reduce to just matching
> indexes and the last element of datums array i.e. the greatest modulus datum.
> If ordered datums array of two partitioned table do not match exactly, the
> mismatch can be because missing datums or different datums. If it's a missing
> datum it will change the greatest modulus or have corresponding entry in
> indexes array as -1. If the entry differs it will cause mismatching indexes in
> the index arrays.
>
Make sense, will fix this.

[...]
>
> + if (offset < 0)
> + {
> + nmod = DatumGetInt32(datums[0][0]);
> + valid_bound = (nmod % spec->modulus) == 0;
> + }
> + else
> + {
> + pmod = DatumGetInt32(datums[offset][0]);
> + valid_bound = (spec->modulus % pmod) == 0;
> +
> + if (valid_bound && (offset + 1) < ndatums)
> + {
> + nmod = DatumGetInt32(datums[offset + 1][0]);
> + valid_bound = (nmod % spec->modulus) == 0;
> + }
> + }
> May be name the variables as prev_mod(ulus) and next_mod(ulus) for better
> readability.
>
Okay, will rename to prev_modulus and next_modulus resp.

> + * for p_p1: satisfies_hash_partition(2, 1, hash_fn(a), hash_fn(b))
> + * for p_p2: satisfies_hash_partition(4, 2, hash_fn(a), hash_fn(b))
> + * for p_p3: satisfies_hash_partition(8, 0, hash_fn(a), hash_fn(b))
> + * for p_p4: satisfies_hash_partition(8, 4, hash_fn(a), hash_fn(b))
> The description here may be read as if we are calling the same hash function
> for both a and b, but that's not true. So, you may want to clarify that
> in hash_fn(a) hash_fn means hash function specified for key a.
>
Okay.

>
> + if (key->partattrs[i] != 0)
> + {
> + keyCol = (Node *) makeVar(1,
> + key->partattrs[i],
> + key->parttypid[i],
> + key->parttypmod[i],
> + key->parttypcoll[i],
> + 0);
> +
> + /* Form hash_fn(value) expression */
> + keyCol = (Node *) makeFuncExpr(key->partsupfunc[i].fn_oid,
> + get_fn_expr_rettype(&key->partsupfunc[i]),
> + list_make1(keyCol),
> + InvalidOid,
> + InvalidOid,
> + COERCE_EXPLICIT_CALL);
> + }
> + else
> + {
> + keyCol = (Node *) copyObject(lfirst(partexprs_item));
> + partexprs_item = lnext(partexprs_item);
> + }
> I think we should add FuncExpr for column Vars as well as expressions.
>
Okay, will fix this.

> The logic to compare two bounds is duplicated in qsort_partition_hbound_cmp()
> and partition_bound_cmp(). Should we separate it into a separate function
> accepting moduli and remainders. That way in case we change it in future, we
> have to change only one place.
>
Okay.

> I think we need more comments for compute_hash_value(), mix_hash_value() and
> satisfies_hash_partition() as to what each of them accepts and what it
> computes.
>
> + /* key's hash values start from third argument of function. */
> + if (!PG_ARGISNULL(i + 2))
> + {
> + values[i] = PG_GETARG_DATUM(i + 2);
> + isnull[i] = false;
> + }
> + else
> + isnull[i] = true;
> You could write this as
> isnull[i] = PG_ARGISNULL(i + 2);
> if (isnull[i])
> values[i] = PG_GETARG_DATUM(i + 2);
>
Okay.

>
> + * Identify a btree or hash opclass to use. Currently, we use only
> + * btree operators, which seems enough for list and range partitioning,
> + * and hash operators for hash partitioning.
>
> The wording, if not read carefully, might be read as "we use only btree
> operators". I suggest we rephrase it as "Identify opclass to use. For
> list and range
> partitioning we use only btree operators, which seems enough for those. For
> hash partitioning, we use hash operators." for clarity.
>
Okay

> + foreach (lc, $5)
> + {
> + DefElem *opt = (DefElem *) lfirst(lc);
> A search on WITH in gram.y shows that we do not handle WITH options in gram.y.
> Usually they are handled at the transformation stage. Why is this an exception?
> If you do that, we can have all the error handling in
> transformPartitionBound().
>
If so, ForValues need to return list for hash and PartitionBoundSpec
for other two; wouldn't that break code consistency? And such
validation is not new in gram.y see xmltable_column_el.

> +DATA(insert OID = 5028 ( satisfies_hash_partition PGNSP PGUID 12 1 0
> 2276 0 f f f f f f i s 3 0 16 "23 23 2276" _null_ _null_ _null_ _null_
> _null_ satisfies_hash_partition _null_ _null_ _null_ ));
> Why is third argument to this function ANY? Shouldn't it be INT4ARRAY (variadic
> INT4)?
>
Will use INT4ARRAY in next patch, but I am little sceptical of it. we
need an unsigned int32, but unfortunately there is not variadic uint32
support. How about INT8ARRAY?

Regards,
Amul

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message amul sul 2017-05-14 08:00:24 Re: [POC] hash partitioning
Previous Message Tom Lane 2017-05-14 04:57:26 Re: Latest Data::Dumper breaks hstore_plperl regression test