Re: [POC] hash partitioning

From: amul sul <sulamul(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Thom Brown <thom(at)linux(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>, David Steele <david(at)pgmasters(dot)net>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] hash partitioning
Date: 2017-10-30 12:22:35
Message-ID: CAAJ_b956m4E58n7pehpavN=uvxCYjeSmLFtsfBKfi+5VdGTf-g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Oct 29, 2017 at 12:38 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Oct 24, 2017 at 1:21 PM, amul sul <sulamul(at)gmail(dot)com> wrote:
>> Updated patch attached.
>
> This patch needs a rebase.

Sure, thanks a lot for your review.

>
> It appears that satisfies_hash_func is declared incorrectly in
> pg_proc.h. ProcedureCreate seems to think that provariadic should be
> ANYOID if the type of the last element is ANYOID, ANYELEMENTOID if the
> type of the last element is ANYARRAYOID, and otherwise the element
> type corresponding to the array type. But here you have the last
> element as int4[] but provariadic is any.

Actually, int4[] is also inappropriate type as we have started using a 64bit
hash function. We need something int8[] which is not available, so that I
have used ANYARRAYOID in the attached patch(0004).

> I wrote the following query
> to detect problems of this type, and I think we might want to just go
> ahead and add this to the regression test suite, verifying that it
> returns no rows:
>
> select oid::regprocedure, provariadic::regtype, proargtypes::regtype[]
> from pg_proc where provariadic != 0
> and case proargtypes[array_length(proargtypes, 1)-1]
> when 2276 then 2276 -- any -> any
> when 2277 then 2283 -- anyarray -> anyelement
> else (select t.oid from pg_type t where t.typarray =
> proargtypes[array_length(proargtypes, 1)-1]) end
> != provariadic;
>

Added in 0001 patch.

> The simple fix is change provariadic to int4 and call it good. It's
> tempting to go the other way and actually make it
> satisfies_hash_partition(int4, int4, variadic "any"), passing the
> column values directly and letting satisfies_hash_partition doing the
> hashing itself. Any arguments that had a partition key type different
> from the column type would have a RelabelType node placed on top of
> the column, so that get_fn_expr_argtype would return the partition key
> type. Then, the function could look up the hash function for that
> type and call it directly on the value. That way, we'd be doing only
> one function call instead of many, and the partition constraint would
> look nicer in \d+ output, too. :-) On the other hand, that would
> also mean that we'd have to look up the extended hash function every
> time through this function, though maybe that could be prevented by
> using fn_extra to cache FmgrInfos for all the hash functions on the
> first time through. I'm not sure how that would compare in terms of
> speed with what you have now, but maybe it's worth trying.
>

One advantage of current implementation is that we can see which hash
function are used for the each partitioning column and also we don't need to
worry about user specified opclass and different input types.

Something similar I've tried in my initial patch version[1], but I have missed
user specified opclass handling for each partitioning column. Do you want me
to handle opclass using RelabelType node? I am afraid that, that would make
the \d+ output more horrible than the current one if non-default opclass used.

> The second paragraph of the CREATE TABLE documentation for PARTITION
> OF needs to be updated like this: "The form with <literal>IN</literal>
> is used for list partitioning, the form with <literal>FROM</literal>
> and <literal>TO</literal> is used for range partitioning, and the form
> with <literal>WITH</literal> is used for hash partitioning."
>

Fixed in the attached version(0004).

> The CREATE TABLE documentation says "When using range partitioning,
> the partition key can include multiple columns or expressions (up to
> 32,"; this should be changed to say "When using range or hash
> partitioning".
>

Fixed in the attached version(0004).

> - expression. If no B-tree operator class is specified when creating a
> - partitioned table, the default B-tree operator class for the
> datatype will
> - be used. If there is none, an error will be reported.
> + expression. If no operator class is specified when creating a
> partitioned
> + table, the default operator class of the appropriate type (btree for list
> + and range partitioning, hash for hash partitioning) will be used. If
> + there is none, an error will be reported.
> + </para>
> +
> + <para>
> + Since hash operator class provides only equality, not ordering, collation
> + is not relevant for hash partitioning. The behaviour will be unaffected
> + if a collation is specified.
> + </para>
> +
> + <para>
> + Hash partitioning will use support function 2 routines from the operator
> + class. If there is none, an error will be reported. See <xref
> + linkend="xindex-support"> for details of operator class support
> + functions.
>
> I think we should rework this a little more heavily. I suggest the
> following, starting after "a single column or expression":
>
> <para>
> Range and list partitioning require a btree operator class, while hash
> partitioning requires a hash operator class. If no operator class is
> specified explicitly, the default operator class of the appropriate
> type will be used; if no default operator class exists, an error will
> be raised. When hash partitioning is used, the operator class used
> must implement support function 2 (see <xref linkend="xindex-support">
> for details).
> </para>
>

Thanks again, added in the attached version(0004).

> I think we can leave out the part about collations. It's possibly
> worth a longer explanation here at some point: for range partitioning,
> collation can affect which rows go into which partitions; for list
> partitioning, it can't, but it can affect the order in which
> partitions are expanded (which is a can of worms I'm not quite ready
> to try to explain in user-facing documentation); for hash
> partitioning, it makes no difference at all. Although at some point
> we may want to document this, I think it's a job for a separate patch,
> since (1) the existing documentation doesn't document the precise
> import of collations on existing partitioning types and (2) I'm not
> sure that CREATE TABLE is really the best place to explain this.
>

Okay.

> The example commands for creating a hash-partitioned table are missing
> spaces between WITH and the parenthesis which follows.
>

Fixed in the attached version(0004).

> In 0003, the changes to partition_bounds_copy claim that I shouldn't
> worry about the fact that typlen is set to 4 because datumCopy won't
> use it for a pass-by-value datatype, but I think that calling
> functions with incorrect arguments and hoping that they ignore them
> and therefore nothing bad happens doesn't sound like a very good idea.
> Fortunately, I think the actual code is fine; I think we just need to
> change the comments. For hash partitioning, the datums array always
> contains two integers, which are of type int4, which is indeed a
> pass-by-value type of length 4 (note that if we were using int8 for
> the modulus and remainder, we'd need to set byval to FLOAT8PASSBYVAL).
> I would just write this as:
>
> if (hash_part)
> {
> typlen = sizeof(int32); /* always int4 */
> byval = true; /* int4 is pass-by-value */
> }
>

Fixed in the attached version (now patch number is 0005).

> + for (i = 0; i < nkeys; i++)
> + {
> + if (!isnull[i])
> + rowHash = hash_combine64(rowHash,
> DatumGetUInt64(hash_array[i]));
> + }
>
> Excess braces.
>

Fixed in the attached version(0004).

> I think it might be better to inline the logic in mix_hash_value()
> into each of the two callers. Then, the callers wouldn't need Datum
> hash_array[PARTITION_MAX_KEYS]; they could just fold each new hash
> value into a uint64 value. That seems likely to be slightly faster
> and I don't see any real downside.
>

Fixed in the attached version(0004).

> rhaas=# create table natch (a citext, b text) partition by hash (a);
> ERROR: XX000: missing support function 2(16398,16398) in opfamily 16437
> LOCATION: RelationBuildPartitionKey, relcache.c:954
>
> It shouldn't be possible to reach an elog() from SQL, and this is not
> a friendly error message.
>

How about an error message in the attached patch(0004)?

1] https://postgr.es/m/CAAJ_b96AQBAxSQ2mxnTmx9zXh79GdP_dQWv0aupjcmz+jpiGjw@mail.gmail.com

Regards,
Amul

Attachment Content-Type Size
0001-provariadic-sanity-check.patch application/octet-stream 2.7 KB
0002-Add-PG_GETARG_UINT64-macro.patch application/octet-stream 957 bytes
0003-partition_bounds_copy-code-refactoring-v1.patch application/octet-stream 2.1 KB
0004-hash-partitioning_another_design-v26.patch application/octet-stream 89.3 KB
0005-Enable-partition-wise-join-support-v4.patch application/octet-stream 10.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Kuzmenkov 2017-10-30 12:25:50 Re: PoC: full merge join on comparison clause
Previous Message Michael Paquier 2017-10-30 11:58:58 Re: pow support for pgbench