Re: [POC] hash partitioning

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: amul sul <sulamul(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Thom Brown <thom(at)linux(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>, David Steele <david(at)pgmasters(dot)net>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] hash partitioning
Date: 2017-10-29 07:08:28
Message-ID: CA+TgmoZ_UGXfq5ygeDDMdUSJ4J_VX7nFnjC6mfY6BgOJ3qZCmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 24, 2017 at 1:21 PM, amul sul <sulamul(at)gmail(dot)com> wrote:
> Updated patch attached.

This patch needs a rebase.

It appears that satisfies_hash_func is declared incorrectly in
pg_proc.h. ProcedureCreate seems to think that provariadic should be
ANYOID if the type of the last element is ANYOID, ANYELEMENTOID if the
type of the last element is ANYARRAYOID, and otherwise the element
type corresponding to the array type. But here you have the last
element as int4[] but provariadic is any. I wrote the following query
to detect problems of this type, and I think we might want to just go
ahead and add this to the regression test suite, verifying that it
returns no rows:

select oid::regprocedure, provariadic::regtype, proargtypes::regtype[]
from pg_proc where provariadic != 0
and case proargtypes[array_length(proargtypes, 1)-1]
when 2276 then 2276 -- any -> any
when 2277 then 2283 -- anyarray -> anyelement
else (select t.oid from pg_type t where t.typarray =
proargtypes[array_length(proargtypes, 1)-1]) end
!= provariadic;

The simple fix is change provariadic to int4 and call it good. It's
tempting to go the other way and actually make it
satisfies_hash_partition(int4, int4, variadic "any"), passing the
column values directly and letting satisfies_hash_partition doing the
hashing itself. Any arguments that had a partition key type different
from the column type would have a RelabelType node placed on top of
the column, so that get_fn_expr_argtype would return the partition key
type. Then, the function could look up the hash function for that
type and call it directly on the value. That way, we'd be doing only
one function call instead of many, and the partition constraint would
look nicer in \d+ output, too. :-) On the other hand, that would
also mean that we'd have to look up the extended hash function every
time through this function, though maybe that could be prevented by
using fn_extra to cache FmgrInfos for all the hash functions on the
first time through. I'm not sure how that would compare in terms of
speed with what you have now, but maybe it's worth trying.

The second paragraph of the CREATE TABLE documentation for PARTITION
OF needs to be updated like this: "The form with <literal>IN</literal>
is used for list partitioning, the form with <literal>FROM</literal>
and <literal>TO</literal> is used for range partitioning, and the form
with <literal>WITH</literal> is used for hash partitioning."

The CREATE TABLE documentation says "When using range partitioning,
the partition key can include multiple columns or expressions (up to
32,"; this should be changed to say "When using range or hash
partitioning".

- expression. If no B-tree operator class is specified when creating a
- partitioned table, the default B-tree operator class for the
datatype will
- be used. If there is none, an error will be reported.
+ expression. If no operator class is specified when creating a
partitioned
+ table, the default operator class of the appropriate type (btree for list
+ and range partitioning, hash for hash partitioning) will be used. If
+ there is none, an error will be reported.
+ </para>
+
+ <para>
+ Since hash operator class provides only equality, not ordering, collation
+ is not relevant for hash partitioning. The behaviour will be unaffected
+ if a collation is specified.
+ </para>
+
+ <para>
+ Hash partitioning will use support function 2 routines from the operator
+ class. If there is none, an error will be reported. See <xref
+ linkend="xindex-support"> for details of operator class support
+ functions.

I think we should rework this a little more heavily. I suggest the
following, starting after "a single column or expression":

<para>
Range and list partitioning require a btree operator class, while hash
partitioning requires a hash operator class. If no operator class is
specified explicitly, the default operator class of the appropriate
type will be used; if no default operator class exists, an error will
be raised. When hash partitioning is used, the operator class used
must implement support function 2 (see <xref linkend="xindex-support">
for details).
</para>

I think we can leave out the part about collations. It's possibly
worth a longer explanation here at some point: for range partitioning,
collation can affect which rows go into which partitions; for list
partitioning, it can't, but it can affect the order in which
partitions are expanded (which is a can of worms I'm not quite ready
to try to explain in user-facing documentation); for hash
partitioning, it makes no difference at all. Although at some point
we may want to document this, I think it's a job for a separate patch,
since (1) the existing documentation doesn't document the precise
import of collations on existing partitioning types and (2) I'm not
sure that CREATE TABLE is really the best place to explain this.

The example commands for creating a hash-partitioned table are missing
spaces between WITH and the parenthesis which follows.

In 0003, the changes to partition_bounds_copy claim that I shouldn't
worry about the fact that typlen is set to 4 because datumCopy won't
use it for a pass-by-value datatype, but I think that calling
functions with incorrect arguments and hoping that they ignore them
and therefore nothing bad happens doesn't sound like a very good idea.
Fortunately, I think the actual code is fine; I think we just need to
change the comments. For hash partitioning, the datums array always
contains two integers, which are of type int4, which is indeed a
pass-by-value type of length 4 (note that if we were using int8 for
the modulus and remainder, we'd need to set byval to FLOAT8PASSBYVAL).
I would just write this as:

if (hash_part)
{
typlen = sizeof(int32); /* always int4 */
byval = true; /* int4 is pass-by-value */
}

+ for (i = 0; i < nkeys; i++)
+ {
+ if (!isnull[i])
+ rowHash = hash_combine64(rowHash,
DatumGetUInt64(hash_array[i]));
+ }

Excess braces.

I think it might be better to inline the logic in mix_hash_value()
into each of the two callers. Then, the callers wouldn't need Datum
hash_array[PARTITION_MAX_KEYS]; they could just fold each new hash
value into a uint64 value. That seems likely to be slightly faster
and I don't see any real downside.

rhaas=# create table natch (a citext, b text) partition by hash (a);
ERROR: XX000: missing support function 2(16398,16398) in opfamily 16437
LOCATION: RelationBuildPartitionKey, relcache.c:954

It shouldn't be possible to reach an elog() from SQL, and this is not
a friendly error message.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-10-29 07:31:12 Re: Implementing pg_receivewal --no-sync
Previous Message Amit Kapila 2017-10-29 06:32:02 Re: Parallel worker error