Quick Links

Re: [POC] hash partitioning

From:	amul sul <sulamul(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>, David Steele <david(at)pgmasters(dot)net>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [POC] hash partitioning
Date:	2017-05-12 12:38:36
Message-ID:	CAAJ_b97mTb=dG2pv6+1ougxEVZFVnZJajW+0QHj46mEE7WsoOQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Please find the following updated patches attached:

0001-Cleanup.patch : Does some cleanup and code refactoring required
for hash partition patch. Otherwise, there will be unnecessary diff in
0002 patch

0002-hash-partitioning_another_design-v3.patch: Addressed review
comments given by Ashutosh and Robert.

On Wed, May 10, 2017 at 11:39 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, May 3, 2017 at 9:09 AM, amul sul <sulamul(at)gmail(dot)com> wrote:
>> Fixed in the attached version.
>
> +[ PARTITION BY { HASH | RANGE | LIST } ( { <replaceable
> class="parameter">column_name</replaceable> | ( <replaceable
> class="parameter">expression</replaceable> ) } [ COLLATE <replaceable
>
> In the department of severe nitpicking, I would have expected this to
> either use alphabetical order (HASH | LIST | RANGE) or to add the new
> method at the end on the theory that we probably did the important
> ones first (RANGE | LIST | HASH).
>
Fixed in the attached version.

> + WITH ( MODULUS <replaceable class="PARAMETER">value</replaceable>,
> REMAINDER <replaceable class="PARAMETER">value</replaceable> ) }
>
> Maybe value -> modulus and value -> remainder?
>
Fixed in the attached version.

> <para>
> + When creating a hash partition, <literal>MODULUS</literal> should be
> + greater than zero and <literal>REMAINDER</literal> should be greater than
> + or equal to zero. Every <literal>MODULUS</literal> must be a factor of
> + the next larger modulus.
> [ ... and it goes on from there ... ]
>
> This paragraph is fairly terrible, because it's a design spec that I
> wrote, not an explanation intended for users. Here's an attempt to
> improve it:
>
> ===
> When creating a hash partition, a modulus and remainder must be
> specified. The modulus must be a positive integer, and the remainder
> must a non-negative integer less than the modulus. Typically, when
> initially setting up a hash-partitioned table, you should choose a
> modulus equal to the number of partitions and assign every table the
> same modulus and a different remainder (see examples, below).
> However, it is not required that every partition have the same
> modulus, only that every modulus which occurs among the children of a
> hash-partitioned table is a factor of the next larger modulus. This
> allows the number of partitions to be increased incrementally without
> needing to move all the data at once. For example, suppose you have a
> hash-partitioned table with 8 children, each of which has modulus 8,
> but find it necessary to increase the number of partitions to 16. You
> can detach one of the modulus-8 partitions, create two new modulus-16
> partitions covering the same portion of the key space (one with a
> remainder equal to the remainder of the detached partition, and the
> other with a remainder equal to that value plus 8), and repopulate
> them with data. You can then repeat this -- perhaps at a later time
> -- for each modulus-8 partition until none remain. While this may
> still involve a large amount of data movement at each step, it is
> still better than having to create a whole new table and move all the
> data at once.
> ===
>
Thanks a lot, added in attached version.

> +CREATE TABLE postal_code (
> + code int not null,
> + city_id bigint not null,
> + address text
> +) PARTITION BY HASH (code);
>
> It would be fairly silly to hash-partition the postal_code table,
> because there aren't enough postal codes to justify it. Maybe make
> this a lineitem or order table, and partition on the order number.
> Also, extend the example to show creating 4 partitions with modulus 4.
>
Understood, added order table example.

> + if (spec->strategy != PARTITION_STRATEGY_HASH)
> + elog(ERROR, "invalid strategy in partition bound spec");
>
> I think this should be an ereport() if it can happen or an Assert() if
> it's supposed to be prevented by the grammar.
>
Used Assert() in the attach version patch, also changed same for RANGE
and LIST in 0001- cleanup patch.

> + if (!(datumIsEqual(b1->datums[i][0], b2->datums[i][0],
> + true, sizeof(int)) &&
>
> It doesn't seem necessary to use datumIsEqual() here. You know the
> datums are pass-by-value, so why not just use == ? I'd include a
> comment but I don't think using datumIsEqual() adds anything here
> except unnecessary complexity. More broadly, I wonder why we're
> cramming this into the datums arrays instead of just adding another
> field to PartitionBoundInfoData that is only used by hash
> partitioning.
>
Fixed in the attached version.

> /*
> + * Check rule that every modulus must be a factor of the
> + * next larger modulus. For example, if you have a bunch
> + * of partitions that all have modulus 5, you can add a new
> + * new partition with modulus 10 or a new partition with
> + * modulus 15, but you cannot add both a partition with
> + * modulus 10 and a partition with modulus 15, because 10
> + * is not a factor of 15. However, you could
> simultaneously
> + * use modulus 4, modulus 8, modulus 16, and modulus 32 if
> + * you wished, because each modulus is a factor of the next
> + * larger one. You could also use modulus 10, modulus 20,
> + * and modulus 60. But you could not use modulus 10,
> + * modulus 15, and modulus 60 for the same reason.
> + */
>
> I think just the first sentence is fine here; I'd nuke the rest of this.
>
Fixed in the attached version.

> The block that follows could be merged into the surrounding block.
> There's no need to increase the indentation level here, so let's not.
> I also suspect that the code itself is wrong. There are two ways a
> modulus can be invalid: it can either fail to be a multiple of the
> next lower-modulus, or it can fail to be a factor of the next-higher
> modulus. I think your code only checks the latter. So for example,
> if the current modulus list is (4, 36), your code would correctly
> disallow 3 because it's not a factor of 4 and would correctly disallow
> 23 because it's not a factor of 36, but it looks to me like it would
> allow 9 because that's a factor of 36. However, then the list would be
> (4, 9, 36), and 4 is not a factor of 9.
>
This case is already handled in previous patch and similar regression
test does exists in create_table.sql, see this in v2 patch.

+-- check partition bound syntax for the hash partition
+CREATE TABLE hash_parted (
+ a int
+) PARTITION BY HASH (a);
+CREATE TABLE hpart_1 PARTITION OF hash_parted FOR VALUES WITH
(modulus 10, remainder 1);
+CREATE TABLE hpart_2 PARTITION OF hash_parted FOR VALUES WITH
(modulus 50, remainder 0);
+-- modulus 25 is factor of modulus of 50 but 10 is not factor of 25.
+CREATE TABLE fail_part PARTITION OF hash_parted FOR VALUES WITH
(modulus 25, remainder 2);

> + greatest_modulus = DatumGetInt32(datums[ndatums - 1][0]);
>
> Here, insert: /* Normally, the lowest remainder that could conflict
> with the new partition is equal to the remainder specified for the new
> partition, but when the new partition has a modulus higher than any
> used so far, we need to adjust. */
>
> + place = spec->remainder;
> + if (place >= greatest_modulus)
> + place = place % greatest_modulus;
>
Fixed in the attached version.

> Here, insert: /* Check every potentially-conflicting remainder. */
>
> + do
> + {
> + if (boundinfo->indexes[place] != -1)
> + {
> + overlap = true;
> + with = boundinfo->indexes[place];
> + break;
> + }
> + place = place + spec->modulus;
>
> Maybe use += ?
>
Fixed.

> + } while (place < greatest_modulus);
>
> + * Used when sorting hash bounds across all hash modulus
> + * for hash partitioning
>
> This is not a very descriptive comment. Maybe /* We sort hash bounds
> by modulus, then by remainder. */
>
Fixed.

> +cal_hash_value(FmgrInfo *partsupfunc, int nkeys, Datum *values, bool *isnull)
>
> I agree with Ashutosh's critique of this name.
>
Fixed.

> + /*
> + * Cache hash function information, similar to how record_eq() caches
> + * equality operator information. (Perhaps no SQL syntax could cause
> + * PG_NARGS()/nkeys to change between calls through the same FmgrInfo.
> + * Checking nkeys here is just defensiveness.)
> + */
>
> Unless I'm missing something, this comment does not actually describe
> what the code does. Each call to the function repeats the same
> TypeCacheEntry lookups. I'm not actually sure whether caching here
> can actually help - is there any situation in which the same FmgrInfo
> will get used repeatedly here? But if it is possible then this code
> fails to achieve its intended objective.
>
This code is no longer exists in new satisfies_hash_partition() code.

> Another problem with this code is that, unless I'm missing something,
> it completely ignores the opclass the user specified and just looks up
> the default hash opclass. I think you should create a non-default
> hash opclass for some data type -- maybe create one for int4 that just
> returns the input value unchanged -- and test that the specifying
> default hash opclass routes tuples according to hash_uint32(val) %
> modulus while specifying your customer opclass routes tuples according
> to val % modulus.
>
> Unless I'm severely misunderstanding the situation this code is
> seriously undertested.
>
You are correct, I've missed to opclass handling. Fixed in the
attached version, and added same case regression test.

> + * Identify a btree opclass to use. Currently, we use only btree
> + * operators, which seems enough for list and range partitioning.
>
> This comment is false, right?
>
Not really, this has been re-added due to indentation change.

> + appendStringInfoString(buf, "FOR VALUES");
> + appendStringInfo(buf, " WITH (modulus %d,
> remainder %d)",
> + spec->modulus, spec->remainder);
>
> You could combine these.
>
I am not sure about this, I've used same code style exist in
get_rule_expr() for range and list. Do you want me to change this for
other partitioning as well?

> +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
> (modulus 0, remainder 1);
> +ERROR: invalid bound specification for a hash partition
> +HINT: modulus must be greater than zero
> +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
> (modulus 8, remainder 8);
> +ERROR: invalid bound specification for a hash partition
> +HINT: modulus must be greater than remainder
> +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
> (modulus 3, remainder 2);
> +ERROR: invalid bound specification for a hash partition
> +HINT: every modulus must be factor of next largest modulus
>
> It seems like you could merge the hint back into the error:
>
> ERROR: hash partition modulus must be greater than 0
> ERROR: hash partition remainder must be less than modulus
> ERROR: every hash partition modulus must be a factor of the next larger modulus
>
Added same in the attached version. Thanks again.

> +DETAIL: Partition key of the failing row contains (HASHa, b) = (c, 5).
>
> That's obviously garbled somehow.
>
Oops. Fixed in the attached version.

> +hash_partbound_elem:
> + NonReservedWord Iconst
> + {
> + $$ = makeDefElem($1, (Node *)makeInteger($2), @1);
> + }
> + ;
> +
> +hash_partbound:
> + hash_partbound_elem ',' hash_partbound_elem
> + {
> + $$ = list_make2($1, $3);
> + }
> + ;
>
> I don't think that it's the grammar's job to enforce that exactly two
> options are present. It should allow any number of options, and some
> later code, probably during parse analysis, should check that the ones
> you need are present and that there are no invalid ones. See the code
> for EXPLAIN, VACUUM, etc.
>
Tried to fixed in the attached version.

> Regarding the test cases, I think that you've got a lot of tests for
> failure scenarios (which is good) but not enough for success
> scenarios. For example, you test that inserting a row into the wrong
> hash partition fails, but not (unless I missed it) that tuple routing
> succeeds. I think it would be good to have a test where you insert
> 1000 or so rows into a hash partitioned table just to see it all work.
>
I am quite unsure about this test, now sure how can we verify correct
tuple routing?

> Also, you haven't done anything about the fact that constraint
> exclusion doesn't work for hash partitioned tables, a point I raised
> in http://postgr.es/m/CA+Tgmob7RsN5A=ehgYbLPx--c5CmptrK-dB=Y-v--o+TKyfteA@mail.gmail.com
> and which I still think is quite important. I think that to have a
> committable patch for this feature that would have to be addressed.
>
Do you mean, we should come up with special handling(pre-pruning) for
hash partitioning or modify constraints exclusion so that it will
handle hash partition expression and cases that you have discussed in
thread[1] as well? I was under the impression that we might going to
have this as a separate feature proposal.

1]. https://www.postgresql.org/message-id/CA%2BTgmoaE9NZ_RiqZQLp2aJXPO4E78QxkQYL-FR2zCDop96Ahdg%40mail.gmail.com

Regards,
Amul Sul

Attachment	Content-Type	Size
0001-Cleanup.patch	application/octet-stream	6.0 KB
0002-hash-partitioning_another_design-v3.patch	application/octet-stream	91.7 KB

In response to

Re: [POC] hash partitioning at 2017-05-10 18:09:04 from Robert Haas

Responses

Re: [POC] hash partitioning at 2017-05-12 17:09:34 from Ashutosh Bapat
Re: [POC] hash partitioning at 2017-05-13 06:41:09 from Dilip Kumar

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	amul sul	2017-05-12 12:46:33	Re: [POC] hash partitioning
Previous Message	tushar	2017-05-12 12:35:24	Getting error at the time of dropping subscription and few more issues