Re: Skip partition tuple routing with constant partition key

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Zhihong Yu <zyu(at)yugabyte(dot)com>, Greg Stark <stark(at)mit(dot)edu>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Skip partition tuple routing with constant partition key
Date: 2022-07-22 13:22:48
Message-ID: CA+HiwqEnsRQA=N3eiWSRss-H2RutM99cJDYLZ=q_rVvPKM96UA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 14, 2022 at 2:31 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> I've spent some time looking at the v10 patch, and to be honest, I
> don't really like the look of it :(

Thanks for the review and sorry for the delay in replying.

> 1. I think we should be putting the cache fields in PartitionDescData
> rather than PartitionDispatch. Having them in PartitionDescData allows
> caching between statements.

Looking at your patch, yes, that makes sense. Initially, I didn't see
much point in having the ability to cache between (supposedly simple
OLTP) statements, because the tuple routing binary search is such a
minuscule portion of their execution, but now I agree why not.

> 2. The function name maybe_cache_partition_bound_offset() fills me
> with dread. It's very unconcise. I don't think anyone should ever use
> that word in a function or variable name.

Yeah, we can live without this one for sure as your patch
demonstrates, but to be fair, it's not like we don't have "maybe_"
used in variables and functions in arguably even trickier parts of our
code, like those you can find with `git grep maybe_`.

> 3. I'm not really sure why there's a field named n_tups_inserted.
> That would lead me to believe that ExecFindPartition is only executed
> for INSERTs. UPDATEs need to know the partition too.

Hmm, (cross-partition) UPDATEs internally use an INSERT that does
ExecFindPartition(). I don't see ExecUpdate() directly calling
ExecFindPartition(). Well, yes, apply_handle_tuple_routing() in a way
does, but apparently I didn't worry about that function.

> 4. The fields you're adding to PartitionDispatch are very poorly
> documented. I'm not really sure what n_offset_changed means.

My intention with that variable was to count the number of partition
switches that happened over the course of inserting N tuples. The
theory was that if the ratio of the number of partition switches and
the number of tuples inserted is too close to 1, the dataset being
loaded is not really in an order that'd benefit from caching. That
was an attempt to get some kind of adaptability to account for the
cases where the ordering in the dataset is not consistent, but it
seems like your approach is just as adaptive. And your code is much
simpler.

> Why
> can't you just keep track by recording the last used partition, the
> last index into the datum array, and then just a count of the number
> of times we've found the last used partition in a row? When the found
> partition does not match the last partition, just reset the counter
> and when the counter reaches the cache threshold, use the cache path.

Yeah, it makes sense and is easier to understand.

> I've taken a go at rewriting this, from scratch, into what I think it
> should look like. I then looked at what I came up with and decided
> the logic for finding partitions should all be kept in a single
> function. That way there's much less chance of someone forgetting to
> update the double-checking logic during cache hits when they update
> the logic for finding partitions without the cache.
>
> The 0001 patch is my original attempt. I then rewrote it and came up
> with 0002 (applies on top of 0001).

Thanks for these patches. I've been reading and can't really find
anything to complain about at a high level.

> After writing a benchmark script, I noticed that the performance of
> 0002 was quite a bit worse than 0001. I noticed that the benchmark
> where the partition changes each time got much worse with 0002. I can
> only assume that's due to the increased code size, so I played around
> with likely() and unlikely() to see if I could use those to shift the
> code layout around in such a way to make 0002 faster. Surprisingly
> using likely() for the cache hit path make it faster. I'd have assumed
> it would be unlikely() that would work.

Hmm, I too would think that unlikely() on that condition, not
likely(), would have helped the unordered case better.

> cache_partition_bench.png shows the results. I tested with master @
> a5f9f1b88. The "Amit" column is your v10 patch.
> copybench.sh is the script I used to run the benchmarks. This tries
> all 3 partitioning strategies and performs 2 COPY FROMs, one with the
> rows arriving in partition order and another where the next row always
> goes into a different partition. I'm expecting to see the "ordered"
> case get better for LIST and RANGE partitions and the "unordered" case
> not to get any slower.
>
> With all of the attached patches applied, it does seem like I've
> managed to slightly speed up all of the unordered cases slightly.
> This might be noise, but I did manage to remove some redundant code
> that needlessly checked if the HASH partitioned table had a DEFAULT
> partition, which it cannot. This may account for some of the increase
> in performance.
>
> I do need to stare at the patch a bit more before I'm confident that
> it's correct. I just wanted to share it before I go and do that.

The patch looks good to me. I thought some about whether the cache
fields in PartitionDesc may ever be "wrong". For example, the index
values becoming out-of-bound after partition DETACHes. Even though
there's some PartitionDesc-preserving cases in
RelationClearRelation(), I don't think that a preserved PartitionDesc
would ever contain a wrong value.

Here are some comments.

PartitionBoundInfo boundinfo; /* collection of partition bounds */
+ int last_found_datum_index; /* Index into the owning
+ * PartitionBoundInfo's datum array
+ * for the last found partition */

What does "owning PartitionBoundInfo's" mean? Maybe the "owning" is
unnecessary?

+ int last_found_part_index; /* Partition index of the last found
+ * partition or -1 if none have been
+ * found yet or if we've failed to
+ * find one */

-1 if none *has* been...?

+ int last_found_count; /* Number of times in a row have we found
+ * values to match the partition

Number of times in a row *that we have* found.

+ /*
+ * The Datum has changed. Zero the number of times we've
+ * found last_found_datum_index in a row.
+ */
+ partdesc->last_found_count = 0;

+ /* Zero the "winning streak" on the cache hit count */
+ partdesc->last_found_count = 0;

Might it be better for the two comments to say the same thing? Also,
I wonder which one do you intend as the resetting of last_found_count:
setting it to 0 or 1? I can see that the stanza at the end of the
function sets to 1 to start a new cycle.

+ /* Check if the value is equal to the lower bound */
+ cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+ key->partcollation,
+ lastDatums,
+ kind,
+ values,
+ key->partnatts);

The function does not merely check for equality, so maybe better to
say the following instead:

Check if the value is >= the lower bound.

Perhaps, just like you've done in the LIST stanza even mention that
the lower bound is same as the last found one, like:

Check if the value >= the last found lower bound.

And likewise, change the nearby comment that says this:

+ /* Check if the value is below the upper bound */

to say:

Now check if the value is below the corresponding [to last found lower
bound] upper bound.

+ * No caching of partitions is done when the last found partition is th

the

+ * Calling this function can be quite expensive for LIST and RANGE partitioned
+ * tables have many partitions.

having many partitions

Many of the use cases for LIST and RANGE
+ * partitioned tables mean that the same partition is likely to be found in

mean -> are such that

we record the partition index we've found in the
+ * PartitionDesc

we record the partition index we've found *for given values* in the
PartitionDesc

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2022-07-22 13:49:23 Re: StrategyAM for IndexAMs
Previous Message Dean Rasheed 2022-07-22 13:17:47 Re: Use extended statistics to estimate (Var op Var) clauses