Re: Skip partition tuple routing with constant partition key

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Zhihong Yu <zyu(at)yugabyte(dot)com>, Greg Stark <stark(at)mit(dot)edu>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Skip partition tuple routing with constant partition key
Date: 2022-07-27 12:50:23
Message-ID: CA+HiwqFH8EPbJAfmMmg+Y91Wc73VFAFpk=zjohBMWu3neGRCcA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 27, 2022 at 7:28 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> On Sat, 23 Jul 2022 at 01:23, Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> > + /*
> > + * The Datum has changed. Zero the number of times we've
> > + * found last_found_datum_index in a row.
> > + */
> > + partdesc->last_found_count = 0;
> >
> > + /* Zero the "winning streak" on the cache hit count */
> > + partdesc->last_found_count = 0;
> >
> > Might it be better for the two comments to say the same thing? Also,
> > I wonder which one do you intend as the resetting of last_found_count:
> > setting it to 0 or 1? I can see that the stanza at the end of the
> > function sets to 1 to start a new cycle.
>
> I think I've addressed all of your comments. The above one in
> particular caused me to make some larger changes.
>
> The reason I was zeroing the last_found_count in LIST partitioned
> tables when the Datum was not equal to the previous found Datum was
> due to the fact that the code at the end of the function was only
> checking the partition indexes matched rather than the bound_offset vs
> last_found_datum_index. The reason I wanted to zero this was that if
> you had a partition FOR VALUES IN(1,2), and you received rows with
> values alternating between 1 and 2 then we'd match to the same
> partition each time, however the equality test with the current
> 'values' and the Datum at last_found_datum_index would have been false
> each time. If we didn't zero the last_found_count we'd have kept
> using the cache path even though the Datum and last Datum wouldn't
> have been equal each time. That would have resulted in always doing
> the cache check and failing, then doing the binary search anyway.

Thanks for the explanation. So, in a way the caching scheme works for
LIST partitioning only if the same value appears consecutively in the
input set, whereas it does not for *a set of* values belonging to the
same partition appearing consecutively. Maybe that's a reasonable
restriction for now.

> I've now changed the code so that instead of checking the last found
> partition is the same as the last one, I'm now checking if
> bound_offset is the same as last_found_datum_index. This will be
> false in the "values alternating between 1 and 2" case from above.
> This caused me to have to change how the caching works for LIST
> partitions with a NULL partition which is receiving NULL values. I've
> coded things now to just skip the cache for that case. Finding the
> correct LIST partition for a NULL value is cheap and no need to cache
> that. I've also moved all the code which updates the cache fields to
> the bottom of get_partition_for_tuple(). I'm only expecting to do that
> when bound_offset is set by the lookup code in the switch statement.
> Any paths, e.g. HASH partitioning lookup and LIST or RANGE with NULL
> values shouldn't reach the code which updates the partition fields.
> I've added an Assert(bound_offset >= 0) to ensure that stays true.

Looks good.

> There's probably a bit more to optimise here too, but not much. I
> don't think the partdesc->last_found_part_index = -1; is needed when
> we're in the code block that does return boundinfo->default_index;
> However, that only might very slightly speedup the case when we're
> inserting continuously into the DEFAULT partition. That code path is
> also used when we fail to find any matching partition. That's not one
> we need to worry about making go faster.

So this is about:

if (part_index < 0)
- part_index = boundinfo->default_index;
+ {
+ /*
+ * Since we don't do caching for the default partition or failed
+ * lookups, we'll just wipe the cache fields back to their initial
+ * values. The count becomes 0 rather than 1 as 1 means it's the
+ * first time we've found a partition we're recording for the cache.
+ */
+ partdesc->last_found_datum_index = -1;
+ partdesc->last_found_part_index = -1;
+ partdesc->last_found_count = 0;
+
+ return boundinfo->default_index;
+ }

I wonder why not to leave the cache untouched in this case? It's
possible that erratic rows only rarely occur in the input sets.

> I also ran the benchmarks again and saw that most of the use of
> likely() and unlikely() no longer did what I found them to do earlier.
> So the weirdness we saw there most likely was just down to random code
> layout changes. In this patch, I just dropped the use of either of
> those two macros.

Ah, using either seems to be trying to fit the code one or the other
pattern in the input set anyway, so seems fine to keep them out for
now.

Some minor comments:

+ * The number of times the same partition must be found in a row before we
+ * switch from a search for the given values to just checking if the values

How about:

switch from using a binary search for the given values to...

Should the comment update above get_partition_for_tuple() mention
something like the cached path is basically O(1) and the non-cache
path O (log N) as I can see in comments in some other modules, like
pairingheap.c?

+ * so bump the count by one. If all goes well we'll eventually reach

Maybe a comma is needed after "well", because I got tricked into
thinking the "well" is duplicated.

+ * PARTITION_CACHED_FIND_THRESHOLD and we'll try the cache path next time

"we'll" sounds redundant with the one in the previous line.

+ * found yet, the last found was the DEFAULT partition, or there was no

Adding "if" to both sentence fragments might make this sound better.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2022-07-27 13:07:46 Re: Trying to add more tests to gistbuild.c
Previous Message David Geier 2022-07-27 12:46:24 Re: [PoC] Reducing planning time on tables with many indexes