Re: [PoC] Reducing planning time when tables have many partitions

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Yuya Watari <watari(dot)yuya(at)gmail(dot)com>
Cc: Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, Thom Brown <thom(at)linux(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Zhang Mingli <zmlpostgres(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PoC] Reducing planning time when tables have many partitions
Date: 2022-12-04 00:34:44
Message-ID: CAApHDvoM-6VukeCpwzO9ddcVEwn4-nc4Cyorgs4-SAJ83zOBcA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 29 Nov 2022 at 21:59, Yuya Watari <watari(dot)yuya(at)gmail(dot)com> wrote:
> Thank you for testing the patch with an actual query. This speedup is
> very impressive. When I used an original query with 1024 partitions,
> its planning time was about 200ms. Given that each partition is also
> partitioned in your workload, I think the result of 1415ms is
> reasonable.

I was looking again at the v9-0001 patch and I think we can do a
little better when building the Bitmapset of matching EMs. For
example, in the v9 patch, the code for get_ecmember_indexes_strict()
is doing:

+ if (!with_children)
+ matching_ems = bms_copy(ec->ec_nonchild_indexes);
+ else
+ matching_ems = bms_copy(ec->ec_member_indexes);
+
+ i = -1;
+ while ((i = bms_next_member(relids, i)) >= 0)
+ {
+ RelOptInfo *rel = root->simple_rel_array[i];
+
+ matching_ems = bms_int_members(matching_ems, rel->eclass_member_indexes);
+ }

It seems reasonable that if there are a large number of partitions
then ec_member_indexes will have a large number of Bitmapwords. When
we do bms_int_members() on that, we're going to probably end up with a
bunch of trailing zero words in the set. In the v10 patch, I've
changed this to become:

+ int i = bms_next_member(relids, -1);
+
+ if (i >= 0)
+ {
+ RelOptInfo *rel = root->simple_rel_array[i];
+
+ /*
+ * bms_intersect to the first relation to try to keep the resulting
+ * Bitmapset as small as possible. This saves having to make a
+ * complete bms_copy() of one of them. One may contain significantly
+ * more words than the other.
+ */
+ if (!with_children)
+ matching_ems = bms_intersect(rel->eclass_member_indexes,
+ ec->ec_nonchild_indexes);
+ else
+ matching_ems = bms_intersect(rel->eclass_member_indexes,
+ ec->ec_member_indexes);
+
+ while ((i = bms_next_member(relids, i)) >= 0)
+ {
+ rel = root->simple_rel_array[i];
+ matching_ems = bms_int_members(matching_ems,
+ rel->eclass_member_indexes);
+ }
+ }

so, effectively we first bms_intersect to the first member of relids
before masking out the bits for the remaining ones. This should mean
we'll have a Bitmapset with fewer words in many complex planning
problems. There's no longer the dilemma of having to decide if we
should start with RelOptInfo's eclass_member_indexes or the
EquivalenceClass's member indexes. When using bms_int_member, we
really want to start with the smallest of those so we get the smallest
resulting set. With bms_intersect(), it will always make a copy of
the smallest set. v10 does that instead of bms_copy()ing the
EquivalenceClass's member's Bitmapset.

I also wondered how much we're losing to the fact that
bms_int_members() zeros the trailing words and does not trim the
Bitmapset down.

The problem there is 2-fold;
1) we have to zero the trailing words on the left input. That'll
pollute the CPU cache a bit as it may have to fetch a bunch of extra
cache lines, and;
2) subsequent bms_int_members() done afterwards may have to mask out
additional words. If we can make the shortest input really short, then
subsequent bms_int_members() are going to be very fast.

You might argue there that setting nwords to the shortest length may
cause us to have to repalloc the Bitmapset if we need to later add
more members again, but if you look at the repalloc() code, it's
effectively a no-op when the allocated size >= the requested size, so
repalloc() should be very fast in this case. So, worst case, there's
an additional "no-op" repalloc() (which should be very fast) followed
by maybe a bms_add_members() which has to zero the words instead of
bms_int_members(). I changed this in the v10-0002 patch. I'm not sure
if we should do this or not.

I also changed v10-0001 so that we still store the EquivalenceClass's
members list. There were a few places where the code just wanted to
get the first member and having to look at the Bitmapset index and
fetch the first match from PlannerInfo seemed convoluted. If the
query is simple, it seems like it's not going to be very expensive to
add a few EquivalenceMembers to this list. When planning more complex
problems, there's probably enough other extra overhead that we're
unlikely to notice the extra lappend()s. This also allows v10-0003 to
work, see below.

In v10-0003, I experimented with the iterator concept that I mentioned
earlier. Since v10-0001 is now storing the EquivalenceMember list in
EquivalenceClass again, it's now quite simple to have the iterator
decide if it should be scanning the index or doing a loop over all
members to find the ones matching the search. We can make this
decision based on list_length(ec->ec_members). This should be a more
reliable check than checking root->simple_rel_array_size as we could
still have classes with just a few members even when there's a large
number of rels in simple_rel_array. I was hoping that v10-0003 would
allow us to maintain the same planner performance for simple queries.
It just does not seem to change the performance much. Perhaps it's not
worth the complexity if there are no performance benefits. It probably
needs more performance testing than what I've done to know if it helps
or hinders, however.

Overall, I'm not quite sure if this is any faster than your v9 patch.
I think more performance testing needs to be done. I think the
v10-0001 + v10-0002 is faster than v9-0001, but perhaps the changes
you've made in v9-0002 and v9-0003 are worth redoing. I didn't test. I
was hoping to keep the logic about which method to use to find the
members in the iterator code and not litter it around the tree.

I did run the test you mentioned in [1] and I got:

$ echo Master @ 29452de73 && ./partbench.sh | grep -E "^(Testing|latency)"
Master @ 29452de73
Testing with 2 partitions...
latency average = 0.231 ms
Testing with 4 partitions...
latency average = 0.303 ms
Testing with 8 partitions...
latency average = 0.454 ms
Testing with 16 partitions...
latency average = 0.777 ms
Testing with 32 partitions...
latency average = 1.576 ms
Testing with 64 partitions...
latency average = 3.574 ms
Testing with 128 partitions...
latency average = 9.504 ms
Testing with 256 partitions...
latency average = 37.321 ms
Testing with 512 partitions...
latency average = 171.660 ms
Testing with 1024 partitions...
latency average = 1021.990 ms

$ echo Master + v10-0001 && ./partbench.sh | grep -E "^(Testing|latency)"
Master + v10-0001
Testing with 2 partitions...
latency average = 0.239 ms
Testing with 4 partitions...
latency average = 0.315 ms
Testing with 8 partitions...
latency average = 0.463 ms
Testing with 16 partitions...
latency average = 0.757 ms
Testing with 32 partitions...
latency average = 1.481 ms
Testing with 64 partitions...
latency average = 2.563 ms
Testing with 128 partitions...
latency average = 5.618 ms
Testing with 256 partitions...
latency average = 16.229 ms
Testing with 512 partitions...
latency average = 38.855 ms
Testing with 1024 partitions...
latency average = 85.705 ms

$ echo Master + v10-0001 + v10-0002 && ./partbench.sh | grep -E
"^(Testing|latency)"
Master + v10-0001 + v10-0002
Testing with 2 partitions...
latency average = 0.241 ms
Testing with 4 partitions...
latency average = 0.312 ms
Testing with 8 partitions...
latency average = 0.459 ms
Testing with 16 partitions...
latency average = 0.755 ms
Testing with 32 partitions...
latency average = 1.464 ms
Testing with 64 partitions...
latency average = 2.580 ms
Testing with 128 partitions...
latency average = 5.652 ms
Testing with 256 partitions...
latency average = 16.464 ms
Testing with 512 partitions...
latency average = 37.674 ms
Testing with 1024 partitions...
latency average = 84.094 ms

$ echo Master + v10-0001 + v10-0002 + v10-0003 && ./partbench.sh |
grep -E "^(Testing|latency)"
Master + v10-0001 + v10-0002 + v10-0003
Testing with 2 partitions...
latency average = 0.240 ms
Testing with 4 partitions...
latency average = 0.318 ms
Testing with 8 partitions...
latency average = 0.465 ms
Testing with 16 partitions...
latency average = 0.763 ms
Testing with 32 partitions...
latency average = 1.486 ms
Testing with 64 partitions...
latency average = 2.858 ms
Testing with 128 partitions...
latency average = 5.764 ms
Testing with 256 partitions...
latency average = 16.995 ms
Testing with 512 partitions...
latency average = 38.012 ms
Testing with 1024 partitions...
latency average = 88.098 ms

$ echo Master + v9-* && ./partbench.sh | grep -E "^(Testing|latency)"
Master + v9-*
Testing with 2 partitions...
latency average = 0.237 ms
Testing with 4 partitions...
latency average = 0.313 ms
Testing with 8 partitions...
latency average = 0.460 ms
Testing with 16 partitions...
latency average = 0.780 ms
Testing with 32 partitions...
latency average = 1.468 ms
Testing with 64 partitions...
latency average = 2.701 ms
Testing with 128 partitions...
latency average = 5.275 ms
Testing with 256 partitions...
latency average = 17.208 ms
Testing with 512 partitions...
latency average = 37.183 ms
Testing with 1024 partitions...
latency average = 90.595 ms

David

[1] https://postgr.es/m/CAJ2pMkZNCgoUKSE%2B_5LthD%2BKbXKvq6h2hQN8Esxpxd%2Bcxmgomg%40mail.gmail.com

Attachment Content-Type Size
v10-0001-Add-Bitmapset-indexes-for-faster-lookup-of-Equiv.patch text/plain 82.3 KB
v10-0003-Add-iterators-for-looping-over-EquivalenceMember.patch text/plain 22.2 KB
v10-0002-Adjust-bms_int_members-so-that-it-shortens-the-l.patch text/plain 2.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2022-12-04 00:52:03 Re: Improve performance of pg_strtointNN functions
Previous Message Tom Lane 2022-12-03 21:46:41 Re: Error-safe user functions