Re: [PoC] Reducing planning time when tables have many partitions

From: Yuya Watari <watari(dot)yuya(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PoC] Reducing planning time when tables have many partitions
Date: 2022-08-26 00:39:32
Message-ID: CAJ2pMka2PBXNNzUfe0-ksFsxVN+gmfKq7aGQ5v35TcpjFG3Ggg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear David,

I really appreciate your reply and your modifying the patch. The
performance improvements are quite impressive. I believe these
improvements will help PostgreSQL users. Thank you again.

> The 2 partition does regress slightly. There might be a few
> things we can do about that

I tried to solve this regression problem. From here, I will refer to
the patch you sent on August 16th as the v3 patch. I will also call my
patch attached to this email the v4 patch. I will discuss the v4 patch
later.

Additionally, I give names to queries.
* Query A: The query we have been using in previous emails, which
joins students, scores, and gpas tables.
* Query B: The query which is attached to this email.

Query B is as follows:

===
SELECT *
FROM testtable_1, testtable_2, testtable_3, testtable_4, testtable_5,
testtable_6, testtable_7, testtable_8
WHERE testtable_1.x = testtable_2.x AND testtable_1.x = testtable_3.x
AND testtable_1.x = testtable_4.x AND testtable_1.x = testtable_5.x
AND testtable_1.x = testtable_6.x AND testtable_1.x = testtable_7.x
AND testtable_1.x = testtable_8.x;
===

Query A joins three tables, whereas Query B joins eight tables. Since
EquivalenceClass is used when handling chained join conditions, I
thought queries joining many tables, such as Query B, would have
greater performance impacts.

I have investigated the v3 patch with these queries. As a result, I
did not observe any regressions in Query A in my environment. However,
the v3 patch showed significant degradation in Query B.

The following table and Figures 1 and 2 describe the result. The v3
patch resulted in a regression of 8.7% for one partition and 4.8% for
two partitions. Figure 2 shows the distribution of planning times for
the 1-partition case, indicating that the 8.7% regression is not an
error.

Table 1: Planning time of Query B
(n: number of partitions)
(milliseconds)
----------------------------------------------------------------
n | Master | v3 | v4 | Master / v3 | Master / v4
----------------------------------------------------------------
1 | 54.926 | 60.178 | 55.275 | 91.3% | 99.4%
2 | 53.853 | 56.554 | 53.519 | 95.2% | 100.6%
4 | 57.115 | 57.829 | 55.648 | 98.8% | 102.6%
8 | 64.208 | 60.945 | 58.025 | 105.4% | 110.7%
16 | 79.818 | 65.526 | 63.365 | 121.8% | 126.0%
32 | 136.981 | 77.813 | 76.526 | 176.0% | 179.0%
64 | 371.991 | 108.058 | 110.202 | 344.2% | 337.6%
128 | 1449.063 | 173.326 | 181.302 | 836.0% | 799.3%
256 | 6245.577 | 333.480 | 354.961 | 1872.8% | 1759.5%
----------------------------------------------------------------

This performance degradation is due to the heavy processing of the
get_ec***_indexes***() functions. These functions are the core part of
the optimization we are working on in this thread, but they are
relatively heavy when the number of partitions is small.

I noticed that these functions were called repeatedly with the same
arguments. During planning Query B with one partition, the
get_ec_source_indexes_strict() function was called 2087 times with
exactly the same parameters. Such repeated calls occurred many times
in a single query.

To address this problem, I introduced a caching mechanism in the v4
patch. This patch caches the Bitmapset once it has been computed.
After that, we only have to read the cached value instead of
performing the same process. Of course, we cannot devote much time to
the caching itself. Hash tables are a simple solution to accomplish
this but are not available under the current case where microsecond
performance degradation is a problem. Therefore, my patch adopts
another approach. I will use the following function as an example to
explain it.

===
Bitmapset *get_ecmember_indexes(PlannerInfo *root,
EquivalenceClass *ec, Relids relids, bool with_children, bool
with_norel_members);
===

My idea is "caching the returned Bitmapset into Relids." If the Relids
has the result Bitmapset, we can access it quickly via the pointer. Of
course, I understand this description is not accurate. Relids is just
an alias of Bitmapset, so we cannot change the layout.

I will describe the precise mechanism. In the v4 patch, I changed the
signature of the get_ecmember_indexes() function as follows.

===
Bitmapset *get_ecmember_indexes(PlannerInfo *root,
EquivalenceClass *ec, Relids relids, bool with_children, bool
with_norel_members, ECIndexCache *cache);
===

ECIndexCache is storage for caching returned values. ECIndexCache has
a one-to-one relationship with Relids. This relationship is achieved
by placing the ECIndexCache just alongside the Relids. For example,
ECIndexCache corresponding to some RelOptInfo's relids exists in the
same RelOptInfo. When calling the get_ecmember_indexes() function with
a RelOptInfo, we pass RelOptInfo->ECIndexCache together. On the other
hand, since Relids appear in various places, it is sometimes difficult
to prepare a corresponding ECIndexCache. In such cases, we give up
caching and pass NULL.

Besides, one ECIndexCache can only map to one EquivalenceClass.
ECIndexCache only caches for the first EquivalenceClass it encounters
and does not cache for another EC.

My method abandons full caching to prevent overhead. However, it
overcame the regression problem for Query B. As can be seen from
Figure 2, the regression with the v4 patch is either non-existent or
negligible. Furthermore, the v4 patch is faster than the v3 patch when
the number of partitions is 32 or less.

In addition to Query B, the results with Query A are shown in Figure
3. I cannot recognize any regression from Figure 3. Please be noted
that these results are done on my machine and may differ in other
environments.

However, when the number of partitions was relatively large, my patch
was slightly slower than the v3 patch. This may be due to too frequent
memory allocation. ECIndexCache is a large struct containing 13
pointers. In the current implementation, ECIndexCache exists within
commonly used structs such as RelOptInfo. Therefore, ECIndexCache is
allocated even if no one uses it. When there were 256 partitions of
Query B, 88509 ECIndexCache instances were allocated, but only 2295
were actually used. This means that 95.4% were wasted. I think
on-demand allocation would solve this problem. Similar problems could
also occur with other workloads, including OLTP. I'm going to try this
approach soon.

I really apologize for not commenting on the rest of your reply. I
will continue to consider them.

--
Best regards,
Yuya Watari

Attachment Content-Type Size
v4-0001-Apply-eclass_member_speedup_v3.patch.patch application/octet-stream 88.5 KB
v4-0002-Implement-ECIndexCache.patch application/octet-stream 26.5 KB
figure-1.png image/png 133.1 KB
figure-2.png image/png 238.1 KB
figure-3.png image/png 203.0 KB
query-b-create-tables.sql application/octet-stream 913 bytes
query-b-query.sql application/octet-stream 523 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-08-26 00:51:26 Re: pg_receivewal and SIGTERM
Previous Message Peter Geoghegan 2022-08-26 00:14:51 Re: New strategies for freezing, advancing relfrozenxid early