[Patch] Optimize dropping of relation buffers using dlist

From: "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>
To: "'pgsql-hackers(at)postgresql(dot)org'" <pgsql-hackers(at)postgresql(dot)org>
Subject: [Patch] Optimize dropping of relation buffers using dlist
Date: 2019-10-28 08:13:19
Message-ID: OSBPR01MB3207DCA7EC725FDD661B3EDAEF660@OSBPR01MB3207.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Currently, we need to scan the WHOLE shared buffers when VACUUM
truncated off any empty pages at end of transaction or when relation
is TRUNCATEd.
As for our customer case, we periodically truncate thousands of tables,
and it's possible to TRUNCATE single table per transaction. This can be
problematic later on during recovery which could take longer, especially
when a sudden failover happens after those TRUNCATEs and when we
have to scan a large-sized shared buffer. In the performance test below,
it took almost 12.5 minutes for recovery to complete for 100GB shared
buffers. But we want to keep failover very short (within 10 seconds).

Previously, I made an improvement in speeding the truncates of relation
forks from 3 scans to one scan.[1] This time, the aim of this patch is
to further speedup the invalidation of pages, by linking the cached pages
of the target relation in a doubly-linked list and just traversing it
instead of scanning the whole shared buffers. In DropRelFileNodeBuffers,
we just get the number of target buffers to invalidate for the relation.
There is a significant win in this patch, because we were able to
complete failover and recover in 3 seconds more or less.

I performed similar tests to what I did in the speedup truncates of
relations forks.[1][2] However, this time using 100GB shared_buffers.

[Machine spec used in testing]
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
CPU: 16, Number of cores per socket: 8
RHEL6.5, Memory: 256GB++

[Test]
1. (Master) Create table (ex. 10,000 tables). Insert data to tables.
2. (Master) DELETE FROM TABLE (ex. all rows of 10,000 tables)
(Standby) To test with failover, pause the WAL replay on standby server.
(SELECT pg_wal_replay_pause();)
3. (M) psql -c "\timing on" (measures total execution of SQL queries)
4. (M) VACUUM (whole db)
5. (M) Stop primary server. pg_ctl stop -D $PGDATA -w
6. (S) Resume wal replay and promote standby.[2]

[Results]

A. HEAD (origin/master branch)
A1. Vacuum execution on Primary server
Time: 730932.408 ms (12:10.932) ~12min 11s
A2. Vacuum + Failover (WAL Recovery on Standby)
waiting for server to promote...........................
.................................... stopped waiting
pg_ctl: server did not promote in time
2019/10/25_12:13:09.692─┐
2019/10/25_12:25:43.576─┘
-->Total: 12min34s

B. PATCH
B1. Vacuum execution on Primary/Master
Time: 6.518333s = 6518.333 ms
B2. Vacuum + Failover (WAL Recovery on Standby)
2019/10/25_14:17:21.822
waiting for server to promote...... done
server promoted
2019/10/25_14:17:24.827
2019/10/25_14:17:24.833
-->Total: 3.011s

[Other Notes]
Maybe one disadvantage is that we can have a variable number of
relations, and allocated the same number of relation structures as
the size of shared buffers. I tried to reduce the use of memory when
doing hash table lookup operation by having a fixed size array (100)
or threshold of target buffers to invalidate.
When doing CachedBufLookup() to scan the count of each buffer in the
dlist, I made sure to reduce the number of scans (2x at most).
First, we scan the dlist of cached buffers of relations.
Then store the target buffers in buf_id_array. Non-target buffers
would be removed from dlist but added to temporary dlist.
After reaching end of main dlist, we append the temporary dlist to
tail of main dlist.
I also performed pgbench buffer test, and this patch did not cause
overhead to normal DB access performance.

Another one that I'd need feedback of is the use of new dlist operations
for this cached buffer list. I did not use in this patch the existing
Postgres dlist architecture (ilist.h) because I want to save memory space
as much as possible especially when NBuffers become large. Both dlist_node
& dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes.

Hope to hear your feedback and comments.

Thanks in advance,
Kirk Jamison

[1] https://www.postgresql.org/message-id/flat/D09B13F772D2274BB348A310EE3027C64E2067%40g01jpexmbkw24
[2] https://www.postgresql.org/message-id/D09B13F772D2274BB348A310EE3027C6502672%40g01jpexmbkw24

Attachment Content-Type Size
v1-Optimize-dropping-of-relation-buffers-using-dlist.patch application/octet-stream 20.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2019-10-28 08:16:54 Re: [PATCH] Do not use StdRdOptions in Access Methods
Previous Message Michael Paquier 2019-10-28 07:55:46 Re: update ALTER TABLE with ATTACH PARTITION lock mode