Re: sequences vs. synchronous replication

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: sequences vs. synchronous replication
Date: 2022-01-11 16:07:37
Message-ID: 9fb080d5-f509-cca4-1353-fd9da85db1d2@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/22/21 18:50, Fujii Masao wrote:
>
>
> On 2021/12/22 21:11, Tomas Vondra wrote:
>> Interesting idea, but I think it has a couple of issues :-(
>
> Thanks for the review!
>
>> 1) We'd need to know the LSN of the last WAL record for any given
>> sequence, and we'd need to communicate that between backends somehow.
>> Which seems rather tricky to do without affecting performance.
>
> How about using the page lsn for the sequence? nextval_internal()
> already uses that to check whether it's less than or equal to checkpoint
> redo location.
>

I explored the idea of using page LSN a bit, and there's some good and
bad news.

The patch from 22/12 simply checks if the change should/would wait for
sync replica, and if yes it WAL-logs the sequence increment. There's a
couple problems with this, unfortunately:

1) Imagine a high-concurrency environment, with a lot of sessions doing
nextval('s') at the same time. One session WAL-logs the increment, but
before the WAL gets flushed / sent to replica, another session calls
nextval. SyncRepNeedsWait() says true, so it WAL-logs it again, moving
the page LSN forward. And so on. So in a high-concurrency environments,
this simply makes the matters worse - it causes an avalanche of WAL
writes instead of saving anything.

(You don't even need multiple sessions - a single session calling
nextval would have the same issue, WAL-logging every call.)

2) It assumes having a synchronous replica, but that's wrong. It's
partially my fault because I formulated this issue as if it was just
about sync replicas, but that's just one symptom. It applies even to
systems without any replicas.

Imagine you do

BEGIN;
SELECT nextval('s') FROM generate_series(1,40);
ROLLBACK;

SELECT nextval('s');

and then you murder the server by "kill -9". If you restart it and do a
nextval('s') again, the value will likely go back, generating duplicate
values :-(

So I think this approach is not really an improvement over WAL-logging
every increment. But there's a better way, I think - we don't need to
generate WAL, we just need to ensure we wait for it to be flushed at
transaction end in RecordTransactionCommit().

That is, instead of generating more WAL, simply update XactLastRecEnd
and then ensure RecordTransactionCommit flushes/waits etc. Attached is a
patch doing that - the changes in sequence.c are trivial, changes in
RecordTransactionCommit simply ensure we flush/wait even without XID
(this actually raises some additional questions that I'll discuss in a
separate message in this thread).

I repeated the benchmark measurements with nextval/insert workloads, to
compare this with the other patch (WAL-logging every increment). I had
to use a different machine, so the the results are not directly
comparable to the numbers presented earlier.

On btrfs, it looks like this. The log-all is the first patch, page-lsn
is the new patch using page LSN. The first columns are raw pgbench tps
values, the last two columns are comparison to master.

On btrfs, it looks like this (the numbers next to nextval are the cache
size, with 1 being the default):

client test master log-all page-lsn log-all page-lsn
-------------------------------------------------------------------
1 insert 829 807 802 97% 97%
nextval/1 16491 814 16465 5% 100%
nextval/32 24487 16462 24632 67% 101%
nextval/64 24516 24918 24671 102% 101%
nextval/128 32337 33178 32863 103% 102%

client test master log-all page-lsn log-all page-lsn
-------------------------------------------------------------------
4 insert 1577 1590 1546 101% 98%
nextval/1 45607 1579 21220 3% 47%
nextval/32 68453 49141 51170 72% 75%
nextval/64 66928 65534 66408 98% 99%
nextval/128 83502 81835 82576 98% 99%

The results seem clearly better, I think.

For "insert" there's no drop at all (same as before), because as soon as
a transaction generates any WAL, it has to flush/wait anyway.

And for "nextval" there's a drop, but only with 4 clients, and it's much
smaller (53% instead of 97%). And increasing the cache size eliminates
even that.

Out of curiosity I ran the tests on tmpfs too, which should show overhed
not related to I/O. The results are similar:

client test master log-all page-lsn log-all page-lsn
-------------------------------------------------------------------
1 insert 44033 43740 43215 99% 98%
nextval/1 58640 48384 59243 83% 101%
nextval/32 61089 60901 60830 100% 100%
nextval/64 60412 61315 61550 101% 102%
nextval/128 61436 61605 61503 100% 100%

client test master log-all page-lsn log-all page-lsn
-------------------------------------------------------------------
4 insert 88212 85731 87350 97% 99%
nextval/1 115059 90644 113541 79% 99%
nextval/32 119765 118115 118511 99% 99%
nextval/64 119717 119220 118410 100% 99%
nextval/128 120258 119448 118826 99% 99%

Seems pretty nice, I guess. The original patch did pretty well too (only
about 20% drop).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
sequences-page-lsn.patch text/x-patch 1.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-01-11 16:22:23 Re: generic plans and "initial" pruning
Previous Message Tom Lane 2022-01-11 16:06:25 Re: Time to drop plpython2?