Re: Syncrep and improving latency due to WAL throttling

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Syncrep and improving latency due to WAL throttling
Date: 2023-01-26 08:03:27
Message-ID: CALj2ACVNo_Mh-AJLp=eKA-GQmiVmtLzpzF_tdGnrWjSxxeKkqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 26, 2023 at 12:35 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote:
> > In other words it allows slow down of any backend activity. Any feedback on
> > such a feature is welcome, including better GUC name proposals ;) and
> > conditions in which such feature should be disabled even if it would be
> > enabled globally (right now only anti- wraparound VACUUM comes to mind, it's
> > not in the patch).
>
> Such a feature could be useful - but I don't think the current place of
> throttling has any hope of working reliably:

+1 for the feature as it keeps replication lag in check and provides a
way for defining RPO (recovery point objective).

> You're blocking in the middle of an XLOG insertion. We will commonly hold
> important buffer lwlocks, it'll often be in a critical section (no cancelling
> / terminating the session!). This'd entail loads of undetectable deadlocks
> (i.e. hard hangs). And even leaving that aside, doing an unnecessary
> XLogFlush() with important locks held will seriously increase contention.
>
> My best idea for how to implement this in a somewhat safe way would be for
> XLogInsertRecord() to set a flag indicating that we should throttle, and set
> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
> proceed (i.e. we'll not be in a critical / interrupts off section) can
> actually perform the delay. That should fix the hard deadlock danger and
> remove most of the increase in lock contention.

We've discussed this feature quite extensively in a recent thread -
https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.
Folks were agreeing to Andres' suggestion there -
https://www.postgresql.org/message-id/20220105174643.lozdd3radxv4tlmx%40alap3.anarazel.de.
However, that thread got lost from my radar. It's good that someone
else started working on it and I'm happy to help with this feature.

> I don't think doing an XLogFlush() of a record that we just wrote is a good
> idea - that'll sometimes significantly increase the number of fdatasyncs/sec
> we're doing. To make matters worse, this will often cause partially filled WAL
> pages to be flushed out - rewriting a WAL page multiple times can
> significantly increase overhead on the storage level. At the very least this'd
> have to flush only up to the last fully filled page.
>
> Just counting the number of bytes inserted by a backend will make the overhead
> even worse, as the flush will be triggered even for OLTP sessions doing tiny
> transactions, even though they don't contribute to the problem you're trying
> to address.

Right.

> How about counting how many bytes of WAL a backend has inserted
> since the last time that backend did an XLogFlush()?

Seems reasonable.

> A bulk writer won't do a lot of XLogFlush()es, so the time/bytes since the
> last XLogFlush() will increase quickly, triggering a flush at the next
> opportunity. But short OLTP transactions will do XLogFlush()es at least at
> every commit.

Right.

> I also suspect the overhead will be more manageable if you were to force a
> flush only up to a point further back than the last fully filled page. We
> don't want to end up flushing WAL for every page, but if you just have a
> backend-local accounting mechanism, I think that's inevitably what you'd end
> up with when you have a large number of sessions. But if you'd limit the
> flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
> only ever to a prior boundary, the amount of unnecessary WAL flushes would be
> proportional to synchronous_commit_flush_wal_after.

When the WAL records are of large in size, then the backend inserting
them would throttle frequently generating more flushes and more waits
for sync replication ack and more contention on WALWriteLock. Not sure
if this is good unless the impact is measured.

Few more thoughts:

1. This feature keeps replication lag in check at the cost of
throttling write traffic. I'd be curious to know improvement in
replication lag vs drop in transaction throughput, say pgbench with a
custom insert script and one or more async/sync standbys.

2. So, heavy WAL throttling can turn into a lot of writes and fsyncs.
Eventually, each backend gets a chance to throttle WAL if it ever
generates WAL irrespective of whether there's a replication lag or
not. How about we let backends throttle themselves not just based on
wal_distance_from_last_flush but also depending on the replication lag
at the moment, say, if replication lag crosses
wal_throttle_replication_lag_threshold bytes?

3. Should the backends wait indefinitely for sync rep ack when they
crossed wal_throttle_threshold? Well, I don't think so, it must be
given a chance to do its stuff instead of favouring other backends by
throttling itself.

4. I'd prefer adding a TAP test for this feature to check if the WAL
throttle is picked up by a backend.

5. Can we also extend this WAL throttling feature to logical
replication to keep replication lag in check there as well?

6. Backends can ignore throttling for WAL records marked as unimportant, no?

7. I think we need to not let backends throttle too frequently even
though they have crossed wal_throttle_threshold bytes. The best way is
to rely on replication lag, after all the goal of this feature is to
keep replication lag under check - say, throttle only when
wal_distance > wal_throttle_threshold && replication_lag >
wal_throttle_replication_lag_threshold.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-01-26 08:32:52 Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message John Naylor 2023-01-26 07:08:52 Re: [PoC] Improve dead tuple storage for lazy vacuum