Re: Syncrep and improving latency due to WAL throttling

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Syncrep and improving latency due to WAL throttling
Date: 2023-01-26 11:08:16
Message-ID: 341e5208-58f6-e0ba-eb81-5e04e536e344@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/25/23 20:05, Andres Freund wrote:
> Hi,
>
> On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote:
>> In other words it allows slow down of any backend activity. Any feedback on
>> such a feature is welcome, including better GUC name proposals ;) and
>> conditions in which such feature should be disabled even if it would be
>> enabled globally (right now only anti- wraparound VACUUM comes to mind, it's
>> not in the patch).
>
> Such a feature could be useful - but I don't think the current place of
> throttling has any hope of working reliably:
>
>> @@ -1021,6 +1025,21 @@ XLogInsertRecord(XLogRecData *rdata,
>> pgWalUsage.wal_bytes += rechdr->xl_tot_len;
>> pgWalUsage.wal_records++;
>> pgWalUsage.wal_fpi += num_fpi;
>> +
>> + backendWalInserted += rechdr->xl_tot_len;
>> +
>> + if ((synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_WRITE) &&
>> + synchronous_commit_flush_wal_after > 0 &&
>> + backendWalInserted > synchronous_commit_flush_wal_after * 1024L)
>> + {
>> + elog(DEBUG3, "throttling WAL down on this session (backendWalInserted=%d)", backendWalInserted);
>> + XLogFlush(EndPos);
>> + /* XXX: refactor SyncRepWaitForLSN() to have different waitevent than default WAIT_EVENT_SYNC_REP */
>> + /* maybe new WAIT_EVENT_SYNC_REP_BIG or something like that */
>> + SyncRepWaitForLSN(EndPos, false);
>> + elog(DEBUG3, "throttling WAL down on this session - end");
>> + backendWalInserted = 0;
>> + }
>> }
>
> You're blocking in the middle of an XLOG insertion. We will commonly hold
> important buffer lwlocks, it'll often be in a critical section (no cancelling
> / terminating the session!). This'd entail loads of undetectable deadlocks
> (i.e. hard hangs). And even leaving that aside, doing an unnecessary
> XLogFlush() with important locks held will seriously increase contention.
>

Yeah, I agree the sleep would have to happen elsewhere.

It's not clear to me how could it cause deadlocks, as we're not waiting
for a lock/resource locked by someone else, but it's certainly an issue
for uninterruptible hangs.

>
> My best idea for how to implement this in a somewhat safe way would be for
> XLogInsertRecord() to set a flag indicating that we should throttle, and set
> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
> proceed (i.e. we'll not be in a critical / interrupts off section) can
> actually perform the delay. That should fix the hard deadlock danger and
> remove most of the increase in lock contention.
>

The solution I've imagined is something like autovacuum throttling - do
some accounting of how much "WAL bandwidth" each process consumed, and
then do the delay/sleep in a suitable place.

>
> I don't think doing an XLogFlush() of a record that we just wrote is a good
> idea - that'll sometimes significantly increase the number of fdatasyncs/sec
> we're doing. To make matters worse, this will often cause partially filled WAL
> pages to be flushed out - rewriting a WAL page multiple times can
> significantly increase overhead on the storage level. At the very least this'd
> have to flush only up to the last fully filled page.
>

I don't see why would that significantly increase the number of flushes.
The goal is to do this only every ~1MB of a WAL or so, and chances are
there were many (perhaps hundreds or more) commits in between. At least
in a workload with a fair number of OLTP transactions (which is kinda
the point of all this).

And the "small" OLTP transactions don't really do any extra fsyncs,
because the accounting resets at commit (or places that flush WAL).

Same for the flushes of partially flushed pages - if there's enough of
small OLTP transactions, we're already having this issue. I don't see
why would this make it measurably worse. But if needed, we can simply
backoff to the last page boundary, so that we only flush the complete
page. That would work too - the goal is not to flush everything, but to
reduce how much of the lag is due to the current process (i.e. to wait
for sync replica to catch up).

>
> Just counting the number of bytes inserted by a backend will make the overhead
> even worse, as the flush will be triggered even for OLTP sessions doing tiny
> transactions, even though they don't contribute to the problem you're trying
> to address. How about counting how many bytes of WAL a backend has inserted
> since the last time that backend did an XLogFlush()?
>

No, we should reset the counter at commit, so small OLTP transactions
should not reach really trigger this. That's kinda the point, to only
delay "large" transactions producing a lot of WAL.

> A bulk writer won't do a lot of XLogFlush()es, so the time/bytes since the
> last XLogFlush() will increase quickly, triggering a flush at the next
> opportunity. But short OLTP transactions will do XLogFlush()es at least at
> every commit.
>

Right, that's kinda the approach the patch is trying to do.

>
> I also suspect the overhead will be more manageable if you were to force a
> flush only up to a point further back than the last fully filled page. We
> don't want to end up flushing WAL for every page, but if you just have a
> backend-local accounting mechanism, I think that's inevitably what you'd end
> up with when you have a large number of sessions. But if you'd limit the
> flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
> only ever to a prior boundary, the amount of unnecessary WAL flushes would be
> proportional to synchronous_commit_flush_wal_after.
>

True, that's kinda what I suggested above as a solution for partially
filled WAL pages.

I agree this is crude and we'd probably need some sort of "balancing"
logic that distributes WAL bandwidth between backends, and throttles
backends producing a lot of WAL.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Borisov 2023-01-26 11:13:24 Re: Considering additional sort specialisation functions for PG16
Previous Message Andrei Zubkov 2023-01-26 11:02:43 Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements