Re: Syncrep and improving latency due to WAL throttling

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Syncrep and improving latency due to WAL throttling
Date: 2023-01-26 18:53:20
Message-ID: bc92154a-28e3-8e66-ce8f-54fee1713aed@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/26/23 16:40, Andres Freund wrote:
> Hi,
>
> On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote:
>> It's not clear to me how could it cause deadlocks, as we're not waiting
>> for a lock/resource locked by someone else, but it's certainly an issue
>> for uninterruptible hangs.
>
> Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of
> lock-ordering rules.
>

Not sure what lock ordering issues you have in mind, but I agree it's
not the right place for the sleep, no argument here.

>
>>> My best idea for how to implement this in a somewhat safe way would be for
>>> XLogInsertRecord() to set a flag indicating that we should throttle, and set
>>> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
>>> proceed (i.e. we'll not be in a critical / interrupts off section) can
>>> actually perform the delay. That should fix the hard deadlock danger and
>>> remove most of the increase in lock contention.
>>>
>>
>> The solution I've imagined is something like autovacuum throttling - do
>> some accounting of how much "WAL bandwidth" each process consumed, and
>> then do the delay/sleep in a suitable place.
>
> Where would such a point be, except for ProcessInterrupts()? Iteratively
> adding a separate set of "wal_delay()" points all over the executor,
> commands/, ... seems like a bad plan.
>

I haven't thought about where to do the work, TBH. ProcessInterrupts()
may very well be a good place.

I should have been clearer, but the main benefit of autovacuum-like
throttling is IMHO that it involves all processes and a global limit,
while the current approach is per-backend.

>
>>> I don't think doing an XLogFlush() of a record that we just wrote is a good
>>> idea - that'll sometimes significantly increase the number of fdatasyncs/sec
>>> we're doing. To make matters worse, this will often cause partially filled WAL
>>> pages to be flushed out - rewriting a WAL page multiple times can
>>> significantly increase overhead on the storage level. At the very least this'd
>>> have to flush only up to the last fully filled page.
>>>
>>
>> I don't see why would that significantly increase the number of flushes.
>> The goal is to do this only every ~1MB of a WAL or so, and chances are
>> there were many (perhaps hundreds or more) commits in between. At least
>> in a workload with a fair number of OLTP transactions (which is kinda
>> the point of all this).
>
> Because the accounting is done separately in each process. Even if you just
> add a few additional flushes for each connection, in aggregate that'll be a
> lot.
>

How come? Imagine the backend does flush only after generating e.g. 1MB
of WAL. Small transactions won't do any additional flushes at all
(because commit resets the accounting). Large transactions will do an
extra flush every 1MB, so 16x per WAL segment. But in between there will
be many commits from the small transactions. If we backoff to the last
complete page, that should eliminate even most of these flushes.

So where would the additional flushes come from?

>
>> And the "small" OLTP transactions don't really do any extra fsyncs,
>> because the accounting resets at commit (or places that flush WAL).
>> [...]
>>> Just counting the number of bytes inserted by a backend will make the overhead
>>> even worse, as the flush will be triggered even for OLTP sessions doing tiny
>>> transactions, even though they don't contribute to the problem you're trying
>>> to address. How about counting how many bytes of WAL a backend has inserted
>>> since the last time that backend did an XLogFlush()?
>>>
>>
>> No, we should reset the counter at commit, so small OLTP transactions
>> should not reach really trigger this. That's kinda the point, to only
>> delay "large" transactions producing a lot of WAL.
>
> I might have missed something, but I don't think the first version of patch
> resets the accounting at commit?
>

Yeah, it doesn't. It's mostly a minimal PoC patch, sufficient to test
the behavior / demonstrate the effect.

>
>> Same for the flushes of partially flushed pages - if there's enough of
>> small OLTP transactions, we're already having this issue. I don't see
>> why would this make it measurably worse.
>
> Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1
> could *not* make it worse. Suddenly you get a bunch of additional XLogFlush()
> calls to partial boundaries by every autovacuum, by every process doing a bit
> more bulky work. Because you're flushing the LSN after a record you just
> inserted, you're commonly not going to be "joining" the work of an already
> in-process XLogFlush().
>

Right. We do ~16 additional flushes per 16MB segment (or something like
that, depending on the GUC value). Considering we do thousand of commits
per segment, each of which does a flush, I don't see how would this make
it measurably worse.

>
>> But if needed, we can simply backoff to the last page boundary, so that we
>> only flush the complete page. That would work too - the goal is not to flush
>> everything, but to reduce how much of the lag is due to the current process
>> (i.e. to wait for sync replica to catch up).
>
> Yes, as I proposed...
>

Right.

>
>>> I also suspect the overhead will be more manageable if you were to force a
>>> flush only up to a point further back than the last fully filled page. We
>>> don't want to end up flushing WAL for every page, but if you just have a
>>> backend-local accounting mechanism, I think that's inevitably what you'd end
>>> up with when you have a large number of sessions. But if you'd limit the
>>> flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
>>> only ever to a prior boundary, the amount of unnecessary WAL flushes would be
>>> proportional to synchronous_commit_flush_wal_after.
>>>
>>
>> True, that's kinda what I suggested above as a solution for partially
>> filled WAL pages.
>
> I think flushing only up to a point further in the past than the last page
> boundary is somewhat important to prevent unnecessary flushes.
>

Not sure I agree with that. Yes, we should not be doing flushes unless
we need to, but OTOH we should not delay sending WAL to standby too much
- because that's what affects syncrep latency for small transactions.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-01-26 18:54:08 Re: suppressing useless wakeups in logical/worker.c
Previous Message Peter Geoghegan 2023-01-26 18:44:45 Re: New strategies for freezing, advancing relfrozenxid early