Re: Syncrep and improving latency due to WAL throttling

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Syncrep and improving latency due to WAL throttling
Date: 2023-01-26 15:40:29
Message-ID: 20230126154029.nhsjlzzyjcg4wc4t@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote:
> It's not clear to me how could it cause deadlocks, as we're not waiting
> for a lock/resource locked by someone else, but it's certainly an issue
> for uninterruptible hangs.

Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of
lock-ordering rules.

> > My best idea for how to implement this in a somewhat safe way would be for
> > XLogInsertRecord() to set a flag indicating that we should throttle, and set
> > InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
> > proceed (i.e. we'll not be in a critical / interrupts off section) can
> > actually perform the delay. That should fix the hard deadlock danger and
> > remove most of the increase in lock contention.
> >
>
> The solution I've imagined is something like autovacuum throttling - do
> some accounting of how much "WAL bandwidth" each process consumed, and
> then do the delay/sleep in a suitable place.

Where would such a point be, except for ProcessInterrupts()? Iteratively
adding a separate set of "wal_delay()" points all over the executor,
commands/, ... seems like a bad plan.

> > I don't think doing an XLogFlush() of a record that we just wrote is a good
> > idea - that'll sometimes significantly increase the number of fdatasyncs/sec
> > we're doing. To make matters worse, this will often cause partially filled WAL
> > pages to be flushed out - rewriting a WAL page multiple times can
> > significantly increase overhead on the storage level. At the very least this'd
> > have to flush only up to the last fully filled page.
> >
>
> I don't see why would that significantly increase the number of flushes.
> The goal is to do this only every ~1MB of a WAL or so, and chances are
> there were many (perhaps hundreds or more) commits in between. At least
> in a workload with a fair number of OLTP transactions (which is kinda
> the point of all this).

Because the accounting is done separately in each process. Even if you just
add a few additional flushes for each connection, in aggregate that'll be a
lot.

> And the "small" OLTP transactions don't really do any extra fsyncs,
> because the accounting resets at commit (or places that flush WAL).
> [...]
> > Just counting the number of bytes inserted by a backend will make the overhead
> > even worse, as the flush will be triggered even for OLTP sessions doing tiny
> > transactions, even though they don't contribute to the problem you're trying
> > to address. How about counting how many bytes of WAL a backend has inserted
> > since the last time that backend did an XLogFlush()?
> >
>
> No, we should reset the counter at commit, so small OLTP transactions
> should not reach really trigger this. That's kinda the point, to only
> delay "large" transactions producing a lot of WAL.

I might have missed something, but I don't think the first version of patch
resets the accounting at commit?

> Same for the flushes of partially flushed pages - if there's enough of
> small OLTP transactions, we're already having this issue. I don't see
> why would this make it measurably worse.

Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1
could *not* make it worse. Suddenly you get a bunch of additional XLogFlush()
calls to partial boundaries by every autovacuum, by every process doing a bit
more bulky work. Because you're flushing the LSN after a record you just
inserted, you're commonly not going to be "joining" the work of an already
in-process XLogFlush().

> But if needed, we can simply backoff to the last page boundary, so that we
> only flush the complete page. That would work too - the goal is not to flush
> everything, but to reduce how much of the lag is due to the current process
> (i.e. to wait for sync replica to catch up).

Yes, as I proposed...

> > I also suspect the overhead will be more manageable if you were to force a
> > flush only up to a point further back than the last fully filled page. We
> > don't want to end up flushing WAL for every page, but if you just have a
> > backend-local accounting mechanism, I think that's inevitably what you'd end
> > up with when you have a large number of sessions. But if you'd limit the
> > flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
> > only ever to a prior boundary, the amount of unnecessary WAL flushes would be
> > proportional to synchronous_commit_flush_wal_after.
> >
>
> True, that's kinda what I suggested above as a solution for partially
> filled WAL pages.

I think flushing only up to a point further in the past than the last page
boundary is somewhat important to prevent unnecessary flushes.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-01-26 15:49:12 Re: Syncrep and improving latency due to WAL throttling
Previous Message Masahiko Sawada 2023-01-26 14:47:40 Re: [PoC] Improve dead tuple storage for lazy vacuum