Re: walsender performance regression due to logical decoding on standby changes

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Jeff Davis <pgsql(at)j-davis(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: walsender performance regression due to logical decoding on standby changes
Date: 2023-05-18 14:41:11
Message-ID: CALj2ACXtHuS9PcP+R2TiMdSx_GepYSfx8n1O0pZSAPZQY2PWJA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 18, 2023 at 1:23 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2023-05-15 20:09:00 +0530, Bharath Rupireddy wrote:
> > > [1]
> > > max_wal_senders = 100
> > > before regression(ms) after regression(ms) v2 patch(ms)
> > > 13394.4013 14141.2615 13455.2543
> > > Compared with before regression 5.58% 0.45%
> > >
> > > max_wal_senders = 200
> > > before regression(ms) after regression(ms) v2 patch(ms)
> > > 13280.8507 14597.1173 13632.0606
> > > Compared with before regression 9.91% 1.64%
> > >
> > > max_wal_senders = 300
> > > before regression(ms) after regression(ms) v2 patch(ms)
> > > 13535.0232 16735.7379 13705.7135
> > > Compared with before regression 23.65% 1.26%
> >
> > Yes, the numbers with v2 patch look close to where we were before.
> > Thanks for confirming. Just wondering, where does this extra
> > 0.45%/1.64%/1.26% coming from?
>
> We still do more work for each WAL record than before, so I'd expect something
> small. I'd say right now the main overhead with the patch comes from the
> spinlock acquisitions in ConditionVariableBroadcast(), which happen even when
> nobody is waiting.

> > + ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
> > + ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
>
> It's not obvious to me that it's worth having two CVs, because it's more
> expensive to find no waiters in two CVs than to find no waiters in one CV.

I disagree. In the tight per-WAL record recovery loop, WalSndWakeup
wakes up logical walsenders for every WAL record, but it wakes up
physical walsenders only if the applied WAL record causes a TLI
switch. Therefore, the extra cost of spinlock acquire-release for per
WAL record applies only for logical walsenders. On the other hand, if
we were to use a single CV, we would be unnecessarily waking up (if at
all they are sleeping) physical walsenders for every WAL record -
which is costly IMO.

I still think separate CVs are good for selective wake ups given there
can be not so many TLI switch WAL records.

> > + *
> > + * XXX: When available, WaitEventSetWait() can be replaced with its CV's
> > + * counterpart.
>
> I don't really understand that XXX - the potential bright future would be to
> add support for CVs into wait event sets, not to replace WES with a CV?

Yes, I meant it and modified that part to:

* XXX: A desirable future improvement would be to add support for CVs into
* WaitEventSetWait().

> FWIW, if I just make WalSndWakeup() do nothing, I still see a very small, but
> reproducible, overhead: 1.72s - that's just the cost of the additional
> external function call.

Hm, this is unavoidable.

> If I add a no-waiters fastpath using proclist_is_empty() to
> ConditionVariableBroadcast(), I get 1.77s. So the majority of the remaining
> slowdown indeed comes from the spinlock acquisition in
> ConditionVariableBroadcast().

Acquiring spinlock for every replayed WAL record seems not great. In
general, this problem exists for CV infrastructure as a whole if
ConditionVariableBroadcast()/CVB() is called in a loop/hot path. I
think this can be proven easily by doing something like - select
pg_replication_origin_session_setup()/pg_wal_replay_resume()/pg_create_physical_replication_slot()
from generate_series(1, 1000000...);, all of these functions end up in
CVB().

Do you think adding a fastpath exit when no waiters to CVB() is worth
implementing? Something like - an atomic state variable to each CV,
similar to LWLock's state variable, setting it when adding waiters and
resetting it when removing waiters, CVB() atomically reading the state
for fastpath (of course, we need memory barriers here). This might
complicate things as each CV structure gets an extra state variable (4
bytes), memory barriers, and atomic writes and reads.

Or given the current uses of CVB() with no callers except recovery
calling it in a hot path with patch, maybe we can add an atomic waiter
count to WalSndCtl, incrementing it atomically in WalSndWait() before
the wait, decrementing it after the wait, and a fastpath in
WalSndWakeup() reading it atomically to avoid CVB() calls. IIUC, this
is something Jeff proposed upthread.

Thoughts?

Please find the attached v4 patch addressing the review comment (not
the fastpath one).

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v4-0001-Optimize-walsender-wake-up-logic-with-Conditional.patch application/octet-stream 6.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-05-18 14:53:35 Re: The documentation for READ COMMITTED may be incomplete or wrong
Previous Message Tom Lane 2023-05-18 14:39:54 Re: Assert failure of the cross-check for nullingrels