Re: Logical replication keepalive flood

From: Greg Nancarrow <gregn4422(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Abbas Butt <abbas(dot)butt(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Zahid Iqbal <zahid(dot)iqbal(at)enterprisedb(dot)com>
Subject: Re: Logical replication keepalive flood
Date: 2021-09-14 05:39:20
Message-ID: CAJcOf-fdJizi7gwvpPa+rZzOuFaHkYpKWC2GyipJxHOwNCjrOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 12, 2021 at 12:33 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> This data shows the special keepalives are now greatly reduced from
> 1000s to just 10s.
>
> Thoughts?
>

I could easily see the flood of keepalives with the test setup
described by the original problem reporter (Abbas Butt).
I found that the
"v1-0002-WIP-skip-the-keepalive-on-FIRST-loop-iteration.patch" patch
reduced the keepalives by about 50% in this case.
I also tried the pub/sub setup with the publication on the
pgbench_history table.
With this pub/sub setup, I found that the patch dramatically reduced
the keepalives sent, similar to that reported by Peter.
Results (using Kyotoro’s keepalive counting patch) are below:

PUB/SUB, publishing the pgbench_history table

(1) without patch, 10s pgbench run:

2021-09-08 15:21:56.643 AEST [63720] LOG: Total records: 47019
2021-09-08 15:21:56.643 AEST [63720] LOG: 8: 8 / 16 /
8: 8571 / 882048
2021-09-08 15:21:56.643 AEST [63720] LOG: 16: 5 / 10 /
5: 3649 / 764892
2021-09-08 15:21:56.643 AEST [63720] LOG: 24: 6271 / 12561 /
6331: 113989 / 886115
2021-09-08 15:21:56.643 AEST [63720] LOG: 195: 2 / 0 /
112: 72 / 10945
2021-09-08 15:21:56.643 AEST [63720] LOG: 6856: 1 / 0 /
1: 666232176 / 666232176
2021-09-08 15:21:56.643 AEST [63720] LOG: 7477: 2 / 0 /
298: 27 / 3303
2021-09-08 15:21:56.643 AEST [63720] LOG: 8159: 19 / 32 /
6073: 15 / 1869

(2) with patch, 10s pgbench run

2021-09-08 15:39:14.008 AEST [71431] LOG: Total records: 45858
2021-09-08 15:39:14.008 AEST [71431] LOG: 24: 61 / 18278 /
6168: 108034 / 115228
2021-09-08 15:39:14.008 AEST [71431] LOG: 84: 1 / 1 /
7: 2256 / 295230
2021-09-08 15:39:14.008 AEST [71431] LOG: 110: 1 / 1 /
3: 10629 / 708293
2021-09-08 15:39:14.008 AEST [71431] LOG: 7477: 18 / 18 /
4577: 53 / 7850

Where columns are:

size: sent /notsent/ calls: write lag/ flush lag

- size is requested data length to logical_read_xlog_page()
- sent is the number of keepalives sent in the loop in WalSndWaitForWal
- notsent is the number of runs of the loop in WalSndWaitForWal
without sending a keepalive
- calls is the number of calls to WalSndWaitForWal
- write lag is the bytes MyWalSnd->write is behind from sentPtr at the
first run of the loop per call to logical_read_xlog_page.
- flush lag is the same as the above, but for MyWalSnd->flush.

However, the problem I found is that, with the patch applied, there is
a test failure when running “make check-world”:

t/006_logical_decoding.pl ............ 4/14
# Failed test 'pg_recvlogical acknowledged changes'
# at t/006_logical_decoding.pl line 117.
# got: 'BEGIN
# table public.decoding_test: INSERT: x[integer]:5 y[text]:'5''
# expected: ''
# Looks like you failed 1 test of 14.
t/006_logical_decoding.pl ............ Dubious, test returned 1 (wstat
256, 0x100) Failed 1/14 subtests

To investigate this, I added some additional logging to
pg_recvlogical.c and PostgresNode.pm and re-ran
006_logical_decoding.pl without and with the patch (logs attached).

When the patch is NOT applied, and when pg_recvlogical is invoked by
the test for a 2nd time with the same "--endpos" LSN, it gets a
keepalive, detects walEnd>=endpos, and thus returns an empty record.
The test is expecting an empty record, so all is OK.
When the patch is applied, and when pg_recvlogical is invoked by the
test for a 2nd time with the same "--endpos" LSN, it gets a WAL record
with THE SAME LSN (== endpos) as previously obtained by the last WAL
record when it was invoked the 1st time, but the record data is
actually the first row of some records written after endpos, that it
wasn't meant to read.
This doesn't seem right to me - how can pg_recvlogical receive two
different WAL records with the same LSN?
With the patch applied, I was expecting pg_reclogical to get WAL
records with LSN>endpos, but this doesn't happen.
I'm thinking that the patch must have broken walsender in some way,
possibly by missing out on calls to ProcessStandbyReplyMessage()
because the sending of some keepalives are avoided (see below), so
that the MyWalSnd flush and write location are not kept up-to-date.
The code comments below seem to hint about this. I don't really like
the way keepalives are used for this, but this seems to be the
existing functionality. Maybe someone else can confirm that this could
indeed break walsender?

walsender.c
WalSndWaitForWal()

/*
* We only send regular messages to the client for full decoded
* transactions, but a synchronous replication and walsender shutdown
* possibly are waiting for a later location. So, before sleeping, we
* send a ping containing the flush location. If the receiver is
* otherwise idle, this keepalive will trigger a reply. Processing the
* reply will update these MyWalSnd locations.
*/
if (!loop_first_time && /* avoid keepalive on first iteration
*/ <--- added by the patch
MyWalSnd->flush < sentPtr &&
MyWalSnd->write < sentPtr &&
!waiting_for_ping_response)
{
WalSndKeepalive(false);

Regards,
Greg Nancarrow
Fujitsu Australia

Attachment Content-Type Size
006_logs.tar.gz application/x-gzip 5.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amul Sul 2021-09-14 06:05:01 Deduplicate code updating ControleFile's DBState.
Previous Message Andres Freund 2021-09-14 05:19:14 Re: straightening out backend process startup