Re: Logical replication keepalive flood

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Abbas Butt <abbas(dot)butt(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Zahid Iqbal <zahid(dot)iqbal(at)enterprisedb(dot)com>
Subject: Re: Logical replication keepalive flood
Date: 2021-08-12 02:32:52
Message-ID: CAHut+PtyMBzweYUpb_QazVL6Uze2Yc5M5Ti2Xwee_eWM3Jrbog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi.

By using Kyotaro's "counting" patch I was able to reproduce very
similar results to what he had earlier posted [1].

AFAIK I have the same test scenario that he was using.

Test setup:
- using async pub/sub
- subscription is for the pgbench_history table
- pgbench is run for 10 seconds
- config for all the wal_sender/receiver timeout GUCs are just default values

Results (HEAD + Kyotaro counting patch)
=======================================

[postgres(at)CentOS7-x64 ~]$ 2021-08-10 16:36:23.733 AEST [32436] LOG:
Total records: 18183
2021-08-10 16:36:23.733 AEST [32436] LOG: 8: 2 / 0 /
1: 440616 / 580320
2021-08-10 16:36:23.733 AEST [32436] LOG: 16: 4 / 8 /
4: 4524 / 288688
2021-08-10 16:36:23.733 AEST [32436] LOG: 24: 2916 / 5151 /
2756: 31227 / 323190
2021-08-10 16:36:23.733 AEST [32436] LOG: 187: 2 / 0 /
51: 157 / 10629
2021-08-10 16:36:23.733 AEST [32436] LOG: 2960: 1 / 0 /
1: 49656944 / 49656944
2021-08-10 16:36:23.733 AEST [32436] LOG: 7537: 2 / 0 /
36: 231 / 7028
2021-08-10 16:36:23.733 AEST [32436] LOG: 7577: 1 / 2 /
78: 106 / 106
2021-08-10 16:36:23.733 AEST [32436] LOG: 8280: 1 / 2 /
3: 88 / 88

//////

That base data is showing there are similar numbers of keepalives sent
as there are calls made to WalSndWaitForWal. IIUC it means that mostly
the loop is sending the special keepalives on the *first* iteration,
but by the time of the *second* iteration the ProcessRepliesIfAny()
will have some status already received, and so mostly sending another
keepalive will be deemed unnecessary.

Based on this, our idea was to simply skip sending the
WalSndKeepalive(false) for the FIRST iteration of the loop only! PSA
the patch 0002 which does this skip.

With this skip patch (v1-0002) applied the same pgbench tests were run
again. The results look like below.

Results (HEAD + Kyotaro patch + Skip-first keepalive patch)
===========================================================

RUN #1
------
[postgres(at)CentOS7-x64 ~]$ 2021-08-11 16:32:59.827 AEST [20339] LOG:
Total records: 19367
2021-08-11 16:32:59.827 AEST [20339] LOG: 24: 10 / 9232 /
3098: 19 / 440
2021-08-11 16:32:59.827 AEST [20339] LOG: 102: 1 / 1 /
32: 257 / 16828
2021-08-11 16:32:59.827 AEST [20339] LOG: 187: 1 / 1 /
52: 155 / 9541

RUN #2
------
[postgres(at)CentOS7-x64 ~]$ 2021-08-11 16:36:03.983 AEST [25513] LOG:
Total records: 17815
2021-08-11 16:36:03.983 AEST [25513] LOG: 24: 73 / 8683 /
2958: 1647 / 3290
2021-08-11 16:36:03.983 AEST [25513] LOG: 8280: 1 / 1 /
3: 88 / 88

RUN #3
------
[postgres(at)CentOS7-x64 ~]$ 2021-08-11 16:39:27.655 AEST [31061] LOG:
Total records: 19906
2021-08-11 16:39:27.655 AEST [31061] LOG: 24: 18 / 8546 /
2890: 61 / 1530
2021-08-11 16:39:27.655 AEST [31061] LOG: 83: 1 / 3 /
1: 8664 / 8664

~~

This data shows the special keepalives are now greatly reduced from
1000s to just 10s.

Thoughts?

------
[1] https://www.postgresql.org/message-id/20210610.150016.1709823354377067679.horikyota.ntt%40gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment Content-Type Size
v1-0002-WIP-skip-the-keepalive-on-FIRST-loop-iteration.patch application/octet-stream 1.4 KB
v1-0001-WIP-count-keepalives-by-Kyotaro.patch application/octet-stream 5.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2021-08-12 02:52:32 Re: Small documentation improvement for ALTER SUBSCRIPTION
Previous Message Masahiko Sawada 2021-08-12 02:09:10 Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o