Re: [PATCH] fix a performance issue with multiple logical-decoding walsenders

From: Marc Cousin <cousinmarc(at)gmail(dot)com>
To: Pierre Ducroquet <p(dot)psql(at)pinaraf(dot)info>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Julien Rouhaud <rjuju123(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] fix a performance issue with multiple logical-decoding walsenders
Date: 2020-01-06 19:21:01
Message-ID: ff8d58cc-8f5a-b0f2-9705-98d7ff923fad@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I spent a little bit of time trying to explain the problem we are facing clearly, and provide a reproducible benchmark.

So here it is.

What triggered our investigation is that we have a PostgreSQL cluster containing about 15 databases, most of them being used as sources for logical replication. This means we have about as many WAL senders active on the cluster at the same time.

What we saw was that we had very high spikes of CPU activity, with very high level of SYS (up to 50% of the whole system was SYS, with 100% load on all CPUs) on the server, which is a dual Xeon Silver 4110, so 16 cores, 32 threads. That seemed insane as our usual load isn't that high on average (like 20% total cpu use, with ~ 1% of SYS), and mostly on 2 of those 15 databases. This mostly occured when creating an index, or doing batch updates, copys, etc. And all WAL senders consumed about the same amount, even those connected to databases where nothing happened. And while testing, we noticed things were worse with PG 12 than with PG 10 (that was the cause for the first patch Pierre posted, which was more of a way to get PostmasterIsAlive out of the way for PG 10 to get the same behaviour on both databases and confirm what was different between the two versions).

So that was what the first benchmark we did, what Pierre posted a few days ago. With the second patch (reducing calls to GetFlushRecPtr), on PostgreSQL 12, with statements affecting lots of records at once, we managed to reduce the WAL senders' consumption by a factor of 15 (if the patch is correct of course). SYS was down to more sensible (near 0) values. WAL senders for databases which had no decoding to do didn't consume that much anymore, only the one connected to the database doing the work used a lot of CPU, but that's expected. This solves the problem we are facing. Without this, we won't be able to upgrade to PG 12, as the impact of GetFlushRecPtr is even worse than with PG 10.

I've now tried to measure the impact of the patch on a more evenly balanced activity on several databases, where the contention on GetFlushRecPtr is less severe, to see if there are wins in all cases. Simple scripts to test this are provided as an attachment.

Just set the "max" environment variable to the amount of databases/WAL senders you want, and then run create_logical.sh (creates the databases and the slots), then connect_logical.sh (connects pg_recvlogical processes to these slots), and run_stress.sh (runs a pgbench on each database, each doing 100000 transactions, all in parallel). drop_logical.sh does the cleanup. Sorry, they are very basic...

Here are the results on patched/unpatched PG 12. You have the messages from all pgbenchs, then the consumption of all WAL senders at the end of a run.

As you can see, pgbench runs are ~ 20% faster. That's because with unpatched, we are CPU starved (the server is 100% CPU), which we aren't with patched. WAL senders needed about half the CPU in the patched case as shown by pidstat.

UNPATCHED:

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.943 ms
tps = 514.796454 (including connections establishing)
tps = 514.805610 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.949 ms
tps = 513.130790 (including connections establishing)
tps = 513.168135 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.950 ms
tps = 512.946425 (including connections establishing)
tps = 512.961746 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.951 ms
tps = 512.643065 (including connections establishing)
tps = 512.678765 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.953 ms
tps = 512.159794 (including connections establishing)
tps = 512.178075 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.953 ms
tps = 512.024962 (including connections establishing)
tps = 512.034590 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.953 ms
tps = 512.016413 (including connections establishing)
tps = 512.034438 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.954 ms
tps = 511.728080 (including connections establishing)
tps = 511.760138 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.954 ms
tps = 511.655046 (including connections establishing)
tps = 511.678533 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.955 ms
tps = 511.604767 (including connections establishing)
tps = 511.617562 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.955 ms
tps = 511.525593 (including connections establishing)
tps = 511.558150 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.955 ms
tps = 511.496498 (including connections establishing)
tps = 511.505871 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.956 ms
tps = 511.334434 (including connections establishing)
tps = 511.363141 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.956 ms
tps = 511.256908 (including connections establishing)
tps = 511.284577 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.957 ms
tps = 511.021219 (including connections establishing)
tps = 511.041905 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.957 ms
tps = 510.977176 (including connections establishing)
tps = 511.004730 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.958 ms
tps = 510.818341 (including connections establishing)
tps = 510.870735 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.958 ms
tps = 510.719611 (including connections establishing)
tps = 510.741600 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.959 ms
tps = 510.460165 (including connections establishing)
tps = 510.504649 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.961 ms
tps = 509.962481 (including connections establishing)
tps = 509.978043 (excluding connections establishing)

~$ pidstat -C postgres -l -u -T CHILD
Linux 4.9.0-11-amd64 (hardy) 01/06/2020 _x86_64_ (32 CPU)

02:39:30 PM UID PID usr-ms system-ms guest-ms Command
02:39:30 PM 111 78644 2232190 284870 0 /tmp/pg-12-unpatched/bin/postgres
02:39:30 PM 111 78646 400 340 0 postgres: checkpointer
02:39:30 PM 111 78647 530 1330 0 postgres: background writer
02:39:30 PM 111 78648 300 150 0 postgres: walwriter
02:39:30 PM 111 78649 30 40 0 postgres: autovacuum launcher
02:39:30 PM 111 78650 130 440 0 postgres: stats collector
02:39:30 PM 111 78790 86340 20560 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78792 89170 23270 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78794 86990 20740 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78795 91900 22540 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78797 92000 23190 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78799 94060 22520 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78801 95300 21500 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78803 93120 21360 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78805 95420 21920 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78807 94400 21350 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78809 88850 20390 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78811 90030 20690 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78812 94310 22660 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78813 94080 22470 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78814 95370 21520 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78815 94780 21470 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78816 92440 21080 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78817 94360 20700 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78818 92230 20760 0 postgres: walsender postgres [local] idle
02:39:30 PM 111 78819 90280 20780 0 postgres: walsender postgres [local] idle

PATCHED:

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.680 ms
tps = 595.090858 (including connections establishing)
tps = 595.131449 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.683 ms
tps = 594.156492 (including connections establishing)
tps = 594.198624 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.684 ms
tps = 593.927387 (including connections establishing)
tps = 593.946829 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.686 ms
tps = 593.209506 (including connections establishing)
tps = 593.257556 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.686 ms
tps = 593.144977 (including connections establishing)
tps = 593.162018 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.686 ms
tps = 593.084403 (including connections establishing)
tps = 593.122178 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.686 ms
tps = 593.134657 (including connections establishing)
tps = 593.199432 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.687 ms
tps = 592.908760 (including connections establishing)
tps = 592.923386 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.687 ms
tps = 592.802027 (including connections establishing)
tps = 592.814300 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.687 ms
tps = 592.678874 (including connections establishing)
tps = 592.769759 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.687 ms
tps = 592.642501 (including connections establishing)
tps = 592.723261 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.688 ms
tps = 592.249597 (including connections establishing)
tps = 592.262962 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.690 ms
tps = 591.867795 (including connections establishing)
tps = 591.958672 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.690 ms
tps = 591.835950 (including connections establishing)
tps = 591.908940 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.690 ms
tps = 591.799816 (including connections establishing)
tps = 591.824497 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.690 ms
tps = 591.738978 (including connections establishing)
tps = 591.780258 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.691 ms
tps = 591.530490 (including connections establishing)
tps = 591.570876 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.691 ms
tps = 591.452142 (including connections establishing)
tps = 591.498424 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.693 ms
tps = 590.674305 (including connections establishing)
tps = 590.708951 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100000
number of transactions actually processed: 100000/100000
latency average = 1.693 ms
tps = 590.517240 (including connections establishing)
tps = 590.563531 (excluding connections establishing)

~$ pidstat -C postgres -l -u -T CHILD
Linux 4.9.0-11-amd64 (hardy) 01/06/2020 _x86_64_ (32 CPU)

02:29:02 PM UID PID usr-ms system-ms guest-ms Command
02:29:02 PM 111 75810 2185430 294190 0 /tmp/pg-12-patched/bin/postgres
02:29:02 PM 111 75812 410 320 0 postgres: checkpointer
02:29:02 PM 111 75813 410 1230 0 postgres: background writer
02:29:02 PM 111 75814 270 80 0 postgres: walwriter
02:29:02 PM 111 75815 30 20 0 postgres: autovacuum launcher
02:29:02 PM 111 75816 130 360 0 postgres: stats collector
02:29:02 PM 111 75961 35890 27240 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75963 37390 27950 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75965 38360 28110 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75966 38350 28160 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75968 38370 28160 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75970 37820 28110 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75972 38250 27330 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75974 36870 27640 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75976 36890 26850 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75979 36920 26330 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75980 37090 27240 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75981 38040 28210 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75982 36530 27460 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75983 37560 27330 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75984 36660 27170 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75985 36370 27020 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75986 36960 27000 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75987 36460 26820 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75988 36290 27140 0 postgres: walsender postgres [local] idle
02:29:02 PM 111 75989 36320 26750 0 postgres: walsender postgres [local] idle

Regard,

Marc

Attachment Content-Type Size
perf_test_logical.tar.gz application/gzip 411 bytes

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Li, Zheng 2020-01-06 19:34:08 Re: NOT IN subquery optimization
Previous Message Julien Rouhaud 2020-01-06 19:16:08 Re: [PATCH] fix a performance issue with multiple logical-decoding walsenders