Re: Improving connection scalability: GetSnapshotData()

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Improving connection scalability: GetSnapshotData()
Date: 2020-09-04 15:24:12
Message-ID: 4f245382-2f04-3b2e-ae94-d075d2eb7868@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03.09.2020 11:18, Michael Paquier wrote:
> On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote:
>> So we get some builfarm results while thinking about this.
> Andres, there is an entry in the CF for this thread:
> https://commitfest.postgresql.org/29/2500/
>
> A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc.
> Now that PGXACT is done, how much work is remaining here?
> --
> Michael

Andres,
First of all a lot of thanks for this work.
Improving Postgres connection scalability is very important.

Reported results looks very impressive.
But I tried to reproduce them and didn't observed similar behavior.
So I am wondering what can be the difference and what I am doing wrong.

I have tried two different systems.
First one is IBM Power2 server with 384 cores and 8Tb of RAM.
I run the same read-only pgbench test as you. I do not think that size of the database is matter, so I used scale 100 -
it seems to be enough to avoid frequent buffer conflicts.
Then I run the same scripts as you:

 for ((n=100; n < 1000; n+=100)); do echo $n; pgbench -M prepared -c $n -T 100 -j $n -M prepared -S -n postgres ; done
 for ((n=1000; n <= 5000; n+=1000)); do echo $n; pgbench -M prepared -c $n -T 100 -j $n -M prepared -S -n postgres ; done

I have compared current master with version of Postgres prior to your commits with scalability improvements: a9a4a7ad56

For all number of connections older version shows slightly better results, for example for 500 clients: 475k TPS vs. 450k TPS for current master.

This is quite exotic server and I do not have currently access to it.
So I have repeated experiments at Intel server.
It has 160 cores Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz and 256Gb of RAM.

The same database, the same script, results are the following:

Clients old/inc old/exl new/inc new/exl
1000 1105750 1163292 1206105 1212701
2000 1050933 1124688 1149706 1164942
3000 1063667 1195158 1118087 1144216
4000 1040065 1290432 1107348 1163906
5000 943813 1258643 1103790 1160251

I have separately show results including/excluding connection connections establishing,
because in new version there are almost no differences between them,
but for old version gap between them is noticeable.

Configuration file has the following differences with default postgres config:

max_connections = 10000 # (change requires restart)
shared_buffers = 8GB # min 128kB

This results contradict with yours and makes me ask the following questions:

1. Why in your case performance is almost two times larger (2 millions vs 1)?
The hardware in my case seems to be at least not worser than yours...
May be there are some other improvements in the version you have tested which are not yet committed to master?

2. You wrote: This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized (2 sockets of 18 cores/36 threads)

According to Intel specification Intel® Xeon® Platinum 8168 Processor has 24 cores:
https://ark.intel.com/content/www/us/en/ark/products/120504/intel-xeon-platinum-8168-processor-33m-cache-2-70-ghz.html

And at your graph we can see almost linear increase of speed up to 40 connections.

But most suspicious word for me is "virtualized". What is the actual hardware and how it is virtualized?

Do you have any idea why in my case master version (with your commits) behaves almost the same as non-patched version?
Below is yet another table showing scalability from 10 to 100 connections and combining your results (first two columns) and my results (last two columns):

Clients old master pgxact-split-cache current master
revision 9a4a7ad56
10 367883 375682 358984
347067
20 748000 810964 668631
630304
30 999231 1288276 920255
848244
40 991672 1573310 1100745
970717
50
1017561 1715762 1193928
1008755
60
993943 1789698 1255629
917788
70
971379 1819477 1277634
873022
80
966276 1842248 1266523
830197
90
901175 1847823 1255260
736550
100
803175 1865795 1241143
736756

May be it is because of more complex architecture of my server?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kelly Min 2020-09-04 15:31:55 [PATCH] Comments related to " buffer descriptors“ cache line size"
Previous Message Alvaro Herrera 2020-09-04 15:04:46 Re: [PATCH]Fix ja.po error