Re: Improving connection scalability: GetSnapshotData()

From: Andres Freund <andres(at)anarazel(dot)de>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving connection scalability: GetSnapshotData()
Date: 2020-04-06 13:39:59
Message-ID: 20200406133959.viql5fqecog6mppj@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

These benchmarks are on my workstation. The larger VM I used in the last
round wasn't currently available.

HW:
2 x Intel(R) Xeon(R) Gold 5215 (each 10 cores / 20 threads)
192GB Ram.
data directory is on a Samsung SSD 970 PRO 1TB

A bunch of terminals, emacs, mutt are open while the benchmark is
running. No browser.

Unless mentioned otherwise, relevant configuration options are:
max_connections=1200
shared_buffers=8GB
max_prepared_transactions=1000
synchronous_commit=local
huge_pages=on
fsync=off # to make it more likely to see scalability bottlenecks

Independent of the effects of this patch (i.e. including master) I had a
fairly hard time getting reproducible number for *low* client cases. I
found the numbers to be more reproducible if I pinned server/pgbench
onto the same core :(. I chose to do that for the -c1 cases, to
benchmark the optimal behaviour, as that seemed to have the biggest
potential for regressions.

All numbers are best of three. Tests start in freshly created cluster
each.

On 2020-03-30 17:04:00 +0300, Alexander Korotkov wrote:
> Following pgbench scripts comes first to my mind:
> 1) SELECT txid_current(); (artificial but good for checking corner case)

-M prepared -T 180
(did a few longer runs, but doesn't seem to matter much)

clients tps master tps pgxact
1 46118 46027
16 377357 440233
40 373304 410142
198 103912 105579

btw, there's some pretty horrible cacheline bouncing in txid_current()
because backends first ReadNextFullTransactionId() (acquires XidGenLock
in shared mode, reads ShmemVariableCache->nextFullXid), then separately
causes GetNewTransactionId() (acquires XidGenLock exclusively, reads &
writes nextFullXid).

With for fsync=off (and also for synchronous_commit=off) the numbers
are, at lower client counts, severly depressed and variable due to
walwriter going completely nuts (using more CPU than the backend doing
the queries). Because WAL writes are so fast on my storage, individual
XLogBackgroundFlush() calls are very quick. This leads to a *lot* of
kill()s from the backend, from within XLogSetAsyncXactLSN(). There got
to be a bug here. But unrelated.

> 2) Single insert statement (as example of very short transaction)

CREATE TABLE testinsert(c1 int not null, c2 int not null, c3 int not null, c4 int not null);
INSERT INTO testinsert VALUES(1, 2, 3, 4);

-M prepared -T 360

fsync on:
clients tps master tps pgxact
1 653 658
16 5687 5668
40 14212 14229
198 60483 62420

fsync off:
clients tps master tps pgxact
1 59356 59891
16 290626 299991
40 348210 355669
198 289182 291529

clients tps master tps pgxact
1024 47586 52135

-M simple
fsync off:
clients tps master tps pgxact
40 289077 326699
198 286011 299928

> 3) Plain pgbench read-write (you already did it for sure)

-s 100 -M prepared -T 700

autovacuum=off, fsync on:
clients tps master tps pgxact
1 474 479
16 4356 4476
40 8591 9309
198 20045 20261
1024 17986 18545

autovacuum=off, fsync off:
clients tps master tps pgxact
1 7828 7719
16 49069 50482
40 68241 73081
198 73464 77801
1024 25621 28410

I chose autovacuum off because otherwise the results vary much more
widely, and autovacuum isn't really needed for the workload.

> 4) pgbench read-write script with increased amount of SELECTs. Repeat
> select from pgbench_accounts say 10 times with different aids.

I did intersperse all server-side statements in the script with two
selects of other pgbench_account rows each.

-s 100 -M prepared -T 700
autovacuum=off, fsync on:
clients tps master tps pgxact
1 365 367
198 20065 21391

-s 1000 -M prepared -T 700
autovacuum=off, fsync on:
clients tps master tps pgxact
16 2757 2880
40 4734 4996
198 16950 19998
1024 22423 24935

> 5) 10% pgbench read-write, 90% of pgbench read-only

-s 100 -M prepared -T 100 -bselect-only(at)9 -btpcb-like(at)1

autovacuum=off, fsync on:
clients tps master tps pgxact
16 37289 38656
40 81284 81260
198 189002 189357
1024 143986 164762

> > That definitely needs to be measured, due to the locking changes around procarrayaddd/remove.
> >
> > I don't think regressions besides perhaps 2pc are likely - there's nothing really getting more expensive but procarray add/remove.
>
> I agree that ProcArrayAdd()/Remove() should be first subject of
> investigation, but other cases should be checked as well IMHO.

I'm not sure I really see the point. If simple prepared tx doesn't show
up as a negative difference, a more complex one won't either, since the
ProcArrayAdd()/Remove() related bottlenecks will play smaller and
smaller role.

> Regarding 2pc I can following scenarios come to my mind:
> 1) pgbench read-write modified so that every transaction is prepared
> first, then commit prepared.

The numbers here are -M simple, because I wanted to use
PREPARE TRANSACTION 'ptx_:client_id';
COMMIT PREPARED 'ptx_:client_id';

-s 100 -M prepared -T 700 -f ~/tmp/pgbench-write-2pc.sql
autovacuum=off, fsync on:
clients tps master tps pgxact
1 251 249
16 2134 2174
40 3984 4089
198 6677 7522
1024 3641 3617

> 2) 10% of 2pc pgbench read-write, 90% normal pgbench read-write

-s 100 -M prepared -T 100 -f ~/tmp/pgbench-write-2pc(dot)sql(at)1 -btpcb-like(at)9

clients tps master tps pgxact
198 18625 18906

> 3) 10% of 2pc pgbench read-write, 90% normal pgbench read-only

-s 100 -M prepared -T 100 -f ~/tmp/pgbench-write-2pc(dot)sql(at)1 -bselect-only(at)9

clients tps master tps pgxact
198 84817 84350

I also benchmarked connection overhead, by using pgbench with -C
executing SELECT 1.

-T 10
clients tps master tps pgxact
1 572 587
16 2109 2140
40 2127 2136
198 2097 2129
1024 2101 2118

These numbers seem pretty decent to me. The regressions seem mostly
within noise. The one possible exception to that is plain pgbench
read/write with fsync=off and only a single session. I'll run more
benchmarks around that tomorrow (but now it's 6am :().

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Anastasia Lubennikova 2020-04-06 13:49:31 Re: pg_upgrade fails with non-standard ACL
Previous Message Julien Rouhaud 2020-04-06 13:37:35 Re: WAL usage calculation patch