Quick Links

Improving connection scalability: GetSnapshotData()

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Improving connection scalability: GetSnapshotData()
Date:	2020-03-01 08:36:01
Message-ID:	20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

I think postgres' issues with scaling to larger numbers of connections
is a serious problem in the field. While poolers can address some of
that, given the issues around prepared statements, transaction state,
etc, I don't think that's sufficient in many cases. It also adds
latency.

Nor do I think the argument that one shouldn't have more than a few
dozen connection holds particularly much water. As clients have think
time, and database results have to be sent/received (most clients don't
use pipelining), and as many applications have many application servers
with individual connection pools, it's very common to need more
connections than postgres can easily deal with.

The largest reason for that is GetSnapshotData(). It scales poorly to
larger connection counts. Part of that is obviously it's O(connections)
nature, but I always thought it had to be more. I've seen production
workloads spending > 98% of the cpu time n GetSnapshotData().

After a lot of analysis and experimentation I figured out that the
primary reason for this is PGXACT->xmin. Even the simplest transaction
modifies MyPgXact->xmin several times during its lifetime (IIRC twice
(snapshot & release) for exec_bind_message(), same for
exec_exec_message(), then again as part of EOXact processing). Which
means that a backend doing GetSnapshotData() on a system with a number
of other connections active, is very likely to hit PGXACT cachelines
that are owned by another cpu / set of cpus / socket. The larger the
system is, the worse the consequences of this are.

This problem is most prominent (and harder to fix) for xmin, but also
exists for the other fields in PGXACT. We rarely have xid, nxids,
overflow, or vacuumFlags set, yet constantly set them, leading to
cross-node traffic.

The second biggest problem is that the indirection through pgprocnos
that GetSnapshotData() has to do to go through to get each backend's
xmin is very unfriendly for a pipelined CPU (i.e. all that postgres runs
on). There's basically a stall at the end of every loop iteration -
which is exascerbated by there being so many cache misses.

It's fairly easy to avoid unnecessarily dirtying cachelines for all the
PGXACT fields except xmin. Because that actually needs to be visible to
other backends.

While it sounds almost trivial in hindsight, it took me a long while to
grasp a solution to a big part of this problem: We don't actually need
to look at PGXACT->xmin to compute a snapshot. The only reason that
GetSnapshotData() does so, is because it also computes
RecentGlobal[Data]Xmin.

But we don't actually need them all that frequently. They're primarily
used as a horizons for heap_page_prune_opt() etc. But for one, while
pruning is really important, it doesn't happen *all* the time. But more
importantly a RecentGlobalXmin from an earlier transaction is actually
sufficient for most pruning requests, especially when there is a larger
percentage of reading than updating transaction (very common).

By having GetSnapshotData() compute an accurate upper bound after which
we are certain not to be able to prune (basically the transaction's
xmin, slots horizons, etc), and a conservative lower bound below which
we are definitely able to prune, we can allow some pruning actions to
happen. If a pruning request (or something similar) encounters an xid
between those, an accurate lower bound can be computed.

That allows to avoid looking at PGXACT->xmin.

To address the second big problem (the indirection), we can instead pack
the contents of PGXACT tightly, just like we do for pgprocnos. In the
attached series, I introduced separate arrays for xids, vacuumFlags,
nsubxids.

The reason for splitting them is that they change at different rates,
and different sizes. In a read-mostly workload, most backends are not
going to have an xid, therefore making the xids array almost
constant. As long as all xids are unassigned, GetSnapshotData() doesn't
need to look at anything else, therefore making it sensible to check the
xid first.

Here are some numbers for the submitted patch series. I'd to cull some
further improvements to make it more manageable, but I think the numbers
still are quite convincing.

The workload is a pgbench readonly, with pgbench -M prepared -c $conns
-j $conns -S -n for each client count. This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized.

conns tps master tps pgxact-split

1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371

I think these are pretty nice results.

Note that the patchset currently does not implement snapshot_too_old,
the rest of the regression tests do pass.

One further cool recognition of the fact that GetSnapshotData()'s
results can be made to only depend on the set of xids in progress, is
that caching the results of GetSnapshotData() is almost trivial at that
point: We only need to recompute snapshots when a toplevel transaction
commits/aborts.

So we can avoid rebuilding snapshots when no commt has happened since it
was last built. Which amounts to assigning a current 'commit sequence
number' to the snapshot, and checking that against the current number
at the time of the next GetSnapshotData() call. Well, turns out there's
this "LSN" thing we assign to commits (there are some small issues with
that though). I've experimented with that, and it considerably further
improves the numbers above. Both with a higher peak throughput, but more
importantly it almost entirely removes the throughput regression from
2000 connections onwards.

I'm still working on cleaning that part of the patch up, I'll post it in
a bit.

The series currently consists out of:

0001-0005: Fixes and assert improvements that are independent of the patch, but
are hit by the new code (see also separate thread).

0006: Move delayChkpt from PGXACT to PGPROC it's rarely checked & frequently modified

0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.

This is the most crucial piece. Instead of code directly using
RecentOldestXmin, there's a new set of functions for testing
whether an xid is visible (InvisibleToEveryoneTestXid() et al).

Those function use new horizon boundaries computed as part of
GetSnapshotData(), and recompute an accurate boundary when the
tested xid falls inbetween.

There's a bit more infrastructure needed - we need to limit how
often an accurate snapshot is computed. Probably to once per
snapshot? Or once per transaction?

To avoid issues with the lower boundary getting too old and
presenting a wraparound danger, I made all the xids be
FullTransactionIds. That imo is a good thing?

This patch currently breaks old_snapshot_threshold, as I've not
yet integrated it with the new functions. I think we can make the
old snapshot stuff a lot more precise with this - instead of
always triggering conflicts when a RecentGlobalXmin is too old, we
can do so only in the cases we actually remove a row. I ran out of
energy threading that through the heap_page_prune and
HeapTupleSatisfiesVacuum.

0008: Move PGXACT->xmin back to PGPROC.

Now that GetSnapshotData() doesn't access xmin anymore, we can
make it a normal field in PGPROC again.

0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.
0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags
0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids access

These successively move the remaining PGXACT fields into separate
arrays in ProcGlobal, and adjust GetSnapshotData() to take
advantage. Those arrays are dense in the sense that they only
contain data for PGPROCs that are in use (i.e. when disconnecting,
the array is moved around)..

I think xid, and vacuumFlags are pretty reasonable. But need
cleanup, obviously:
- The biggest cleanup would be to add a few helper functions for
accessing the values, rather than open coding that.
- Perhaps we should call the ProcGlobal ones 'cached', and name
the PGPROC ones as the one true source of truth?

For subxid I thought it'd be nice to have nxids and overflow be
only one number. But that probably was the wrong call? Now
TransactionIdInProgress() cannot look at at the subxids that did
fit in PGPROC.subxid. I'm not sure that's important, given the
likelihood of misses? But I'd probably still have the subxid
array be one of {uint8 nsubxids; bool overflowed} instead.

To keep the arrays dense they copy the logic for pgprocnos. Which
means that ProcArrayAdd/Remove move things around. Unfortunately
that requires holding both ProcArrayLock and XidGenLock currently
(to avoid GetNewTransactionId() having to hold ProcArrayLock). But
that doesn't seem too bad?

0012: Remove now unused PGXACT.

There's no reason to have it anymore.

The patchseries is also available at
https://github.com/anarazel/postgres/tree/pgxact-split

Greetings,

Andres Freund

Attachment	Content-Type	Size
v1-0001-WIP-Ensure-snapshot-is-registered-within-ScanPgRe.patch.gz	application/x-patch-gzip	886 bytes
v1-0002-TMP-work-around-missing-snapshot-registrations.patch.gz	application/x-patch-gzip	2.6 KB
v1-0003-TMP-don-t-build-snapshot_too_old-module-it-s-curr.patch.gz	application/x-patch-gzip	499 bytes
v1-0004-Improve-and-extend-asserts-for-a-snapshot-being-s.patch.gz	application/x-patch-gzip	2.2 KB
v1-0005-Fix-unlikely-xid-wraparound-issue-in-heap_abort_s.patch.gz	application/x-patch-gzip	803 bytes
v1-0006-Move-delayChkpt-from-PGXACT-to-PGPROC-it-s-rarely.patch.gz	application/x-patch-gzip	2.9 KB
v1-0007-WIP-Introduce-abstraction-layer-for-is-tuple-invi.patch.gz	application/x-patch-gzip	9.1 KB
v1-0008-Move-PGXACT-xmin-back-to-PGPROC.patch.gz	application/x-patch-gzip	4.0 KB
v1-0009-Improve-GetSnapshotData-performance-by-avoiding-i.patch.gz	application/x-patch-gzip	8.8 KB
v1-0010-Improve-GetSnapshotData-performance-by-avoiding-i.patch.gz	application/x-patch-gzip	4.7 KB
v1-0011-Improve-GetSnapshotData-performance-by-avoiding-i.patch.gz	application/x-patch-gzip	4.8 KB
v1-0012-Remove-now-unused-PGXACT.patch.gz	application/x-patch-gzip	3.0 KB

Responses

Re: Improving connection scalability: GetSnapshotData() at 2020-03-01 08:46:38 from Andres Freund
Re: Improving connection scalability: GetSnapshotData() at 2020-03-02 23:24:21 from Andres Freund
Re: Improving connection scalability: GetSnapshotData() at 2020-03-04 14:04:08 from David Steele
Re: Improving connection scalability: GetSnapshotData() at 2020-03-17 10:59:14 from David Rowley
Re: Improving connection scalability: GetSnapshotData() at 2020-03-20 05:23:03 from Thomas Munro
Re: Improving connection scalability: GetSnapshotData() at 2020-03-29 01:39:32 from Peter Geoghegan
Re: Improving connection scalability: GetSnapshotData() at 2021-02-27 17:40:58 from AJG

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2020-03-01 08:46:38	Re: Improving connection scalability: GetSnapshotData()
Previous Message	Wao	2020-03-01 08:02:10	Re[2]: bool_plperl transform