Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Date: 2020-02-12 04:54:37
Message-ID: 20200212045437.ya2d3pgpwssn3xcc@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-02-11 08:01:34 +0530, Amit Kapila wrote:
> I don't see much downside with the patch, rather there is a
> performance increase of 3-9% in various scenarios.

As I wrote in [1] I started to look at this patch. My problem with itis
that it just seems like the wrong direction architecturally to
me. There's two main aspects to this:

1) It basically builds a another, more lightweight but less capable, of
a lock manager that can lock more objects than we can have distinct
locks for. It is faster because it uses *one* hashtable without
conflict handling, because it has fewer lock modes, and because it
doesn't support detecting deadlocks. And probably some other things.

2) A lot of the contention around file extension comes from us doing
multiple expensive things under one lock (determining current
relation size, searching victim buffer, extending file), and in tiny
increments (growing a 1TB table by 8kb). This patch doesn't address
that at all.

I've focused on 1) in the email referenced above ([1]). Here I'll focus
on 2).

To quantify my concerns I instrumented postgres to measure the time for
various operations that are part of extending a file (all per
process). The hardware is a pretty fast nvme, with unlogged tables, on a
20/40 core/threads machine. The workload is copying a scale 10
pgbench_accounts into an unindexed, unlogged table using pgbench.

Here are the instrumentations for various client counts, when just
measuring 20s:

1 client:
LOG: extension time: lock wait: 0.00 lock held: 3.19 filesystem: 1.29 buffersearch: 1.58

2 clients:
LOG: extension time: lock wait: 0.47 lock held: 2.99 filesystem: 1.24 buffersearch: 1.43
LOG: extension time: lock wait: 0.60 lock held: 3.05 filesystem: 1.23 buffersearch: 1.50

4 clients:
LOG: extension time: lock wait: 3.92 lock held: 2.69 filesystem: 1.10 buffersearch: 1.29
LOG: extension time: lock wait: 4.40 lock held: 2.02 filesystem: 0.81 buffersearch: 0.93
LOG: extension time: lock wait: 3.86 lock held: 2.59 filesystem: 1.06 buffersearch: 1.22
LOG: extension time: lock wait: 4.00 lock held: 2.65 filesystem: 1.08 buffersearch: 1.26

8 clients:
LOG: extension time: lock wait: 6.94 lock held: 1.74 filesystem: 0.70 buffersearch: 0.80
LOG: extension time: lock wait: 7.16 lock held: 1.81 filesystem: 0.73 buffersearch: 0.82
LOG: extension time: lock wait: 6.93 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89
LOG: extension time: lock wait: 7.08 lock held: 1.87 filesystem: 0.76 buffersearch: 0.86
LOG: extension time: lock wait: 6.95 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89
LOG: extension time: lock wait: 6.88 lock held: 2.01 filesystem: 0.83 buffersearch: 0.93
LOG: extension time: lock wait: 6.94 lock held: 2.02 filesystem: 0.82 buffersearch: 0.93
LOG: extension time: lock wait: 7.02 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89

16 clients:
LOG: extension time: lock wait: 10.37 lock held: 0.88 filesystem: 0.36 buffersearch: 0.39
LOG: extension time: lock wait: 10.53 lock held: 0.90 filesystem: 0.37 buffersearch: 0.40
LOG: extension time: lock wait: 10.72 lock held: 1.01 filesystem: 0.42 buffersearch: 0.45
LOG: extension time: lock wait: 10.45 lock held: 1.25 filesystem: 0.52 buffersearch: 0.55
LOG: extension time: lock wait: 10.66 lock held: 0.94 filesystem: 0.38 buffersearch: 0.41
LOG: extension time: lock wait: 10.50 lock held: 1.27 filesystem: 0.53 buffersearch: 0.56
LOG: extension time: lock wait: 10.53 lock held: 1.19 filesystem: 0.49 buffersearch: 0.53
LOG: extension time: lock wait: 10.57 lock held: 1.22 filesystem: 0.50 buffersearch: 0.53
LOG: extension time: lock wait: 10.72 lock held: 1.17 filesystem: 0.48 buffersearch: 0.52
LOG: extension time: lock wait: 10.67 lock held: 1.32 filesystem: 0.55 buffersearch: 0.58
LOG: extension time: lock wait: 10.95 lock held: 0.92 filesystem: 0.38 buffersearch: 0.40
LOG: extension time: lock wait: 10.81 lock held: 1.24 filesystem: 0.51 buffersearch: 0.56
LOG: extension time: lock wait: 10.62 lock held: 1.27 filesystem: 0.53 buffersearch: 0.56
LOG: extension time: lock wait: 11.14 lock held: 0.94 filesystem: 0.38 buffersearch: 0.41
LOG: extension time: lock wait: 11.20 lock held: 0.96 filesystem: 0.39 buffersearch: 0.42
LOG: extension time: lock wait: 10.75 lock held: 1.41 filesystem: 0.58 buffersearch: 0.63
0.88 + 0.90 + 1.01 + 1.25 + 0.94 + 1.27 + 1.19 + 1.22 + 1.17 + 1.32 + 0.92 + 1.24 + 1.27 + 0.94 + 0.96 + 1.41
in *none* of these cases the drive gets even close to being
saturated. Like not even 1/3.

If you consider the total time with the lock held, and the total time of
the test, it becomes very quickly obvious that pretty quickly we spend
the majority of the total time with the lock held.
client count 1: 3.18/20 = 0.16
client count 2: 6.04/20 = 0.30
client count 4: 9.95/20 = 0.50
client count 8: 15.30/20 = 0.76
client count 16: 17.89/20 = 0.89

In other words, the reason that relation extension scales terribly
isn't, to a significant degree, because the locking is slow. It's
because we hold locks for the majority of the benchmark's time starting
even at just 4 clients. Focusing on making the locking faster is just
optimizing for the wrong thing. Amdahl's law will just restrict the
benefits to a pretty small amount.

Looking at a CPU time profile (i.e. it'll not include the time spent
waiting for a lock, once sleeping in the kernel) for time spent within
RelationGetBufferForTuple():

- 19.16% 0.29% postgres postgres [.] RelationGetBufferForTuple
- 18.88% RelationGetBufferForTuple
- 13.18% ReadBufferExtended
- ReadBuffer_common
+ 5.02% mdextend
+ 4.77% FlushBuffer.part.0
+ 0.61% BufTableLookup
0.52% __memset_avx2_erms
+ 1.65% PageInit
- 1.18% LockRelationForExtension
- 1.16% LockAcquireExtended
- 1.07% WaitOnLock
- 1.01% ProcSleep
- 0.88% WaitLatchOrSocket
0.52% WaitEventSetWait
0.65% RecordAndGetPageWithFreeSpace

the same workload using an assert enabled build, to get a simpler to
interpret profile:
- 13.28% 0.19% postgres postgres [.] RelationGetBufferForTuple
- 13.09% RelationGetBufferForTuple
- 8.35% RelationAddExtraBlocks
- 7.67% ReadBufferBI
- 7.54% ReadBufferExtended
- 7.52% ReadBuffer_common
- 3.64% BufferAlloc
+ 2.39% FlushBuffer
+ 0.27% BufTableLookup
+ 0.24% BufTableDelete
+ 0.15% LWLockAcquire
0.14% StrategyGetBuffer
+ 0.13% BufTableHashCode
- 2.96% smgrextend
+ mdextend
+ 0.52% __memset_avx2_erms
+ 0.14% smgrnblocks
0.11% __GI___clock_gettime (inlined)
+ 0.57% RecordPageWithFreeSpace
- 1.23% RecordAndGetPageWithFreeSpace
- 1.03% fsm_set_and_search
+ 0.50% fsm_readbuf
+ 0.20% LockBuffer
+ 0.18% UnlockReleaseBuffer
0.11% fsm_set_avail
0.19% fsm_search
- 0.86% ReadBufferBI
- 0.72% ReadBufferExtended
- ReadBuffer_common
- 0.58% BufferAlloc
+ 0.20% BufTableLookup
+ 0.10% LWLockAcquire
+ 0.81% PageInit
- 0.67% LockRelationForExtension
- 0.67% LockAcquire
- LockAcquireExtended
+ 0.60% WaitOnLock

Which, I think, pretty clearly shows a few things:

1) It's crucial to move acquiring a victim buffer to the outside of the
extension lock, as for copy acquiring the victim buffer will commonly
cause a buffer having to be written out, due to the ringbuffer. This
is even more crucial when using a logged table, as the writeout then
also will often also trigger a WAL flush.

While doing so will sometimes add a round of acquiring the buffer
mapping locks, having to do the FlushBuffer while holding the
extension lock is a huge problem.

This'd also move a good bit of the cost of finding (i.e. clock sweep
/ ringbuffer replacement) and invalidating the old buffer mapping out
of the lock.

2) We need to make the smgrwrite more efficient, it is costing a lot of
time. A small additional experiment shows the cost of doing 8kb
writes:

I wrote a small program that just iteratively writes a 32GB file:

pwrite using 8kb blocks:
0.24user 17.88system 0:18.16 elapsed 99%CPU

pwrite using 128kb blocks:
0.00user 16.71system 0:17.01 elapsed 98%CPU

pwrite using 256kb blocks:
0.00user 15.95system 0:16.03 elapsed 99%CPU

pwritev() using 16 8kb blocks to write 128kb at once:
0.02user 15.94system 0:16.09 elapsed 99%CPU

pwritev() using 32 8kb blocks to write 256kb at once:
0.01user 14.90system 0:14.93 elapsed 99%CPU

pwritev() using 128 8kb blocks to write 1MB at once:
0.00user 13.96system 0:13.96 elapsed 99%CPU

if I instead just use posix_fallocate() with 8kb blocks:
0.28user 23.49system 0:23.78elapsed 99%CPU (0avgtext+0avgdata 1212maxresident)k
0inputs+0outputs (0major+66minor)pagefaults 0swaps

if I instead just use posix_fallocate() with 32 8kb blocks:
0.01user 1.18system 0:01.19elapsed 99%CPU (0avgtext+0avgdata 1200maxresident)k
0inputs+0outputs (0major+67minor)pagefaults 0swaps

obviously fallocate doesn't quite have the same behaviour, and may incur
a bit higher overhead for a later write.

using a version that instead uses O_DIRECT + async IO, I get (but
only when also doing posix_fallocate in larger chunks):
0.05user 5.53system 0:12.53 elapsed 44%CPU

So we get considerably higher write throughput, at a considerably
lower CPU usage (because DMA replaces the CPU doing a memcpy()).

So it looks like extending the file with posix_fallocate() might be a
winner, but only if we actually can do so in larger chunks than 8kb
at once.

Alternatively it could be worthwhile to rejigger things so we don't
extend the files with zeroes once, just to then immediately overwrite
it with actual content. For some users it's probably possible to
pre-generate a page with contents when extending the file (would need
fiddling with block numbers etc).

3) We should move the PageInit() that's currently done with the
extension lock held, to the outside. Since we get the buffer with
RBM_ZERO_AND_LOCK these days, that should be safe. Also, we don't
need to zero the entire buffer both in RelationGetBufferForTuple()'s
PageInit(), and in ReadBuffer_common() before calling smgrextend().

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20200211042229.msv23badgqljrdg2%40alap3.anarazel.de

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2020-02-12 05:10:17 Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Previous Message imai.yoshikazu@fujitsu.com 2020-02-12 04:35:57 RE: [Proposal] Add accumulated statistics for wait event