Re: Add spin_delay() implementation for Arm in s_lock.h

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Blake, Geoff" <blakgeof(at)amazon(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add spin_delay() implementation for Arm in s_lock.h
Date: 2022-01-07 02:12:13
Message-ID: 1269414.1641521533@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Blake, Geoff" <blakgeof(at)amazon(dot)com> writes:
> Hope everything is well going into the new year. I'd like to pick this discussion back up and your thoughts on the patch with the data I posted 2 weeks prior. Is there more data that would be helpful? Different setup? Data on older versions of Postgresql to ascertain if it makes more sense on versions before the large re-work of the snapshot algorithm that exhibited quite a bit of synchronization contention?

I spent some time working on this. I don't have a lot of faith in
pgbench as a measurement testbed for spinlock contention, because over
the years we've done a good job of getting rid of that in our main
code paths (both the specific change you mention, and many others).
After casting around a bit and thinking about writing a bespoke test
framework, I landed on the idea of adding some intentional spinlock
contention to src/test/modules/test_shm_mq, which is a prefab test
framework for passing data among multiple worker processes. The
attached quick-hack patch makes it grab and release a spinlock once
per passed message. I'd initially expected that this would show only
marginal changes, because you'd hope that a spinlock acquisition would
be reasonably cheap compared to shm_mq_receive plus shm_mq_send.
Turns out not though.

The proposed test case is

(1) patch test_shm_mq as below

(2) time this query:

SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, n);

for various values of "n" up to about how many cores you have.
(You'll probably need to bump up max_worker_processes.)

For context, on my Intel-based main workstation (8-core Xeon W-2245),
the time to do this with stock test_shm_mq is fairly level.
Reporting best-of-3 runs:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1386.413 ms (00:01.386)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 1302.503 ms (00:01.303)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 1373.121 ms (00:01.373)

However, after applying the contention patch:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1346.362 ms (00:01.346)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3313.490 ms (00:03.313)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 7660.329 ms (00:07.660)

So this seems like (a) it's a plausible model for code that has
unoptimized spinlock contention, and (b) the effects are large
enough that you needn't fret too much about measurement noise.

I tried this out on a handy Apple M1 mini, which I concede
is not big iron but it's pretty decent aarch64 hardware.
With current HEAD's spinlock code, I get (again best-of-3):

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1630.255 ms (00:01.630)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3495.066 ms (00:03.495)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 19541.929 ms (00:19.542)

With your spin-delay patch:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1643.524 ms (00:01.644)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3404.625 ms (00:03.405)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 19260.721 ms (00:19.261)

So I don't see a lot of reason to think your patch changes anything.
Maybe on something with more cores?

For grins I also tried this same test with the use-CAS-for-TAS patch
that was being discussed November before last, and it didn't
really show up as any improvement either:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1608.642 ms (00:01.609)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3396.564 ms (00:03.397)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 20092.683 ms (00:20.093)

Maybe that's a little better in the uncontended (single-worker)
case, but it's worse at the high end.

I'm really curious to hear if this measurement method shows
any interesting improvements on your hardware.

regards, tom lane

Attachment Content-Type Size
test-spinlock-contention.patch text/x-diff 2.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tanghy.fnst@fujitsu.com 2022-01-07 02:12:23 RE: Support tab completion for upper character inputs in psql
Previous Message Euler Taveira 2022-01-07 01:20:55 Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file