Re: Possible performance regression in version 10.1 with pgbench read-write tests.

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Possible performance regression in version 10.1 with pgbench read-write tests.
Date: 2018-07-20 19:50:42
Message-ID: 20180720195042.qf3ne7z3qyame5im@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2018-07-20 15:35:39 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On 2018-07-21 00:53:28 +0530, Mithun Cy wrote:
> >> I did a quick test applying the patch with same settings as initial mail I
> >> have reported (On postgresql 10 latest code)
> >> 72 clients
> >>
> >> CASE 1:
> >> Without Patch : TPS 29269.823540
> >>
> >> With Patch : TPS 36005.544960. --- 23% jump
> >>
> >> Just Disabling using unnamed POSIX semaphores: TPS 34481.207959
>
> >> So it seems that is the issue as the test is being run on 8 node numa
> >> machine.
>
> > Cool. I think we should just backpatch that then. Does anybody want to
> > argue against?
>
> Not entirely clear to me what change is being proposed here?

Adding padding to struct PGSemaphoreData, so multiple semas don't share
a cacheline.

> In any case, I strongly resist making performance-based changes on
> the basis of one test on one kernel and one hardware platform.
> We should reproduce the results on a few different machines before
> we even think of committing anything. I'm happy to test on what
> I have, although I'd be the first to agree that what I'm checking
> is relatively low-end cases. (Too bad hydra's gone.)

Sure, it'd be good to do more of that. But from a theoretical POV it's
quite logical that posix semas sharing cachelines is bad for
performance, if there's any contention. When backed by futexes -
i.e. all non ancient linux machines - the hot path just does a cmpxchg
of the *userspace* data (I've copied the relevant code below). Given
that we don't have a large number of semas these days, that there's
reasons to make the change even without measuring it, that we have
benchmark results, and that it's hard to see how it'd cause regressions,
I don't think going for a fix quickly is unreasonable.

You could argue it'd be better to make semaphores be embeddable in
bigger structures like PGPROC, rather than allocated in an array. While
I suspect you'd get a bit of a performance benefit from that, it seems
like a far bigger change we'd want to do in a minor release.

int
__new_sem_wait (sem_t *sem)
{
/* We need to check whether we need to act upon a cancellation request here
because POSIX specifies that cancellation points "shall occur" in
sem_wait and sem_timedwait, which also means that they need to check
this regardless whether they block or not (unlike "may occur"
functions). See the POSIX Rationale for this requirement: Section
"Thread Cancellation Overview" [1] and austin group issue #1076 [2]
for thoughs on why this may be a suboptimal design.

[1] http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html
[2] http://austingroupbugs.net/view.php?id=1076 for thoughts on why this
*/
__pthread_testcancel ();

if (__new_sem_wait_fast ((struct new_sem *) sem, 0) == 0)
return 0;
else
return __new_sem_wait_slow((struct new_sem *) sem, NULL);
}

/* Fast path: Try to grab a token without blocking. */
static int
__new_sem_wait_fast (struct new_sem *sem, int definitive_result)
{
/* We need acquire MO if we actually grab a token, so that this
synchronizes with all token providers (i.e., the RMW operation we read
from or all those before it in modification order; also see sem_post).
We do not need to guarantee any ordering if we observed that there is
no token (POSIX leaves it unspecified whether functions that fail
synchronize memory); thus, relaxed MO is sufficient for the initial load
and the failure path of the CAS. If the weak CAS fails and we need a
definitive result, retry. */
#if __HAVE_64B_ATOMICS
uint64_t d = atomic_load_relaxed (&sem->data);
do
{
if ((d & SEM_VALUE_MASK) == 0)
break;
if (atomic_compare_exchange_weak_acquire (&sem->data, &d, d - 1))
return 0;
}
while (definitive_result);
return -1;
#else
unsigned int v = atomic_load_relaxed (&sem->value);
do
{
if ((v >> SEM_VALUE_SHIFT) == 0)
break;
if (atomic_compare_exchange_weak_acquire (&sem->value,
&v, v - (1 << SEM_VALUE_SHIFT)))
return 0;
}
while (definitive_result);
return -1;
#endif
}

/* See sem_wait for an explanation of the algorithm. */
int
__new_sem_post (sem_t *sem)
{
struct new_sem *isem = (struct new_sem *) sem;
int private = isem->private;

#if __HAVE_64B_ATOMICS
/* Add a token to the semaphore. We use release MO to make sure that a
thread acquiring this token synchronizes with us and other threads that
added tokens before (the release sequence includes atomic RMW operations
by other threads). */
/* TODO Use atomic_fetch_add to make it scale better than a CAS loop? */
uint64_t d = atomic_load_relaxed (&isem->data);
do
{
if ((d & SEM_VALUE_MASK) == SEM_VALUE_MAX)
{
__set_errno (EOVERFLOW);
return -1;
}
}
while (!atomic_compare_exchange_weak_release (&isem->data, &d, d + 1));

/* If there is any potentially blocked waiter, wake one of them. */
if ((d >> SEM_NWAITERS_SHIFT) > 0)
futex_wake (((unsigned int *) &isem->data) + SEM_VALUE_OFFSET, 1, private);
#else
/* Add a token to the semaphore. Similar to 64b version. */
unsigned int v = atomic_load_relaxed (&isem->value);
do
{
if ((v >> SEM_VALUE_SHIFT) == SEM_VALUE_MAX)
{
__set_errno (EOVERFLOW);
return -1;
}
}
while (!atomic_compare_exchange_weak_release
(&isem->value, &v, v + (1 << SEM_VALUE_SHIFT)));

/* If there is any potentially blocked waiter, wake one of them. */
if ((v & SEM_NWAITERS_MASK) != 0)
futex_wake (&isem->value, 1, private);
#endif

return 0;
}

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-07-20 20:15:14 buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
Previous Message Tom Lane 2018-07-20 19:35:39 Re: Possible performance regression in version 10.1 with pgbench read-write tests.