Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Date: 2018-06-05 09:47:40
Message-ID: c4b2d00a-c9f0-6eaf-bb1b-c05e0c8cbf0a@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04.06.2018 21:42, Andres Freund wrote:
> Hi,
>
> On 2018-06-04 16:47:29 +0300, Konstantin Knizhnik wrote:
>> We in PostgresProc were faced with lock extension contention problem at two
>> more customers and tried to use this patch (v13) to address this issue.
>> Unfortunately replacing heavy lock with lwlock couldn't completely eliminate
>> contention, now most of backends are blocked on conditional variable:
>>
>> 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
>> #0  0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
>> #1  0x00000000007024ee in WaitEventSetWait ()
>> #2  0x0000000000718fa6 in ConditionVariableSleep ()
>> #3  0x000000000071954d in RelExtLockAcquire ()
> That doesn't necessarily mean that the postgres code is to fault
> here. It's entirely possible that the filesystem or storage is the
> bottleneck. Could you briefly describe workload & hardware?

Workload is combination of inserts and selects.
Looks like shared locks obtained by select cause starvation of inserts,
trying to get exclusive relation extension lock.
The problem is fixed by fair lwlock patch, implemented by Alexander
Korotkov. This patch prevents granting of shared lock if wait queue is
not empty.
May be we should use this patch or find some other way to prevent
starvation of writers on relation extension locks for such workloads.

>
>
>> Second problem we observed was even more critical: if backed is granted
>> relation extension lock and then got some error before releasing this lock,
>> then abort of the current transaction doesn't release this lock (unlike
>> heavy weight lock) and the relation is kept locked.
>> So database is actually stalled and server has to be restarted.
> That obvioulsy needs to be fixed...

Sorry, looks like the problem is more obscure than I expected.
What we have observed is that all backends are blocked in lwlock (sorry
stack trace is not complete):

#0 0x00007ff5a9c566d6 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7ff3c57b9b38) at ../sysdeps/unix/sysv/lin
ux/futex-internal.h:205
#1 do_futex_wait (sem=sem(at)entry=0x7ff3c57b9b38, abstime=0x0) at sem_waitcommon.c:111
#2 0x00007ff5a9c567c8 in __new_sem_wait_slow (sem=sem(at)entry=0x7ff3c57b9b38, abstime=0x0) at sem_waitcommon.c:181 #3 0x00007ff5a9c56839 in __new_sem_wait (sem=sem(at)entry=0x7ff3c57b9b38) at sem_wait.c:42 #4 0x000056290c901582 in PGSemaphoreLock (sema=0x7ff3c57b9b38) at pg_sema.c:310
#5 0x000056290c97923c in LWLockAcquire (lock=0x7ff3c7038c64, mode=LW_SHARED) at ./build/../src/backend/storage/lmgr/lwlock.c:1233

I happen after error in disk write operation. Unfortunately we do not have core files and not able to reproduce the problem.
All LW locks should be cleared by LWLockReleaseAll but ... for some reasons it doesn't happen.
We will continue investigation and try to reproduce the problem.
I will let you know if we find the reason of the problem.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2018-06-05 09:47:47 Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Previous Message Amit Langote 2018-06-05 08:49:39 Re: Remove mention in docs that foreign keys on partitioned tables are not supported