Postgres stucks in deadlock detection

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Postgres stucks in deadlock detection
Date: 2018-04-04 08:54:14
Message-ID: c9f840f4-b7fe-19c6-76e6-65c02a0c230c@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,
Please notice that it is not a first of April joke;)

Several times we and our customers are suffered from the problem that
Postgres got stuck in deadlock detection.
One scenario is YCSB workload with Zipf's distribution when many clients
are trying to update the same record and compete for it's lock.
Another scenario is large number of clients performing inserts in the
same table. In this case the source of the problem is relation extension
lock.
In both cases number of connection is large enough: several hundreds.

So what's happening? Due to high contention backends will not be able to
obtains requested lock in the specified deadlock detection timeout (1
second by default).
Wait it interrupted by timeout and backend tries to perform deadlock
detection. CheckDeadLock sets exclusive lock on all partitions locks...
Avalanche of deadlock timeout expiration
in backends and there contention of exclusive partition locks cause
Postgres to got stucks.
Speed falls down almost to zero and it is not possible even to login to
Postgres.

It is well known fact that Postgres is not scaling well for such larger
number of connections and it is necessary to use pgbouncer or some other
connection pooler
to limit number of backends. But modern systems has hundreds of CPU
cores. And to utilize all this resources we need to have hundreds of
active backaneds.
So this is not an artificial problem, but real show stopper, which takes
place on real workloads.

There are several ways to solve this problem.
First is trivial: increase deadlock detection timeout. In case of YCSB
it helps. But in case of many concurrent inserts, some backends are
waiting for lock for several minutes.
So there is no any realistic value of deadlock detection timeout which
can completely solve the problem.
Also significant increasing of deadlock detection timeout may case
blocking applications for unacceptable amount of time in case of real
deadlock occurrence.

There is a patch in commitfest proposed by Yury Sokolov:
https://commitfest.postgresql.org/18/1602/
He make deadlock check in two phases: first under shared lock and second
under exclusive lock.

I am proposing much simpler patch (attached) which uses atomic flag to
prevent concurrent deadlock detection by more than one backend.
The obvious drawback of such solution is that detection of unrelated
deadlock loops may take larger amount of time. But deadlock is abnormal
situation in any case and I do not know applications which consider
deadlocks as normal behavior. Also I didn't see in my life situations
when more than one independent deadlocks are happen at the same time
(but obviously it is possible).

So, I see three possible ways to fix this problem:
1. Yury Sololov's patch with two phase deadlock check
2. Avoid concurrent deadlock detection
3. Avoid concurrent deadlock detection + let CheckDeadLock detect all
deadlocks, not only those in which current transaction is involved.

I want to know opinion of community concerning this approaches (or may
we there are some other solutions).

Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
deadlock.patch text/x-patch 2.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2018-04-04 09:23:36 Re: Optimize Arm64 crc32c implementation in Postgresql
Previous Message Heikki Linnakangas 2018-04-04 08:35:19 Re: Optimize Arm64 crc32c implementation in Postgresql