Re: Postgres stucks in deadlock detection

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Postgres stucks in deadlock detection
Date: 2018-04-13 16:13:07
Message-ID: 2e13a34b-0fb4-72f6-0b1b-a4458b5ca33b@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13.04.2018 18:41, Andres Freund wrote:
> Hi,
>
> On 2018-04-13 16:43:09 +0300, Konstantin Knizhnik wrote:
>> Updated patch is attached.
>> + /*
>> + * Ensure that only one backend is checking for deadlock.
>> + * Otherwise under high load cascade of deadlock timeout expirations can cause stuck of Postgres.
>> + */
>> + if (!pg_atomic_test_set_flag(&ProcGlobal->activeDeadlockCheck))
>> + {
>> + enable_timeout_after(DEADLOCK_TIMEOUT, DeadlockTimeout);
>> + return;
>> + }
>> + inside_deadlock_check = true;
> I can't see that ever being accepted. This means there's absolutely no
> bound for deadlock checks happening even under light concurrency, even
> if there's no contention for a large fraction of the time.

It may cause problems only if
1. There is large number of active sessions
2. They perform deadlock-prone queries (so no attempts to avoid
deadlocks at application level)
3. Deadlock timeout is set to be very small (10 msec?)

Otherwise either probability that all backends  once and once again are
trying to check deadlocks concurrently is very small (and can be even
more reduced by using random timeout for subsequent deadlock checks),
either system can not normally function in any case because large number
of clients fall into deadlock.

>
> If you want to improve this, improve the efficiency of the
> implementation, check multiple lockers at the same time (set a bit
> afterwards that they don't recheck themselves). There's plenty
> approaches that don't result in a significantly worse user experience.

It seems to me that the best best solution will be to perform deadlock
by some dedicated process (deadlock checker) with specified frequency,
or do it in backends themselves but ensure that interval between
deadlock checks exceeds deadlock_timeout.
On both cases it is necessary to check for all deadlock loops, not only
deadlocks involving self transaction.
And definitely we should be able to abort "foreign" transaction, not
only transaction performed by this backend.
I do not know how difficult it will be to implement. I expect that it
will be not so easy, otherwise why deadlock detection was not
implemented in this way in Postgres from the very beginning...

I completely agree that there are plenty of different approaches, but
IMHO the currently used strategy is the worst one, because it can stall
system even if there are not deadlocks at all.
I always think that deadlock is a programmer's error rather than normal
situation. May be it is wrong assumption but I can not believe that some
system can normally work when deadlocks happen quite frequently. 
Consider classical example: transfer money from one account to another:

start transaction;
update accounts set balance=balance+? where aid=?;
update accounts set balance=balance-? where aid=?;
commit transactions;

With thousands accounts and random choice of account IDs concurrent
execution of this transaction by multiple clients (~100) cause deadlocks
in less than few percents of transactions.
My patch doesn't somehow affect performance and latency in this case.
Obviously that if number of accounts is changed to 10 and number of
clients to 1000, then deadlocks will happen permanently. But in this
case system will not be able to work normally either with my patch,
either without it.

So before implementing some complicated solution of the problem9too slow
deadlock detection), I think that first it is necessary to understand
whether there is such problem at al and under which workload it can happen.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2018-04-13 16:25:00 Re: crash with sql language partition support function
Previous Message Nikolay Samokhvalov 2018-04-13 16:07:03 Re: Built-in connection pooling