Re: Two pass CheckDeadlock in contentent case

From: Sokolov Yura <y(dot)sokolov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Cc: pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Two pass CheckDeadlock in contentent case
Date: 2017-10-03 15:06:30
Message-ID: 77de6c23263a06093ebb4a590c8bc4ae@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On 2017-10-03 17:30, Sokolov Yura wrote:
> Good day, hackers.
>
> During hard workload sometimes process reaches deadlock timeout
> even if no real deadlock occurred. It is easily reproducible with
> pg_xact_advisory_lock on same value + some time consuming
> operation (update) and many clients.
>
> When backend reaches deadlock timeout, it calls CheckDeadlock,
> which locks all partitions of lock hash in exclusive mode to
> walk through graph and search for deadlock.
>
> If hundreds of backends reaches this timeout trying to acquire
> advisory lock on a same value, it leads to hard-stuck for many
> seconds, cause they all traverse same huge lock graph under
> exclusive lock.
> During this stuck there is no possibility to do any meaningful
> operations (no new transaction can begin).
>
> Attached patch makes CheckDeadlock to do two passes:
> - first pass uses LW_SHARED on partitions of lock hash.
> DeadLockCheck is called with flag "readonly", so it doesn't
> modify anything.
> - If there is possibility of "soft" or "hard" deadlock detected,
> ie if there is need to modify lock graph, then partitions
> relocked with LW_EXCLUSIVE, and DeadLockCheck is called again.
>
> It fixes hard-stuck, cause backends walk lock graph under shared
> lock, and found that there is no real deadlock.
>
> Test on 4 socket xeon machine:
> pgbench_zipf -s 10 -c 800 -j 100 -M prepared -T 450 -f
> ./ycsb_read_zipf(dot)sql(at)50 -f ./ycsb_update_lock2_zipf(dot)sql(at)50 -P 5
>
> ycsb_read_zipf.sql:
> \set i random_zipfian(1, 100000 * :scale, 1.01)
> SELECT abalance FROM pgbench_accounts WHERE aid = :i;
> ycsb_update_lock2_zipf.sql:
> \set i random_zipfian(1, 100000 * :scale, 1.01)
> select lock_and_update( :i );
>
> CREATE OR REPLACE FUNCTION public.lock_and_update(i integer)
> RETURNS void
> LANGUAGE sql
> AS $function$
> SELECT pg_advisory_xact_lock( $1 );
> UPDATE pgbench_accounts SET abalance = 1 WHERE aid = $1;
> $function$
>
> Without attached patch:
>
> progress: 5.0 s, 45707.0 tps, lat 15.599 ms stddev 83.757
> progress: 10.0 s, 51124.4 tps, lat 15.681 ms stddev 78.353
> progress: 15.0 s, 52293.8 tps, lat 15.327 ms stddev 77.017
> progress: 20.0 s, 51280.4 tps, lat 15.603 ms stddev 78.199
> progress: 25.0 s, 47278.6 tps, lat 16.795 ms stddev 83.570
> progress: 30.0 s, 41792.9 tps, lat 18.535 ms stddev 93.697
> progress: 35.0 s, 12393.7 tps, lat 33.757 ms stddev 169.116
> progress: 40.0 s, 0.0 tps, lat -nan ms stddev -nan
> progress: 45.0 s, 0.0 tps, lat -nan ms stddev -nan
> progress: 50.0 s, 1.2 tps, lat 2497.734 ms stddev 5393.166
> progress: 55.0 s, 0.0 tps, lat -nan ms stddev -nan
> progress: 60.0 s, 27357.9 tps, lat 160.622 ms stddev 1866.625
> progress: 65.0 s, 38770.8 tps, lat 20.829 ms stddev 104.453
> progress: 70.0 s, 40553.2 tps, lat 19.809 ms stddev 99.741
>
> (autovacuum led to trigger deadlock timeout,
> and therefore, to stuck)
>
> Patched:
>
> progress: 5.0 s, 56264.7 tps, lat 12.847 ms stddev 66.980
> progress: 10.0 s, 55389.3 tps, lat 14.329 ms stddev 71.997
> progress: 15.0 s, 50757.7 tps, lat 15.730 ms stddev 78.222
> progress: 20.0 s, 50797.3 tps, lat 15.736 ms stddev 79.296
> progress: 25.0 s, 48485.3 tps, lat 16.432 ms stddev 82.720
> progress: 30.0 s, 45202.1 tps, lat 17.733 ms stddev 88.554
> progress: 35.0 s, 40035.8 tps, lat 19.343 ms stddev 97.787
> progress: 40.0 s, 14240.1 tps, lat 47.625 ms stddev 265.465
> progress: 45.0 s, 30150.6 tps, lat 31.140 ms stddev 196.114
> progress: 50.0 s, 38975.0 tps, lat 20.543 ms stddev 104.281
> progress: 55.0 s, 40573.9 tps, lat 19.770 ms stddev 99.487
> progress: 60.0 s, 38361.1 tps, lat 20.693 ms stddev 103.141
> progress: 65.0 s, 39793.3 tps, lat 20.216 ms stddev 101.080
> progress: 70.0 s, 38387.9 tps, lat 20.886 ms stddev 104.482
>
> (autovacuum led to trigger deadlock timeout,
> but postgresql did not stuck)
>
> I believe, patch will positively affect other heavy workloads
> as well, although I have not collect benchmarks.
>
> `make check-world` passes with configured `--enable-tap-tests
> --enable-casserts`

Excuse me, corrected version is in attach.

--
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

Attachment Content-Type Size
shared_deadlock_check.diff text/x-diff 2.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-10-03 15:20:35 Re: Postgresql gives error that role goes not exists while it exists
Previous Message Robert Haas 2017-10-03 15:06:08 Re: SendRowDescriptionMessage() is slow for queries with a lot of columns