Excessive CPU usage in StandbyReleaseLocks()

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Excessive CPU usage in StandbyReleaseLocks()
Date: 2018-06-19 05:43:42
Message-ID: CAEepm=1mL0KiQ2KJ4yuPpLGX94a4Ns_W6TL4EGRouxWibu56pA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

Andres Freund diagnosed a case of $SUBJECT in a customer's 9.6 system.
I've written a minimal reproducer and a prototype patch to address the
root cause.

The problem is that StandbyReleaseLocks() does a linear search of all
known AccessExclusiveLocks when a transaction ends. Luckily, since
v10 (commit 9b013dc2) that is skipped for transactions that haven't
taken any AELs and aren't using 2PC, but that doesn't help all users.

It's fine if the AEL list is short, but if you do something that takes
a lot of AELs such as restoring a database with many tables or
truncating a lot of partitions while other transactions are in flight
then we start doing O(txrate * nlocks * nsubxacts) work and that can
hurt.

The reproducer script I've attached creates one long-lived transaction
that acquires 6,000 AELs and takes a nap, while 48 connections run
trivial 2PC transactions (I was also able to reproduce the effect
without 2PC by creating a throw-away temporary table in every
transaction, but it was unreliable due to contention slowing
everything down). For me, the standby's startup process becomes 100%
pegged, replay_lag begins to climb and perf says something like:

+ 97.88% 96.96% postgres postgres [.] StandbyReleaseLocks

The attached patch splits the AEL list into one list per xid and
sticks them in a hash table. That makes perf say something like:

+ 0.60% 0.00% postgres postgres [.] StandbyReleaseLocks

This seems like something we'd want to back-patch because the problem
affects all branches (the older releases more severely because they
lack the above-mentioned optimisation).

Thoughts?

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
repro.py text/x-python-script 1.1 KB
0001-Move-RecoveryLockList-into-a-hash-table.patch application/octet-stream 9.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-06-19 05:47:00 Re: Partitioning with temp tables is broken
Previous Message Ashutosh Bapat 2018-06-19 04:51:18 Re: Remove mention in docs that foreign keys on partitioned tables are not supported