pg_stat_statements locking

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: pg_stat_statements locking
Date: 2022-09-12 06:52:28
Message-ID: 1AEEB240-9B68-44D5-8A29-8F9FDB22C801@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers!

Recently I observed very peculiar incident.

=== Incident description ===

ETL database was operating fine for many months, regularly updated etc. Workload was not changing much, but as far as it was ETL database - most of queries were different all the time.
On the night of September 7th database was stuck, no monitoring query could be executed. DBAs started to deal with the incident, but there is not much to do with database service, when you cannot execute a single query. According to VM metrics, VM was writing a lot on disk.
On-call engineer was summoned. He observed a lot of backends stuck with similar backtrace
#5 LWLockAcquire()
#6 pgss_store()
#7 pgss_post_parse_analyze()
#8 parse_analyze()
#9 pg_analyze_and_rewrite()

After restart, problem reproduced within 50 minutes. But monitoring queries were operating, what showed that all backends were stuck in LWLock pg_stat_statements. It was impossible to disable pgss with SQL, so engineer altered auto.conf and restarted database. This resolved the incident.

Later I was working on analyzing the incident. Enabling pgss back showed traces of the problem:
Fri 09 Sep 2022 08:52:31 AM MSK (every 2s)

usename | state | wait_event | cnt
-----------------+---------------------+--------------------+-----
content | active | DataFileRead | 1
content | active | pg_stat_statements | 42
content | idle in transaction | ClientRead | 2
pgwatch_monitor | active | BufFileRead | 1
pgwatch_monitor | active | [null] | 5
pgwatch_monitor | active | pg_stat_statements | 85
postgres | active | [null] | 1
repl | active | WalSenderMain | 2
[null] | active | [null] | 2
[null] | active | VacuumDelay | 7
(10 rows)

pgwatch was quering 60 databases, every minute and each call to pg_stat_statements() took approximately 3-4 seconds.
Backend that was in charge of grand lock was looking like this in pg_stat_statements:

datid | 127782
pid | 4077900
usename | pgwatch_monitor
application_name | pgwatch2 - 10.96.17.68
wait_event_type | IO
wait_event | BufFileWrite
state | active
backend_xid | [null]
backend_xmin | 67048029

The contents of pg_stat_statements view overrun work_mem and were materialized in tuplestore on disk. This is what cause a lot of disk write on database that was not accepting any user query.

</Incident description>

TLDR: LWLock "pg_stat_statements" disabled all SQL queries.

=== How the problem develops ===
Prerequisite 1. pgwatch is quering pgss often.
Prerequisite 2. pgss becomes big so that tuplestore is written on disk, while holding shared lock.
Prerequisite 3. Someone is calling reset() or pgss_store() needing exclusive lock.

Consequence. Exclusive lock queues after long held shared lock and prevents all shared locks to be taken.
Result. Any query calling pgss hooks hangs.

=== Reproduction for development purposes ===
0. Setup a database with pg_stat_statements.track = all.
1. Modify pg_stat_statements_internal() to wait for a long time under LWLockAcquire(pgss->lock, LW_SHARED).
2. select * from pg_stat_statements()
3. select pg_stat_statements_reset()

Now the database is bonked. Any query will hang until pg_stat_statements() finishes.

=== How to fix ===
pgss uses LWLock to protect hashtable.
When the hashtable is reset or new user query is inserted in pgss_store() - exclusive lock is used.
When stats are updated for known query or pg_stat_statements are read - shared lock is used.

I propose to swap shared lock for stats update to conditional shared lock.
It cannot be taken during reset() - and that's totally fine. It cannot be taken during entering new query - and I think it's OK to spill some stats in this case. PFA patch attached.

This patch prevents from a disaster incurred by described here locking.

=== Other possible mitigation ===
The incident would not occur if tuplestore did not convert into on-disk representation. But I don't see how to reliably protect from this.

What do you think?

Thank!

Best regards, Andrey Borodin.

Attachment Content-Type Size
v1-0001-Demonstrate-and-fix-lock-of-all-SQL-queries-by-pg.patch application/octet-stream 2.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message bt22kawamotok 2022-09-12 06:53:46 Re: [PATCH]Feature improvement for MERGE tab completion
Previous Message Bharath Rupireddy 2022-09-12 06:26:29 Re: Switching XLog source from archive to streaming when primary available