Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?

From: Marco Boeringa <marco(at)boeringa(dot)demon(dot)nl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
Date: 2022-07-24 07:55:29
Message-ID: 2dbccee7-9616-997d-cf19-fa23e29e0497@boeringa.demon.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Tom,

Thanks for the response, but please review the exact conditions I
already mentioned in my previous mails:

- When this issue happens, there is absolutely no other activity going
on than the three active sessions I mentioned: the autovacuum worker
with no wait event, and the two "SELECT COUNT(*) FROM <table>" related
sessions with both a wait event, see the original post. There are no
other active sessions doing any kind of editing work, no INSERTS,
UPDATES, DELETES or whatever in this point of the processing that could
generate WAL.

- Note that this is a custom written geoprocessing workflow with just
one user on the database, not a public database with hundreds of users
emitting whatever unknown queries against the database, so I know
exactly at what point in my processing flow it fails and what goes on then.

- The database affected itself is just a few dozen GBs. While I
appreciate, if I understand PostgreSQL and the concept of WAL good
enough (I don't consider myself a PostgreSQL expert), that WAL might
potentially exceed the size of the database when heavy editing is going,
890 GB of WAL being written seems like an anomaly given in the context
of the first points.

- This problem only first reared its head after the issues starting in
PG14.2 related to SELECT COUNT(*)

So, does your suggested option (b) still make sense in this context?

If not, and we assume this is a bug needing reporting, what exact
information will you guys need to pinpoint the issue besides the
information already given? What is the best course of action? I have
never before reported a bug for PostgreSQL, so I am slightly at loss as
to what exact information you will need. E.g., besides your suggestion
of activating 'log_checkpoints', what other suggestions for specific
logging?

I fully appreciate the main answer will be to submit the typical
"smallest reproducible case", but that will be extremely hard in this
particular case, as the geoprocessing workflow processing OpenStreetMap
data goes through a whole chain of largely auto-generated SQL statements
(based on settings in the input tool), that are nearly impossible to
share. Although it is also again questionable if it is actually
relevant, as the point where it fails only has the mentioned sessions
and single SELECT COUNT(*) SQL statement going on. The issues is
intermittent as well, so there wouldn't be guarantees it would reproduce
on the first try, even if I could share it.

I also appreciate I might need to hire an expert for some remote
debugging, but before going that way, I appreciate some more insights.

Marco

Op 23-7-2022 om 17:33 schreef Tom Lane:
> Marco Boeringa <marco(at)boeringa(dot)demon(dot)nl> writes:
>> Ok, I found it out using 'sudo baobab'. It is the:
>> 'var/lib/postgresql/14/main/pg_wal'
>> folder that is filled up with 890 GB of data... causing the file system
>> root to run out of space and Ubuntu opening the disk usage analyzer and
>> a warning as a consequence.
> The most likely explanations for this are
> (a) misconfiguration of WAL archiving, so that the server thinks
> it should keep WAL files till they've been archived, only that
> never happens.
> (b) inability to complete checkpoints for some reason, preventing
> WAL files from being recycled.
>
> It doesn't look like you have wal_archiving on, so (a) *should*
> be ruled out, but you never know. If there are a ton of "nnn.ready"
> files underneath pg_wal then trouble here would be indicated.
>
> As for (b), you might try enabling log_checkpoints and seeing if
> the log messages give any clue.
>
> regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Ajin Cherian 2022-07-24 08:16:42 Re: Excessive number of replication slots for 12->14 logical replication
Previous Message Zsolt Ero 2022-07-23 16:53:53 Re: could not link file in wal restore lines