Re: PG12 autovac issues

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Justin King <kingpin867(at)gmail(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org, michael(at)paquier(dot)xyz, kgrittn(at)gmail(dot)com
Subject: Re: PG12 autovac issues
Date: 2020-03-23 15:22:47
Message-ID: 20200323152247.GB52612@nol
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-general

On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote:
> Hi,
>
> On 2020-03-20 12:42:31 -0500, Justin King wrote:
> > When we get into this state again, is there some other information
> > (other than what is in pg_stat_statement or pg_stat_activity) that
> > would be useful for folks here to help understand what is going on?
>
> If it's actually stuck on a single table, and that table is not large,
> it would be useful to get a backtrace with gdb.

FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc)
during performance tests since a recent upgrade to pg12 .

What seems to be happening is that after reaching 200M transaction a first pass
of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k
(age(datfrozenxid) still being more than autovacuum_freeze_max_age afterwards).
After that point, all available information seems to indicate that no
autovacuum worker is scheduled anymore:

- log_autovacuum_min_duration is set to 0 and no activity is logged (while
having thousands of those per hour before that)
- 15 min interval snapshot of pg_database and pg_class shows that
datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never
goes down
- 15 min interval snapshot of pg_stat_activity doesn't show any autovacuum
worker
- the autovacuum launcher is up and running and doesn't show any sign of
problem
- n_mod_since_analyze keeps growing at a consistent rate, never going down
- 15 min delta of tup_updated and tup_deleted shows that the globate write
activity doesn't change before and after the autovacuum problem

The situation continues for ~2h, at which point the bloat is so heavy that the
main filesystem becomes full, and postgres panics after a failed write in
pg_logical directory or similar.

The same bench was run against pg11 many times and never triggered this issue.
So far our best guess is a side effect of 2aa6e331ead7.

Michael and I have been trying to reproduce this issue locally (drastically
reducing the various freeze_age parameters) for hours, but no luck for now.

This is using a vanilla pg 12.1, with some OLTP workload. The only possibly
relevant configuration changes are quite aggressive autovacuum settings on some
tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
factor to 0, for both heap and toast).

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Andres Freund 2020-03-23 16:23:03 Re: PG12 autovac issues
Previous Message Laurenz Albe 2020-03-23 13:35:48 Re: Append only tables

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2020-03-23 15:31:36 Re: Loading 500m json files to database
Previous Message Rob Sargent 2020-03-23 15:16:40 Re: Loading 500m json files to database