Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, sawada(dot)mshk(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, sk(at)zsrv(dot)org, nasbyj(at)amazon(dot)com, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru
Date: 2020-03-24 04:40:16
Message-ID: 20200324044016.GD1836@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On Mon, Mar 23, 2020 at 06:41:50PM -0700, Andres Freund wrote:
> Which valid scenario can lead to this? Neither the comment, nor commit
> message explain it.

The commit message mentions that concurrent autovacuum jobs can lead
to the creation of non-aggressive and anti-wraparound jobs, which have
no sense because an aggressive and anti-wraparound job was already
done in parallel with a different worker, and that this was possible
because of inconsistent relcache lookups across concurrent jobs. This
was mentioned upthread.

> Unless you're thinking of scenarios where autovacuum
> and manual vacuum are mixed, I don't really see valid reasons? Normally
> autovacuum's locking + the table_recheck_autovac() check should prevent
> problematic scenarios.
>
> I do see a few scenarios that can trigger this - but they all more or
> less are bugs.

Hmm. OK.

> It doesn't strike me as a good idea to work around such bugs by silently
> neutering heap_vacuum_rel(). The likelihood of that temporarily covering
> up more severe problems seems significant - they're likely to then later
> bite you with a cluster shutdown.

Saying that, I have been thinking about this one for a couple of days
now and it seems to me that this is a factor contributing to what we
are seeing in [1], and I agree that this is just an incorrect approach
that makes easier to trigger the real underlying issues, while
table_recheck_autovac() ought to be the only code path doing the skip
job. Note that I have failed to reproduce the behavior of the other
thread though, always finishing with a non-aggressive anti-wraparound
skipped because of an aggressive and anti-wraparound job happened just
before in parallel, and autovacuum was always able to continue
triggering new jobs, keeping the relfrozenxid age at bay.

So I would like to first revert that part, to have a cleaner state to
work on the underlying issues. A pure revert means also adding back
the log message for non-aggressive and anti-wraparound jobs that
should never exist, which should be replaced by an assertion once all
the holes are fixed. What do you think?

[1]: https://www.postgresql.org/message-id/CAE39h23RTX1jkYjWc5tccv34HwwraizaCUxOmdQdPM+Zt5-2Qg@mail.gmail.com
--
Michael

In response to

Browse pgsql-committers by date

  From Date Subject
Next Message Andres Freund 2020-03-24 05:59:37 pgsql: Report wait event for cost-based vacuum delay.
Previous Message Fujii Masao 2020-03-24 03:50:04 pgsql: Prefer standby promotion over recovery pause.

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-03-24 04:52:49 Re: error context for vacuum to include block number
Previous Message Paul Jungwirth 2020-03-24 04:23:31 Re: range_agg