From: | Michael Paquier <michael(at)paquier(dot)xyz> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, sawada(dot)mshk(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, sk(at)zsrv(dot)org, nasbyj(at)amazon(dot)com, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru |
Date: | 2020-03-24 04:40:16 |
Message-ID: | 20200324044016.GD1836@paquier.xyz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-committers pgsql-hackers |
On Mon, Mar 23, 2020 at 06:41:50PM -0700, Andres Freund wrote:
> Which valid scenario can lead to this? Neither the comment, nor commit
> message explain it.
The commit message mentions that concurrent autovacuum jobs can lead
to the creation of non-aggressive and anti-wraparound jobs, which have
no sense because an aggressive and anti-wraparound job was already
done in parallel with a different worker, and that this was possible
because of inconsistent relcache lookups across concurrent jobs. This
was mentioned upthread.
> Unless you're thinking of scenarios where autovacuum
> and manual vacuum are mixed, I don't really see valid reasons? Normally
> autovacuum's locking + the table_recheck_autovac() check should prevent
> problematic scenarios.
>
> I do see a few scenarios that can trigger this - but they all more or
> less are bugs.
Hmm. OK.
> It doesn't strike me as a good idea to work around such bugs by silently
> neutering heap_vacuum_rel(). The likelihood of that temporarily covering
> up more severe problems seems significant - they're likely to then later
> bite you with a cluster shutdown.
Saying that, I have been thinking about this one for a couple of days
now and it seems to me that this is a factor contributing to what we
are seeing in [1], and I agree that this is just an incorrect approach
that makes easier to trigger the real underlying issues, while
table_recheck_autovac() ought to be the only code path doing the skip
job. Note that I have failed to reproduce the behavior of the other
thread though, always finishing with a non-aggressive anti-wraparound
skipped because of an aggressive and anti-wraparound job happened just
before in parallel, and autovacuum was always able to continue
triggering new jobs, keeping the relfrozenxid age at bay.
So I would like to first revert that part, to have a cleaner state to
work on the underlying issues. A pure revert means also adding back
the log message for non-aggressive and anti-wraparound jobs that
should never exist, which should be replaced by an assertion once all
the holes are fixed. What do you think?
[1]: https://www.postgresql.org/message-id/CAE39h23RTX1jkYjWc5tccv34HwwraizaCUxOmdQdPM+Zt5-2Qg@mail.gmail.com
--
Michael
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2020-03-24 05:59:37 | pgsql: Report wait event for cost-based vacuum delay. |
Previous Message | Fujii Masao | 2020-03-24 03:50:04 | pgsql: Prefer standby promotion over recovery pause. |
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2020-03-24 04:52:49 | Re: error context for vacuum to include block number |
Previous Message | Paul Jungwirth | 2020-03-24 04:23:31 | Re: range_agg |