Re: Berserk Autovacuum (let's save next Mandrill)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Andres Freund <andres(at)anarazel(dot)de>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Darafei Komяpa Praliaskouski <me(at)komzpa(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Michael Banck <mbanck(at)gmx(dot)net>
Subject: Re: Berserk Autovacuum (let's save next Mandrill)
Date: 2020-04-02 14:44:53
Message-ID: 24467.1585838693@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

David Rowley <dgrowleyml(at)gmail(dot)com> writes:
> On Thu, 2 Apr 2020 at 16:13, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Quite :-(. While it's too early to declare victory, we've seen no
>> more failures of this ilk since 0936d1b6f, so it's sure looking like
>> autovacuum did have something to do with it.

> How about [1]? It seems related to me and also post 0936d1b6f.

That looks much like the first lousyjack failure, which as I said
I wasn't trying to account for at that point.

After looking at those failures, though, I believe that the root cause
may be the same, ie small changes in pg_class.reltuples due to
autovacuum not seeing all pages of the tables. The test structure
is a bit different, but it is accessing the tables in between EXPLAIN
attempts, so it could be preventing a concurrent autovac from seeing
all pages.

I see your fix at cefb82d49, but it feels a bit brute-force. Unlike
stats_ext.sql, we're not (supposed to be) dependent on exact planner
estimates in this test. So I think the real problem here is crappy test
case design. Namely, that these various sub-tables are exactly the
same size, despite which the test is expecting that the planner will
order them consistently --- with a planning algorithm that prefers
to put larger tables first in parallel appends (cf. create_append_path).
It's not surprising that the result is unstable in the face of small
variations in the rowcount estimates.

I'd be inclined to undo what you did in favor of initializing the
test tables to contain significantly different numbers of rows,
because that would (a) achieve plan stability more directly,
and (b) demonstrate that the planner is actually ordering the
tables by cost correctly. Maybe somewhere else we have a test
that is verifying (b), but these test cases abysmally fail to
check that point.

I'm not really on board with disabling autovacuum in the regression
tests anywhere we aren't absolutely forced to do so. It's not
representative of real world practice (or at least not real world
best practice ;-)) and it could help hide actual bugs. We don't seem
to have much choice with the stats_ext tests as they are constituted,
but those tests look really fragile to me. Let's not adopt that
technique where we have other possible ways to stabilize test results.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-04-02 14:55:46 Re: [BUG] non archived WAL removed during production crash recovery
Previous Message Julien Rouhaud 2020-04-02 14:44:38 Re: WAL usage calculation patch