Re: Stabilizing the test_decoding checks, take N

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Stabilizing the test_decoding checks, take N
Date: 2022-04-17 14:31:04
Message-ID: 20220417143104.eftldj7d73rqtqoc@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-04-16 13:11:59 -0400, Tom Lane wrote:
> My pet dinosaur prairiedog just failed in the contrib/test_decoding
> tests [1]:
>
> diff -U3 /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out
> --- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out 2022-04-15 07:59:17.000000000 -0400
> +++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out 2022-04-15 09:06:36.000000000 -0400
> @@ -77,10 +77,12 @@
> streaming change for transaction
> streaming change for transaction
> streaming change for transaction
> + closing a streamed block for transaction
> + opening a streamed block for transaction
> streaming change for transaction
> closing a streamed block for transaction
> committing streamed transaction
> -(13 rows)
> +(15 rows)
>
> Looking at the postmaster log, it's obvious where this extra transaction
> came from: auto-analyze ran on pg_type concurrently with the test step
> just before this one. That could only happen if the tests ran long enough
> for autovacuum_naptime to elapse, but prairiedog is a pretty slow machine.
> (And I hasten to point out that some other animals, such as those running
> valgrind or CLOBBER_CACHE_ALWAYS, are even slower.)
>
> We've seen this sort of problem before [2], and attempted to fix it [3]
> by making these tests ignore empty transactions. But of course
> auto-analyze's transaction wasn't empty, so that didn't help.

I don't quite understand this bit - the logic test_decoding uses to
decide if a transaction is "empty" is just whether a tuple was
output. And there shouldn't be any as part of auto-analyze, because we
don't decode catalog changes. I suspect there's something broken in the
streaming logic (potentially just in test_decoding) around
skip_empty_xacts.

> I think the most expedient way to prevent this type of failure is to run
> the test_decoding tests with autovacuum_naptime cranked up so far as to
> make it a non-issue, like maybe a day. Since test_decoding already adds
> some custom settings to postgresql.conf, this'll take just a one-line
> addition to test_decoding/logical.conf.

I'm a bit worried about this approach - we've IIRC had past bugs that
came only to light because of autovacuum starting. I wonder if we rather
should do the opposite and reduce naptime so it'll be seen on fast
machines, rather than very slow ones.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-04-17 14:32:36 Re: Stabilizing the test_decoding checks, take N
Previous Message Andres Freund 2022-04-17 13:31:12 Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)