Re: Fixing WAL instability in various TAP tests

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fixing WAL instability in various TAP tests
Date: 2021-09-25 14:12:08
Message-ID: 2E9D28A3-1B54-446B-AC12-3BF86A2B522A@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Sep 24, 2021, at 10:21 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>
>> I would
>> expect tests which fail under legal alternate GUC settings to be hardened to
>> explicitly set the GUCs as they need, rather than implicitly relying on the
>> defaults.
>
> That is not the general practice in PostgreSQL tests today. The buildfarm
> exercises some settings, so we keep the tests clean for those. Coping with
> max_wal_size=2 that way sounds reasonable. I'm undecided about the value of
> hardening tests against all possible settings.

Leaving the tests brittle wastes developer time.

I ran into this problem when I changed the storage underlying bloom indexes and ran the contrib/bloom/t/001_wal.pl test with wal_consistency_checking=all. That caused the test to fail with errors about missing wal files, and it took time to backtrack and see that the test fails under this setting even before applying my storage layer changes. Ordinarily, failures about missing wal files would have led me to suspect the TAP test sooner, but since I had mucked around with storage and wal it initially seemed plausible that my code changes were the problem. The real problem is that a replication slot is not used in the test.

The failure in src/test/recovery/t/015_promotion_pages.pl is also that a replication slot should be used but is not.

The failure in src/bin/pg_basebackup/t/010_pg_basebackup.pl stems from not heeding the documented requirement for pg_basebackup -X fetch that the wal_keep_size "be set high enough that the required log data is not removed before the end of the backup". It's just assuming that it will be, because that tends to be true under default GUC settings. I think this can be fixed by setting wal_keep_size=<SOMETHING_BIG_ENOUGH>, but (a) you say this is not the general practice in PostgreSQL tests today, and (b) there doesn't seem to be any principled way to decide what value would be big enough. Sure, we can use something that is big enough in practice, and we'll probably have to go with that, but it feels like we're just papering over the problem.

I'm inclined to guess that the problem in src/bin/pg_basebackup/t/020_pg_receivewal.pl is similar.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-09-25 14:17:47 Re: Fixing WAL instability in various TAP tests
Previous Message Tom Lane 2021-09-25 13:59:33 Re: BUG #16583: merge join on tables with different DB collation behind postgres_fdw fails