Fixing WAL instability in various TAP tests

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Fixing WAL instability in various TAP tests
Date: 2021-09-25 00:33:13
Message-ID: 32A1FDD1-9C7B-43B1-B3EE-49198DD3F887@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hackers,

A few TAP tests in the project appear to be sensitive to reductions of the PostgresNode's max_wal_size setting, resulting in tests failing due to wal files having been removed too soon. The failures in the logs typically are of the "requested WAL segment %s has already been removed" variety. I would expect tests which fail under legal alternate GUC settings to be hardened to explicitly set the GUCs as they need, rather than implicitly relying on the defaults. As far as missing WAL files go, I would expect the TAP test to prevent this with the use of replication slots or some other mechanism, and not simply to rely on checkpoints not happening too soon. I'm curious if others on this list disagree with that point of view.

Failures in src/test/recovery/t/015_promotion_pages.pl can be fixed by creating a physical replication slot on node "alpha" and using it from node "beta", a technique already used in other TAP tests and apparently merely overlooked in this one.

The first two tests in src/bin/pg_basebackup/t fail, and it's not clear that physical replication slots are the appropriate solution, since no replication is happening. It's not immediately obvious that the tests are at fault anyway. On casual inspection, it seems they might be detecting a live bug which simply doesn't manifest under larger values of max_wal_size. Test 010 appears to show a bug with `pg_basebackup -X`, and test 020 with `pg_receivewal`.

The test in contrib/bloom/t/ is deliberately disabled in contrib/bloom/Makefile with a comment that the test is unstable in the buildfarm, but I didn't find anything to explain what exactly those buildfarm failures might have been when I chased down the email thread that gave rise to the related commit. That test happens to be stable on my laptop until I change GUC settings to both reduce max_wal_size=32MB and to set wal_consistency_checking=all.

Thoughts?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-09-25 00:48:48 Re: prevent immature WAL streaming
Previous Message Alvaro Herrera 2021-09-24 23:00:47 Re: Column Filtering in Logical Replication