Re: Add basic tests for the low-level backup method.

From: David Steele <david(at)pgmasters(dot)net>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Add basic tests for the low-level backup method.
Date: 2024-03-14 20:40:38
Message-ID: 1c85f906-85d4-4479-9791-062989fec352@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/14/24 20:00, Michael Paquier wrote:
> On Thu, Mar 14, 2024 at 09:12:52AM +1300, David Steele wrote:
>> I think you are right that the start message is better since it can only
>> appear once when the backup_label is found. The completed message could in
>> theory appear after a restart, though the backup_label must have been found
>> at some point.
>
> So, I've given a try to this patch with 99b4a63bef94, to note that
> sidewinder failed because of a timing issue on Windows: the recovery
> of the node without backup_label, expected to fail, would try to
> backup the last segment it has replayed, because, as it has no
> backup_label, it behaves like the primary. It would try to use the
> same archive location as the primary, leading to a conflict failure on
> Windows. This one was easy to fix, by overwritting postgresql.conf on
> the node to not do archiving.

Hmmm, I wonder why this did not show up in the Windows tests on CI?

> Following that, I've noticed a second race condition: we don't wait
> for the segment after pg_switch_wal() to be archived. This one can be
> easily avoided with a poll on pg_stat_archiver.

Ugh, yeah, good change.

> After that, also because I've initially managed to, cough, forget an
> update of meson.build to list the new test, I've noticed a third
> failure on Windows for the case of the node that has a backup_label.
> Here is one of the failures:
> https://cirrus-ci.com/task/5245341683941376

Is the missing test in meson the reason we did not see test failures for
Windows in CI?

> regress_log_042_low_level_backup and
> 042_low_level_backup_replica_success.log have all the information
> needed, that can be summarized like that:
> The system cannot find the file specified.
> 2024-03-14 06:02:37.670 GMT [560][startup] FATAL: invalid data in file "backup_label"
>
> The first message is something new to me, that seems to point to a
> corruption failure of the file system. Why don't we see something
> similar in other tests, then? Leaving that aside..
>
> The second LOG is something that can be acted on. I've added some
> debugging to the parsing of the backup_label file in the backend, and
> noticed that the first fscanf() for START WAL LOCATION is failing
> because the last %c is detected as \r rather than \n. Tweaking the
> contents stored from pg_backend_stop() with a sed won't help, because
> the issue is that we write the CRLFs with append_to_file, and the
> startup process cannot cope with that. The simplest method I can
> think of is to use binmode, as of the attached.

Yeah, that makes sense.

> It is the first time that we'd take the contents received from a
> BackgroundPsql and write them to a file parsed by the backend, so
> perhaps we should try to do that in a more general way, but I'm not
> sure how, tbh, and the case of this test is special while adding
> handling for \r when reading the backup_label got discussed in the
> past but we were OK with what we are doing now on HEAD.

I think it makes sense to leave the parsing code as is and make the
change in the test. If we add more tests to this module we'll probably
need a function to avoid repeating code.

> On top of all that, note that I have removed remove_tree as I am not
> sure if this would be OK in all the buildfarm animals, added a quit()
> for BackgroundPsql, moved queries to use less BackgroundPsql, as well
> as a few other things like avoiding the hardcoded segment names.
> meson.build is.. Cough.. Updated now.

OK.

> I am attaching an updated patch with all that fixed, which is stable
> in the CI and any tests I've run. Do you have any comments about

These changes look good to me. Sure wish we had an easier to way to test
commits in the build farm.

Regards,
-David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2024-03-14 20:42:20 Re: Built-in CTYPE provider
Previous Message Robert Haas 2024-03-14 20:38:35 Re: Possibility to disable `ALTER SYSTEM`