Re: Race condition in recovery?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race condition in recovery?
Date: 2021-05-26 16:10:06
Message-ID: CA+TgmoatTeD++2p_9-9N=YxFxrHYCvFOV7gtMk8BQr9N-dqrKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 26, 2021 at 2:44 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> I think we need to create some content on promoted standby and check
> whether the cascade standby is able to get that or not, that will
> guarantee that it is actually following the promoted standby, I have
> added the test for that so that it matches the comments.

OK, but I ran this test against an unpatched server and it passed.
That's not so great, because the test should fail without the bug fix.
It seems to be because there's only one actual test in this test case.
Looking at the log file,
src/test/recovery/tmp_check/log/regress_log_025_timeline_issue, the
only "ok" nor "not ok" line is:

ok 1 - check streamed content on cascade standby

So either that test is not right or some other test is needed. I think
there's something else going wrong here, because when I run my
original test.sh script, I see this:

2021-05-26 11:37:47.794 EDT [57961] LOG: restored log file
"00000002.history" from archive
...
2021-05-26 11:37:47.916 EDT [57961] LOG: redo starts at 0/2000028
...
2021-05-26 11:37:47.927 EDT [57966] LOG: replication terminated by
primary server
2021-05-26 11:37:47.927 EDT [57966] DETAIL: End of WAL reached on
timeline 1 at 0/3000000

But in the src/test/recovery/tmp_check/log/025_timeline_issue_cascade.log
file generated by this test case:

cp: /Users/rhaas/pgsql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/00000002.history:
No such file or directory
...
2021-05-26 11:41:08.149 EDT [63347] LOG: fetching timeline history
file for timeline 2 from primary server
...
2021-05-26 11:41:08.288 EDT [63344] LOG: new target timeline is 2
...
2021-05-26 11:41:08.303 EDT [63344] LOG: redo starts at 0/2000028
...
2021-05-26 11:41:08.331 EDT [63347] LOG: restarted WAL streaming at
0/3000000 on timeline 2

So it doesn't seem like the test is actually reproducing the problem
correctly. The timeline history file isn't available from the archive,
so it streams it, and then the problem doesn't occur. I guess that's
because there's nothing to guarantee that the history file reaches the
archive before 'cascade' is started. The code just does:

# Promote the standby.
$node_standby->psql('postgres', 'SELECT pg_promote()');

# Start cascade node
$node_cascade->start;

...which has a clear race condition.
src/test/recovery/t/023_pitr_prepared_xact.pl has logic to wait for a
WAL file to be archived, so maybe we can steal that logic and use it
here.

I suggest we rename the test to something a bit more descriptive. Like
instead of 025_timeline_issue.pl, perhaps
025_stuck_on_old_timeline.pl? Or I'm open to other suggestions, but
"timeline issue" is a bit too vague for my taste.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2021-05-26 16:26:24 Re: Race condition in recovery?
Previous Message Tom Lane 2021-05-26 15:37:08 Re: Replacing pg_depend PIN entries with a fixed range check