Re: Race condition in recovery?

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race condition in recovery?
Date: 2021-05-26 16:26:24
Message-ID: CAFiTN-vcTODEFuyktzvURgSRkULzfkNP6k=zHv4w96-s-5AX5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 26, 2021 at 9:40 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Wed, May 26, 2021 at 2:44 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > I think we need to create some content on promoted standby and check
> > whether the cascade standby is able to get that or not, that will
> > guarantee that it is actually following the promoted standby, I have
> > added the test for that so that it matches the comments.
>
> OK, but I ran this test against an unpatched server and it passed.
> That's not so great, because the test should fail without the bug fix.
> It seems to be because there's only one actual test in this test case.
> Looking at the log file,
> src/test/recovery/tmp_check/log/regress_log_025_timeline_issue, the
> only "ok" nor "not ok" line is:
>
> ok 1 - check streamed content on cascade standby
>
> So either that test is not right or some other test is needed. I think
> there's something else going wrong here, because when I run my
> original test.sh script, I see this:

Thats strange, when I ran the test I can see below in log of cascade
node (which shows that cascade get the history file but not the WAL
file and then it select the old timeline and never go to the new
timeline)

...
2021-05-26 21:46:54.412 IST [84080] LOG: restored log file
"00000002.history" from archive
cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/00000003.history’:
No such file or directory
2021-05-26 21:46:54.415 IST [84080] LOG: entering standby mode
2021-05-26 21:46:54.419 IST [84080] LOG: restored log file
"00000002.history" from archive
.....
2021-05-26 21:46:54.429 IST [84085] LOG: started streaming WAL from
primary at 0/2000000 on timeline 1 -> stream using previous TL
2021-05-26 21:46:54.466 IST [84080] LOG: redo starts at 0/2000028
2021-05-26 21:46:54.466 IST [84080] LOG: consistent recovery state
reached at 0/3000000
2021-05-26 21:46:54.467 IST [84079] LOG: database system is ready to
accept read only connections
2021-05-26 21:46:54.483 IST [84085] LOG: replication terminated by
primary server
2021-05-26 21:46:54.483 IST [84085] DETAIL: End of WAL reached on
timeline 1 at 0/3000000.
cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/00000003.history’:
No such file or directory
cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/000000010000000000000003’:
No such file or directory
2021-05-26 21:46:54.498 IST [84085] LOG: primary server contains no
more WAL on requested timeline 1

<failure continues as it never go to timeline 2>

And finally the test case fails because the cascade can never get the changes.

I will check if there is any timing dependency in the test case.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-05-26 16:35:01 Re: Replacing pg_depend PIN entries with a fixed range check
Previous Message Robert Haas 2021-05-26 16:10:06 Re: Race condition in recovery?