Re: Race condition in recovery?

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: robertmhaas(at)gmail(dot)com, dilipbalaut(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Race condition in recovery?
Date: 2021-06-11 05:07:45
Message-ID: 20210611.140745.1586374762292931816.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Thu, 10 Jun 2021 21:53:18 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in
> Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> writes:
> > At Thu, 10 Jun 2021 09:56:51 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in
> >> Thanks for the analysis and the patches. I have committed them.
>
> > Thanks for committing it.
>
> Please note that conchuela and jacana are still failing ...
>
> conchuela's failure is evidently not every time, but this test
> definitely postdates the "fix":
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-10%2014%3A09%3A08

A different test is failing there. Maybe from different issue.

> ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/regress_log_002_archiving ==~_~===-=-===~_~==
...
> # Postmaster PID for node "standby2" is 342349
> ### Promoting node "standby2"
> # Running: pg_ctl -D /home/pgbf/buildroot/HEAD/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_standby2_data/pgdata -l /home/pgbf/buildroot/HEAD/pgsql.build/src/test/recovery/tmp_check/log/002_archiving_standby2.log promote
> waiting for server to promote................................................................................................................ stopped waiting
> pg_ctl: server did not promote in time
> Bail out! system pg_ctl failed

> ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/002_archiving_standby2.log ==~_~===-=-===~_~==
...
> 2021-06-10 16:21:21.870 CEST [342350:9] LOG: received promote request
> 2021-06-10 16:21:21.870 CEST [342350:10] LOG: redo done at 0/3030200 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.07 s
> 2021-06-10 16:21:21.870 CEST [342350:11] LOG: last completed transaction was at log time 2021-06-10 16:21:21.010599+02
> 2021-06-10 16:21:21.893 CEST [342350:12] LOG: restored log file "000000010000000000000003" from archive
> cp: /home/pgbf/buildroot/HEAD/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/00000003.history: No such file or directory
> 2021-06-10 16:21:21.896 CEST [342350:13] LOG: selected new timeline ID: 3
(log ends here)

> ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/002_archiving_primary.log ==~_~===-=-===~_~==
...
> 2021-06-10 16:21:21.107 CEST [342322:4] 002_archiving.pl LOG: disconnection: session time: 0:00:00.022 user=pgbf database=postgres host=[local]
> 2021-06-10 16:23:21.965 CEST [342279:4] LOG: received immediate shutdown request

So the standby2 was stuck after selecting the new timeline and before
updating control file and its postmaster couldn't even respond to
SIGQUIT.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2021-06-11 05:12:36 Re: locking [user] catalog tables vs 2pc vs logical rep
Previous Message Noah Misch 2021-06-11 05:00:20 Re: libpq debug log