Re: BUG #17226: hot standby mode, create database and drop database cause standby restrat failed

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: bchen90(at)163(dot)com, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17226: hot standby mode, create database and drop database cause standby restrat failed
Date: 2021-10-14 05:05:58
Message-ID: CAD21AoDXGFKmRNPj-OuqTu28rEKMhWfJm0-p52gts=UKwjp-zw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Oct 13, 2021 at 6:49 PM PG Bug reporting form
<noreply(at)postgresql(dot)org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference: 17226
> Logged by: Bo Chen
> Email address: bchen90(at)163(dot)com
> PostgreSQL version: 11.8
> Operating system: Suse linux 11.3 x86_64
> Description:
>
> Hi, all
>
> Recently, I got a FATAL while restarts standby, which can be reproduced by
> the following steps, based on pg 11:
> 1. create a database test
> 2. create another database test1 using test as template
> 3. drop database test
> 4. stop standby using 'pg_ctl stop -mi'
> 5. start standby, start failed
> 6. stop primary using 'pg_ctl stop -mi'
> 7. start primary, start OK
>
> The starting log for standby failed is:
>
> 2021-10-13 15:57:43.294 CST startup 2420219 1/0 0 %LOG: 00000: redo starts
> at 0/6000024 xlog.c (7043 StartupXLOG)
> 2021-10-13 15:57:43.297 CST startup 2420219 1/0 0 %FATAL: 58P01: could not
> open directory "base/16385": No such file or directory fd.c (2714
> ReadDirExtended)
> 2021-10-13 15:57:43.297 CST startup 2420219 1/0 0 %CONTEXT: WAL redo at
> 0/7001200 for Database/CREATE: copy dir 16385/1663 to 16386/1663
> 2021-10-13 15:57:43.298 CST postmaster 2420180 0 %LOG: 00000: startup
> process (PID 2420219) exited with exit code 1 postmaster.c (3772
> LogChildExit)
>
>
> I think the reason for the above error is as follows:
> For standby, the redo of XLOG_DBASE_DROP and XLOG_DBASE_CREATE just remove
> and copy the databse directory but not peform checkpoint like create or drop
> database in primary. So, when we restart, it redo from the log of create
> database but now the directory of the source database has been deleted, so
> redo failed.

Thank you for the report!

I could reproduce this issue. If WAL records are produced in the
following order and the standby restarts the crash recovery from #2
after replaying all records, the replay of #3 fails due to missing the
database directory (base/16384) used as a template:

1. CREATEDB copy 1/1663 to 16384/1663
2. CHECKPOINT (and becomes the restart point for the standby)
3. CREATEDB copy 16384/1663 to 16385/1663
4. DROPDB 16384/1663

This issue is related to the discussion on pgsql-hackers[1] and the
patch is already proposed[2]. I’ve applied v12-0003 patch and
confirmed that it fixes this issue as well (although not looked at how
to fix this issue). So probably this issue can be fixed together with
that issue.

Regards,

[1] https://www.postgresql.org/message-id/CAEET0ZGx9AvioViLf7nbR_8tH9-%3D27DN5xWJ2P9-ROH16e4JUA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/SA1PR05MB85821DEC48A6841277BDBF57C3F29%40SA1PR05MB8582.namprd05.prod.outlook.com

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Masahiko Sawada 2021-10-14 05:22:21 Re: Inconsistent behavior of pg_dump/pg_restore on DEFAULT PRIVILEGES
Previous Message Peter Geoghegan 2021-10-14 04:18:30 Re: BUG #17212: pg_amcheck fails on checking temporary relations