Quick Links

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

From:	Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To:	Paul Guo <pguo(at)pivotal(dot)io>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Date:	2019-08-22 16:13:05
Message-ID:	bdca9c56-ec72-8fa3-b6ec-8a4ede8e4105@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

22.08.2019 16:13, Paul Guo wrote:
> Thanks. I updated the patch to v5. It passes install-check testing and
> recovery testing.
Hi,
Thank you for working on this fix.
The overall design of the latest version looks good to me.
But during the review, I found a bug in the current implementation.
New behavior must apply to crash-recovery only, now it applies to
archiveRecovery too.
That can cause a silent loss of a tablespace during regular standby
operation
since it never calls CheckRecoveryConsistency().

Steps to reproduce:
1) run master and replica
2) create dir for tablespace:
mkdir /tmp/tblspc1

3) create tablespace and database on the master:
create tablespace tblspc1 location '/tmp/tblspc1';
create database db1 tablespace tblspc1 ;

4) wait for replica to receive this changes and pause replication:
select pg_wal_replay_pause();

5) move replica's tablespace symlink to some empty directory, i.e.
/tmp/tblspc2
mkdir /tmp/tblspc2
ln -sfn /tmp/tblspc2 postgresql_data_replica/pg_tblspc/16384

6) create another database in tblspc1 on master:
create database db2 tablespace tblspc1 ;

7) resume replication on standby:
select pg_wal_replay_resume();

8) try to connect to db2 on standby

It's expected that dbase_redo() will fail because the directory on
standby is not found.
While with the patch it suppresses the error until we attempt to connect
db2 on the standby:

2019-08-22 18:34:39.178 MSK [21066] HINT: Execute
pg_wal_replay_resume() to continue.
2019-08-22 18:42:41.656 MSK [21066] WARNING: Skip creating database
directory "pg_tblspc/16384/PG_13_201908012". The dest tablespace may
have been removed before abnormal shutdown. If the removal is illegal
after later checking we will panic.
2019-08-22 18:42:41.656 MSK [21066] CONTEXT: WAL redo at 0/3027738 for
Database/CREATE: copy dir base/1 to pg_tblspc/16384/PG_13_201908012/16390
2019-08-22 18:42:46.096 MSK [21688] FATAL:
"pg_tblspc/16384/PG_13_201908012/16390" is not a valid data directory
2019-08-22 18:42:46.096 MSK [21688] DETAIL: File
"pg_tblspc/16384/PG_13_201908012/16390/PG_VERSION" is missing.

Also some nitpicking about code style:
1) Please, add comment to forget_missing_directory().

2) + elog(LOG, "Directory \"%s\" was missing during
directory copying "
I think we'd better update this message elevel to WARNING.

3) Shouldn't we also move FlushDatabaseBuffers(xlrec->src_db_id); call under
if (do_copydir) clause?
I don't see a reason to flush pages that we won't use later.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Re: standby recovery fails (tablespace related) (tentative patch and discussion) at 2019-08-22 13:13:20 from Paul Guo

Responses

Re: standby recovery fails (tablespace related) (tentative patch and discussion) at 2019-09-03 15:58:55 from Alvaro Herrera
Re: standby recovery fails (tablespace related) (tentative patch and discussion) at 2019-09-10 11:42:55 from Asim R P

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dilip Kumar	2019-08-22 16:14:10	Re: POC: Cleaning up orphaned files using undo logs
Previous Message	Kohei KaiGai	2019-08-22 16:05:19	Re: Asymmetric partition-wise JOIN