Re: standby recovery fails (tablespace related) (tentative patch and discussion)

From: Paul Guo <pguo(at)pivotal(dot)io>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Asim Praveen <apraveen(at)pivotal(dot)io>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Date: 2019-04-28 07:33:13
Message-ID: CAEET0ZFCQp5JMOS7Tx9v1-4F3pESbpFAYmWyFxhQ=6+YBEHiuA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <
horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:

> Mmm. I posted to wrong thread. Sorry.
>
> At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro
> HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <
> 20190423(dot)163949(dot)36763221(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo(at)pivotal(dot)io> wrote in
> <CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg(at)mail(dot)gmail(dot)com>
> > > Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this
> database
> > > create redo error, but I suspect some other kind of redo, which
> depends on
> > > the files under the directory (they are not copied since the directory
> is
> > > not created) and also cannot be covered by the invalid page mechanism,
> > > could fail. Thanks.
> >
> > If recovery starts from just after tablespace creation, that's
> > simple. The Symlink to the removed tablespace is already removed
> > in the case. Hence server innocently create files directly under
> > pg_tblspc, not in the tablespace. Finally all files that were
> > supposed to be created in the removed tablespace are removed
> > later in recovery.
> >
> > If recovery starts from recalling page in a file that have been
> > in the tablespace, XLogReadBufferExtended creates one (perhaps
> > directly in pg_tblspc as described above) and the files are
> > removed later in recoery the same way to above. This case doen't
> > cause FATAL/PANIC during recovery even in master.
> >
> > XLogReadBufferExtended(at)xlogutils(dot)c
> > | * Create the target file if it doesn't already exist. This lets us
> cope
> > | * if the replay sequence contains writes to a relation that is later
> > | * deleted. (The original coding of this routine would instead suppress
> > | * the writes, but that seems like it risks losing valuable data if the
> > | * filesystem loses an inode during a crash. Better to write the data
> > | * until we are actually told to delete the file.)
> >
> > So buffered access cannot be a problem for the reason above. The
> > remaining possible issue is non-buffered access to files in
> > removed tablespaces. This is what I mentioned upthread:
> >
> > me> but I haven't checked this covers all places where need the same
> > me> treatment.
>
> RM_DBASE_ID is fixed by the patch.
>
> XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
> - are not relevant.
>
> HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
> - Resources works on buffer is not affected.
>
> SMGR:
> - Both CREATE and TRUNCATE seems fine.
>
> TBLSPC:
> - We don't nest tablespace directories. No Problem.
>
> I don't find a similar case.

I took some time in digging into the related code. It seems that ignoring
if the dst directory cannot be created directly
should be fine since smgr redo code creates tablespace path finally by
calling TablespaceCreateDbspace().
What's more, I found some more issues.

1) The below error message is actually misleading.

2019-04-17 14:52:14.951 CST [23030] FATAL: could not create directory
"pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547

That should be due to dbase_desc(). It could be simply fixed following the
code logic in GetDatabasePath().

2) It seems that src directory could be missing then
dbase_redo()->copydir() could error out. For example,

\!rm -rf /tmp/tbspace1
\!mkdir /tmp/tbspace1
\!rm -rf /tmp/tbspace2
\!mkdir /tmp/tbspace2
create tablespace tbs1 location '/tmp/tbspace1';
create tablespace tbs2 location '/tmp/tbspace2';
create database db1 tablespace tbs1;
alter database db1 set tablespace tbs2;
drop tablespace tbs1;

Let's say, the standby finishes all replay but redo lsn on pg_control is
still the point at 'alter database', and then
kill postgres, then in theory when startup, dbase_redo()->copydir() will
ERROR since 'drop tablespace tbs1'
has removed the directories (and symlink) of tbs1. Below simple code change
could fix that.

diff --git a/src/backend/commands/dbcommands.c
b/src/backend/commands/dbcommands.c
index 9707afabd9..7d755c759e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record)
*/
FlushDatabaseBuffers(xlrec->src_db_id);

+ /*
+ * It is possible that the source directory is missing if
+ * we are re-replaying the xlog while subsequent xlogs
+ * drop the tablespace in previous replaying. For this
+ * we just skip.
+ */
+ if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+ return;
+
/*
* Copy this subdirectory to the new location
*

If we want to fix the issue by ignoring the dst path create failure, I do
not think we should do
that in copydir() since copydir() seems to be a common function. I'm not
sure whether it is
used by some extensions or not. If no maybe we should move the dst patch
create logic
out of copydir().

Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace() to
replace
the code block includes a lot of get_parent_directory(), MakePGDirectory(),
etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more
graceful and simpler.

Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to
be error-prone
along with postgre evolving since they are hard to test and also we are not
easy to think out
various potential bad cases. Is it possible that we should do real
checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will slow
down standby
but master also does this and also these operations are not usual,
espeically it seems that it
does not slow down wal receiving usually?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sascha Kuhl 2019-04-28 13:59:15 Data streaming between different databases
Previous Message Peter Geoghegan 2019-04-28 01:36:22 Re: Improve search for missing parent downlinks in amcheck