Re: standby recovery fails (tablespace related) (tentative patch and discussion)

From: Paul Guo <pguo(at)pivotal(dot)io>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Asim Praveen <apraveen(at)pivotal(dot)io>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Date: 2019-04-30 06:33:47
Message-ID: CAEET0ZGhmDKrq7JJu2rLLqcJBR8pA4OYrKsirZ5Ft8-deG1e8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I updated the original patch to

1) skip copydir() if either src path or dst parent path is missing in
dbase_redo(). Both missing cases seem to be possible. For the src path
missing case, mkdir_p() is meaningless. It seems that moving the directory
existence check step to dbase_redo() has less impact on other code.

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database len (rec/tot): 42/ 42, tx: 486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database len (rec/tot): 34/ 34, tx: 487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

I'm not familiar with the TAP test details previously. I learned a lot
about how to test such case from Kyotaro's patch series.👍

On Sun, Apr 28, 2019 at 3:33 PM Paul Guo <pguo(at)pivotal(dot)io> wrote:

>
> On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <
> horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>
>> Mmm. I posted to wrong thread. Sorry.
>>
>> At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro
>> HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <
>> 20190423(dot)163949(dot)36763221(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
>> > At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo(at)pivotal(dot)io> wrote
>> in <CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg(at)mail(dot)gmail(dot)com>
>> > > Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this
>> database
>> > > create redo error, but I suspect some other kind of redo, which
>> depends on
>> > > the files under the directory (they are not copied since the
>> directory is
>> > > not created) and also cannot be covered by the invalid page mechanism,
>> > > could fail. Thanks.
>> >
>> > If recovery starts from just after tablespace creation, that's
>> > simple. The Symlink to the removed tablespace is already removed
>> > in the case. Hence server innocently create files directly under
>> > pg_tblspc, not in the tablespace. Finally all files that were
>> > supposed to be created in the removed tablespace are removed
>> > later in recovery.
>> >
>> > If recovery starts from recalling page in a file that have been
>> > in the tablespace, XLogReadBufferExtended creates one (perhaps
>> > directly in pg_tblspc as described above) and the files are
>> > removed later in recoery the same way to above. This case doen't
>> > cause FATAL/PANIC during recovery even in master.
>> >
>> > XLogReadBufferExtended(at)xlogutils(dot)c
>> > | * Create the target file if it doesn't already exist. This lets us
>> cope
>> > | * if the replay sequence contains writes to a relation that is later
>> > | * deleted. (The original coding of this routine would instead
>> suppress
>> > | * the writes, but that seems like it risks losing valuable data if the
>> > | * filesystem loses an inode during a crash. Better to write the data
>> > | * until we are actually told to delete the file.)
>> >
>> > So buffered access cannot be a problem for the reason above. The
>> > remaining possible issue is non-buffered access to files in
>> > removed tablespaces. This is what I mentioned upthread:
>> >
>> > me> but I haven't checked this covers all places where need the same
>> > me> treatment.
>>
>> RM_DBASE_ID is fixed by the patch.
>>
>> XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
>> - are not relevant.
>>
>> HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
>> - Resources works on buffer is not affected.
>>
>> SMGR:
>> - Both CREATE and TRUNCATE seems fine.
>>
>> TBLSPC:
>> - We don't nest tablespace directories. No Problem.
>>
>> I don't find a similar case.
>
>
> I took some time in digging into the related code. It seems that ignoring
> if the dst directory cannot be created directly
> should be fine since smgr redo code creates tablespace path finally by
> calling TablespaceCreateDbspace().
> What's more, I found some more issues.
>
> 1) The below error message is actually misleading.
>
> 2019-04-17 14:52:14.951 CST [23030] FATAL: could not create directory
> "pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
> 2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for
> Database/CREATE: copy dir 1663/1 to 65546/65547
>
> That should be due to dbase_desc(). It could be simply fixed following the
> code logic in GetDatabasePath().
>
> 2) It seems that src directory could be missing then
> dbase_redo()->copydir() could error out. For example,
>
> \!rm -rf /tmp/tbspace1
> \!mkdir /tmp/tbspace1
> \!rm -rf /tmp/tbspace2
> \!mkdir /tmp/tbspace2
> create tablespace tbs1 location '/tmp/tbspace1';
> create tablespace tbs2 location '/tmp/tbspace2';
> create database db1 tablespace tbs1;
> alter database db1 set tablespace tbs2;
> drop tablespace tbs1;
>
> Let's say, the standby finishes all replay but redo lsn on pg_control is
> still the point at 'alter database', and then
> kill postgres, then in theory when startup, dbase_redo()->copydir() will
> ERROR since 'drop tablespace tbs1'
> has removed the directories (and symlink) of tbs1. Below simple code
> change could fix that.
>
> diff --git a/src/backend/commands/dbcommands.c
> b/src/backend/commands/dbcommands.c
> index 9707afabd9..7d755c759e 100644
> --- a/src/backend/commands/dbcommands.c
> +++ b/src/backend/commands/dbcommands.c
> @@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record)
> */
> FlushDatabaseBuffers(xlrec->src_db_id);
>
> + /*
> + * It is possible that the source directory is missing if
> + * we are re-replaying the xlog while subsequent xlogs
> + * drop the tablespace in previous replaying. For this
> + * we just skip.
> + */
> + if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
> + return;
> +
> /*
> * Copy this subdirectory to the new location
> *
>
> If we want to fix the issue by ignoring the dst path create failure, I do
> not think we should do
> that in copydir() since copydir() seems to be a common function. I'm not
> sure whether it is
> used by some extensions or not. If no maybe we should move the dst patch
> create logic
> out of copydir().
>
> Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace()
> to replace
> the code block includes a lot of
> get_parent_directory(), MakePGDirectory(), etc even it
> is not fixing a bug since pg_mkdir_p() code change seems to be more
> graceful and simpler.
>
> Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to
> be error-prone
> along with postgre evolving since they are hard to test and also we are
> not easy to think out
> various potential bad cases. Is it possible that we should do real
> checkpoint (flush & update
> redo lsn) when seeing checkpoint xlogs for these operations? This will
> slow down standby
> but master also does this and also these operations are not usual,
> espeically it seems that it
> does not slow down wal receiving usually?
>
>
>
>

Attachment Content-Type Size
v2-0001-skip-copydir-if-either-src-directory-or-dst-direc.patch application/octet-stream 5.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-04-30 07:05:52 Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6
Previous Message John Naylor 2019-04-30 06:12:11 Re: Unhappy about API changes in the no-fsm-for-small-rels patch