Re: Problem while setting the fpw with SIGHUP

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(at)paquier(dot)xyz
Cc: amit(dot)kapila16(at)gmail(dot)com, robertmhaas(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, dilipbalaut(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Problem while setting the fpw with SIGHUP
Date: 2018-08-01 07:25:45
Message-ID: 20180801.162545.234930371.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you, Amit, Michael.

At Sun, 29 Jul 2018 08:19:11 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in <20180728231911(dot)GB1471(at)paquier(dot)xyz>
> On Sat, Jul 28, 2018 at 07:10:24PM +0530, Amit Kapila wrote:
> > I have just responded to your first patch (0001). Can you once again
> > summarize what the 0002 exactly accomplishes? I think one of the
> > goals is to fix the original problem reported in this thread and other
> > is you have found the concurrency issue. Is it possible to have
> > separate patches for those or you think they are interrelated and
> > needs to be fixed together?
>
> That would be nice. The last time I read this thread I have been rather
> confused about what was being discussed, what were the set of problems,
> and what was being fixed. Speaking of which, this is one of the bugfix
> patches I wanted to look at once I have untanggled the autovacuum one
> for temporary relations and the DOS issues with lock lookups.

It's a long time ago.. Let me have a bit of time to blow dust off..

https://www.postgresql.org/message-id/20180420.173354.43303926.horiguchi.kyotaro@lab.ntt.co.jp

...done..i working..

The original problem here was UpdateFullPageWrites() can call
RecoveryInProgress(), which does palloc in a critical
section. This happens when standy is commanded to reload with
change of full_pages_writes during recovery.

While exploring it, I found that update of fullPageWrite flag is
updated concurrently against its design. A race condition happens
in the following steps in my diagnosis.

1. While the startup process is running recovery, we didn't
consider that checkpointer may be running concurrently, but it
happens for standby.

2. Checkpointer considers itself (or designed) as the *only*
updator of shared config including fillPageWrites. In reality
the startup is another concurent updator on standby. Addition to
that, checkpointer assumes that it is started under WAL-emitting
state, that is, InitXLogInsert() has been called elsewhere, but
it is not the case on standby.

Note that checkpointer mustn't update FPW flag before recovery
ends because startup will overrides the flag.

3. As the result, when standby gets full_page_writes changed and
SIGHUP during recovery, checkpointer tries to update FPW flag
and calls RecoveryInProgress() on the way. bang!

With the 0002-step1.patch, checkpointer really becomes the only
updator of the FPW flag after recovery ends. StartupXLOG()
updates it just once before checkpointer starts to update it.

Under the normal(slow?) promotion mode, checkpointer is woken up
explicitly to make the config setting effective as soon as
possible. (This might be unnecessary.)

In checkpointer, RecoveryInProgress() is called preior to
UpdateFPW() to disable update FPW during recovery, so the crash
that is the issue here is fixed.

FYI I think that 0002-tesp2.patch is rejected.

Please find the attacehd revised patch. It is rebased and
provided with a commit message.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
v2-0001-Fix-FPW-flags-updates-during-recovery.patch text/x-patch 5.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2018-08-01 07:32:14 Re: [HACKERS] Cached plans and statement generalization
Previous Message Peter Eisentraut 2018-08-01 07:04:20 Re: Alter index rename concurrently to