Re: Problem while setting the fpw with SIGHUP

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(at)paquier(dot)xyz
Cc: dilipbalaut(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Problem while setting the fpw with SIGHUP
Date: 2018-03-28 06:40:59
Message-ID: 20180328.154059.146076703.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 27 Mar 2018 22:02:26 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in <20180327130226(dot)GA1105(at)paquier(dot)xyz>
> On Tue, Mar 27, 2018 at 09:01:20PM +0900, Kyotaro HORIGUCHI wrote:
> > The current UpdateFullPageWrites is safe on standby and promotion
> > so what we should consider is only the non-standby case. I think
> > what we should do is just calling RecoveryInProgress() at the
> > beginning of CheckPointerMain, which is just the same thing with
> > InitPostgres, but before setting up signal handler to avoid
> > processing SIGHUP before being ready to insert xlog.
>
> Your proposal does not fix the issue for a checkpointer process started
> on a standby. After a promotion, if SIGHUP is issued with a change in
> full_page_writes, then the initialization of InitXLogInsert() would
> happen again in the critical section of UpdateFullPageWrites(). The
> window is rather small for normal promotions as the startup process
> requests a checkpoint which would do the initialization, and much larger
> for fallback_promote where the startup process is in charge of doing the
> end-of-recovery checkpoint.

Yeah. I realized that after sending the mail.

I looked closer and found several problems there.

- On standby, StartupXLOG calls UpdateFullPageWrites and
checkpointer can call the same function simultaneously, but it
doesn't assume concurrent call.

- StartupXLOG can make a concurrent write to
Insert->fullPageWrite so it needs to be locked.

- At the time of the very end of recovery, the startup process
ignores possible change of full_page_writes GUC. It sticks with
the startup value. It leads to loss of
XLOG_CHANGE_FPW. (StartXLOG is not considering GUC changes by
reload)

- If checkpointer calls UpdateFullPageWrites concurrently with
change of SharedRecoveryInProgress to false in StartupXLOG the
change may lose corresponding XLOG_CHANGE_FPW.

So, if we don't accept the current behavior, what I think we
should do are all of the follows.

A. In StartupXLOG, protect write to Insert->fullPageWrites with
wal insert exlusive lock. Do the same thing for read in
UpdateFullPageWrites.

B. Surround the whole UpdateFullPageWrites with any kind of lock
to exclude concurrent calls. The attached uses ControlFileLock.
This also exludes the function with chaging of
SharedRecoveryInProgress to avoid loss of XLOG_CHANGE_FPW.

C. After exiting recovery mode, call UpdateFullPageWrites from
StartupXLOG if shared fullPageWrites is found changed from the
last known value. If checkponiter did the same thing at the
same time, one of them completes the work.

D. Call RecoveryInProgress to set up xlog working area.

The attached does that. I don't like that it uses ControlFileLock
to exlucde concurrent UpdateFullPageWrites and StartupXLOG but
WALInsertLock cannot be used since UpdateFullPageWrites may take
the same lock.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2018-03-28 06:41:24 Re: [HACKERS] [PATCH] Lockable views
Previous Message Michael Paquier 2018-03-28 06:09:24 Re: Enhance pg_stat_wal_receiver view to display connected host