Re: [BUG] standby node can not provide service even it replays all log files

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: thunder1(at)126(dot)com
Cc: robertmhaas(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [BUG] standby node can not provide service even it replays all log files
Date: 2019-10-29 04:57:19
Message-ID: 20191029.135719.784886453123056051.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Thu, 24 Oct 2019 17:37:52 +0800 (CST), Thunder <thunder1(at)126(dot)com> wrote in
> Thanks for replay.I feel confused about snapshot.
>
> At 2019-10-23 11:51:19, "Kyotaro Horiguchi" <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> >Hello.
> >
> >At Tue, 22 Oct 2019 20:42:21 +0800 (CST), Thunder <thunder1(at)126(dot)com> wrote in
> >> Update the patch.
> >>
> >> 1. The STANDBY_SNAPSHOT_PENDING state is set when we replay the first XLOG_RUNNING_XACTS and the sub transaction ids are overflow.
> >> 2. When we log XLOG_RUNNING_XACTS in master node, can we assume that all xact IDS < oldestRunningXid are considered finished?
> >
> >Unfortunately we can't. Standby needs to know that the *standby's*
> >oldest active xid exceeds the pendig xmin, not master's. And it is
> >already processed in ProcArrayApplyRecoveryInfo. We cannot assume that
>
> >the oldest xids are not same on the both side in a replication pair.
>
>
> This issue occurs when master does not commit the transaction which has lots of sub transactions, while we restart or create a new standby node.
> The standby node can not provide service because of this issue.
> Can the standby have any active xid while it can not provide service?

The problem is not xid, but snapshot, information on what xids are not
committed yet on the master. Standby cannot deterine what rows should
be visible without the information. The xid list is maintained using
incoming commit records and vanishes on restart. So the restarted
standby needs non-subxid-overflown XLOG_RUNNING_XACTS to make sure the
xid list is complete.

> >> 3. If we can assume this, when we replay XLOG_RUNNING_XACTS and change standbyState to STANDBY_SNAPSHOT_PENDING, can we record oldestRunningXid to a shared variable, like procArray->oldest_running_xid?
> >> 4. In standby node when call GetSnapshotData if procArray->oldest_running_xid is valid, can we set xmin to be procArray->oldest_running_xid?
> >>
> >> Appreciate any suggestion to this issue.

So, somehow we need to complete the KnownAssignedTransactionIds even
if there's any subxid-overflown transactions. As mentioned upthread,
I think we have at least the following choices.

- Send back the complete xid list for START REPLICATION command from
walreceiver.

- The first XLOG_RUNNING_XACTS after a standby comes in while
subxid-overflown transaction lives.

I think the first is better.

Any suggestions?

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dongming Liu 2019-10-29 05:40:41 Re: Problem with synchronous replication
Previous Message Dilip Kumar 2019-10-29 04:47:51 Re: [HACKERS] Block level parallel vacuum