Re:Re:Re: [BUG] standby node can not provide service even it replays all log files

From: Thunder <thunder1(at)126(dot)com>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "PostgreSQL Hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re:Re:Re: [BUG] standby node can not provide service even it replays all log files
Date: 2019-10-28 13:54:51
Message-ID: 78a38648.8cd4.16e12a5e053.Coremail.thunder1@126.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi
In our usage scenario the standby node could be OOM killed and we have to create new standby node.
If master node has uncommitted long transaction and new standby node can not provide service.
So for us this is a critical issue.

I do hope any suggestion to this issue.
And can any one help to review the attached patch?
Thanks.

At 2019-10-22 20:42:21, "Thunder" <thunder1(at)126(dot)com> wrote:

Update the patch.

1. The STANDBY_SNAPSHOT_PENDING state is set when we replay the first XLOG_RUNNING_XACTS and the sub transaction ids are overflow.
2. When we log XLOG_RUNNING_XACTS in master node, can we assume that all xact IDS < oldestRunningXid are considered finished?
3. If we can assume this, when we replay XLOG_RUNNING_XACTS and change standbyState to STANDBY_SNAPSHOT_PENDING, can we record oldestRunningXid to a shared variable, like procArray->oldest_running_xid?
4. In standby node when call GetSnapshotData if procArray->oldest_running_xid is valid, can we set xmin to be procArray->oldest_running_xid?

Appreciate any suggestion to this issue.

At 2019-10-22 01:27:58, "Robert Haas" <robertmhaas(at)gmail(dot)com> wrote:
>On Mon, Oct 21, 2019 at 4:13 AM Thunder <thunder1(at)126(dot)com> wrote:
>> Can we fix this issue like the following patch?
>>
>> $git diff src/backend/access/transam/xlog.c
>> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
>> index 49ae97d4459..0fbdf6fd64a 100644
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -8365,7 +8365,7 @@ CheckRecoveryConsistency(void)
>> * run? If so, we can tell postmaster that the database is consistent now,
>> * enabling connections.
>> */
>> - if (standbyState == STANDBY_SNAPSHOT_READY &&
>> + if ((standbyState == STANDBY_SNAPSHOT_READY || standbyState == STANDBY_SNAPSHOT_PENDING) &&
>> !LocalHotStandbyActive &&
>> reachedConsistency &&
>> IsUnderPostmaster)
>
>I think that the issue you've encountered is design behavior. In
>other words, it's intended to work that way.
>
>The comments for the code you propose to change say that we can allow
>connections once we've got a valid snapshot. So presumably the effect
>of your change would be to allow connections even though we don't have
>a valid snapshot.
>
>That seems bad.
>
>--
>Robert Haas
>EnterpriseDB: http://www.enterprisedb.com
>The Enterprise PostgreSQL Company

Attachment Content-Type Size
standby_service.patch application/octet-stream 2.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Geoff Winkless 2019-10-28 13:57:32 Re: Proposition to use '==' as synonym for 'IS NOT DISTINCT FROM'
Previous Message Andrew Dunstan 2019-10-28 13:52:11 Re: jsonb_set() strictness considered harmful to data