Re: Hot Standby: too many KnownAssignedXids

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Joachim Wieland <joe(at)mcknight(dot)de>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: Hot Standby: too many KnownAssignedXids
Date: 2010-11-24 11:38:03
Message-ID: 4CECF91B.3040105@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24.11.2010 12:48, Heikki Linnakangas wrote:
> On 24.11.2010 06:56, Joachim Wieland wrote:
>> On Tue, Nov 23, 2010 at 8:45 AM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> On 19.11.2010 23:46, Joachim Wieland wrote:
>>>>
>>>> FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978,
>>>> pArray->maxKnownAssignedXids: 6890
>>>
>>> Hmm, that's a lot of entries in KnownAssignedXids.
>>>
>>> Can you recompile with WAL_DEBUG, and run the recovery again with
>>> wal_debug=on ? That will print all the replayed WAL records, which is
>>> a lot
>>> of data, but it might give a hint what's going on.
>>
>> Sure, but this gives me only one more line:
>>
>> [...]
>> LOG: redo starts at 1F8/FC00E978
>> LOG: REDO @ 1F8/FC00E978; LSN 1F8/FC00EE90: prev 1F8/FC00E930; xid
>> 385669; len 21; bkpb1: Heap - insert: rel 1663/16384/18373; tid
>> 3829898/23
>> FATAL: too many KnownAssignedXids
>> CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23
>> LOG: startup process (PID 4587) exited with exit code 1
>> LOG: terminating any other active server processes
>
> Thanks, I can reproduce this now. This happens when you have a wide gap
> between the oldest still active xid and the latest xid.
>
> When recovery starts, we fetch the oldestActiveXid from the checkpoint
> record. Let's say that it's 100. We then start replaying WAL records
> from the Redo pointer, and the first record (heap insert in your case)
> contains an Xid that's much larger than 100, say 10000. We call
> RecordKnownAssignedXids() to make note that all xids between that range
> are in-progress, but there isn't enough room in the array for that.
>
> We normally get away with a smallish array because the array is trimmed
> at commit and abort records, and the special xid-assignment record to
> handle the case of a lot of subtransactions. We initialize the array
> from the running-xacts record that's written at a checkpoint. That
> mechanism fails in this case because the heap insert record is seen
> before the running-xacts record, causing all those xids in the range
> 100-10000 to be considered in-progress. The running-xacts record that
> comes later would prune them, but we don't have enough slots to hold
> them until that.
>
> Hmm. I'm not sure off the top of my head how to fix that. Perhaps stash
> the xids we see during WAL replay in private memory instead of putting
> them in the KnownAssignedXids array until we see the running-xacts record.

Looking closer at RecordKnownAssignedTransactionIds(), there's a related
much more serious bug there too. When latestObservedXid is initialized
to the oldest still-running xid, oldestActiveXid, at WAL recovery, we
zero the CLOG starting from the oldestActiveXid. That will zap away the
clog bits of any old transactions that had already committed before the
checkpoint started, but were younger than the oldest still running
transaction. The transactions will be lost :-(.

It's dangerous to initialize latestObservedXid to anything to an older
value. The idea of keeping the seen xids in a temporary list private to
the startup process until the running-xacts record would solve that
problem too. ProcArrayInitRecoveryInfo() would not be needed anymore,
the KnownAssignedXids tracking would start at the first running-xacts
record (or shutdown checkpoint) we see, not any sooner than that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2010-11-24 11:39:47 Re: Hot Standby: too many KnownAssignedXids
Previous Message Heikki Linnakangas 2010-11-24 10:48:25 Re: Hot Standby: too many KnownAssignedXids