Re: Deriving Recovery Snapshots

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Deriving Recovery Snapshots
Date: 2008-10-22 09:29:46
Message-ID: 48FEF28A.5060803@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon Riggs wrote:
> On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> * The backend slot may not be reused for some time, so we should take
>>> additional actions to keep state current and true. So we choose to log a
>>> snapshot from the master into WAL after each checkpoint. This can then
>>> be used to cleanup any unobserved xids. It also provides us with our
>>> initial state data, see later.
>> We don't need to log a complete snapshot, do we? Just oldestxmin should
>> be enough.
>
> Possibly, but you're thinking that once we're up and running we can use
> less info.
>
> Trouble is, you don't know when/if the standby will crash/be shutdown.
> So we need regular full snapshots to allow it to re-establish full
> information at regular points. So we may as well drop the whole snapshot
> to WAL every checkpoint. To do otherwise would mean more code and less
> flexibility.

Surely it's less code to write the OldestXmin to the checkpoint record,
rather than a full snapshot, no? And to read it off the checkpoint record.

>>> UnobservedXids is maintained as a sorted array. This comes for free
>>> since xids are always added in xid assignment order. This allows xids to
>>> be removed via bsearch when WAL records arrive for the missing xids. It
>>> also allows us to stop searching for xids once we reach
>>> latestCompletedXid.
>> If we're going to have an UnobservedXids array, why don't we just treat
>> all in-progress transactions as Unobserved, and forget about the dummy
>> PROC entries?
>
> That's a good question and I expected some debate on that.
>
> The main problem is fatal errors that don't write abort records. By
> reusing the PROC entries we can keep those to a manageable limit. If we
> don't have that, the number of fatal errors could cause that list to
> grow uncontrollably and we might overflow any setting, causing snapshots
> to stall and new queries to hang. We really must have a way to place an
> upper bound on the number of unobserved xacts. So we really need the
> proc approach. But we also need the UnobservedXids array.

If you write the oldestxmin (or a full snapshot, including the
oldestxmin) to each checkpoint record, you can crop out any unobserved
xids older than that, when you replay the checkpoint record.

> Having only an UnobservedXid array was my first thought and I said
> earlier I would do it without using procs. Bad idea. Using the
> UnobservedXids array means every xact removal requires a bsearch,
> whereas with procs we can do a direct lookup, removing all xids in one
> stroke. Much better for typical cases.

How much does that really matter? Under normal circumstances, the array
would be quite small anyway. A bsearch of a relatively small array isn't
that expensive. Or a hash table, so that removing/inserting items
doesn't need to shift all the following entries.

>> Also, I can't help thinking that this would be a lot simpler if we just
>> treated all subtransactions the same as top-level transactions. The only
>> problem with that is that there can be a lot of subtransactions, which
>> means that we'd need a large UnobservedXids array to handle the worst
>> case, but maybe it would still be acceptable?
>
> Yes, you see the problem. Without subtransactions, this would be a
> simple issue to solve.
>
> In one sense, I do as you say. When we make a snapshot we stuff the
> UnobservedXids into the snapshot *somewhere*. We don't know whether they
> are top level or subxacts. But we need a solution for when we run out of
> top-level xid places in the snapshot. Which has now been provided,
> luckily.
>
> If we have no upper bound on snapshot size then *all* backends would
> need a variable size snapshot. We must solve that problem or accept
> having people wait maybe minutes for a snapshot in worst case. I've
> found one way of placing a bound on the number of xids we need to keep
> in the snapshot. If there is another, better way of keeping it bounded I
> will happily adopt it. I spent about 2 weeks sweating this issue...

How about:

1. Keep all transactions and subtransactions in UnobservedXids.
2. If it fills up, remove all subtransactions from it, that the startup
process knows to be subtransactions and knows the parents, and update
subtrans. Mark the array as overflowed.

To take a snapshot, a backend simply copies UnobservedXids array and the
flag. If it hasn't overflowed, a transaction is considered to be in
progress if it's in the array. If it has overflowed, and the xid is not
in the array, check subtrans

Note that the startup process sees all WAL records, so it can do
arbitrarily complex bookkeeping in backend-private memory, and only
expose the necessary parts in shared mem. For example, it can keep track
of the parent-child relationships of the xids in UnobservedXids, but the
backends taking snapshots don't need to know about that. For step 2 to
work, that's exactly what the startup process needs to keep track of.

For the startup process to know about the parent-child relationships,
we'll need something like WAL changes you suggested. I'm not too
thrilled about adding a new field to all WAL records. Seems simpler to
just rely on the new WAL records on AssignTransactionId(), and we can
only do it, say, every 100 subtransactions, if we make the
UnobservedXids array big enough (100*max_connections).

This isn't actually that different from your proposal. The big
difference is that instead of PROC entries and UnobservedXids, all
transactions are tracked in UnobservedXids, and instead of caching
subtransactions in the subxids array in PROC entries, they're cached in
UnobservedXids as well.

Aanother, completely different approach, would be to forget about xid
arrays altogether, and change the way snapshots are taken: just do a
full memcpy of the clog between xmin and xmax. That might be pretty slow
if xmax-xmin is big, though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Kirkwood 2008-10-22 09:50:44 Re: Bitmap Indexes: request for feedback
Previous Message Martin Pihlak 2008-10-22 08:02:34 Re: Withdraw PL/Proxy from commitfest