Re: Usage of epoch in txid_current

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Usage of epoch in txid_current
Date: 2017-12-05 18:01:43
Message-ID: 20171205180143.GN4628@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres,

* Andres Freund (andres(at)anarazel(dot)de) wrote:
> On 2017-12-05 16:21:27 +0530, Amit Kapila wrote:
> > On Tue, Dec 5, 2017 at 2:49 PM, Alexander Korotkov
> > <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> > > On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >>
> > >> Currently, txid_current and friends export a 64-bit format of
> > >> transaction id that is extended with an “epoch” counter so that it
> > >> will not wrap around during the life of an installation. The epoch
> > >> value it uses is based on the epoch that is maintained by checkpoint
> > >> (aka only checkpoint increments it).
> > >>
> > >> Now if epoch changes multiple times between two checkpoints
> > >> (practically the chances of this are bleak, but there is a theoretical
> > >> possibility), then won't the computation of xids will go wrong?
> > >> Basically, it can give the same value of txid after wraparound if the
> > >> checkpoint doesn't occur between the two calls to txid_current.
> > >
> > >
> > > AFAICS, yes, if epoch changes multiple times between two checkpoints, then
> > > computation will go wrong. And it doesn't look like purely theoretical
> > > possibility for me, because I think I know couple of instances of the edge
> > > of this...
>
> I think it's not terribly likely principle, due to the required WAL
> size. You need at least a commit record for each of 4 billion
> transactions. Each commit record is at least 24bytes long, and in a
> non-artificial scenario you additionally would have a few hundred bytes
> of actual content of WAL. So we're talking about a distance of at least
> 0.5-2TB within a single checkpoint here. Not impossible, but not likely
> either.

At the bottom end, with a 30-minute checkpoint, that's about 300MB/s.
Certainly quite a bit and we might have trouble getting there for other
reasons, but definitely something that can be accomplished with even a
single SSD these days.

> > Okay, it is quite strange that we haven't discovered this problem till
> > now. I think we should do something to fix it. One idea is that we
> > track epoch change in shared memory (probably in the same data
> > structure (VariableCacheData) where we track nextXid). We need to
> > increment it when the xid wraparound during xid allocation (in
> > GetNewTransactionId). Also, we need to make it persistent as which
> > means we need to log it in checkpoint xlog record and we need to write
> > a separate xlog record for the epoch change.
>
> I think it makes a fair bit of sense to not do the current crufty
> tracking of xid epochs. I don't really how we got there, but it doesn't
> make terribly much sense. Don't think we need additional WAL logging
> though - we should be able to piggyback this onto the already existing
> clog logging.

Don't you mean xact logging? ;)

> I kinda wonder if we shouldn't just track nextXid as a 64bit integer
> internally, instead of bothering with tracking the epoch
> separately. Then we can "just" truncate it in the cases where it's
> stored in space constrained places etc.

This sounds reasonable to me, at least, but I've not been in these
depths much.

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-12-05 18:04:43 Re: [HACKERS] parallel.c oblivion of worker-startup failures
Previous Message Peter Eisentraut 2017-12-05 17:59:08 Re: Speeding up pg_upgrade