Re: Freezing without write I/O

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Freezing without write I/O
Date: 2013-06-01 14:17:43
Message-ID: CA+U5nMJJB8QxYuM87UMmhjV1Bfu8LUAOaaqzhYn2DuK_bfM9mA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 30 May 2013 19:39, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
>> and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
>> that would waste a lot space. The trick is to add a field to the page header
>> indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
>> 32-bit wide (*).
>
> Check.
>
>> The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we
>> wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we
>> could implicitly treat anything older than relfrozenxid as frozen.
>
> Check.
>
>> That's the basic idea. Vacuum freeze only needs to remove dead tuples, but
>> doesn't need to dirty pages that contain no dead tuples.
>
> Check.

Yes, this is the critical point. Large insert-only tables don't need
to be completely re-written twice.

>> Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
>> replace the XIDs with FrozenXid whenever the difference between the smallest
>> and largest XID on a page exceeds 2^31. But that would only happen when
>> you're updating the page, in which case the page is dirtied anyway, so it
>> wouldn't cause any extra I/O.
>
> It would cause some extra WAL activity, but it wouldn't dirty the page
> an extra time.
>
>> This would also be the first step in allowing the clog to grow larger than 2
>> billion transactions, eliminating the need for anti-wraparound freezing
>> altogether. You'd still want to truncate the clog eventually, but it would
>> be nice to not be pressed against the wall with "run vacuum freeze now, or
>> the system will shut down".
>
> Interesting. That seems like a major advantage.
>
>> (*) "Adding an epoch" is inaccurate, but I like to use that as my mental
>> model. If you just add a 32-bit epoch field, then you cannot have xids from
>> different epochs on the page, which would be a problem. In reality, you
>> would store one 64-bit XID value in the page header, and use that as the
>> "reference point" for all the 32-bit XIDs on the tuples. See existing
>> convert_txid() function for how that works. Another method is to store the
>> 32-bit xid values in tuple headers as offsets from the per-page 64-bit
>> value, but then you'd always need to have the 64-bit value at hand when
>> interpreting the XIDs, even if they're all recent.
>
> As I see it, the main downsides of this approach are:
>
> (1) It breaks binary compatibility (unless you do something to
> provided for it, like put the epoch in the special space).
>
> (2) It consumes 8 bytes per page. I think it would be possible to get
> this down to say 5 bytes per page pretty easily; we'd simply decide
> that the low-order 3 bytes of the reference XID must always be 0.
> Possibly you could even do with 4 bytes, or 4 bytes plus some number
> of extra bits.

Yes, the idea of having a "base Xid" on every page is complicated and
breaks compatibility. Same idea can work well if we do this via tuple
headers.

> (3) You still need to periodically scan the entire relation, or else
> have a freeze map as Simon and Josh suggested.

I don't think that is needed with this approach.

(The freeze map was Andres' idea, not mine. I just accepted it as what
I thought was the only way forwards. Now I see other ways)

> The upsides of this approach as compared with what Andres and I are
> proposing are:
>
> (1) It provides a stepping stone towards allowing indefinite expansion
> of CLOG, which is quite appealing as an alternative to a hard
> shut-down.

I would be against expansion of the CLOG beyond its current size. If
we have removed all aborted rows and marked hints, then we don't need
the CLOG values and can trim that down.

I don't mind the hints, its the freezing we don't need.

>> convert_txid() function for how that works. Another method is to store the
>> 32-bit xid values in tuple headers as offsets from the per-page 64-bit
>> value, but then you'd always need to have the 64-bit value at hand when
>> interpreting the XIDs, even if they're all recent.

You've touched here on the idea of putting the epoch in the tuple
header, which is where what I posted comes together. We don't need
anything at page level, we just need something on each tuple.

Please can you look at my recent post on how to put this in the tuple header?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2013-06-01 14:20:38 pgsql: Minor spelling fixes
Previous Message Simon Riggs 2013-06-01 14:02:56 Re: Freezing without write I/O