Re: Multixid hindsight design

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Multixid hindsight design
Date: 2015-06-24 15:30:47
Message-ID: CANP8+jKppa-S+qBJQtDmK5SYJcsRacHKVVE1TWZyPDaP_3H6sw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24 June 2015 at 14:57, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Fri, Jun 5, 2015 at 10:46 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> > It would be a great deal nicer if we didn't have to keep ANY of the
> > transactional data for a tuple around once it's all-visible. Heikki
> > defined ephemeral as "only needed when xmin or xmax is in-progress",
> > but if we extended that definition slightly to "only needed when xmin
> > or xmax is in-progress or commited but not all-visible" then the
> > amount non-ephemeral data in the tuple header is 5 bytes (infomasks +
> > t_hoff).
>
> OK, I was wrong here: if you only have that stuff, you can't
> distinguish between a tuple that is visible to everyone and a tuple
> that is visible to no one. I think the minimal amount of data we need
> in order to distinguish visibility once no relevant transactions are
> in progress is one XID: either XMIN, if the tuple was never updated at
> all or only be the inserting transaction or one of its subxacts; or
> XMAX, if the inserting transaction committed. The other visibility
> information -- including (1) the other of XMIN and XMAX, (2) CMIN and
> CMAX, and (3) the CTID -- are only interesting the transactions
> involved are no longer running and, if they committed, visible to all
> running transactions.
>
> Heikki's proposal is basically to merge the 4-byte CID field and the
> first 4 bytes of the CTID that currently store the block number into
> one 8-byte field that can store a pointer into this new TED structure.
> And after mulling it over, that sounds pretty good to me. It's true
> (as has been pointed out by several people) that the TED will need to
> be persistent because of prepared transactions. But it would still be
> a big improvement over the status quo, because:
>
> (1) We would no longer need to freeze MultiXacts. TED wouldn't need
> to be frozen either. You'd just truncate it whenever RecentGlobalXmin
> advances.
>
> (2) If the TED becomes horribly corrupted, you can recover by
> committing or aborting any prepared transactions, shutting the system
> down, and truncating it, with no loss of data integrity. Nothing in
> the TED is required to determine whether tuples are visible to an
> unrelated transaction - you only need it (a) to determine whether
> tuples are visible to a particular command within a transaction that
> has inserted, updated, or deleted the tuple and (b) determine whether
> tuples are locked.
>
> (3) As a bonus, we'd eliminate combo CIDs, because the TED could have
> space to separately store CMIN and CMAX. Combo CIDs required special
> handling for logical decoding, and they are one of the nastier
> barriers to making parallelism support writes (because they are stored
> in backend-local memory of unbounded size and therefore can't easily
> be shared with workers), so it wouldn't be very sad if they went away.
>
> I'm not quite sure how to decide whether something like this worth (a)
> the work and (b) the risk of creating new bugs, but the more I think
> about it, the more the principal of the thing seems sound to me.

Splitting multitrans into persistent (xmax) and ephemeral (TED) is
something I already proposed so I support the concept; TED is a much better
suggestion, so I support TED.

Your addition of removing combocids is good also, since everything is
public.

I think we need to see a detailed design and we also need to understand the
size of this new beast. I'm worried it might become very big, very quickly
causing problems for us in other ways. We would need to be certain that
truncation can actually occur reasonably frequently and that there are no
edge cases that cause it to bloat.

Though TED sounds nice, the way to avoid another round of on-disk bugs is
by using a new kind of testing, not simply by moving the bits around.

It might be argued that we are increasing the diagnostic/forensic
capabilities by making CIDs more public. We can use that...

The good thing I see from TED is it allows us to test the on-disk outcome
of concurrent activity. Currently we have isolationtester, but that is not
married in any way to the on-disk state allowing us the situation where
isolationtester can pass yet we have corrupted on-disk state. We should
specify the on-disk tuple representation as a state machine and work out
how to recheck the new on-disk state matches the state transition that we
performed.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2015-06-24 15:37:44 Re: git push hook to check for outdated timestamps
Previous Message Andres Freund 2015-06-24 15:15:42 Re: Removing SSL renegotiation (Was: Should we back-patch SSL renegotiation fixes?)