Re: UNDO and in-place update

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UNDO and in-place update
Date: 2017-01-09 18:17:02
Message-ID: CA+TgmobgeujZ_OFS2bKnqM=Z35dCJ3qrvUf4jmoky8oi0qyP7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 9, 2017 at 7:50 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> Okay, I see the point. I think here UNDO pointer can be marked
> invalid either during page-pruning or vacuum.

I explicitly want to avoid that, because it means re-dirtying the
page. The UNDO pointer becomes invalid by removing the UNDO log to
which it points; the page itself does not need to be dirtied for the
UNDO pointer to become invalid.

> Another point which I think is not discussed till now is, how the
> locking will work. As of now, we first lock the tuple and then xid on
> which we have to wait. In the new design, we don't know the xid which
> has updated the tuple and I am not sure there is any easy way to find
> that information.

Yes, there is: you can get it from the UNDO pointer. Each UNDO
pointer is of the form <epoch, XID, offset>. If the page has a regular
UNDO pointer, then you can get the XID for any modified table right
out of the UNDO pointer itself. If the page has a TPD pointer, the
TPD contains a series of UNDO pointers and you can get all of the XIDs
that have touched the page by iterating through those UNDO pointers.
There is some work to do to make it cheap to find which XID pertains
to which updated tuple, but I think that problem can be solved by
choosing carefully what to store in the TPD. I think, in fact, that
for performance it's absolutely essential to solve that problem well.

> One idea could be that we have some fixed number of
> slots (i think we can make it variable as well, but for simplicity,
> lets consider it as fixed) in the page header which will store the
> offset to the transaction id inside a TPD entry of the page. Consider
> a TPD entry of page contains four transactions, so we will just store
> enough information in heap page header to reach the transaction id for
> these four transactions. I think each such page header slot could be
> three or four bits long depending upon how many concurrent
> transactions we want to support on a page after which a new
> transaction has to wait (I think in most workloads supporting
> simultaneous eight transactions on a page should be sufficient).
> Then we can have an additional byte (or less than byte) in the tuple
> header to store lock info which is nothing but an offset to the slot
> in the page header. We might find some other locking technique as
> well, but I think keeping it same as current has benefit.

Yes, something like this can be done. You don't really need any new
page-level header data, because you can get the XIDs from the TPD
entry (or from the page itself if there's only one). But you could
expand the single "is-modified" bit that I've proposed adding to each
tuple to multiple bits. 0 means not recently modified. 1 means
modified by the first or only transaction that has recently modified
the page. 2 means modified by the second transaction that has
recently modified the page. Etc.

What I was thinking about doing instead is storing an array in the TPD
containing the same information. There would be one byte or one half
a byte or whatever per TID and it would contain the index of the XID
in the TPD that had most recently modified or locked that TID. Your
solution might be better, though, at least for cases where the number
of tuples that have modified the page is small. However, I'm not
totally sure. I think it's important to keep the tuple headers VERY
small, like 3 bytes. Or 2 bytes. Or maybe even variable size but
only 1 byte in common cases. So I expect bit space in those places to
be fairly scarce and precious.

Now that might be the wrong idea -- maybe it's worth expanding that
header in order to speed things up. On the other hand, having to read
more database pages in order to process the same amount of data is
*really* expensive, especially when you start talking about data sets
that probably don't fit in memory, like a 10 TB or 100 TB table. If
you've got 100 tuples per page (~81 bytes per tuple), increasing the
size of a tuple by 1 byte causes that 100 TB table to increase in size
by about 1.2 TB (modulo padding effects). An extra byte of space (or
even an extra ten bytes) doesn't matter much for a table with a
million tuples in it because the whole table is going to fit in memory
either way, but when you have 10+ billion tuples those extra bytes
start to matter a LOT. And the problem is not only or even primarily
the space consumption - the issue is that all of your queries run that
much slower because of the extra I/O.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2017-01-09 18:21:29 Re: Incorrect XLogRegisterBuffer flag for revmapbuf in brin
Previous Message Antonin Houska 2017-01-09 17:56:45 PoC: Grouped base relation