undo: zedstore vs. zheap

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: undo: zedstore vs. zheap
Date: 2019-06-03 15:53:35
Message-ID: CA+TgmoYeTeQSmALox0PmSm5Gh03oe=UNjhmL+K+btofY_U2jFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Saturday, we had a nice in-person conversation about the
requirements that zedstore has for an undo facility vs. the
requirements that zheap has vs. the current design of the undo patch
set. The people present were: Heikki Linnakangas, Amit Kapila, Thomas
Munro, Kuntal Ghosh, Andres Freund, and me. The remainder of this
email consists of the notes that I took during that conversation, with
some subsequent editing that I did to try to make them more broadly
understandable.

zedstore needs a very fast way to determine whether an undo pointer is
old enough that it doesn't matter. Perhaps we should keep a
backend-local cache of discard pointers so that we can test an undo
pointer against the discard horizon without needing to touch shared
memory.

zedstore needs a very fast way of obtaining the XID when it looks up
an undo pointer. It seems inefficient to store the full XID
(especially an 8-byte FullTransactionId) in every undo record, but
storing it only in the transaction header makes getting the XID
prohibitively expensive. Heikki had the idea of storing a flag in
each undo record saying 'same X as first tuple as on the page', where
X might be XID, CID, relation OID, block number, etc. That seems like
a good idea. We'd need to decide exactly how many such flags to have
and which fields they cover; and it's probably best not to depend on
an undo record for another transaction that might be independently
discarded. Another option is to put a "default" for each of these
values in the page header and then store a value in individual undo
records only if it differs from the value in the page header. We
don't have a way to do similar optimization for whatever individual
clients of the undo machinery choose to store in the undo record's
payload, which might be nice if we had a good idea how to do it.

zedstore intends to store an undo pointer per tuple, whereas the
current zheap code stores an undo pointer per transaction slot.
Therefore, zheap wants an undo record to point to the previous undo
record for that transaction slot; whereas zedstore wants an undo
record to point to the previous undo record for the same tuple (which
might not belong to the same transaction). The undo code that assumes
per-block chaining (uur_blkprev) needs to be made more generic.

For either zedstore or zheap, newly-inserted tuples could potentially
point to an XID/CID rather than an undo record pointer, because
visibility checks don't need to look at the actual undo record. An
undo record still needs to be written in case we abort, but we don't
need to be able to refer to it for visibility purposes.

zedstore and zheap have different batching requirements. Both want to
process undo records in TID order, but zedstore doesn't care about
pages. The current undo patch set calls the RMGR-specific callback
once per page; that needs to be generalized. Is it possible that some
other table AM might want to sort by something other than TID?

Right now, zedstore stores a separate btree for each column, and an
extra btree for the visibility information, but it could have column
groups instead. If you put all of the columns and the visibility
information into a single column group, you'd have a row store. How
would the performance of that solution compare with zheap?

Maybe zheap should forget about having transaction slots, and just
store an undo pointer in each tuple. That would be worse in cases
such as bulk loading, where all the tuples on the page are modified by
the same transaction ID. We could optimize that case by using
something like a transaction slot, but then there's a problem if, say,
every tuple on the page is updated or deleted by a separate
transaction: the tuples need to get bigger to accommodate separate
undo pointers, and the page might overflow. Perhaps we should design
a solution that allows for the temporary use of overflow space in such
cases.

Tuple locking is complicated for both zheap and zedstore. Storing the
locker XIDs in the tuple or page is tempting, but you can run out of
space; where do you put the extras? Writing an undo record per lock
is also tempting, but that could generate a very large amount of undo
if there are many transactions that are each locking a whole table,
whereas the heap doesn't have that problem, because it uses
MultiXacts. On the other hand, in the current heap, it's easily
possible for N transactions to burn through O(N^2) MultiXactIds, which
is worse than an undo record per lock.

A perfect system would avoid permanently bloating the table when many
transactions each take many locks, but it's hard to design a system
that has that property AND ALSO is maximally compact for a table that
is bulk-loaded and then never updated again. (Perhaps it's OK for the
table to expand a little bit if we find that we need to make space for
a MultiXactId pointer or similar in each tuple? Thomas later dubbed
this doubtless-uncontroversial design non-in-place select, and we're
all looking forward to review comments on noninplace_select.c.)

Heikki proposed an out-of-line btree of tuple locks, with compression
to represent locks on TID ranges via a single entry, stored in memory
if small and spilling to disk if large. It could be discarded on
startup, except for locks held by prepared transactions. Updates might
not need to enter their own tuple locks in this btree, but they would
need to propagate key share locks forward to new TIDs.

Robert and Thomas came up with the idea of having a special kind of
undo record that lives outside of a transaction; a backend could
attach to an additional undo log, insert one of these special records,
and the detach. These special undo records would have a special rule
for when they could be discarded; an rmgr callback would decide when
it's OK to discard one. This could serve as a replacement for
MultiXactIds; you could store a collection of XIDs in the payload of
one of these special records. (Andres was not very impressed by this
idea.) Later, after Heikki left, we talked about perhaps also
allowing these special undo records to have a special rule for when
undo actions are executed, so that you could use the undo framework
for un-commit actions. It doesn't seem desirable to have to perform
on-commit actions frequently; one of the motivating ideas behind zheap
is that commits should be as cheap as possible, even if that costs
something in the abort case. But it might be useful to have the
option available for use in rare or exceptional cases.

After this discussion, Heikki is thinking that he might just allow
each tuple in zedstore to store both an undo log pointer and a
MultiXactId, elided when not needed. The MultiXact system would need
to be updated to make MultiXactIds 64-bits in order to avoid needing
to freeze. No one seemed keen to build a new storage engine that
still requires freezing.

When subtransactions are used, the undo system intends to use the
toplevel XID for everything, rather than requiring additional XIDs for
subtransactions. After some discussion, this seems like it should
work well for both zheap and zedstore.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Browse pgsql-hackers by date

  From Date Subject
Next Message Elvis Pranskevichus 2019-06-03 15:56:43 Re: WITH NOT MATERIALIZED and DML CTEs
Previous Message Andres Freund 2019-06-03 15:50:15 Re: WITH NOT MATERIALIZED and DML CTEs