Re: UNDO and in-place update

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UNDO and in-place update
Date: 2016-11-24 16:39:45
Message-ID: CA+TgmobNFH7qDzd2b-O5Y3cCCVr1uRNKULx+EhBNfMJ-kTgZtA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 23, 2016 at 5:18 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Wed, Nov 23, 2016 at 6:01 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> * Our behavior with many duplicates in secondary indexes is pretty bad
>> in general, I suspect.
>
> From the pie-in-the-sky department: I believe there are
> snapshot-based systems that don't ever have more than one entry for a
> logical row in a btree. Instead, they do UNDO-log based snapshot
> reconstruction for btrees too, generalising what is being discussed
> for heap tables in this thread. That means that whenever you descend
> a btree, at each page you have to be prepared to go and look in the
> UNDO log if necessary on a page-by-page basis. Alternatively, some
> systems use a shadow table rather than an UNDO log for the old
> entries, but still privilege queries that need the latest version by
> keeping only those in the btree. I don't know the details but suspect
> that both of those approaches might require CSN (= LSN?) based
> snapshots, so that you can make page-level visibility determinations
> while traversing the 'current' btree based on a page header. That
> would reduce both kinds of btree bloat: the primary kind of being
> entries for dead tuples that still exist in the heap, and the
> secondary kind being the resultant permanent btree fragmentation that
> remains even after vacuum removes them. Presumably once you have all
> that, you can allow index-only-scans without a visibility map. Also,
> I suspect that such a "3D time travelling" btree would be a
> requirement for index ordered tables in a snapshot-based RDBMS.

I don't really understand how a system of this type copes with page
splits. Suppose there are entries 1000, 2000, ..., 1e6 in the btree.
I now start two overlapping transactions, one inserting all the
positive odd values < 1e6 and the other inserting all the positive
even values < 1e6 that are not already present. The insertions are
interleaved with each other. After they get done inserting, what does
the 3D time traveling btree look like at this point?

The problem I'm getting at here is that the heap, unlike an index, is
unordered. So when I've got all values 1..1e6, they've got to be in a
certain order. Necessarily, the values from the in-flight
transactions are interleaved with each other and with the old data.
If they're not, I can no longer support efficient searches by the
in-flight transactions, plus I have an unbounded amount of cleanup
work to do at commit time. But if I *do* have all of those values
interleaved in the proper order, then an abort leaves fragmented free
space. I can go and kill all of the inserted tuples during UNDO of an
aborted transactions, but because page splits have happened, the index
tuples I need to kill may not be on the same pages where I inserted
them, so I might need to just search from the top of the tree, so it's
expensive. And even if I do it anyway, the pages are left half-full.
The point is that the splits are the joint result of the two
concurrent transactions taken together - you can't blame individual
splits on individual transactions.

Another way to put this is - if you could make all of the index
operations appear to happen instantaneously at commit time, then I see
how we could make what you're talking about work. And if you never
needed to split pages, then you could do that in the same way that you
can do it for the heap. But that's not realistic.

Even for an UNDO-based heap, we have a mild form of this problem.
Suppose we have a full page and delete a tuple, creating free space.
Now another transaction inserts a tuple, using up that free space.
Well, if the first transaction aborts and the second one commits,
we've got trouble. So we can't allow that. We have to keep track of
the amount of space that could potentially be gobbled up by UNDO
actions and leave enough space on the page to guarantee that such
actions will succeed. If that means that some new tuple has to be
placed on a different page instead, so be it. (This could create some
minor bloat, but I think it's not really that bad.) For an index, we
can't manage so easily, because we don't have the same kind of
flexibility about where to put inserted tuples.

Do you see some trick here that I'm missing?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2016-11-24 16:40:46 Re: UNDO and in-place update
Previous Message Greg Stark 2016-11-24 16:06:14 Re: UNDO and in-place update