Re: UNDO and in-place update

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UNDO and in-place update
Date: 2016-12-05 09:49:53
Message-ID: CAA4eK1JMw=Eupx9OXB808fQ6kn3jAgBMUu=XaAKSDis5fJpXnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 1, 2016 at 8:55 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Nov 29, 2016 at 12:21 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> I see what you're going for, but I'm not sure it's worth it. I mean,
> say you just have one bit per index tuple. If it's set, the heap
> tuple is definitely all-visible; if it's clear, then the tuple may or
> may not be visible to your snapshot. You insert tuples with the bit
> clear, and clear the bit again when the tuple is deleted or the
> indexed column is updated. You set the bit when you follow the index
> pointer and discover that the tuple is all-visible, so that next time
> you don't need to follow it again. Of the advantages you mention
> above, this still allows (a), but less efficiently. It allows (b).
> It does not allow (c). So, it's not as good. But, it saves you the
> overhead of storing xmin and/or xmax for each tuple in the page. Even
> if you have an optimization to list the XIDs once per page and then
> refer to them from the tuples, you are going to need a byte in each
> tuple for the xmin-index and a byte for the xmax-index, or something
> like that, plus the space for the XID list itself. That's not a ton
> of space, maybe, but it's definitely more. It's also a significantly
> bigger change to the tuple format, I think. What I'm proposing could
> be done by repurposing the LP_* bits available in the line pointer.
> So I think the questions are (1) will the extra space consumption hurt
> us more than the benefit we will get? and (2) is it really worth the
> work to change the page format to that degree?
>
> As you can probably tell, I'm leaning toward "no". I think a bit
> per-tuple -- whether we call it delete-mark or definitely-all-visible
> -- should be enough to get most of the benefit that is available here
> for substantially less work and no extra space.
>

I can very well understand the reason for not doing so (IIUC, it is
about complexity and time to get it right when we are already trying
to solve one big and complex problem of the system), but saying most
of the benefit can be realized by having simple optimization seems
less convincing. I think having simple optimizations won't solve the
multi-pass vacuum problem. However, having the visibility information
in the index will solve that and avoid index bloat much aggressively.

>> I think with proposed design there is a good chance that for many of
>> the index scans, there is a need for the extra hop to decide
>> visibility of index tuple. I think as of today, during index scan if
>> the corresponding heap page is not-all-visible, then we check the
>> corresponding heap tuple to decide about the visibility of index
>> tuple, whereas with proposed design, I think it first needs to check
>> heap page and then TPD, if there is more than one transaction
>> operating on page.
>
> There's a couple of possible designs here, but there is the
> possibility for extra hops in some cases. But there are things we can
> do to mitigate that.
>
> 1. If each tuple has a definitely-all-visible bit, you can check that
> first; if it's set, the tuple is definitely all-visible you don't need
> to do anything further.
>
> 2. If we still maintain a visibility map, you can check that next. If
> the bit is set for the target page, then the tuple is definitely
> all-visible and you don't need to do anything further.
>
> 3. Otherwise, you need to look at the heap page. But if the heap
> page's UNDO pointer is invalid, then the whole page is all-visible, so
> you're done. (You can also set the visibility map bit and/or the
> index tuple's definitely-all-visible bit to avoid having to recheck
> the heap page in the future.)
>
> 4. Each tuple will have a bit indicating whether it's been recently
> modified; that is, whether the transaction whose UNDO log is
> referenced by the page's UNDO pointer did anything to that tuple; or
> if the page points to a TPD entry, whether any of the transactions in
> the TPD modified that tuple. If that bit is not set for that
> particular tuple, it's all-visible and you're done.
>

I think 4th optimization is especially good and can cover many cases,
but I am not sure how do we unset it once it is set. For example,
once the effect of transaction that has touched that row is
all-visible, how will we unset it. It might be better to store the
offset of transaction that has last modified the tuple, if the last
transaction that has touched the row is visible to snapshot, then we
don't need to process the UNDO.

> 5. Otherwise you have to look at the TPD entry (if any) and UNDO logs.
>
> If we put in the visibility information in the index as you are
> proposing, then we can always resolve the visibility of the index
> tuple at step 1; we just test xmin and xmax against the snapshot. For
> index-only scans, that's great, because we never need to consult the
> heap at all. For regular index scans, it's not nearly as great,
> because we're still going to need to consult the TPD and UNDO logs to
> determine which version of the tuple is visible to that snapshot.
>

Yeah, but we might think of extending it in future so that index also
has UNDO logs which can help such a case. Also, I think the cases
where the tuple is visible but not marked as definitely-all-visible
will be much cheaper.

One more thing that needs some thoughts is how will the rollback work
if keep just one bit for delete marking. I mean for heap we can
directly apply from UNDO, but for the index, I think we need to also
take care of undoing it in some way. For example, search the tree to
remove the deleting marking from old value and delete marking the new
value. Now if that is what we need to do then first we need to
traverse the tree twice which is okay considering rollback is a seldom
operation, the second challenge would be how to make such an operation
atomic with respect to WAL.

>>>>> That's OK, but we're in real trouble if
>>>>> step-3 inserted an additional index tuple pointing to that chain
>>>>> rather than simply noting that one already exists. If it adds an
>>>>> additional one, then we'll visit two index tuples for that TID. Each
>>>>> time, the visibility information in UNDO will cause us to return the
>>>>> correct tuple, but we've erred by returning it twice (once per index
>>>>> entry) rather than just once.
>>>>
>>>> Why will scan traverse the UNDO chain if it starts after step-3 commit?
>>>
>>> Why wouldn't it? I think that if an index scan hits a page with an
>>> UNDO chain, it always need to traverse the whole thing.
>>
>> If you notice the scan has started after all the writes have
>> committed, so what is the guarantee that there will be a UNDO chain?
>> Yes, it could be there if there is some live transaction which started
>> before step, otherwise, I am not sure if we can guarantee that UNDO
>> chain will be present.
>
> The UNDO chain has to remain until all transactions that modified the
> page are all-visible, not just until the transactions are committed.
>

Okay, I think this will work if we avoid the second insertion of the
same value-TID pair in the index as you mentioned above. Without
that, I think we might not distinguish which of the two rows are
visible for a transaction to which the latest (step-3) row is visible.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Verite 2016-12-05 09:57:24 Re: pg_dump / copy bugs with "big lines" ?
Previous Message Michael Paquier 2016-12-05 09:09:05 Re: pgcrypto compilation error due to stack-allocated EVP_CIPHER_CTX