Re: Toast issues with OldestXmin going backwards

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Toast issues with OldestXmin going backwards
Date: 2018-04-21 22:58:53
Message-ID: 87d0yscktp.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>>>> "Andres" == Andres Freund <andres(at)anarazel(dot)de> writes:

Andres> Please wait for a bit. This isn't a trivial change, and I would
Andres> like to wrap my head around it.

Sure.

Andres> At the very least this seems like it could cause extreme
Andres> slowdowns for large tables that have the right update/delete
Andres> patterns?

I'm working on the basis that fetching the same set of toast rows twice
in succession will at least mean they are likely in cache for the second
time, so we shouldn't be doing any more actual I/O than before. (This is
assuming that a system with really big toast values is going to also
have plenty of RAM, which is a generally safe assumption since such
values are copied many times in memory by pg_dump.) There is still some
overhead in the check, of course; I will see about doing some
benchmarks.

>> + /*
>> + * OldestXmin going backwards can mean we have a RECENTLY_DEAD row that
>> + * contains toast pointers whose toast rows have already been vacuumed
>> + * away (or in the worst but unlikely case, recycled). If so, then the
>> + * row must really be dead to all snapshots that could access it, so
>> + * treat it as DEAD instead.
>> + */

Andres> How is it guaranteed that the recheck doesn't find a different
Andres> toast tuple at the relevant position?

It's not, but it doesn't matter as long as the new toast rows can pass
through toast_fetch_datum without throwing an error (which is why
toast_validate_datum has to scan all the entries and check all their
lengths).

Firstly, we can assume the following statements (or at least if we
can't, we have far worse problems than this):

1. OldestXmin going backwards cannot make any tuple visible in any
(non-vacuum) snapshot that it wasn't visible in before;

2. If a tuple is non-vacuumable itself, it won't have its toast
entries vacuumed away;

3. Once a tuple is vacuumable at any single instant, it will never
become visible in any snapshot.

So the only cases where a recently-dead tuple can have pointers to a
now-recycled toast entry OID is if it was at some point vacuumable, such
that a toast-table-only vacuum removed its toast entries, and a new insert
or update subsequently inserted a value of the same length that was
assigned the OID.

Since the tuple was vacuumable at that point, it will never again become
visible in any snapshot, so the only code that should ever access its
toast values is vacuum full/cluster, which only ever copies the value
and does not attempt to decompress or otherwise interpret it. So the
only possible consequences for copying the wrong value, as long as we
don't throw an error in the process, are that (a) some space is possibly
wasted in the toast table - which actually happens already in far more
common circumstances - and (b) someone doing forensics or other
non-standard access to dead tuples might (in this vanishingly rare case)
see something unexpected. At no time would incorrect data be visible in
any query.

(regarding wasted space in the toast table: vacuum full / cluster will
insert live toast entries for copied recently-dead tuples, and if those
toast entries are not also referenced from any live tuple, they will
never be removed by subsequent normal vacuums, since nothing can ever
mark those entries dead; only a later vacuum full / cluster can remove
them.)

--
Andrew (irc:RhodiumToad)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2018-04-21 23:35:41 Re: Toast issues with OldestXmin going backwards
Previous Message Gasper Zejn 2018-04-21 19:21:39 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS