Re: Vacuum dead tuples that are "between" transactions

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Paul Tillotson <spam1011(at)adelphia(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Vacuum dead tuples that are "between" transactions
Date: 2006-02-28 13:22:35
Message-ID: 1141132955.27729.119.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2006-02-28 at 01:32 -0500, Tom Lane wrote:
> Paul Tillotson <spam1011(at)adelphia(dot)net> writes:
> > The topic of improving vacuum for use in heavy-update environments seems
> > to come up frequently on the list. Has anyone weighed the costs of
> > allowing VACUUM to reclaim tuples that are not older than the oldest
> > transaction but are nonetheless invisible to all running transactions?
> > It seems that it's not that hard....
>
> It's not that easy either --- you are assuming that every process
> advertises far more of its internal state than it actually does.
>
> > Suppose that in the system, we have a serializable transaction with xid
> > 1000 and a read committed transaction with xid 1001. Other than these
> > two, the oldest running xid is 2000.
>
> > Suppose we consider a tuple with xmin 1200 and xmax 1201. We will
> > assume that xid 1201 committed before xid 2000 began to run.
>
> > So:
>
> > (A) This tuple is invisible to the serializable transaction, since its
> > snapshot can't ever advance.
>
> Wrong --- you can't assume that simply from the transaction numbering,
> even assuming that you know that 1000 is serializable. 1000 might not
> have set its snapshot until quite some time after it started. (This is
> even pretty likely, if it waited for some locks before setting the
> snapshot.) You'd need access to the snapshot 1000 is actually using to
> be sure which "later" transactions are invisible to it.
>
> While advertising whole snapshots (rather than just xmin) in shared
> memory is at least theoretically possible, the costs of doing that seem
> nontrivial to me ... and they'd have to be paid whether any savings
> ensued or not.
>
> > (B) The read committed transaction might be able to see it. However, if
> > its current command started AFTER xid 1201 committed, it can't.
>
> Another issue is that there's not just "one single snapshot" to worry
> about per backend. Cursors for instance capture their own snaps.
> So a backend would have to somehow keep track of the oldest live
> snapshot it has internally.
>
> > The read committed ones are a more difficult matter, but I think you can
> > treat a tuple as dead if it was inserted after the read committed
> > transaction started to run AND the tuple was deleted before the
> > transaction's currently running command started to run.
>
> To do that requires not just that you have access to a backend's oldest
> snapshot, but that you have access to *all* its active snapshots;
> because such a transient tuple might be visible in some newer snap even
> though it's too new for the oldest snap. Doing that will create very
> significant problems of shared memory management, as well as performance
> and locking issues.
>
> There's been some talk of distinguishing "global" and "within database"
> xmin values, so that a long-lived transaction wouldn't interfere with
> vacuuming tables in other databases that that xact couldn't possibly
> access. That seems doable to me, but I think any finer-grained analysis
> is probably going to be a net loss because of the distributed overhead
> it'd impose on every transaction.

Paul raises some thoughts that are worth considering, even with the
usual minefield of difficulties.

Paul, you mention serializable transactions, but your root issue seems
to be that VACUUM clears up less rows when pg_dump is running, yes? Have
you tried using an on-line hot backup with archive_command set (PITR)?
That doesn't suffer from the same issue and is faster too.

OTOH a few hackers discussed this recently and found that nobody used
serializable transactions (ST) except during pg_dump. It seems a
reasonable observation that *most* STs are pg_dumps, or at very least:
the longest running STs are pg_dumps. So rather than changing all
transaction modes, or even special-casing STs, why not put in some
infrastructure to cope specifically with the problems that pg_dump can
cause?

A general facility that would allow STs to identify which tables they
would/would not touch again could be used by pg_dump to advertise useful
information. That information could then be picked up by a VACUUM: when
locking to get xmin it would see an ST, then retrieve the information to
allow it to work out a per-table xmin. Non-VACUUM transactions would
ignore any special ST information, causing very low overhead for normal
operation (checking whether each current transaction was an ST, which
mostly will be predicted correctly as "no" by the CPU).

You could take that further and get pg_dump to use a list file like
pg_restore. You would then be able to *where possible* alter the
sequence of data dumping so that heavily updated tables were dumped
first so the dumping ST could then advertise "no further access" to
particular tables. VACUUMs could then proceed as if the ST were not
there at all.

Or maybe at least the idea of some special case ST behaviour might be
worthy of some thought.

I've no intention of working on this myself, especially since PITR
provides an alternate backup solution anyway (even in combination with
other techniques), but the idea seems worth recording for others to
discuss.

Best Regards, Simon Riggs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2006-02-28 13:48:49 Re: new feature: LDAP database name resolution
Previous Message Alvaro Herrera 2006-02-28 13:16:10 Re: Dead Space Map