Tom brought this up during the PGCon developer meet. After thinking a
bit about it, I think it's actually possible to avoid the second heap
scan, especially now that we've HOT. If we can remove the second pass,
not only would that speed up vacuum, but also reduce lots of redundant
read and write IO.
Currently second heap scan is required to remove the dead tuples from
the heap. We can not do this in the first scan because we haven't yet
removed the index pointers pointing to them. HOT now prunes and
defrags the pages in the first phase itself and what is left behind is
just a bunch of DEAD line pointers. The line pointers are marked
"UNUSED" in the second heap scan. Since we don't repair any line
pointer bloat, no additional free space is created in the second pass.
So frankly there is not much left to be done in the second phase. Of
course we also update the FSM information at the end of second pass.
If we want to remove the second pass, what we need is a mechanism to
reclaim the DEAD line pointers. But to this correctly, we must ensure
that the DEAD line pointers are reclaimed only and only after the
index entries pointing to them are removed.
Tom's idea was to store the vacuum-xid in the tuple header and check
that xid to see if the vacuum successfully removed the index pointers
or not. Heikki had some brilliant idea to store the xid in the line
pointer itself. These ideas are good, but would require xid wraparound
handling. I am thinking of a solution on the following lines to handle
DEAD line pointers. Other ideas are welcome too.
1. Before VACUUM starts, it updates the pg_class row of the target
table, noting that VACUUM_IN_PROGRESS for the target table.
2. It then waits for all the existing transactions to finish to make
sure that everyone can see the change in the pg_class row,
3. It then scans the heap, prunes and defrags the pages. The normal
pruning would reclaim all the dead tuples and mark their line pointers
as DEAD. Since VACUUM is going to remove the index pointers pointing
to these DEAD line pointers, it now marks these DEAD line pointers
with additional flag, say DEAD_RECLAIMED.
4. At the end of first scan, VACUUM updates FSM information for heap pages.
5. It then proceeds with the index scan and removes index pointers
pointing to the DEAD line pointers collected in the heap scan.
6. Finally, it again updates the pg_class row and clears the
Any other backend, when invokes page pruning, would check if the
VACUUM is in progress by looking at the VACUUM_IN_PROGRESS flag. Note
that if the previous vacuum had failed or database crashed before
vacuum completed, the VACUUM_IN_PROGRESS flag would remain set until
the next vacuum successfully completes on the table and resets the
flag (VACUUM_NOT_IN_PROGRESS state).
Since vacuum waits for the existing transactions to finish before
marking any DEAD line pointers DEAD_RECLAIMED, for a backend which
sees VACUUM_NOT_IN_PROGRESS, any DEAD_RECLAIMED line pointer it finds
must be left over from the previously successfully completed vacuum.
Since the previous vacuum must have removed the index pointers
pointing to it, the backend can now safely reclaim the line pointer
itself. The backend can potentially do this any time it sees a
DEAD_RECLAIMED line pointer, but we may restrict this only during the
pruning activity to keep things simple. This operation need not be WAL
logged if we appropriately handle DEAD_RECLAIMED line pointer during
redo recovery (if it's reused for some other insert/update activity).
I think this scheme guarantees that a backend would always see
VACUUM_IN_PROGRESS if vacuum is currently in progress on the table or
the last vacuum has failed. There might be situations when a backend
sees VACUUM_IN_PROGRESS when if fact there is no vacuum is progress
and the last vacuum finished successfully, but that won't have any
correctness implication, but would only delay reclaiming
DEAD_RECLAIMED line pointers.
pgsql-hackers by date
|Next:||From: Simon Riggs||Date: 2008-05-28 11:31:33|
|Subject: Re: Hint Bits and Write I/O|
|Previous:||From: Matthew Wakeling||Date: 2008-05-28 10:45:14|
|Subject: Re: Outer joins and equivalence|