Segment Visibility Map for VACUUM

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Segment Visibility Map for VACUUM
Date: 2008-02-06 15:48:13
Message-ID: 1202312893.29242.84.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Earlier, I proposed a Segment Visibility Map as part of my design of new
partitioning feature proposals. This idea has been spun-off into a
completely **separate** proposal because SVM has merit enough to be
worthy of implementation on its own:

http://developer.postgresql.org/index.php/Segment_Visibility_Map
(full text of this is copied below also)

This then allows it to be compared directly with other proposals
- Dead Space Map (Itagaki)
- Visibility Maps (Heikki)
and everybody else down the years with a variant on this idea

ISTM that SVM would offer many of the advantages of the other
approaches, with a less invasive, lower contention design. The
implementation is fairly straightforward also.

I'd like to get this sorted out as one of the first things in 8.4
development, so that all of the other *potential* uses of the visibility
information can also be considered and possibly implemented in the same
release.

SEGMENT VISIBILITY MAP

= Use Case =

Larger tables frequently need VACUUMing, but only over a small range of
blocks. It is common for much a table to be read-only, so if we can
identify which parts are read-only then we can improve performance of
VACUUM, as we as other utilities.

= Segment Visibility Map =

We would keep a dynamic visibility map at *segment* level, showing which
segments have all rows as 100% visible.
This is known as the Segment Visibility Map (SVM).

We have at least two bits for each segment in the table, rather than
each block.
The two bits allow us to represent four states:
* read_write
* read_only
* read_only_pending (explained later)
* read_only_frozen

Some additional states that have been discussed are

* explicitly marked read only
* clustered
* compressed
* off-line (indicates high cost to access)

We should allow for 1 byte (8 bits) per segment.

That's possibly small enough to store directly on pg_class, even for
fairly large tables. However we want to avoid bloat when the SVM
changes, plus we may need to handle some very big tables. So we would
store SVM as a single column, in a table pg_svm with storage attribute
MAIN, so it is inline, compressible.

A segment is a range of blocks, size of which can be altered as
required. Initial suggestion is that a segment would be 1GB in size,
same as a physical data segment. We might allow this to be user
settable, or we might automatically adjust it to minimise the overhead
and maximise the usefulness.

It would be possible to change the SVM segment size dynamically,
recalculating the bit positions accordingly. If that was desirable to
optimise the overhead and benefit from this approach.

= Access to the SVM =

Since the map is very small and hardly ever changes for a table we can
cache it easily on each backend.

If the map does change we can perform a relcache invalidation to make
everybody re-read the map. (Few improvements needed there also, but not
relevant here).

No dynamic shared memory cache is required because any concurrent
changes to the table would be ignored by a scan anyway, so it doesn't
matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any
new scans that start will attempt to lock the table and then perform a
rel cache check before continuing. So the visibility will be set
correctly for *that* scan at least.

In most cases the visibility map can be summarised as a single boolean
to show whether *any* 100% visible segments exist. That makes accessing
the map very cheap in the common, unset case. This would be an
additional boolean entry on the Relation struct.

The check to see if the rel cache has changed is essentially free, since
we do it already when we grab a lock. Since INSERT, UPDATE and DELETE
only spoil the cache, they don't need to have a completely up-to-date
picture. (That aspect may change).

If we want backends running non-utility commands to see the SVM then
we would need to check the rel cache each time we access the table,
not just each time we lock a table. We need not do that in all cases,
only those that would benefit from knowing SVM information.

= UnSetting the Segment Visibility Map =

The SVM will be "unset", i.e. set to read_write if an INSERT, UPDATE or
DELETE occurs within a segment marked read_only or read_only_pending.
This catalog change can be handled with a non-transactional overwrite
- the bit always exists already, so the overwritten data is always same
size. That's pessimistic, since it resets the state even if the
UPDATE/DELETE aborts, but its better than holding the row locked until
the transaction completes, which might prevent other things from
happening. Or maybe we could do that at transaction commit.

DML commands would first check the boolean rel.has_read_only_segments to
see if any non read_write segments exist. If so, then we would calculate
the segment being written to by the DML operation then set the state
accordingly.

The relcache invalidation caused by making a segment read_only_pending
can flush the rd_targBlock for the relation in each backend. Other
INSERTs are not possible since all current code-paths either use
extension or the FSM when the table is non-empty, and neither of those
will cause a problem. Extension only touches last segment, which will
never by read_only_pending. The FSM contains no entries for an
read_only_pending segment.

SHARE locks currently write to data blocks, but since they represent
transient state we do not need to unset either the partitioning or the
visibility map.

= Setting the Segment Visibility Map =

Setting the visibility map is tricky since this must work whilst
concurrent changes occur to the tables.

We do this by having an intermediate state, similar to the way that
CREATE INDEX CONCURRENTLY works.

First, VACUUM will scan a segment and if it finds all tuples are visible
then it will set the state of that segment to read_only_pending. We also
record the xid of the VACUUM, so we know when this was set.

If the next VACUUM still sees read_only_pending state then we VACUUM a
second time. If all rows are visible then at the end of the second
VACUUM if read_only_pending is still set then we set read_only state.
(If changes to SVM are immediate we don't have to wait; if we only make
SVM changes at end of transaction then we have to wait for all
concurrent lockers of the table to complete before we can set read_only
state).

Why do we do this? It's possible the first VACUUM missed some concurrent
changes. However, once read_only_pending is set any further changes will
clear it. So at the end of the second VACUUM we are guaranteed to have
seen any changes between the start of the first VACUUM and the end of
the second VACUUM.

If we discover that a segment is completely frozen, then we set
read_only_frozen rather than just read_only.

If we perform a VACUUM FREEZE then we re-scan read_only segments,
expecting to set them to read_only_frozen.

VACUUM will skip read_only and read_only_frozen segments.

VACUUM FREEZE will skip read_only_frozen segments.

VACUUM FULL will always scan all segments, so that we always have a safe
option to fall back to in case of suspected problems.

We never mark the highest numbered segment read_only_pending, since it
is likely to expand when INSERTs occur. This is also a useful way of
excluding smaller tables from the overheads involved for this feature.

= VACUUM Benefits =

In an insert-only table this would mean that only the last 1 or 2
segments would be read write, so VACUUM would always run in a bounded
time, no matter how large the table.

Frequently tables have areas of them deleted en-masse, so this would
help direct VACUUM only to the areas of change.

We recognise that read only segments will change the autovacuum
calculation - we simply exclude them entirely. This will mean that
VACUUMs are run more regularly than they are now, but when they do run
they will be quicker. (Note that currently, VACUUMs on a growing table
get further and further apart as the table grows, assuming constant
write rate).

A later VACUUM will still be required to freeze the rows, though that
can happen later when we hit the frozen limit, or earlier if we run an
explicit VACUUM FREEZE. We would still store only one oldest xid for
each table. There's no need to store a specific xid for each segment.

= Interaction with FSM =

This departs completely from the idea that the visibility map and the
freespace map are related somehow. However, there is a change required
in the FSM to ensure the proposed technique is stable and useful: If we
scan a segment and it is 100% visible, if there is freespace in just one
of the blocks in the segment then we will soon find that our
visibility-bit will be set to off again very quickly.

If there is more than 5% freespace total in the segment then we will not
set read_only_pending, nor report blocks in a segment to the FSM. That
way, small amounts of freespace won't destroy the benefits of making
segments read-only.

VACUUM currently overwrites FSM information, though if this
was not the case then we would have to actively remove it for the newly
read only segments. Perhaps that limit would be (100 - fillfactor)%,
which is normally set according to a user's expectation of the number of
updates likely on a table.

= Utilities =

It would be possible to have a segment-aware
CLUSTER. That would require us to use another flag in the visibility
map to indicate clustered/unclustered. That might assist us in improving
REINDEX time, since the rows for read_only_clustered segments would
not require sorting.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2008-02-06 15:56:19 Re: pg_dump additional options for performance
Previous Message Andrew Dunstan 2008-02-06 15:47:58 Re: pg_dump additional options for performance