Happy New Year, everybody.
This proposal follows on from previous thinking about partitioning,
where I've taken up Andrew Sullivan's suggestion to re-examine the
current partitioning concept of using tables as partitions. So I've come
up with an alternative concept to allow us to discuss the particular
merits of each. ISTM that this new proposal has considerable potential.
Recap: Very Large Table Use Case
Many large tables have a regular access pattern:
- new rows inserted frequently
- some updates and deletes on the "recent" data
- older data becomes effectively read-only
- at some point data may be removed in bulk from the table
Most tables vary on what "recent" means; some are insert only, many not.
A typical example might be a large history table where the last 6 months
data is updated, but after that the data has almost zero change.
Partitioning relies on knowledge of the physical location of data to
exclude sections of data from a scan.
Currently the partitioning knowledge is provided by user declarations in
the form of constraints on tables linked together by views or
inheritance. We also currently do this in "constraint-first" fashion,
i.e. we put the data where the constraints say we should, rather than
deriving the constraints from the data.
In the common use-case there is a clear affinity between certain data
columns and the physical placement of data in a table. This is true for
columns such as LOG_DATE or TRANSACTION_DATE, but is also true for
any columns where the id column is assigned by a Sequence. HOT helps
this to remain true over time.
Given the above use case, when there is an affinity between data values
and data placement *and* we know that older data tends to become read
only, we can then realise that certain physical sections of a table will
become effectively read only. (My experience is that the larger the
table, the less random the write pattern - whatever the guys that wrote
If we were able to keep track of which sections of a table are now
read-only then we would be able to record information about the data in
that section and use it to help solve queries. This is turning the
current thinking on its head: we could derive the constraints from the
data, rather than placing the data according to the constraints. That
would be much more natural: load data into the table and have the system
work out the rest.
We can imagine that we do this within a single table. Imagine: split a
table up into 100 sections, then record the min/max values of all tuples
in those sections. When we perform a SeqScan we could skip over sections
that could be proved would never satisfy the query predicate. Same thing
as partitioning, but within the table.
Currently tables are already broken up into 1GB file segments, so it
seems fairly natural to pick that as our section size. That fits
reasonably well with recommendations for ideal partition size.
After we note that a segment is read-only we can scan the segment and
record min/max values for all columns. These are then "implicit
constraints", which can then be used for segment exclusion in a similar
manner as we do with the current constraint exclusion mechanism.
The implicit constraints can then allow SeqScans to assess segment
exclusion for each segment at execution time, i.e. a scan can skip a
whole segment if constraints don't match. If a segment does not (yet)
have implicit constraints it will always be scanned in full. This allows
partitioning, synch scans and buffer recycling to work much better
together than they currently do.
Other scans types might also use segment exclusion, though this would
only be useful for scans retrieving many rows, otherwise the overhead of
segment exclusion might not be worthwhile.
This would also allow a Merge Join to utilise exclusion for the inner
plan at execution time. Instead of scanning the whole inner table plan
from the start, we would be able to start the scan from the appropriate
segment. This would require us to pass the current value of the outer
plan down into the inner plan. The typical executor nodes on the inner
plan would be a SortNode and below that a SeqScan. In that case the
executor would need to pass the outer value from the MergeJoinNode down
thru the SortNode to the SeqScan node. The SeqScan node could then
perform partition exclusion, reducing the time for that scan and also
reducing the time for the resulting sort. This sounds like it might be
worth doing in normal cases also. It might turn out that the potentially
applicable cases are already excluded during planning, I haven't thought
about that aspect in enough detail yet.
If we collect data for all columns then many of our implicit constraints
would be useless. e.g. if a column only has a few values and these are
present in all segments. Matching our predicate against all constraints
would be expensive, so we must weed out poor constraints. We would do
this by removing any constraint that overlapped more than 10% of other
segments. Various heuristics would likely need to be discovered during
development to make this work without resorting to manual commands.
Note that all of this exclusion is happening in the executor, not the
planner. That allows this form of exclusion to work with stable
functions and parameters without problem.
Noting which segments are read-only
Everything so far has relied upon our ability to note which segments of
a table are read-only. We could do this in two ways
1) have the system automatically keep track of non-changing data
2) have the DBA issue a command to "mark" a segment as read-only now
Probably a combination of the two is better, so we have three states for
- EFFECTIVE_READ_ONLY (set by 1 only)
- MARKED_READ_ONLY (set by 2 only)
Having said that I want to concentrate on (1), though consider the other
one also in case requested by reviewers.
So *how* would we mark segments as read only? It turns out that
attempting to do this is extremely similar to Heikki's Visibility Map
idea, so I'll describe it in those terms, even though there are some
differences. It also brings to mind Andrew Sullivan's read-only tuples
We would keep a dynamic visibility map at *segment* level, showing which
segments have all rows as 100% visible. No freespace map data would be
held at this level.
Most everything about Heikki's Visibility Map idea holds, just that we
have a bit for each segment in the table, rather than each block. The
map is very small and hardly ever changes for a table, so we can cache
it easily on each backend. If the map does change we can perform a
relcache invalidation to make everybody re-read the map.
No dynamic shared memory cache is required because any concurrent
changes to the table would be ignored by a Scan anyway, so it doesn't
matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any
new scans that start will attempt to lock the table and then perform a
rel cache check before continuing. So the visibility will be set
correctly for *that* scan at least.
In most cases the visibility map can be summarised as a single boolean
to show whether any 100% visible segments exist. That makes accessing
the map very cheap in the common, unset case.
Setting the Visibility Map
VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of
them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to
all current backends. Marking the map will cause a rel cache
We would never mark the highest numbered segment
EFFECTIVE_READ_ONLY, since it is likely to expand when INSERTs occur.
This is also a useful way of excluding smaller tables from the overheads
involved for this feature.
In an insert-only table this would mean that only the last 1 or 2
segments would be read write, so VACUUM would always run in a bounded
time, no matter how large the table.
When would we attempt to set the visibility map? If we do this
aggressively, then it will be quickly unset. If we wait too long then we
might lose some of the benefits of this approach.
First, we recognise that read only segments will change the autovacuum
calculation - we simply exclude them entirely. This will mean that
VACUUMs are run more regularly than they are now, but when they do run
they will be quicker. (Note that currently, VACUUMs on a growing table
get further and further apart as the table grows, assuming constant
My feeling is that VACUUMs are sufficiently infrequent that we should
set the visibility map aggressively after each VACUUM. If the map is
quickly unset, then we wait some time before trying again. Unsetting
seems to be relatively low contention (see below), so that seems
We would then re-scan newly marked segments to derive min/max values for
implicit constraints, performed by autovacuum worker processes. The best
way seems to be that an ANALYZE command would automatically detect that
a segment has been recently set EFFECTIVE_READ_ONLY and perform an
all-rows scan rather than just a sample. (We would keep min/max values
only for datatypes with valid btree sort operators, just as ANALYZE
already does). The data for this can be stored as part of the stats
histogram for each column. pg_class.reltuples would become an array of
row counts in each segment, known accurate for read only segments.
A later VACUUM will still be required to freeze the rows, though that
can happen later when we hit the frozen limit, or earlier if we run an
explicit VACUUM FREEZE. We would need to store the oldest xid for each
segment, so we know when to kick off a FREEZE on that part of the table.
We would do this by making pg_class.relfrozenxid an array also.
This departs completely from the idea that the visibility map and the
freespace map are related somehow. However, there is a change required
in the FSM to ensure the proposed technique is stable and useful: If we
scan a segment and it is 100% visible, if there is freespace in just one
of the blocks in the segment then we will soon find that our
visibility-bit will be set to off again very quickly.
If there is more than 5% freespace total in the segment then we will not
set EFFECTIVE_READ_ONLY, nor report blocks in a segment to the FSM. That
way, small amounts of freespace won't destroy the benefits of segment
exclusion. VACUUM currently overwrites FSM information, though if this
was not the case then we would have to actively remove it for the newly
read only segments. Perhaps that limit would be (100 - fillfactor)%,
which is normally set according to a user's expectation of the number of
updates likely on a table.
Unsetting the visibility map
For EFFECTIVE_READ_ONLY, INSERTs, UPDATEs and DELETEs are still
possible. For MARKED_READ_ONLY it would be explicitly refused.
The relcache invalidation caused by making a segment EFFECTIVE_READ_ONLY
can flush the rd_targBlock for the relation in each backend. Other
INSERTs are not possible since all current code-paths either use
extension or the FSM when the table is non-empty, and neither of those
will cause a problem. Extension only touches last segment, which will
never by EFFECTIVE_READ_ONLY. The FSM contains no entries for an
UPDATEs or DELETEs are possible on EFFECTIVE_READ_ONLY segments, but we
don't expect it often. If this occurs, we will simply reset the
visibility map for the table. This can be handled with a
non-transactional overwrite - the bit always exists already, so the
overwritten data is always same size. That's pessimistic, since it
resets the state even if the UPDATE/DELETE aborts, but its better than
holding the row locked until the transaction completes, which might
prevent other things from happening. We probably also want VACUUM to
clear up any aborted transactions anyway.
The user may also wish to clear down very old data, so allowing DELETEs
can ensure the user can still remove old data from the table. By
carefully choosing the values to be deleted, a whole segment can be
deleted and then returned to the FSM.
Running a huge DELETE will likely perform a SeqScan that avoids all but
the deleting data. The following VACUUM will also ignore the other
segments, so removing older data seems fairly well optimised already, in
comparison with now. It would be possible to optimise this to perform a
segment-level truncate, but since the partitioning will already improve
the DELETE and VACUUM, I'm not proposing that yet, if ever.
SHARE locks currently write to data blocks, but since they represent
transient state we do not need to unset either the partitioning or the
Overlap with Visibility Map Proposal
This proposal offers many of the advantages of the earlier Visibility
Map proposal, yet without major changes to heap structure. Since the
segment-level visibility map is more granular it would only be 80% as
effective as the more detailed block-level map. Having said that, the
bookkeeping overheads will also be considerably reduced, so it does seem
a good joint approach. It also allows freezing to be handled fully,
which was a problem with the original visibility map proposal. WAL
logging visibility map changes is also now much easier.
My thoughts have been targeted directly at partitioning, yet I have to
admit that this idea overlaps, and in my world view, replaces the
Visibility Map proposal. I very much like the name Visibility Map
though. I've re-read the earlier thread and agree with Jeff Davis that
we *could* have multiple kinds of visibility map, but also agree with
Heikki that it seems like too much work. It was never my intent to
address both challenges in one proposal and this is just pure
coincidence, or possibly just luck, depending upon how you like the
If we do need to differentiate between the two proposals, we can refer
to this one as the Segment Visibility Map (SVM).
Index-only scans would look at the Visibility Map, just as they would
have done in the earlier proposal.
It would be costly for an index scan to repeatedly check the visibility
map, but probably worth it to do so.
We can handle select count(*) by scanning the non-100% visible segments
of a table, then adding the stored counts for each segment to get a
final total. Not sure if its really worth doing, but it does sound like
an added bonus.
There would be additional complexity in selectivity estimation and plan
costing. The above proposal allows dynamic segment exclusion, which
cannot be assessed at planning time anyway, so suggestions welcome...
I think it would be prudent to have an override option on VACUUM to
allow us to scan a table in full, checking rather than using the
visibility map. This mode would be called VACUUM CHECK.
None of the above blocks earlier proposals for read only tables, and the
two features could work together very straightforwardly.
Comparison with other Partitioning approaches
Once I realised this was possible in fairly automatic way, I've tried
hard to keep away from manual overrides, commands and new DDL.
Declarative partitioning is a big overhead, though worth it for large
databases. No overhead is *much* better though.
This approach to partitioning solves the following challenges
- allows automated table extension, so works automatically with Slony
- responds dynamically to changing data
- allows stable functions, nested loop joins and parametrised queries
- allows RI via SHARE locks
- avoids the need for operator push-down through Append nodes
- allows unique indexes
- allows both global indexes (because we only have one table)
- allows advanced planning/execution using read-only/visible data
- works naturally with synchronous scans and buffer recycling
All of the above are going to take considerably longer to do in any of
the other ways I've come up with so far...
And yes, this is a different conclusion to earlier discussions,
particularly those in 2005 where I was still thinking in terms of
This technique would be useful for any table with historical data keyed
by date or timestamp. It would also be useful for data where a
time-of-insert component is implicit, such as many major entity tables
where the object ids are assigned by a sequence. e.g. an Orders table
with an OrderId as PK. Once all orders placed in a period have been
shipped/resolved/closed then the segments will be marked read-only.
Its not really going to change the code path much for small tables, yet
seems likely to work reasonably well for large tables of all shapes and
sizes. If a segment is being updated, we leave it alone, and maybe never
actually set the visibility map at all. So overall, this idea seems to
cover the main use case well, yet with only minimal impact on the
existing use cases we support.
As before, I will maintain this proposal on the PG developer Wiki, once
we get down to detailed design.
pgsql-hackers by date
|Next:||From: Simon Riggs||Date: 2008-01-02 18:12:47|
|Subject: Re: Slow count(*)|
|Previous:||From: Pavel Stehule||Date: 2008-01-02 16:42:53|
|Subject: Re: [HACKERS] Slow count(*)|