Simon Riggs wrote:
> I've come
> up with an alternative concept to allow us to discuss the particular
> merits of each. ISTM that this new proposal has considerable potential.
Hm.. interesting idea.
> If we were able to keep track of which sections of a table are now
> read-only then we would be able to record information about the data in
> that section and use it to help solve queries. This is turning the
> current thinking on its head: we could derive the constraints from the
> data, rather than placing the data according to the constraints. That
> would be much more natural: load data into the table and have the system
> work out the rest.
Yeah, but that's also the most limiting factor of your approach: it
covers only horizontal partitioning by time (or to be more precise, by
columns which are very likely to increase or decrease with time). All
other columns will very likely contain values from the full range of
As you have pointed out, that might be a very frequent use case. I can't
argue about that, however, I think it's important to be well aware of
> Other scans types might also use segment exclusion, though this would
> only be useful for scans retrieving many rows, otherwise the overhead of
> segment exclusion might not be worthwhile.
Uh.. the overhead of checking against min/max values doesn't seem that
big to me.
I rather think the gain for index scans would be prohibitively small,
because (given frequent enough vacuuming) an index scan shouldn't return
many pointers to tuples in segments which could be optimized away by
> If we collect data for all columns then many of our implicit constraints
> would be useless. e.g. if a column only has a few values and these are
> present in all segments. Matching our predicate against all constraints
> would be expensive, so we must weed out poor constraints. We would do
> this by removing any constraint that overlapped more than 10% of other
> segments. Various heuristics would likely need to be discovered during
> development to make this work without resorting to manual commands.
Uh, well, that's about the limitation I've pointed out above. But is it
really worth maintaining statistics about overlapping values and
removing min/max checks for certain columns?
It would save you the min/max check per segment and scan, but cost
maintaining the statistics and checking against the statistics once per
scan. AFAICS the block with the min/max tuple per segment will often
remain cached anyway... dunno.
> Noting which segments are read-only
> Everything so far has relied upon our ability to note which segments of
> a table are read-only. We could do this in two ways
> 1) have the system automatically keep track of non-changing data
> 2) have the DBA issue a command to "mark" a segment as read-only now
> Probably a combination of the two is better, so we have three states for
> - READ_WRITE_ALLOWED
> - EFFECTIVE_READ_ONLY (set by 1 only)
> - MARKED_READ_ONLY (set by 2 only)
> Having said that I want to concentrate on (1), though consider the other
> one also in case requested by reviewers.
Hm.. AFAICT, horizontal partitioning often serves manageability, which
is quite limited having all data in one table (you can't move a single
segment to a different tablespace). Thus I think option 2 is pretty
constrained is usability. What would the DBA gain by setting a segment
to read only? How does the DBA figure out, in which segment a tuple is
stored in (so she can decide to mark it read only)?
> The user may also wish to clear down very old data, so allowing DELETEs
> can ensure the user can still remove old data from the table. By
> carefully choosing the values to be deleted, a whole segment can be
> deleted and then returned to the FSM.
Oh, yeah, that sounds like a good optimization. Bulk deletes, yummie!
> This proposal offers many of the advantages of the earlier Visibility
> Map proposal, yet without major changes to heap structure. Since the
> segment-level visibility map is more granular it would only be 80% as
> effective as the more detailed block-level map. Having said that, the
> bookkeeping overheads will also be considerably reduced, so it does seem
> a good joint approach. It also allows freezing to be handled fully,
> which was a problem with the original visibility map proposal. WAL
> logging visibility map changes is also now much easier.
I generally agree, although I'm somewhat dubious about the 80% factor.
> My thoughts have been targeted directly at partitioning, yet I have to
> admit that this idea overlaps, and in my world view, replaces the
> Visibility Map proposal. I very much like the name Visibility Map
I would even say, that partitioning is somewhat of a misleading term
here, because it normally allows the DBA to decide on where to split.
Given that we are operating on segments here, to which the DBA has very
limited information and access, I prefer the term "Segment Exclusion". I
think of that as an optimization of sequential scans on tables with the
above mentioned characteristics.
> If we do need to differentiate between the two proposals, we can refer
> to this one as the Segment Visibility Map (SVM).
I'm clearly in favor of separating between the two proposals. SVM is a
good name, IMHO.
> We can handle select count(*) by scanning the non-100% visible segments
> of a table, then adding the stored counts for each segment to get a
> final total. Not sure if its really worth doing, but it does sound like
> an added bonus.
Yup, sounds tempting. Although it's contrary to Postgres position so
far. And one could argue that you'd have to maintain not only count(),
but also avg(), sum(), etc... as well for all tuples in the 100% visible
> There would be additional complexity in selectivity estimation and plan
> costing. The above proposal allows dynamic segment exclusion, which
> cannot be assessed at planning time anyway, so suggestions welcome...
Hm.. that looks like a rather bad downside of an executor-only optimization.
> Comparison with other Partitioning approaches
> Once I realised this was possible in fairly automatic way, I've tried
> hard to keep away from manual overrides, commands and new DDL.
> Declarative partitioning is a big overhead, though worth it for large
> databases. No overhead is *much* better though.
> This approach to partitioning solves the following challenges
> - allows automated table extension, so works automatically with Slony
> - responds dynamically to changing data
> - allows stable functions, nested loop joins and parametrised queries
> - allows RI via SHARE locks
> - avoids the need for operator push-down through Append nodes
> - allows unique indexes
> - allows both global indexes (because we only have one table)
> - allows advanced planning/execution using read-only/visible data
> - works naturally with synchronous scans and buffer recycling
> All of the above are going to take considerably longer to do in any of
> the other ways I've come up with so far...
I fully agree. But as I tried to point out above, the gains in
manageability from Segment Exclusion are also pretty close to zero. So
I'd argue they only fulfill parts of the needs for general horizontal
> This technique would be useful for any table with historical data keyed
> by date or timestamp. It would also be useful for data where a
> time-of-insert component is implicit, such as many major entity tables
> where the object ids are assigned by a sequence. e.g. an Orders table
> with an OrderId as PK. Once all orders placed in a period have been
> shipped/resolved/closed then the segments will be marked read-only.
Agreed. Just a minor note: I find "marked read-only" too strong, as it
implies an impossibility to write. I propose speaking about mostly-read
segments, or optimized for reading or similar.
> Its not really going to change the code path much for small tables, yet
> seems likely to work reasonably well for large tables of all shapes and
That sounds a bit too optimistic to me. For Segment Exclusion, it takes
only *one* tuple to enlarge the min/max range dramatically in any
direction. So it's not the overall correlation between column values and
storage location, but rather only the min/max column values which
matter. Have you ever checked, how well these min/max values correlate
with the segment number?
Pretty much the same argument applies to SVM: an update to only one
tuple in a segment is enough to remove the optimization for reading
(EFFECTIVE_READ_ONLY property) for the segment. The assumption here
(being that updates happen mostly to newer segments) is not quite the
same as above.
Maybe a combination with CLUSTERing would be worthwhile? Or even
enforced CLUSTERing for the older segments?
> If a segment is being updated, we leave it alone, and maybe never
> actually set the visibility map at all. So overall, this idea seems to
> cover the main use case well, yet with only minimal impact on the
> existing use cases we support.
> As before, I will maintain this proposal on the PG developer Wiki, once
> we get down to detailed design.
Thanks for working out yet another great proposal. I hope to have been
of help with my questions and remarks. ;-)
In response to
pgsql-hackers by date
|Next:||From: Markus Schiltknecht||Date: 2008-01-04 12:39:42|
|Subject: Re: Dynamic Partitioning using Segment Visibility Maps|
|Previous:||From: Alvaro Herrera||Date: 2008-01-04 12:20:26|
|Subject: Re: timestamp typedefs|