Re: Dynamic Partitioning using Segment Visibility Maps

From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Dynamic Partitioning using Segment Visibility Maps
Date: 2008-01-07 15:41:46
Message-ID: 20080107154146.GA18581@crankycanuck.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 05, 2008 at 08:02:41PM +0100, Markus Schiltknecht wrote:
> Well, management of relations is easy enough, known to the DBA and most
> importantly: it already exists. Having to set up something which is
> *not* tied to a relation complicates things just because it's an
> additional concept.

But we're already dealing with some complicated concepts.

There isn't anything that will prevent current-style partitioning strategies
from continuing to work in the face of Simon's proposal. But let me see if
I can outline the sort of cases where I see real value in what he's
outlined.

There is a tendency in data systems to gather all manner of data that, in
retrospect, _might_ turn out to be be valuable; but which, at the time, is
not really valuable at all. Moreover, the value later on might be
relatively low: if you can learn something much later from that data, and do
so easily, then it will be worth doing. But if the work involved passes
some threshold (say 1/2 a day), it's suddenly not worth it any more. It's
simple economics: below a certain cost, the data is valuable. Above a
certain cost, you simply shouldn't keep the data in the first place, because
the cost of using it is higher than any value you'll likely be able to
extract.

Simon's proposal changes the calculations you have to do. If keeping some
data online longer does not impose administrative or operational overhead
(you have it marked read only, so there's no I/O for vacuum; you don't need
to do anything to get the data marked read only; &c.), then all it costs is
a little more disk, which is relatively cheap these days. More importantly,
if the longer-term effect of this strategy is to make it possible to move
such data offline _without imposing a big cost_ when moving it back online,
then the value is potentially very high.

Without even trying, I can think of a dozen examples in the past 5 years
where I could have used that sort of functionality. Because the cost of
data retrieval was high enough, we had to decide that the question wasn't
worth answering. Some of those answers might have been quite valuable
indeed to the Internet community, to be frank; but because I had to pay the
cost without getting much direct benefit, it just wasn't worth the effort.

The thing about Simon's proposal that is beguiling is that it is aimed at
a very common use pattern. The potential for automatic management under
such a use pattern makes it seem to me to be worth exploring in some detail.

> Agreed. I'd say that's why the DBA needs to be able to define the split
> point between partitions: only he knows the meaning of the data.

I think this is only partly true. A casual glance at the -general list will
reveal all manner of false assumptions on the parts of administrators about
how their data is structured. My experience is that, given that the
computer has way more information about the data than I do, it is more
likely to make the right choice. To the extent it doesn't do so, that's a
problem in the planning (or whatever) algorithms, and it ought to be fixed
there.

A

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Akinde 2008-01-07 15:42:24 Re: VACUUM FULL out of memory
Previous Message Gregory Stark 2008-01-07 15:18:46 Re: Index trouble with 8.3b4