Re: Dynamic Partitioning using Segment Visibility Maps

From: Gavin Sherry <swm(at)alcove(dot)com(dot)au>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Dynamic Partitioning using Segment Visibility Maps
Date: 2008-01-10 01:22:39
Message-ID: 20080110012238.GA2837@europa.idg.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Simon,

On Wed, Jan 02, 2008 at 05:56:14PM +0000, Simon Riggs wrote:
> Segment Exclusion
> -----------------
>
> After we note that a segment is read-only we can scan the segment and
> record min/max values for all columns. These are then "implicit
> constraints", which can then be used for segment exclusion in a similar
> manner as we do with the current constraint exclusion mechanism.

What about columns which are varlen? The maximum value could be very long -- 1
GB in fact. Say we decide to not support those. Well, what about
numeric? What about user defined types?

> This would also allow a Merge Join to utilise exclusion for the inner
> plan at execution time. Instead of scanning the whole inner table plan
> from the start, we would be able to start the scan from the appropriate
> segment. This would require us to pass the current value of the outer
> plan down into the inner plan. The typical executor nodes on the inner
> plan would be a SortNode and below that a SeqScan. In that case the
> executor would need to pass the outer value from the MergeJoinNode down
> thru the SortNode to the SeqScan node. The SeqScan node could then
> perform partition exclusion, reducing the time for that scan and also
> reducing the time for the resulting sort. This sounds like it might be
> worth doing in normal cases also. It might turn out that the potentially
> applicable cases are already excluded during planning, I haven't thought
> about that aspect in enough detail yet.

I don't think that would work at all. Say you have have skew of data
across the partitions. You may end up doing a seq scan for every outer
tuple. It would look like a really expensive nested loop.

> If we collect data for all columns then many of our implicit constraints
> would be useless. e.g. if a column only has a few values and these are
> present in all segments. Matching our predicate against all constraints
> would be expensive, so we must weed out poor constraints. We would do
> this by removing any constraint that overlapped more than 10% of other
> segments. Various heuristics would likely need to be discovered during
> development to make this work without resorting to manual commands.
>
> Note that all of this exclusion is happening in the executor, not the
> planner. That allows this form of exclusion to work with stable
> functions and parameters without problem.

So, in that case, you've totally undercut the planner. All the steps up
the tree would be based on the stats for scanning the entire table but
you've only returned part of it.

To solve that, you could put the min/max values in a catalog somewhere.
For a 16 TB table, that's 16000 min/max values. That's pretty expensive.
And how do we handle making things non-read-only. You'd have to wait for
all current queries to finish... that could take a long time.

How do we handle crashes during setting of min/max? I presume it needs
WAL support.

What happens if the free space map gives us a free block in a read only
segment. Then we need to look at concurrency and transactionality of
min/max values don't we? If changing it makes it not read-only (which
seems to be the case) it would be trivial to totally degrade your
partitioning scheme to full seq scan. One a 16 TB table. And getting the
performance back might involve a day or more or VACUUMing. Impossible.

> Visibility Map
> --------------
[snip]
> No dynamic shared memory cache is required because any concurrent
> changes to the table would be ignored by a Scan anyway, so it doesn't
> matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any
> new scans that start will attempt to lock the table and then perform a
> rel cache check before continuing. So the visibility will be set
> correctly for *that* scan at least.

Is that true in the presence of READ COMMITTED?

>
> In most cases the visibility map can be summarised as a single boolean
> to show whether any 100% visible segments exist. That makes accessing
> the map very cheap in the common, unset case.

What do we do in failure scenarios. Does this go into WAL? Is it
replicated via log shipping techniques. You mention Slony support below.
I don't know that Slony's target is Very Big Tables (up to 16 TB) but
even if it was being used for a 'small' system of a few hundred GB, when
you fail over the system has degraded if you aren't vacuuming it. More
over, the layout of data may be different and therefore the performance
of the segment exclusion. That's a bit of a surprise.

>
> Setting the Visibility Map
> --------------------------
>
> VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of
> them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to
> all current backends. Marking the map will cause a rel cache
> invalidation.

Argh. Remember, we're talking about 16 TB tables here. With the best of
tuning, that takes all day. So, day 1, load data; day 2 VACUUM it? With
the existing partitioning approach, there's no such thing.

>
> We would never mark the highest numbered segment
> EFFECTIVE_READ_ONLY, since it is likely to expand when INSERTs occur.
> This is also a useful way of excluding smaller tables from the overheads
> involved for this feature.

What about concurrent update or deletes -- during VACUUM. Surely we
don't have to vacuum full? Concurrent updates and deletes could change
the partition boundaries and that will impact on the accuracy of segment
exclusion (I do not think it leads to data corruption but it might lead
to the inclusion of a segment that shouldn't have been included). That
leads to unpredictable behaviour.

>
> In an insert-only table this would mean that only the last 1 or 2
> segments would be read write, so VACUUM would always run in a bounded
> time, no matter how large the table.

You said else where that unload was low cost. What about segments which
cross the unload boundary. Half of their data needs to be removed, say.
That sounds expensive. And, they'll have to be vacuumed.

[snip]

> Unsetting the visibility map
> ----------------------------
>

[snip]

With the existing partitioning, we could just truncate a partition. This
seems really expensive in comparison.

> There would be additional complexity in selectivity estimation and plan
> costing. The above proposal allows dynamic segment exclusion, which
> cannot be assessed at planning time anyway, so suggestions welcome...

As I've noted elsewhere, you cannot just undermine the planner like
this. You're plans will be totally bogus. I hope you're not proposing
that. If you're proposing tracking information for each segment then I
there are problems there too.

> Comparison with other Partitioning approaches
> ---------------------------------------------
>
> Once I realised this was possible in fairly automatic way, I've tried
> hard to keep away from manual overrides, commands and new DDL.
>
> Declarative partitioning is a big overhead, though worth it for large
> databases. No overhead is *much* better though.

I think you're making this out to be harder than it is. Dealing with
lots of data is difficult, issuing one DDL command is not such a big
deal.

Thanks,

Gavin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-01-10 01:25:44 Re: tzdata issue on cross-compiled postgresql
Previous Message Andrew Dunstan 2008-01-10 00:53:41 Re: odd convert_from bug