On Fri, Mar 8, 2013 at 8:58 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I'd be interested in something slightly related that is doing HOT on a
> per-index basis. Currently we don't do hot if any index is updated even
> though quite possibly most of the indexes don't change.
> I think that might result in some quite nice speedups...
Yeah, I agree. The thing that seems tricky is that, at present, once
you have the CTID of a tuple in a HOT chain, you can follow the chain
all the way to the end without inspecting the tuple contents - just
xmin/xmax; and you know that all the tuples in the chain agree on all
indexed attributes. If we relax that, consider a chain A -> B -> C,
where A and B agree on all indexed attributes but B and C agree on
only some of them. At present, only A will have any index pointers.
If we break that assumption, then we the ability to HOT-prune away B
and eventually C without index vac, which is a bummer. If we keep
that assumption, then we can have an index pointer whose attributes
don't match those of the tuple it references, which threatens to break
index-only scans (at least). I haven't been able to think of a way to
make this work out cleanly.
I think it's important to understand whether we're trying to extend
the benefits of HOT update (skipping index insertions) or HOT cleanup
(page-at-a-time vacuuming). They're both significant. If the goal is
to optimize away index maintenance for the columns that aren't
updated, we could change the way vacuum works: when vacuum finds a
dead tuple, instead of scanning for the index pointers and removing
them, it could scan for the index pointers and update them to point to
the new CTID (which is available from looking at the original tuple).
Then, potentially, updates don't need to insert into indexes on
unchanged columns, because scans can follow the CTID pointers forward
where needed. For sanity's sake, it might be best to limit this to
cases where the old and new tuples are on the same page, both to
minimize the overhead during scans, and also because there's not
enough space available in the line pointer to store a full CTID (and
we definitely don't want to postpone truncating tuples to line
pointers to get this optimization). I'm waving my hands wildly here;
I might be totally off-base in thinking that any of this can work.
More broadly, I'm not sure if it's the right thing to optimize. It
essentially aims to speed up updates, but the cost would be slower
scans; indeed, it's hard to think of anything we could do in this area
that wouldn't make scans more complex and therefore potentially
slower, and it's not clear that's the right way to go. HOT cleanups -
or hint bit setting - during scans are already a frequent cause of
performance complaints. We could skip them during read-only
operations but we do them for a good reason: otherwise, repeated scans
can get very painful. So I think there's a decent argument that we're
already over-optimized for writes at the expense of reads.
I'm increasingly tempted to think that our heap representation needs a
much broader rethink. We hear endless complaints in this forum to
the effect that hint bits cause problems, both directly with
performance and indirectly by greatly complicating things for other
features, such as page checksums. We all love HOT, but it doesn't
cover enough cases, and the HOT cleanups can occasionally cause pain.
Freezing sucks. The often-expressed desire to cluster a table and
have it stay clustered is unimplementable. Returning space to the OS
requires painfully expensive maintenance. The 24-byte-per-tuple
overhead seems badly overpriced for insert-only tables. These are not
fundamental truths of the universe, or even of PostgreSQL; they are
specific consequences of the representation we've chosen for heaps.
Many of them are things that we've grown into, rather than designed
with malice aforethought: for example, freezing is a consequence of
the after-the-fact desire to be able to support more than 4bn
transactions over the lifetime of the database. So it's way better
than what we had before, and yet, if we all sat down and designed a
new on-disk storage format for a new product today, I'm sure none of
us would pick one that expires after 2bn transactions.
We can continue to whittle away at these problems incrementally, and I
hope we do, but I've started to feel like we're bumping up against the
limits of what is feasible given the design decisions to which we've
already committed. To take one concrete example, suppose we sat down
and designed a new heap format, and suppose we made it our principal
goal to make it write-once and zero-maintenance. In other words, if a
user sat down and bulk-loaded data into a table, and never modified
it, there would be no further writes to that table after the initial
flush to disk (no hint bits, no visibility map bits, no freezing) and
no scans of the data except as a result of user-initiated activity
(i.e. no scans to see whether there is data needing to be frozen).
Could we do it? I bet we could, or darn close. Will we ever get
there through incremental improvements to what we have now? I doubt
it. Instead, we end up with things like COPY FREEZE. Now that's a
good tool, and, importantly, it's real, whereas everything I'm saying
here is pie in the sky. But it's clearly also a special-case hack,
and more generally I don't know how much more we can do that's going
to be anything BUT a special-case hack. Some, certainly. But I'm not
sure how much.
Curious to hear your thoughts. Is anyone else feeling this same itch?
The Enterprise PostgreSQL Company
In response to
pgsql-hackers by date
|Next:||From: Robert Haas||Date: 2013-03-08 15:31:30|
|Subject: Re: Materialized views WIP patch|
|Previous:||From: Bruce Momjian||Date: 2013-03-08 14:45:59|
|Subject: Re: Materialized views WIP patch|