On 8/30/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com> writes:
> > Please see the version 14 of HOT patch attached.
> I expected to find either a large new README, or some pretty substantial
> additions to existing README files, to document how this all works.
> The comments included do not represent nearly enough documentation.
I shall take that up. There are couple of documents posted by Heikki and
apart from several email threads and original design doc by Simon. I shall
consolidate everything in a single README.
One thing I was unable to discern from the comments is how CREATE INDEX
> can work at all. A new index might mean that tuples that could formerly
> legally be part of the same hot-chain no longer can. I can't find any
> code here that looks like it's addressing that.
You are right - a new index might mean that an existing HOT chain
is broken as far as the new index is concerned. The way we address
that is by indexing the root tuple of the chain, but the index key is
extracted from the last tuple in the chain. The index is marked "unusable"
for all those existing transactions which can potentially see any
tuples in the chain.
Please see this document written by Greg Stark. He has nicely summarized
how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT.
I also don't think I
> believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated
> tuples: what if the index completion commits, but the concurrent delete
> rolls back? Then you've got a valid tuple that's not in the index.
Since CREATE INDEX works with ShareLock on the relation, we can
safely assume that we can't find any DELETE_IN_PROGRESS tuples except
those deleted by our own transaction. The only exception is system relation,
but we don't do HOT updates on system relation. Given that, it seems OK
to me to ignore these tuples because if the transaction aborts,
CREATE INDEX is aborted as well. Am I overlooking something here ?
There is a comment to this regard in the current code as well.
> I also took this opportunity to remove the modularity invasion caused
> > by heap_check_idxupdate() since it was using resultRelInfo. We now
> > build the list of attributes that must be checked to satisfy HOT update.
> > This list includes all the index columns, columns in the partial index
> > predicates and expression index expressions and is built in the
> > executor.
> The executor is the wrong place for that: I'm not sure why you think
> that making heapam depend on the executor is a modularity improvement.
Earlier (in the previous version of HOT patch) we were passing ResultRelInfo
to heap_update, which was more ugly. The comment is in that context :-)
Furthermore this approach requires recalculating the list during
> each query, which is wasteful when it could only change during schema
> updates. I'd suggest making the relcache responsible for computing and
> saving this data, along the same lines as RelationGetIndexList().
> (That also means that heapam can get the data from the relcache, saving
> a lot of API-bloating from passing around Attrids explicitly.)
> Also, rather than inventing Attrids, I'd suggest just using one
> Bitmapset with the convention that its contents are offset by
I liked all these suggestions. I know I thought about computing
the attribute list in relcache, but probably avoided to keep things simple.
I shall make these changes.
The redefinition of the value of MaxHeapTuplesPerPage seems pretty
> ugly and unprincipled. I think it'd be better to leave it as-is,
> and just enforce that we don't allow more than that many line pointers
> on a heap page (as indeed you have to do anyway, so it's not clear
> what the change is buying).
The only reason to redefine MaxHeapTuplesPerPage to higher side is
because HOT allows us to accommodate more tuples per page because
of redirect-dead line pointers. For a table with sufficiently large tuples,
the original bound would work well, but for very small tuples - we might
hit the line pointer limit even if there is free space available in the
page. Doubling the value is chosen as a balance between heap page
utilization, line pointer bloating and overhead for bitmap scans. But
I agree that the factor choice is rather arbitrary.
> Is it really necessary to add hot_update to xl_heap_update? Seems the
> information should be available from the tuple header fields.
I think its not necessary. The reason I did that way because the new block
might be backed up in the WAL record and hence we might have recorded
the new tuple infomasks. But we can surely handle that corner case. I shall
Have you demonstrated that the "prune_hard" logic is worth its weight?
> Considering that many of us want to drop VACUUM FULL, adding more
> complexity to it doesn't seem like a profitable use of time.
The prune_hard code is lot more simple in this version (it was horrible
before Heikki and I reworked the pruning logic). We just follow the HOT
chain to the last definitely DEAD tuple and remove all tuples preceding
that. This simplifies the second phase of VACUUM FULL since
we don't need to handle any broken HOT chains because of intermediate
DEAD tuples. Also, prune_hard allows us to cleanup the redirected
Another reason for doing this is to avoid any significant changes to
VACUUM FULL code since the code itself is complex and we might
just drop this in future. But until that time, we need to make sure that
HOT works fine with VACUUM FULL.
Is it really safe, or productive, to run heap_page_prune when the buffer
> is not locked for cleanup (ie, there are other people with pins on it)?
It looks safe to me. We don't move tuples around during chain pruning.
We only fix the HOT chains by removing intermediate DEAD tuples.
Other places where we follow HOT chains, we do so while holding
at-least SHARE lock on the buffer. So there are no race conditions here.
A place where this bite me is that the heap-scan with SnapshotAny
may return a tuple which gets pruned (and marked ~LP_USED)
before the caller can see it. Fortunately there are only limited users
of SnapshotAny. We make sure that IndexBuildHeapScan confirms that
the ItemIdIsValid() after acquiring lock on the buffer. Cluster works with
exclusive lock on the relation and hence there is no concurrent pruning
Even if it's safe, ISTM what you're mostly accomplishing there is to
> expend a lot of cycles while holding exclusive lock on the page, when
> there is good reason to think that you're blocking other people who are
> interested in using the page. Eliminating the separation between that
> and cleanup would also allow eliminating the separate "PD_FRAGMENTED"
> page state.
The reason we did it that way because repairing fragmentation seems
much more costly that pruning. Please note that we prune a single
chain during index fetch. Its only for heap-scans (and VACUUM) that
we try to prune all chains in the page. So unless we are doing heap-scan,
I am not sure if we are spending too much time holding the exclusive
lock. I agree we don't have any specific numbers to prove that though.
Another reasoning behind separating these two steps is: pruning
requires exclusive lock whereas repairing fragmentation requires
cleanup lock. Also we want to prune the chains aggressively
because the visible tuple is most likely at the end of the chain. Longer
the chain, greater the cost to fetch the tuple.
PlanSetValidForThisTransaction is completely bogus --- it's not
> re-entrant, and it needs to be. I think you need some state in
> PlannerGlobal instead.
I was not happy with this - frankly I don't understand the planning code
much. Thanks for pointing me to the appropriate code. I shall try fix this,
unless you would like to do it yourself.
rd_avgfsm seems fairly bogus ... when does it get updated?
Yeah, its not the best solution. What I wanted to do is delay repairing
page fragmentation as far as possible. But I noticed that fsmrel->avgRequest
(rd_avgfsm is initialized to it) starts with a small value (I guess 256).
For a table with large tuples, we may not repair fragmentation at all, thus
causing unnecessary COLD updates.
rd_avgfsm is updated when the relcache is rebuilt. Since we don't
send out any relcache invalidation when fsmrel->avgRequest changes,
rd_avgfsm may never get updated in a session.
Should we just revert to checking if the free space is less than a certain
percentage of BLCKSZ and trigger defragmentation ? Or may be do a
combination of both ?
In response to
pgsql-patches by date
|Next:||From: Gregory Stark||Date: 2007-08-30 08:58:23|
|Subject: Re: HOT patch - version 14|
|Previous:||From: Tom Lane||Date: 2007-08-29 22:01:45|
|Subject: Re: HOT patch - version 14 |