Re: [PERFORM] DELETE vs TRUNCATE explanation

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject: Re: [PERFORM] DELETE vs TRUNCATE explanation
Date: 2012-07-19 12:56:51
Message-ID: CA+TgmoYg1t3jAOVQLwiCCG6eNcJ8DbVdi_3Vfr=-8-cvxm3zag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Wed, Jul 18, 2012 at 5:17 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I've been chewing on this issue some more, and no longer like my
> previous proposal, which was
>
>>>> ... What I'm thinking about
>>>> is reducing the hash key to just RelFileNodeBackend + ForkNumber,
>>>> so that there's one hashtable entry per fork, and then storing a
>>>> bitmap to indicate which segment numbers need to be sync'd. At
>>>> one gigabyte to the bit, I think we could expect the bitmap would
>>>> not get terribly large. We'd still have a "cancel" flag in each
>>>> hash entry, but it'd apply to the whole relation fork not each
>>>> segment.
>
> The reason that's not so attractive is the later observation that what
> we really care about optimizing is FORGET_RELATION_FSYNC for all the
> forks of a relation at once, which we could produce just one request
> for with trivial refactoring of smgrunlink/mdunlink. The above
> representation doesn't help for that. So what I'm now thinking is that
> we should create a second hash table, with key RelFileNode only,
> carrying two booleans: a cancel-previous-fsyncs bool and a
> please-unlink-after-checkpoint bool. (The latter field would allow us
> to drop the separate pending-unlinks data structure.) Entries would
> be made in this table when we got a FORGET_RELATION_FSYNC or
> UNLINK_RELATION_REQUEST message -- note that in 99% of cases we'd get
> both message types for each relation, since they're both created during
> DROP. (Maybe we could even combine these request types.) To use the
> table, as we scan the existing per-fork-and-segment hash table, we'd
> have to do a lookup in the per-relation table to see if there was a
> later cancel message for that relation. Now this does add a few cycles
> to the processing of each pendingOpsTable entry in mdsync ... but
> considering that the major work in that loop is an fsync call, it is
> tough to believe that anybody would notice an extra hashtable lookup.

Seems a bit complex, but it might be worth it. Keep in mind that I
eventually want to be able to make an unlogged table logged or a visca
versa, which will probably entail unlinking just the init fork (for
the logged -> unlogged direction).

> However, I also came up with an entirely different line of thought,
> which unfortunately seems incompatible with either of the improved
> table designs above. It is this: instead of having a request queue
> that feeds into a hash table hidden within the checkpointer process,
> what about storing the pending-fsyncs table as a shared hash table
> in shared memory? That is, ForwardFsyncRequest would not simply
> try to add the request to a linear array, but would do a HASH_ENTER
> call on a shared hash table. This means the de-duplication occurs
> for free and we no longer need CompactCheckpointerRequestQueue at all.
> Basically, this would amount to saying that the original design was
> wrong to try to micro-optimize the time spent in ForwardFsyncRequest,
> and that we'd rather pay a little more per ForwardFsyncRequest call
> to avoid the enormous response-time spike that will occur when
> CompactCheckpointerRequestQueue has to run. (Not to mention that
> the checkpointer would eventually have to do HASH_ENTER anyway.)
> I think this would address your observation above that the request
> queue tends to contain an awful lot of duplicates.

I'm not concerned about the queue *containing* a large number of
duplicates; I'm concerned about the large number of duplicate
*requests*. Under either the current system or this proposal, every
time we write a block, we must take and release CheckpointerCommLock.
Now, I have no evidence that there's actually a bottleneck there, but
if there is, this proposal won't fix it. In fact, I suspect on the
whole it would make things worse, because while it's true that
CompactCheckpointerRequestQueue is expensive, it shouldn't normally be
happening at all, because the checkpointer should be draining the
queue regularly enough to prevent it from filling. So except when the
system is in the pathological state where the checkpointer becomes
unresponsive because it's blocked in-kernel on a very long fsync and
there is a large amount of simultaneous write activity, each process
that acquires CheckpointerCommLock holds it for just long enough to
slam a few bytes of data into the queue, which is very cheap. I
suspect that updating a hash table would be significantly more
expensive, and we'd pay whatever that extra overhead is on every fsync
request, not just in the unusual case where we manage to fill the
queue. So I don't think this is likely to be a win.

If you think about the case of an UPDATE statement that hits a large
number of blocks in the same relation, it sends an fsync request for
every single block. Really, it's only necessary to send a new fsync
request if the checkpointer has begun a new checkpoint cycle in the
meantime; otherwise, the old request is still pending and will cover
the new write as well. But there's no way for the backend doing the
writes to know whether that's happened, so it just sends a request
every time. That's not necessarily a problem, because, again, I have
no evidence whatsoever that CheckpointerCommLock is contented, or that
the overhead of sending those requests is significant. But if it is
then we need a solution that does not require acquisition of a
system-wide lwlock on every block write.

> But I only see how to make that work with the existing hash table
> structure, because with either of the other table designs, it's
> difficult to set a predetermined limit on the amount of shared
> memory needed. The segment-number bitmaps could grow uncomfortably
> large in the first design, while in the second there's no good way
> to know how large the per-relation table has to be to cover a given
> size for the per-fork-and-segment table. (The sore spot here is that
> once we've accepted a per-fork entry, failing to record a relation-level
> cancel for it is not an option, so we can't just return failure.)

Moreover, even if it were technically an option, we know from
experience that failure to absorb fsync requests has disastrous
performance consequences.

> So if we go that way it seems like we still have the problem of
> having to do hash_seq_search to implement a cancel. We could
> possibly arrange for that to be done under shared rather than
> exclusive lock of the hash table, but nonetheless it's not
> really fixing the originally complained-of O(N^2) problem.

Yep. In fact it's making it worse, because AIUI the existing
hash_seq_search calls are happening in backend-private memory while
holding no lock. Doing it on a shared-memory hash table while holding
a high-traffic LWLock figures to be much worse.

> Another issue, which might be fatal to the whole thing, is that
> it's not clear that a shared hash table similar in size to the
> existing request array is big enough. The entries basically need
> to live for about one checkpoint cycle, and with a slow cycle
> you could need an arbitrarily large number of them.

Yep.

> A variant that might work a little better is to keep the main
> request table still in checkpointer private memory, but to have
> *both* a small hash table and a request queue in shared memory.
> The idea is that you first try to enter your request in the hash
> table; if successful, done (and de-duping has happened automatically).
> If no room left in the hash table, add it to the request queue as
> normal. The checkpointer periodically empties both the hash table
> and the queue. The hash table probably doesn't have to be too huge
> to be effective at de-duping requests ... but having said that,
> I have no idea exactly how to size it.

I think this is just over-engineered. The originally complained-of
problem was all about the inefficiency of manipulating the
checkpointer's backend-private data structures, right? I don't see
any particular need to mess with the shared memory data structures at
all. If you wanted to add some de-duping logic to retail fsync
requests, you could probably accomplish that more cheaply by having
each such request look at the last half-dozen or so items in the queue
and skip inserting the new request if any of them match the new
request. But I think that'd probably be a net loss, because it would
mean holding the lock for longer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-07-19 13:10:41 Re: bgwriter, regression tests, and default shared_buffers settings
Previous Message Craig Ringer 2012-07-19 12:37:50 Re: [GENERAL] main log encoding problem

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2012-07-19 14:09:26 Re: [PERFORM] DELETE vs TRUNCATE explanation
Previous Message Andrew Dunstan 2012-07-19 12:38:53 Re: queries are fast after dump->restore but slow again after some days dispite vacuum