Re: [PERFORM] DELETE vs TRUNCATE explanation

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject: Re: [PERFORM] DELETE vs TRUNCATE explanation
Date: 2012-07-19 18:57:26
Message-ID: 10862.1342724246@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Jul 19, 2012 at 10:09 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> What about checking just the immediately previous entry? This would
>> at least fix the problem for bulk-load situations, and the cost ought
>> to be about negligible compared to acquiring the LWLock.

> 2. You say "fix the problem" but I'm not exactly clear what problem
> you think this fixes.

What I'm concerned about is that there is going to be a great deal more
fsync request queue traffic in 9.2 than there ever was before, as a
consequence of the bgwriter/checkpointer split. The design expectation
for this mechanism was that most fsync requests would be generated
locally inside the bgwriter and thus go straight into the hash table
without having to go through the shared-memory queue. I admit that
we have seen no benchmarks showing that there's a problem, but that's
because up till yesterday the bgwriter was failing to transmit such
messages at all. So I'm looking for ways to cut the overhead.

But having said that, maybe we should not panic until we actually see
some benchmarks showing the problem.

Meanwhile, we do know there's a problem with FORGET_RELATION_FSYNC.
I have been looking at the two-hash-tables design I suggested before,
and realized that there's a timing issue: if we just stuff "forget"
requests into a separate table, there is no method for determining
whether a given fsync request arrived before or after a given forget
request. This is problematic if the relfilenode gets recycled: we
need to be able to guarantee that a previously-posted forget request
won't cancel a valid fsync for the new relation. I believe this is
soluble though, if we merge the "forget" requests with unlink requests,
because a relfilenode can't be recycled until we do the unlink.
So as far as the code goes:

1. Convert the PendingUnlinkEntry linked list to a hash table keyed by
RelFileNode. It acts the same as before, and shouldn't be materially
slower to process, but now we can determine in O(1) time whether there
is a pending unlink for a relfilenode.

2. Treat the existence of a pending unlink request as a relation-wide
fsync cancel; so the loop in mdsync needs one extra hashtable lookup
to determine validity of a PendingOperationEntry. As before, this
should not matter much considering that we're about to do an fsync().

3. Tweak mdunlink so that it does not send a FORGET_RELATION_FSYNC
message if it is sending an UNLINK_RELATION_REQUEST. (A side benefit
is that this gives us another 2X reduction in fsync queue traffic,
and not just any queue traffic but the type of traffic that we must
not fail to queue.)

The FORGET_RELATION_FSYNC code path will still exist, and will still
require a full hashtable scan, but we don't care because it isn't
being used in common situations. It would only be needed for stuff
like killing an init fork.

The argument that this is safe involves these points:

* mdunlink cannot send UNLINK_RELATION_REQUEST until it's done
ftruncate on the main fork's first segment, because otherwise that
segment could theoretically get unlinked from under it before it can do
the truncate. But this is okay since the ftruncate won't cause any
fsync the checkpointer might concurrently be doing to fail. The
request *will* be sent before we unlink any other files, so mdsync
will be able to recover if it gets an fsync failure due to concurrent
unlink.

* Because a relfilenode cannot be recycled until we process and delete
the PendingUnlinkEntry during mdpostckpt, it is not possible for valid
new fsync requests to arrive while the PendingUnlinkEntry still exists
to cause them to be considered canceled.

* Because we only process and delete PendingUnlinkEntrys that have been
there since before the checkpoint started, we can be sure that any
PendingOperationEntrys referring to the relfilenode will have been
scanned and deleted by mdsync before we remove the PendingUnlinkEntry.

Unless somebody sees a hole in this logic, I'll go make this happen.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2012-07-19 18:59:39 very elaborate mkdir error checking in pg_dump
Previous Message Satoshi Nagayasu 2012-07-19 17:18:08 Re: [PATCH] XLogReader v2

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Haas 2012-07-19 20:03:20 Re: [PERFORM] DELETE vs TRUNCATE explanation
Previous Message mandavi 2012-07-19 17:24:59 Re: queries are fast after dump->restore but slow again after some days dispite vacuum