Quick Links

Re: [PERFORM] DELETE vs TRUNCATE explanation

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject:	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date:	2012-07-19 20:03:20
Message-ID:	CA+TgmobkLaBJa5+HG9VsV_MNJGiVEBymio-msdKETZiPk1q36Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-performance

On Thu, Jul 19, 2012 at 2:57 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Jul 19, 2012 at 10:09 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> What about checking just the immediately previous entry? This would
>>> at least fix the problem for bulk-load situations, and the cost ought
>>> to be about negligible compared to acquiring the LWLock.
>
>> 2. You say "fix the problem" but I'm not exactly clear what problem
>> you think this fixes.
>
> What I'm concerned about is that there is going to be a great deal more
> fsync request queue traffic in 9.2 than there ever was before, as a
> consequence of the bgwriter/checkpointer split. The design expectation
> for this mechanism was that most fsync requests would be generated
> locally inside the bgwriter and thus go straight into the hash table
> without having to go through the shared-memory queue. I admit that
> we have seen no benchmarks showing that there's a problem, but that's
> because up till yesterday the bgwriter was failing to transmit such
> messages at all. So I'm looking for ways to cut the overhead.
>
> But having said that, maybe we should not panic until we actually see
> some benchmarks showing the problem.

+1 for not panicking. I'm prepared to believe that there could be a
problem here, but I'm not prepared to believe that we've characterized
it well enough to be certain that any changes we choose to make will
make things better not worse.

> Meanwhile, we do know there's a problem with FORGET_RELATION_FSYNC.
> I have been looking at the two-hash-tables design I suggested before,
> and realized that there's a timing issue: if we just stuff "forget"
> requests into a separate table, there is no method for determining
> whether a given fsync request arrived before or after a given forget
> request. This is problematic if the relfilenode gets recycled: we
> need to be able to guarantee that a previously-posted forget request
> won't cancel a valid fsync for the new relation. I believe this is
> soluble though, if we merge the "forget" requests with unlink requests,
> because a relfilenode can't be recycled until we do the unlink.
> So as far as the code goes:
>
> 1. Convert the PendingUnlinkEntry linked list to a hash table keyed by
> RelFileNode. It acts the same as before, and shouldn't be materially
> slower to process, but now we can determine in O(1) time whether there
> is a pending unlink for a relfilenode.
>
> 2. Treat the existence of a pending unlink request as a relation-wide
> fsync cancel; so the loop in mdsync needs one extra hashtable lookup
> to determine validity of a PendingOperationEntry. As before, this
> should not matter much considering that we're about to do an fsync().
>
> 3. Tweak mdunlink so that it does not send a FORGET_RELATION_FSYNC
> message if it is sending an UNLINK_RELATION_REQUEST. (A side benefit
> is that this gives us another 2X reduction in fsync queue traffic,
> and not just any queue traffic but the type of traffic that we must
> not fail to queue.)
>
> The FORGET_RELATION_FSYNC code path will still exist, and will still
> require a full hashtable scan, but we don't care because it isn't
> being used in common situations. It would only be needed for stuff
> like killing an init fork.
>
> The argument that this is safe involves these points:
>
> * mdunlink cannot send UNLINK_RELATION_REQUEST until it's done
> ftruncate on the main fork's first segment, because otherwise that
> segment could theoretically get unlinked from under it before it can do
> the truncate. But this is okay since the ftruncate won't cause any
> fsync the checkpointer might concurrently be doing to fail. The
> request *will* be sent before we unlink any other files, so mdsync
> will be able to recover if it gets an fsync failure due to concurrent
> unlink.
>
> * Because a relfilenode cannot be recycled until we process and delete
> the PendingUnlinkEntry during mdpostckpt, it is not possible for valid
> new fsync requests to arrive while the PendingUnlinkEntry still exists
> to cause them to be considered canceled.
>
> * Because we only process and delete PendingUnlinkEntrys that have been
> there since before the checkpoint started, we can be sure that any
> PendingOperationEntrys referring to the relfilenode will have been
> scanned and deleted by mdsync before we remove the PendingUnlinkEntry.
>
> Unless somebody sees a hole in this logic, I'll go make this happen.

What if we change the hash table to have RelFileNode as the key and an
array of MAX_FORKNUM bitmapsets as the value? Then when you get a
"forget" request, you can just zap all the sets to empty. That seems
a whole lot simpler than your proposal and I don't see any real
downside. I can't actually poke a whole in your logic at the moment
but a simpler system that requires no assumptions about filesystem
behavior seems preferable to me.

You can still make an unlink request imply a corresponding
forget-request if you want, but now that's a separate optimization.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: [PERFORM] DELETE vs TRUNCATE explanation at 2012-07-19 18:57:26 from Tom Lane

Responses

Re: [PERFORM] DELETE vs TRUNCATE explanation at 2012-07-19 21:02:08 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Pavel Stehule	2012-07-19 20:04:35	2GB limit for temp_file_limit on 32bit platform
Previous Message	Tom Lane	2012-07-19 19:15:20	Re: very elaborate mkdir error checking in pg_dump

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Tom Lane	2012-07-19 21:02:08	Re: [PERFORM] DELETE vs TRUNCATE explanation
Previous Message	Tom Lane	2012-07-19 18:57:26	Re: [PERFORM] DELETE vs TRUNCATE explanation