Re: [PERFORM] DELETE vs TRUNCATE explanation

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com>
Subject: Re: [PERFORM] DELETE vs TRUNCATE explanation
Date: 2012-07-15 18:29:27
Message-ID: 19949.1342376967@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> On Thu, Jul 12, 2012 at 9:55 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> The topic was poor performance when truncating lots of small tables
>> repeatedly on test environments with fsync=off.
>>
>> On Thu, Jul 12, 2012 at 6:00 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>> I think the problem is in the Fsync Absorption queue. Every truncate
>>> adds a FORGET_RELATION_FSYNC to the queue, and processing each one of
>>> those leads to sequential scanning the checkpointer's pending ops hash
>>> table, which is quite large. It is almost entirely full of other
>>> requests which have already been canceled, but it still has to dig
>>> through them all. So this is essentially an N^2 operation.

> The attached patch addresses this problem by deleting the entry when
> it is safe to do so, and flagging it as canceled otherwise.

I don't like this patch at all. It seems ugly and not terribly safe,
and it won't help at all when the checkpointer is in the midst of an
mdsync scan, which is a nontrivial part of its cycle.

I think what we ought to do is bite the bullet and refactor the
representation of the pendingOps table. What I'm thinking about
is reducing the hash key to just RelFileNodeBackend + ForkNumber,
so that there's one hashtable entry per fork, and then storing a
bitmap to indicate which segment numbers need to be sync'd. At
one gigabyte to the bit, I think we could expect the bitmap would
not get terribly large. We'd still have a "cancel" flag in each
hash entry, but it'd apply to the whole relation fork not each
segment.

If we did this then the FORGET_RELATION_FSYNC code path could use
a hashtable lookup instead of having to traverse the table
linearly; and that would get rid of the O(N^2) performance issue.
The performance of FORGET_DATABASE_FSYNC might still suck, but
DROP DATABASE is a pretty heavyweight operation anyhow.

I'm willing to have a go at coding this design if it sounds sane.
Comments?

> Also, I still wonder if it is worth memorizing fsyncs (under
> fsync=off) that may or may not ever take place. Is there any
> guarantee that we can make by doing so, that couldn't be made
> otherwise?

Yeah, you have a point there. It's not real clear that switching fsync
from off to on is an operation that we can make any guarantees about,
short of executing something like the code recently added to initdb
to force-sync the entire PGDATA tree. Perhaps we should change fsync
to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could
skip forwarding fsync requests when it's off?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gurjeet Singh 2012-07-15 18:53:08 Re: Getting rid of pre-assignment of index names in CREATE TABLE LIKE
Previous Message Tom Lane 2012-07-15 15:49:54 Re: Getting rid of pre-assignment of index names in CREATE TABLE LIKE

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2012-07-15 22:37:12 Re: [PERFORM] DELETE vs TRUNCATE explanation
Previous Message Ioannis Anagnostopoulos 2012-07-15 01:14:45 Index slow down insertions...