Re: Process local hint bit cache

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Process local hint bit cache
Date: 2011-04-01 16:43:14
Message-ID: AANLkTimURatxi+cboMeK+do7-eU4k4UACREhjVHW=ON+@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Thu, Mar 31, 2011 at 5:33 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas
>>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>> On 30.03.2011 18:02, Robert Haas wrote:
>>>>>
>>>>> On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark(at)mit(dot)edu>  wrote:
>>>>>>
>>>>>> But one way or another the hint bits have to get set sometime. The
>>>>>> sooner that happens the less clog i/o has to happen in the meantime.
>>>>>
>>>>> I talked about this with Merlin a bit yesterday.  I think that his
>>>>> thought is that most transactions will access a small enough number of
>>>>> distinct CLOG pages, and the cache accesses might be fast enough, that
>>>>> we really wouldn't need to get the hint bits set, or at least that
>>>>> vacuum/freeze time would be soon enough.  I'm not sure if that's
>>>>> actually true, though - I think the overhead of the cache might be
>>>>> higher than he's imagining.  However, there's a sure-fire way to find
>>>>> out... code it up and see how it plays.
>>>>
>>>> I did a little experiment: I hacked SetHintBits() to do nothing, so that
>>>> hint bits are never set. I then created a table with 100000 rows in it:
>>>>
>>>> postgres=# \d foo
>>>>      Table "public.foo"
>>>>  Column |  Type   | Modifiers
>>>> --------+---------+-----------
>>>>  a      | integer |
>>>>
>>>> postgres=# INSERT INTO foo SELECT generate_series(1, 100000);
>>>> INSERT 0 100000
>>>>
>>>> And ran queries on the table:
>>>>
>>>> postgres=# do $$
>>>> declare
>>>>  i int4;
>>>> begin
>>>>  loop
>>>>    perform COUNT(*) FROM foo;
>>>>  end loop;
>>>> end;
>>>> $$;
>>>>
>>>> This test case is designed to exercise the visibility tests as much as
>>>> possible. However, all the tuples are loaded in one transaction, so the
>>>> one-element cache in TransactionLogFetch is 100% effective.
>>>>
>>>> I ran oprofile on that. It shows that about 15% of the time is spent in
>>>> HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in
>>>> HeapTupleSatisfiesMVCC itself. Here's the breakdown of that:
>>>>
>>>> $ opreport  -c --global-percent
>>>>
>>>> CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated)
>>>> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
>>>> mask of 0x00 (No unit mask) count 100000
>>>> samples  %        app name                 symbol name
>>>> ...
>>>> -------------------------------------------------------------------------------
>>>>  2143      0.4419  postgres                 postgres heapgettup_pagemode
>>>>  73277    15.1091  postgres                 postgres heapgetpage
>>>> 31858 6.5688  postgres                 postgres HeapTupleSatisfiesMVCC
>>>>  31858 6.5688  postgres                 postgres HeapTupleSatisfiesMVCC
>>>> [self]
>>>>  12809     2.6411  postgres                 postgres
>>>> TransactionIdIsInProgress
>>>>  12091     2.4931  postgres                 postgres XidInMVCCSnapshot
>>>>  7150      1.4743  postgres                 postgres
>>>> TransactionIdIsCurrentTransactionId
>>>>  7056      1.4549  postgres                 postgres TransactionIdDidCommit
>>>>  1839      0.3792  postgres                 postgres TransactionIdPrecedes
>>>>  1467      0.3025  postgres                 postgres SetHintBits
>>>>  1155      0.2382  postgres                 postgres TransactionLogFetch
>>>> -------------------------------------------------------------------------------
>>>> ...
>>>>
>>>> I then ran the same test again with an unpatched version, to set the hint
>>>> bits. After the hint bits were set, I ran oprofile again:
>>>>
>>>> -------------------------------------------------------------------------------
>>>>  275       0.4986  postgres                 heapgettup_pagemode
>>>>  4459      8.0851  postgres                 heapgetpage
>>>> 3005      5.4487  postgres                 HeapTupleSatisfiesMVCC
>>>>  3005      5.4487  postgres                 HeapTupleSatisfiesMVCC [self]
>>>>  1620      2.9374  postgres                 XidInMVCCSnapshot
>>>>  110       0.1995  postgres                 TransactionIdPrecedes
>>>> -------------------------------------------------------------------------------
>>>>
>>>> So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total
>>>> CPU time, and without hint bits, 15%.
>>>>
>>>> Even if clog access was free, hint bits still give a significant speedup
>>>> thanks to skipping all the other overhead like TransactionIdIsInProgress()
>>>> and TransactionIdIsCurrentTransactionId(). Speeding up clog access is
>>>> important; when the one-element cache isn't saving you the clog access
>>>> becomes a dominant factor. But you have to address that other overhead too
>>>> before you can get rid of hint bits.
>>>
>>> Here is a patch demonstrating the caching action, but without the
>>> cache table, which isn't done yet -- It only works using the very last
>>> transaction id fetched.  I used macros so I could keep the changes
>>> quite localized.
>>
>> While playing with the data structure to implement an xid cache inside
>> TransactionLogFetch, I learned a nasty lesson.
>> HeapTupleSatisifiesMVCC is super sensitive to any changes and even
>> jumping through a couple of calls to TransactionLogFetch was
>> measurable in a couple of performance test case I came up with.  I
>> hauntingly recalled Tom's warnings to beware unexpected performance
>> downsides.
>>
>> In short, irregardless of implementation, I unfortunately don't think
>> any caching as or under the clog abstraction is going to work without
>> impacting at least some cases.
>>
>> OTOH, maybe some micro optimization of HeapTupleSatisifiesMVCC could
>> provide the same benefit I'm going after here.  If you save off the
>> the last xid that came back committed in a static variable, you can
>> check that at the same level as if it were a hint bit.  From my
>> testing, this is well and truly 'free' in the sense that it can
>> defer/eliminate clog checks and hint bit i/o.
>>
>> The patch attached gives all the advantages I was after in the
>
> hm, this isn't quite right -- if you get a 'last xid cache hit', the
> hint bit tuples should be set if they are not already (but the page
> should not be dirtied).
>
> working on exanding the cache to # xid > 1.

patch attached. this is essentially my original idea except it's
injected directly in to tqual.c as a kind of a expansion of the 'last
seen' concept. Because the cache loops are inlined and very tight (4
int compares), it's actually cheaper than jumping out of line into
clog to test this_int = that_int;

Still needs some more code docs and refactoring/cleanup (I'm not even
gonna pretend this is up to pg coding standards), but it passes
regression and gets the performance i'm looking for w/o the downsides
I was seeing with other implementation.

maybe some extra cycles could be eeked out by doing aligned reads on
machine bit size (32 or 64). It's pretty fast as is though, at least
on x86. Feel free to take a look.

merlin

Attachment Content-Type Size
hbcache_v2.patch application/octet-stream 6.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua D. Drake 2011-04-01 16:43:47 Re: Should psql support URI syntax?
Previous Message Robert Haas 2011-04-01 16:14:18 Re: Bug in autovacuum.c?