Quick Links

Re: Process local hint bit cache

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-04-01 16:43:14
Message-ID:	AANLkTimURatxi+cboMeK+do7-eU4k4UACREhjVHW=ON+@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Thu, Mar 31, 2011 at 5:33 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas
>>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>> On 30.03.2011 18:02, Robert Haas wrote:
>>>>>
>>>>> On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark(at)mit(dot)edu> wrote:
>>>>>>
>>>>>> But one way or another the hint bits have to get set sometime. The
>>>>>> sooner that happens the less clog i/o has to happen in the meantime.
>>>>>
>>>>> I talked about this with Merlin a bit yesterday. I think that his
>>>>> thought is that most transactions will access a small enough number of
>>>>> distinct CLOG pages, and the cache accesses might be fast enough, that
>>>>> we really wouldn't need to get the hint bits set, or at least that
>>>>> vacuum/freeze time would be soon enough. I'm not sure if that's
>>>>> actually true, though - I think the overhead of the cache might be
>>>>> higher than he's imagining. However, there's a sure-fire way to find
>>>>> out... code it up and see how it plays.
>>>>
>>>> I did a little experiment: I hacked SetHintBits() to do nothing, so that
>>>> hint bits are never set. I then created a table with 100000 rows in it:
>>>>
>>>> postgres=# \d foo
>>>> Table "public.foo"
>>>> Column | Type | Modifiers
>>>> --------+---------+-----------
>>>> a | integer |
>>>>
>>>> postgres=# INSERT INTO foo SELECT generate_series(1, 100000);
>>>> INSERT 0 100000
>>>>
>>>> And ran queries on the table:
>>>>
>>>> postgres=# do $$
>>>> declare
>>>> i int4;
>>>> begin
>>>> loop
>>>> perform COUNT(*) FROM foo;
>>>> end loop;
>>>> end;
>>>> $$;
>>>>
>>>> This test case is designed to exercise the visibility tests as much as
>>>> possible. However, all the tuples are loaded in one transaction, so the
>>>> one-element cache in TransactionLogFetch is 100% effective.
>>>>
>>>> I ran oprofile on that. It shows that about 15% of the time is spent in
>>>> HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in
>>>> HeapTupleSatisfiesMVCC itself. Here's the breakdown of that:
>>>>
>>>> $ opreport -c --global-percent
>>>>
>>>> CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated)
>>>> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
>>>> mask of 0x00 (No unit mask) count 100000
>>>> samples % app name symbol name
>>>> ...
>>>> -------------------------------------------------------------------------------
>>>> 2143 0.4419 postgres postgres heapgettup_pagemode
>>>> 73277 15.1091 postgres postgres heapgetpage
>>>> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
>>>> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
>>>> [self]
>>>> 12809 2.6411 postgres postgres
>>>> TransactionIdIsInProgress
>>>> 12091 2.4931 postgres postgres XidInMVCCSnapshot
>>>> 7150 1.4743 postgres postgres
>>>> TransactionIdIsCurrentTransactionId
>>>> 7056 1.4549 postgres postgres TransactionIdDidCommit
>>>> 1839 0.3792 postgres postgres TransactionIdPrecedes
>>>> 1467 0.3025 postgres postgres SetHintBits
>>>> 1155 0.2382 postgres postgres TransactionLogFetch
>>>> -------------------------------------------------------------------------------
>>>> ...
>>>>
>>>> I then ran the same test again with an unpatched version, to set the hint
>>>> bits. After the hint bits were set, I ran oprofile again:
>>>>
>>>> -------------------------------------------------------------------------------
>>>> 275 0.4986 postgres heapgettup_pagemode
>>>> 4459 8.0851 postgres heapgetpage
>>>> 3005 5.4487 postgres HeapTupleSatisfiesMVCC
>>>> 3005 5.4487 postgres HeapTupleSatisfiesMVCC [self]
>>>> 1620 2.9374 postgres XidInMVCCSnapshot
>>>> 110 0.1995 postgres TransactionIdPrecedes
>>>> -------------------------------------------------------------------------------
>>>>
>>>> So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total
>>>> CPU time, and without hint bits, 15%.
>>>>
>>>> Even if clog access was free, hint bits still give a significant speedup
>>>> thanks to skipping all the other overhead like TransactionIdIsInProgress()
>>>> and TransactionIdIsCurrentTransactionId(). Speeding up clog access is
>>>> important; when the one-element cache isn't saving you the clog access
>>>> becomes a dominant factor. But you have to address that other overhead too
>>>> before you can get rid of hint bits.
>>>
>>> Here is a patch demonstrating the caching action, but without the
>>> cache table, which isn't done yet -- It only works using the very last
>>> transaction id fetched. I used macros so I could keep the changes
>>> quite localized.
>>
>> While playing with the data structure to implement an xid cache inside
>> TransactionLogFetch, I learned a nasty lesson.
>> HeapTupleSatisifiesMVCC is super sensitive to any changes and even
>> jumping through a couple of calls to TransactionLogFetch was
>> measurable in a couple of performance test case I came up with. I
>> hauntingly recalled Tom's warnings to beware unexpected performance
>> downsides.
>>
>> In short, irregardless of implementation, I unfortunately don't think
>> any caching as or under the clog abstraction is going to work without
>> impacting at least some cases.
>>
>> OTOH, maybe some micro optimization of HeapTupleSatisifiesMVCC could
>> provide the same benefit I'm going after here. If you save off the
>> the last xid that came back committed in a static variable, you can
>> check that at the same level as if it were a hint bit. From my
>> testing, this is well and truly 'free' in the sense that it can
>> defer/eliminate clog checks and hint bit i/o.
>>
>> The patch attached gives all the advantages I was after in the
>
> hm, this isn't quite right -- if you get a 'last xid cache hit', the
> hint bit tuples should be set if they are not already (but the page
> should not be dirtied).
>
> working on exanding the cache to # xid > 1.

patch attached. this is essentially my original idea except it's
injected directly in to tqual.c as a kind of a expansion of the 'last
seen' concept. Because the cache loops are inlined and very tight (4
int compares), it's actually cheaper than jumping out of line into
clog to test this_int = that_int;

Still needs some more code docs and refactoring/cleanup (I'm not even
gonna pretend this is up to pg coding standards), but it passes
regression and gets the performance i'm looking for w/o the downsides
I was seeing with other implementation.

maybe some extra cycles could be eeked out by doing aligned reads on
machine bit size (32 or 64). It's pretty fast as is though, at least
on x86. Feel free to take a look.

merlin

Attachment	Content-Type	Size
hbcache_v2.patch	application/octet-stream	6.6 KB

In response to

Re: Process local hint bit cache at 2011-03-31 22:38:33 from Merlin Moncure

Responses

Re: Process local hint bit cache at 2011-04-03 00:37:57 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Joshua D. Drake	2011-04-01 16:43:47	Re: Should psql support URI syntax?
Previous Message	Robert Haas	2011-04-01 16:14:18	Re: Bug in autovacuum.c?