Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From: dg(at)illustra(dot)com (David Gould)
To: maillist(at)candle(dot)pha(dot)pa(dot)us (Bruce Momjian)
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]
Date: 1998-05-12 20:21:30
Message-ID: 9805122021.AA16043@hawk.illustra.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Bruce Momjian:
> David Gould:
> > This is always how micro-optimization goes, 1% and 2% gains here and there.
> > I am very familiar with it.
>
> I said 0.01 seconds at a time, not 1% at a time. At this point, a 0.01
> seconds savings is 12%, because the total test takes 0.08 seconds.

I think I may have been unclear here. I was merely attempting to agree that
optimization is a process of accumulating small gains.

> > Have you done call graph profile for this, or just flat profiling? I think
> > you may find the call graph (gprof) output revealing, although perhaps not
> > on this particular topic...
>
> I ran gprof. I did not look at the call graph, just the total number of
> calls. We have a very modular system, and the call overhead can get
> exessive. gprof shows tas() getting called far more than any other
> function. It shows it as 0.01 seconds, on a 0.08 second test! Now, I
> realize that gprof measurement is not perfect, but it certainly shows
> tas as being called a lot.

I agree the system is sometimes excessively layered resulting in many
trivial calls.

I agree tas() is called a lot. I am trying to understand if the overhead seen
below is in the call itself, or in the actual work of synchronization.

> % cumulative self self total
> time seconds seconds calls ms/call ms/call name
> 20.0 0.02 0.02 mcount (463)
> 10.0 0.03 0.01 5288 0.00 0.00 _tas [31]
> 10.0 0.04 0.01 2368 0.00 0.00 _hash_search [30]
> 10.0 0.05 0.01 1631 0.01 0.02 _malloc [11]
> 10.0 0.06 0.01 101 0.10 0.10 _sbrk [35]
> 10.0 0.07 0.01 56 0.18 0.20 _heapgettup [25]
> 10.0 0.08 0.01 4 2.50 2.50 _write [32]
> 10.0 0.09 0.01 2 5.00 5.00 ___sysctl [33]
> 10.0 0.10 0.01 1 10.00 10.41 _cnfify [28]
> 0.0 0.10 0.00 1774 0.00 0.00 _call_hash [468]
> 0.0 0.10 0.00 1604 0.00 0.00 _tag_hash [469]
> 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPush [470]
> 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPushHead [471]
> 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPushInto [472]
> 0.0 0.10 0.00 1375 0.00 0.02 _AllocSetAlloc [12]
> 0.0 0.10 0.00 1375 0.00 0.02 _MemoryContextAlloc [13]
> 0.0 0.10 0.00 1353 0.00 0.02 _palloc [14]
> 0.0 0.10 0.00 1322 0.00 0.01 _SpinAcquire [45]

I asked about the call graph for two reasons (the first of which is part
of another thread):

1) I would expect that the 1353 calls to palloc() are also responsible for:

1375 _MemoryContextAlloc
1375 _AllocSetAlloc
1380 _OrderedElemPushInto
1380 _OrderedElemPush
for a total of (1353 + 1375 + 1375 + 1380 + 1380) = 6863 calls.
(not including the further 1631 _malloc and 101 _sbrk calls).

I am curious why these calls do not seem to show up on the cumulative time.

2) I wonder how fine the resolution of the profile is. I am assuming that all
the overhead of tas comes from either:
- the call overhead
- or the actual work done in tas().
Given that, I wonder if the call overhead is the major part, it could be
that the bus/cache synchronization is the real overhead. As a function,
it is easy to identify tas(). As a macro it does not show up on the
profile and its contribution to the overhead is distributed amoung all the
callers which makes it less obvious on the profile. I was hoping the call
graph would help identify which was the case.

In any case, I will test with it as a macro and as a function. It may also
be instructive to make a dummy tas() that does nothing and if that shows
the overhead to be in the actual synchronization, or in the calling. I will
test this too.

My intent here is not to be argumentative. My current mental model is that
excess calls are unfortunate but not especially harmful and not usually
worth changing (sometimes of course an inline sequence allows the optimizer
to do something clever and makes more difference than expected). If this view
is incorrect, I would like to know and understand that so that I can adjust
my theory accordingly.

> I don't understand your hesitation. The code WAS inlined. It was
> inlined because gprof showed is as being called a lot. Most of them are
> ASM anyway, so what does it matter if it sits in a a *.c or *.h file, an
> asm() call looks the same in a macro or in a function.

I do not feel strongly about this. I prefer the function on the grounds of
clarity and ease of maintenance and porting. But I would be happy to make it
a macro even if the performance difference is not significant.

> If it makes you feel better, put it in something called tas.h, and add
> it as an include in all the files that include s_lock.h, or have
> s_lock.h include tas.h.

I am fine with folding it all into s_lock.h. If you wish, I will do so.

> I am not looking around for 1% optimization. I am using gprof output to
> improve things that gprof shows need improving.

Please, I am not intending _any_ criticism. I have no disagreement with what
you are doing. I am glad to see us working on performance.

-dg

David Gould dg(at)illustra(dot)com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
"My life has been full of wonderful moments -
it's only later that they become embarassing."
-- Gerhard Berger

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message ocie 1998-05-12 20:24:21 Re: [HACKERS] mmap and MAP_ANON
Previous Message Bruce Momjian 1998-05-12 20:17:28 Re: [HACKERS] text patch -- sugg cmd when run as root