Re: Limit of bgwriter_lru_maxpages of max. 1000?

From: Gerhard Wiesinger <lists(at)wiesinger(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Limit of bgwriter_lru_maxpages of max. 1000?
Date: 2009-10-05 18:28:25
Message-ID: alpine.LFD.2.00.0910051928460.15050@bbs.intern
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, 5 Oct 2009, Greg Smith wrote:

> On Sun, 4 Oct 2009, Gerhard Wiesinger wrote:
>
>>> On Fri, 2 Oct 2009, Scott Marlowe wrote:
>>>
>>>> I found that lowering checkpoint completion target was what helped.
>>>> Does that seem counter-intuitive to you?
>>>
>>
>> I set it to 0.0 now.
>
> If you set that to 0.0, the whole checkpoing spreading logic doesn't apply
> like it's supposed to. I'm not sure what the results you posted mean now. If
> you had it set to 0 and saw a bad spike (which is how I read your message),
> I'd say "yes, that's what happens when you do reduce that parameter, so don't
> do that". If you meant something else please clarify.

I think the problem is, that it is done on checkpoint time (whether spread
or not). I should have been already be done by bgwriter.

>
> Thanks for the dtrace example, I suggested we add those checkpoint probes in
> there and someone did, but I hadn't seen anybody use them for anything yet.
>

I think more probes (e.g. on different writing conditions like writing
from bgwriter or on a checkpoint) would be interesting here.

>> Bug1: usage_count is IHMO not consistent
>
> It's a bit hack-ish, but the changes made to support multiple buffer use
> strategies introduced by the "Make large sequential scans and VACUUMs work in
> a limited-size ring" commit are reasonable even if they're not as consistent
> as we'd like. Those changes were supported by benchmarks proving their
> utility, which always trump theoretical "that shouldn't work better!" claims
> when profiling performance.
>
> Also, they make sense to me, but I've spent a lot of time staring at
> pg_buffercache output to get a feel for what shows up in there under various
> circumstances. That's where I'd suggest you go if this doesn't seem right to
> you; run some real database tests and use pg_buffercache to see what's inside
> the cache when you're done. What's in there and what I expected to be in
> there weren't always the same thing, and it's interesting to note how that
> changes as shared_buffers increases. I consider some time studying that a
> pre-requisite to analyzing performance of this code.
>

I have analyzed pg_buffercache (query every second, see below) in parallel
to see what happens but I didn't see expected results in some ways with
the usage_counts. Therefore I analyzed the code and found IHMO the problem
with the usage_count and buffer reallocation. Since the code change is
also new (I think it way 05/2009) it might be that you tested before ...

BTW: Is it possible to get everything in pg_class over all databases as
admin?

>> Bug2: Double iteration of buffers
>> As you can seen in the calling tree below there is double iteration with
>> buffers involved. This might be a major performance bottleneck.
>
> Hmmm, this might be a real bug causing scans through the buffer cache to go
> twice as fast as intended.

That's not twice O(2*n)=O(n) that's a factor n*n (outer and inner loop
iteration) which means overall is O(n^2) which is IHMO too much.

> Since the part you suggest is doubled isn't very
> intensive or called all that often, there's no way it can be a major issue
> though.

It is a major issue since it is O(n^2) and not O(n). E.g. with 2GB share
buffer we have 262144 blocks and 68719476736 calls which is far too much.

> That's based on knowing what the code does and how much it was
> called, as well as some confidence that if it were really a *major* problem,
> it would have shown up on the extensive benchmarks done on all the code paths
> you're investigating.
>

The problem might be hidden for the following reasons:
1.) Buffers values are too low that even n^2 is low for today's machines
2.) Code is not often called in that way
3.) backend writes out pages so that the code is never executed
4.) ...

>> BTW: Are there some tests available how fast a buffer cache hit is and a
>> disk cache hit is (not in the buffer cache but in the disk cache)? I'll
>> asked, because a lot of locking is involved in the code.
>
> I did some once but didn't find anything particularly interesting about the
> results. Since you seem to be on a research tear here, it would be helpful
> to have a script to test that out available, I wasn't able to release mine
> and something dtrace based would probably be better than the approach I used
> (I threw a bunch of gettimeofdata calls into the logs and post-processed them
> with a script).
>

Do you have an where one should set tracepoints inside and outside
PostgreSQL?

>> BTW2: Oracle buffercache and background writer strategy is also
>> interesting.
>
> As a rule, we don't post links to other database implementation details here,
> as those can have patented design details we'd prefer not to intentionally
> re-implement. Much of Oracle's design here doesn't apply here anyway, as it
> was done in the era when all of their writes were synchronous. That required
> them to worry about doing a good job on some things in their background
> writer that we shrug off and let os writes combined with fsync handle
> instead.
>

Ok, no problem.

Ciao,
Gerhard

--
http://www.wiesinger.com/

SELECT
CASE
WHEN datname IS NULL THEN pg_buffercache.reldatabase::text
ELSE datname
END AS database,
CASE
WHEN spcname IS NULL THEN pg_buffercache.reltablespace::text
ELSE spcname
END AS tablespace,
CASE
WHEN relname IS NULL THEN pg_buffercache.relfilenode::text
ELSE relname
END AS relation,
CASE
WHEN relkind IS NULL THEN pg_buffercache.relfilenode::text
ELSE relkind
END AS relkind,
usagecount > 0 AS usagecount_gt_0,
isdirty,
MIN(relblocknumber) AS min_blocknumber,
MAX(relblocknumber) AS max_blocknumber,
ROUND(AVG(relblocknumber),2) AS avg_blocknumber,
ROUND(STDDEV(relblocknumber),2) AS stddev_blocknumber,
COUNT(*) AS count
FROM
pg_buffercache
LEFT JOIN pg_class ON pg_buffercache.relfilenode = pg_class.oid
LEFT JOIN pg_tablespace ON pg_buffercache.reltablespace =
pg_tablespace.oid
LEFT JOIN pg_database ON pg_buffercache.reldatabase = pg_database.oid
WHERE
isdirty = true
GROUP BY
CASE
WHEN datname IS NULL THEN pg_buffercache.reldatabase::text
ELSE datname
END,
CASE
WHEN spcname IS NULL THEN pg_buffercache.reltablespace::text
ELSE spcname
END,
CASE
WHEN relkind IS NULL THEN pg_buffercache.relfilenode::text
ELSE relkind
END,
CASE
WHEN relname IS NULL THEN pg_buffercache.relfilenode::text
ELSE relname
END,
usagecount > 0,
isdirty
--HAVING relkind = 'r'
ORDER BY
database,
tablespace,
relkind DESC,
relation,
usagecount > 0,
isdirty
;

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Guy Rouillier 2009-10-05 19:48:44 Re: How useful is the money datatype?
Previous Message John R Pierce 2009-10-05 18:27:08 Re: