Re: [Testperf-general] BufferSync and bgwriter

From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Neil Conway <neilc(at)samurai(dot)com>, Mark Wong <markw(at)osdl(dot)org>, pgsql-hackers(at)postgresql(dot)org, testperf-general(at)pgfoundry(dot)org
Subject: Re: [Testperf-general] BufferSync and bgwriter
Date: 2004-12-15 16:10:22
Message-ID: 41C061EE.5030508@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/12/2004 5:08 PM, Simon Riggs wrote:

>> On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
>> Simon Riggs wrote:
>> > If the bgwriter_percent = 100, then we should actually do the sensible
>> > thing and prepare the list that we need, i.e. limit
>> > StrategyDirtyBufferList to finding at most bgwriter_maxpages.
>>
>> Is the plan to make bgwriter_percent = 100 the default setting?
>
> Hmm...must confess that my only plan is:
> i) discover dynamic behaviour of bgwriter
> ii) fix any bugs or wierdness as quickly as possible
> iii) try to find a way to set the bgwriter defaults
>
> I'm worried that we're late in the day for changes, but I'm equally
> worried that a) the bgwriter is very tuning sensitive b) we don't really
> have much info on how to set the defaults in a meaningful way for the
> majority of cases c) there are some issues that greatly reduce the
> effectiveness of the bgwriter in many circumstances.
>
> The 100pct.patch was my first attempt at getting something acceptable in
> the next few days that gives sufficient room for the DBA to perform
> tuning.

Doesn't cranking up the bgwriter_percent to 100 effectively make the
entire shared memory a write-through cache? In other words, with 100%
the bgwriter will allways write all dirty blocks out and it becomes
unlikely to avoid an IO for subsequent modificaitons to the same data block.

Jan

>
> On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
>> I wonder if we even need to retain the bgwriter_percent GUC var. Is
>> there actually a situation in which the combination of bgwriter_maxpages
>> and bgwriter_delay does not give the DBA sufficient flexibility in
>> tuning bgwriter behavior?
>
> Yes, I do now think that only two GUCs are required to tune the
> behaviour; but you make me think - which two? Right now, bgwriter_delay
> is useless because the O(N) behaviour makes it impossible to set any
> lower when you have a large shared_buffers. (I see that as a bug)
>
> Your question has made me rethink the exact objective of the bgwriter's
> actions: The way it is coded now the bgwriter looks for dirty blocks, no
> matter where they are in the list. What we are bothered about is the
> number of clean buffers at the LRU, which has a direct influence on the
> probability that BufferAlloc() will need to call FlushBuffer(), since
> StrategyGetBuffer() returns the first unpinned buffer, dirty or not.
> After further thought, I would prefer a subtle change in behaviour so
> that the bgwriter checks that clean blocks are available at the LRUs for
> when buffer replacement occurs. With that slight change, I'd keep the
> bgwriter_percent GUC but make it mean something different.
>
> bgwriter_percent would be the % of shared_buffers that are searched
> (from the LRU end) to see if they contain dirty buffers, which are then
> written to disk. That means the number of dirty blocks written to disk
> is between 0 and the number of buffers searched, but we're not hugely
> bothered what that number is... [This change to StrategyDirtyBufferList
> resolves the unusability of the bgwriter with large shared_buffers]
>
> Writing away dirty blocks towards the MRU end is more likely to be
> wasted effort. If a block stays near the MRU then it will be dirty again
> in the wink of an eye, so you gain nothing at checkpoint time by
> cleaning it. Also, since it isn't near the LRU, cleaning it has no
> effect on buffer replacement I/O. If a block is at the LRU, then it is
> by definition the least likely to be reused, and is a candidate for
> replacement anyway. So concentrating on the LRU, not the number of dirty
> buffers seems to be the better thing to do.
>
> That would then be a much simpler way of setting the defaults. With that
> definition, we would set the defaults:
>
> bgwriter_percent = 2 (according to my new suggestion here)
> bgwriter_delay = 200
> bgwriter_maxpages = -1 (i.e. mostly ignore it, but keep it for fine
> tuning)
>
> Thus, for the default shared_buffers=1000 the bgwriter would clear a
> space of up to 20 blocks each cycle.
> For a config with shared_buffers=60000, the bgwriter default would clear
> space for 600 blocks (max) each cycle - a reasonable setting.
>
> Overall that would need very little specific tuning, because it would
> scale upwards as you changed the shared_buffers higher.
>
> So, that interpretation of bgwriter_percent gives these advantages:
> - we bound the StrategyDirtyBufferList scan to a small % of the whole
> list, rather than the whole list...so we could realistically set the
> bgwriter_delay lower if required
> - we can set a default that scales, so would not often need to change it
> - the parameter is defined in terms of the thing we really care about:
> sufficient clean blocks at the LRU of the buffer lists
> - these changes are very isolated and actually minor - just a different
> way of specifying which buffers the bgwriter will clean
>
> Patch attached...again for discussion and to help understanding of this
> proposal. Will submit to patches if we agree it seems like the best way
> to allow the bgwriter defaults to be sensibly set.
>
> [...and yes, everybody, I do know where we are in the release cycle]
>
>
>
> ------------------------------------------------------------------------
>
> Index: buffer/bufmgr.c
> ===================================================================
> RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
> retrieving revision 1.182
> diff -d -c -r1.182 bufmgr.c
> *** buffer/bufmgr.c 24 Nov 2004 02:56:17 -0000 1.182
> --- buffer/bufmgr.c 12 Dec 2004 21:53:10 -0000
> ***************
> *** 681,686 ****
> --- 681,687 ----
> {
> BufferDesc **dirty_buffers;
> BufferTag *buftags;
> + int maxdirty;
> int num_buffer_dirty;
> int i;
>
> ***************
> *** 688,717 ****
> if (percent == 0 || maxpages == 0)
> return 0;
>
> /*
> * Get a list of all currently dirty buffers and how many there are.
> * We do not flush buffers that get dirtied after we started. They
> * have to wait until the next checkpoint.
> */
> ! dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *));
> ! buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag));
>
> LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
> - num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
> - NBuffers);
>
> ! /*
> ! * If called by the background writer, we are usually asked to only
> ! * write out some portion of dirty buffers now, to prevent the IO
> ! * storm at checkpoint time.
> ! */
> ! if (percent > 0)
> ! {
> ! Assert(percent <= 100);
> ! num_buffer_dirty = (num_buffer_dirty * percent + 99) / 100;
> ! }
> ! if (maxpages > 0 && num_buffer_dirty > maxpages)
> ! num_buffer_dirty = maxpages;
>
> /* Make sure we can handle the pin inside the loop */
> ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> --- 689,720 ----
> if (percent == 0 || maxpages == 0)
> return 0;
>
> + /* Set number of buffers we will clean at LRUs of buffer lists
> + * If no limits set, then clean the whole of shared_buffers
> + */
> + if (maxpages > 0)
> + maxdirty = maxpages;
> + else {
> + if (percent > 0) {
> + Assert(percent <= 100);
> + maxdirty = (NBuffers * percent + 99) / 100;
> + }
> + else
> + maxdirty = NBuffers;
> + }
> +
> /*
> * Get a list of all currently dirty buffers and how many there are.
> * We do not flush buffers that get dirtied after we started. They
> * have to wait until the next checkpoint.
> */
> ! dirty_buffers = (BufferDesc **) palloc(maxdirty * sizeof(BufferDesc *));
> ! buftags = (BufferTag *) palloc(maxdirty * sizeof(BufferTag));
>
> LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
>
> ! num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
> ! maxdirty);
>
> /* Make sure we can handle the pin inside the loop */
> ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> Index: buffer/freelist.c
> ===================================================================
> RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/freelist.c,v
> retrieving revision 1.48
> diff -d -c -r1.48 freelist.c
> *** buffer/freelist.c 16 Sep 2004 16:58:31 -0000 1.48
> --- buffer/freelist.c 12 Dec 2004 21:53:11 -0000
> ***************
> *** 735,741 ****
> * StrategyDirtyBufferList
> *
> * Returns a list of dirty buffers, in priority order for writing.
> - * Note that the caller may choose not to write them all.
> *
> * The caller must beware of the possibility that a buffer is no longer dirty,
> * or even contains a different page, by the time he reaches it. If it no
> --- 735,740 ----
> ***************
> *** 755,760 ****
> --- 754,760 ----
> int cdb_id_t2;
> int buf_id;
> BufferDesc *buf;
> + int i;
>
> /*
> * Traverse the T1 and T2 list LRU to MRU in "parallel" and add all
> ***************
> *** 765,771 ****
> cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
> cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];
>
> ! while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0)
> {
> if (cdb_id_t1 >= 0)
> {
> --- 765,771 ----
> cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
> cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];
>
> ! for (i = 0; i < max_buffers; i++)
> {
> if (cdb_id_t1 >= 0)
> {
> ***************
> *** 779,786 ****
> buffers[num_buffer_dirty] = buf;
> buftags[num_buffer_dirty] = buf->tag;
> num_buffer_dirty++;
> - if (num_buffer_dirty >= max_buffers)
> - break;
> }
> }
>
> --- 779,784 ----
> ***************
> *** 799,806 ****
> buffers[num_buffer_dirty] = buf;
> buftags[num_buffer_dirty] = buf->tag;
> num_buffer_dirty++;
> - if (num_buffer_dirty >= max_buffers)
> - break;
> }
> }
>
> --- 797,802 ----
>
>
> ------------------------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Wieck 2004-12-15 16:16:26 Re: [Testperf-general] BufferSync and bgwriter
Previous Message simon 2004-12-15 16:10:01 Re: RE: Re: bgwriter changes