Re: Scaling shared buffer eviction

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling shared buffer eviction
Date: 2014-09-02 20:15:30
Message-ID: CA+Tgmoa9=EcNUFLSsweR=MmPOcUQU_8NaZzEcsP_5PH-1HwB6A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I have updated the patch to address the feedback. Main changes are:
>
> 1. For populating freelist, have a separate process (bgreclaimer)
> instead of doing it by bgwriter.
> 2. Autotune the low and high threshold values for buffers
> in freelist. I have used the formula as suggested by you upthread.
> 3. Cleanup of locking regimen as discussed upthread (completely
> eliminated BufFreelist Lock).
> 4. Improved comments and general code cleanup.

Overall this looks quite promising to me.

I had thought to call the new process just "bgreclaim" rather than
"bgreclaimer", but perhaps your name is better after all. At least,
it matches what we do elsewhere. But I don't care for the use
"Bgreclaimer"; let's do "BgReclaimer" if we really need mixed-case, or
else "bgreclaimer".

This is unclear:

+buffers for replacement. Earlier to protect freelist, we use LWLOCK as that
+is needed to perform clock sweep which is a longer operation, however now we
+are using two spinklocks freelist_lck and victimbuf_lck to perform operations
+on freelist and run clock sweep respectively.

I would drop the discussion of what was done before and say something
like this: The data structures relating to buffer eviction are
protected by two spinlocks. freelist_lck protects the freelist and
related data structures, while victimbuf_lck protects information
related to the current clock sweep condition.

+always in this list. We also throw buffers into this list if we consider
+their pages unlikely to be needed soon; this is done by background process
+reclaimer. The list is singly-linked using fields in the

I suggest: Allocating pages from this list is much cheaper than
running the "clock sweep" algorithm, which may encounter many buffers
that are poor candidates for eviction before finding a good candidate.
Therefore, we have a background process called bgreclaimer which works
to keep this list populated.

+Background Reclaimer's Processing
+---------------------------------

I suggest titling this section "Background Reclaim".

+The background reclaimer is designed to move buffers to freelist that are

I suggest replacing the first three words of this sentence with "bgreclaimer".

+and move the the unpinned and zero usage count buffers to freelist. It
+keep on doing this until the number of buffers in freelist become equal
+high threshold of freelist.

s/keep/keeps/
s/become equal/reaches the/
s/high threshold/high water mark/
s/of freelist//

Please change the other places that say threshold to use the "water
mark" terminology.

+ if (StrategyMoveBufferToFreeListEnd (bufHdr))

Extra space.

+ * buffers in consecutive cycles.

s/consecutive/later/

+ /* Execute the LRU scan */

s/LRU scan/clock sweep/ ?

+ while (tmp_num_to_free > 0)

I am not sure it's a good idea for this value to be fixed at loop
start and then just decremented. Shouldn't we loop and do the whole
thing over once we reach the high watermark, only stopping when
StrategySyncStartAndEnd() says num_to_free is 0?

+ /* choose next victim buffer to clean. */

This process doesn't clean buffers; it puts them on the freelist.

+ * high threshold of freelsit), we drasticaly reduce the odds for

Two typos.

+ * of buffers in freelist fall below low threshold of freelist.

s/fall/falls/

In freelist.c, it seems like a poor idea to have two spinlocks as
consecutive structure members; they'll be in the same cache line,
leading to false sharing. If we merge them into a single spinlock,
does that hurt performance? If we put them further apart, e.g. by
moving the freelist_lck to the start of the structure, followed by the
latches, and leaving victimbuf_lck where it is, does that help
performance?

+ /*
+ * If the buffer is pinned or has a nonzero usage_count,
we cannot use
+ * it; discard it and retry. (This can only happen if VACUUM put a
+ * valid buffer in the freelist and then someone else
used it before
+ * we got to it. It's probably impossible altogether as
of 8.3, but
+ * we'd better check anyway.)
+ */
+

This comment is clearly obsolete.

> I have not yet added statistics (buffers_backend_clocksweep) as
> for that we need to add one more variable in BufferStrategyControl
> structure where I have already added few variables for this patch.
> I think it is important to have such a stat available via
> pg_stat_bgwriter, but not sure if it is worth to make the structure
> bit more bulky.

I think it's worth it.

> Another minor point is about changes in lwlock.h
> lwlock.h
> * if you remove a lock, consider leaving a gap in the numbering
> * sequence for the benefit of DTrace and other external debugging
> * scripts.
>
> As I have removed BufFreelist lock, I have adjusted the numbering
> as well in lwlock.h. There is a meesage on top of lock definitions
> which suggest to leave gap if we remove any lock, however I was not
> sure whether this case (removing the first element) can effect anything,
> so for now, I have adjusted the numbering.

Let's leave slot 0 unused, instead.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-09-02 20:16:55 Re: PL/pgSQL 2
Previous Message Bruce Momjian 2014-09-02 20:12:53 Re: PL/pgSQL 2