Re: Scaling shared buffer eviction

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling shared buffer eviction
Date: 2014-09-05 11:47:49
Message-ID: CAA4eK1JC-ZijSv-3km3C-zjK=GAxk97B6i0aQ-Fow0Tmw7K24A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 3, 2014 at 1:45 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > I have updated the patch to address the feedback. Main changes are:
> >
> > 1. For populating freelist, have a separate process (bgreclaimer)
> > instead of doing it by bgwriter.
> > 2. Autotune the low and high threshold values for buffers
> > in freelist. I have used the formula as suggested by you upthread.
> > 3. Cleanup of locking regimen as discussed upthread (completely
> > eliminated BufFreelist Lock).
> > 4. Improved comments and general code cleanup.
>
> Overall this looks quite promising to me.
>
> I had thought to call the new process just "bgreclaim" rather than
> "bgreclaimer", but perhaps your name is better after all. At least,
> it matches what we do elsewhere. But I don't care for the use
> "Bgreclaimer"; let's do "BgReclaimer" if we really need mixed-case, or
> else "bgreclaimer".

Changed it to bgreclaimer.

> This is unclear:
>
> +buffers for replacement. Earlier to protect freelist, we use LWLOCK as
that
> +is needed to perform clock sweep which is a longer operation, however
now we
> +are using two spinklocks freelist_lck and victimbuf_lck to perform
operations
> +on freelist and run clock sweep respectively.
>
> I would drop the discussion of what was done before and say something
> like this: The data structures relating to buffer eviction are
> protected by two spinlocks. freelist_lck protects the freelist and
> related data structures, while victimbuf_lck protects information
> related to the current clock sweep condition.

Changed, but I have not used exact wording mentioned above, let me know
if new wording used is okay?

> +always in this list. We also throw buffers into this list if we consider
> +their pages unlikely to be needed soon; this is done by background
process
> +reclaimer. The list is singly-linked using fields in the
>
> I suggest: Allocating pages from this list is much cheaper than
> running the "clock sweep" algorithm, which may encounter many buffers
> that are poor candidates for eviction before finding a good candidate.
> Therefore, we have a background process called bgreclaimer which works
> to keep this list populated.

Changed as per your suggestion.

> +Background Reclaimer's Processing
> +---------------------------------
>
> I suggest titling this section "Background Reclaim".
>
> +The background reclaimer is designed to move buffers to freelist that are
>
> I suggest replacing the first three words of this sentence with
"bgreclaimer".

As per discussion in thread, I have kept it as it is.

> +and move the the unpinned and zero usage count buffers to freelist. It
> +keep on doing this until the number of buffers in freelist become equal
> +high threshold of freelist.
>
> s/keep/keeps/
> s/become equal/reaches the/
> s/high threshold/high water mark/
> s/of freelist//

Changed as per your suggestion.

> Please change the other places that say threshold to use the "water
> mark" terminology.
>
> + if (StrategyMoveBufferToFreeListEnd (bufHdr))
>
> Extra space.
>
> + * buffers in consecutive cycles.
>
> s/consecutive/later/
>
> + /* Execute the LRU scan */
>
> s/LRU scan/clock sweep/ ?

Changed as per your suggestion.

>
> + while (tmp_num_to_free > 0)
>
> I am not sure it's a good idea for this value to be fixed at loop
> start and then just decremented. Shouldn't we loop and do the whole
> thing over once we reach the high watermark, only stopping when
> StrategySyncStartAndEnd() says num_to_free is 0?

Okay, changed the loop as per your suggestion.

> + /* choose next victim buffer to clean. */
>
> This process doesn't clean buffers; it puts them on the freelist.

Right. Changed it to match what it does.

> + * high threshold of freelsit), we drasticaly reduce the odds for
>
> Two typos.

Fixed.

> + * of buffers in freelist fall below low threshold of freelist.
>
> s/fall/falls/

Changed as per your suggestion.

> In freelist.c, it seems like a poor idea to have two spinlocks as
> consecutive structure members; they'll be in the same cache line,
> leading to false sharing. If we merge them into a single spinlock,
> does that hurt performance? If we put them further apart, e.g. by
> moving the freelist_lck to the start of the structure, followed by the
> latches, and leaving victimbuf_lck where it is, does that help
> performance?

As per discussion, I have kept them as it is and added a comment
indicating that we can consider having both locks in separate
cache lines.

> + /*
> + * If the buffer is pinned or has a nonzero usage_count,
> we cannot use
> + * it; discard it and retry. (This can only happen if
VACUUM put a
> + * valid buffer in the freelist and then someone else
> used it before
> + * we got to it. It's probably impossible altogether as
> of 8.3, but
> + * we'd better check anyway.)
> + */
> +
>
> This comment is clearly obsolete.

Removed.

> > I have not yet added statistics (buffers_backend_clocksweep) as
> > for that we need to add one more variable in BufferStrategyControl
> > structure where I have already added few variables for this patch.
> > I think it is important to have such a stat available via
> > pg_stat_bgwriter, but not sure if it is worth to make the structure
> > bit more bulky.
>
> I think it's worth it.

Okay added new statistic.

> > Another minor point is about changes in lwlock.h
> > lwlock.h
> > * if you remove a lock, consider leaving a gap in the numbering
> > * sequence for the benefit of DTrace and other external debugging
> > * scripts.
> >
> > As I have removed BufFreelist lock, I have adjusted the numbering
> > as well in lwlock.h. There is a meesage on top of lock definitions
> > which suggest to leave gap if we remove any lock, however I was not
> > sure whether this case (removing the first element) can effect anything,
> > so for now, I have adjusted the numbering.
>
> Let's leave slot 0 unused, instead.

Sure, that make sense.

Apart from above, I think for this patch, cat version bump is required
as I have modified system catalog. However I have not done the
same in patch as otherwise it will be bit difficult to take performance
data.

Performance Data with updated patch

Configuration and Db Details

IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

Client Count/Patch_Ver (tps) 8 16 32 64 128 HEAD 58614 107370 140717
104357 65010 Patch 60092 113564 165014 213848 216065

This data is median of 3 runs, detailed report is attached with mail.
I have not repeated the test for all configurations, as there is no
major change in design/algorithm which can effect performance.
Mark has already taken tpc-b data which ensures that there is
no problem with it, however I will also take it once with latest version.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
scalable_buffer_eviction_v6.patch application/octet-stream 59.2 KB
perf_read_scalability_data_v6.ods application/vnd.oasis.opendocument.spreadsheet 17.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2014-09-05 11:50:29 Re: Scaling shared buffer eviction
Previous Message Marko Tiikkaja 2014-09-05 11:38:43 Re: pgcrypto: PGP signatures