Re: Proposal: SLRU to Buffer Cache

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Shawn Debnath <sdn(at)amazon(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: SLRU to Buffer Cache
Date: 2018-08-15 05:56:19
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Hi Shawn,

On Wed, Aug 15, 2018 at 9:35 AM, Shawn Debnath <sdn(at)amazon(dot)com> wrote:
> At the Unconference in Ottawa this year, I pitched the idea of moving
> components off of SLRU and on to the buffer cache. The motivation
> behind the idea was three fold:
> * Improve performance by eliminating fixed sized caches, simplistic
> scan and eviction algorithms.
> * Ensuring durability and consistency by tracking LSNs and checksums
> per block.
> * Consolidating caching strategies in the engine to simplify the
> codebase, and would benefit from future buffer cache optimizations.

Thanks for working on this. These are good goals, and I've wondered
about doing exactly this myself for exactly those reasons. I'm sure
we're not the only ones, and I heard only positive reactions to your
unconference pitch. As you know, my undo log storage design interacts
with the buffer manager in the same way, so I'm interested in this
subject and will be keen to review and test what you come up with.
That said, I'm fairly new here myself and there are people on this
list with a decade or two more experience hacking on the buffer
manager and transam machinery.

> As the changes are quite invasive, I wanted to vet the approach with the
> community before digging in to implementation. The changes are strictly
> on the storage side and do not change the runtime behavior or protocols.
> Here's the current approach I am considering:
> 1. Implement a generic block storage manager that parameterizes
> several options like segment sizes, fork and segment naming and
> path schemes, concepts entrenched in md.c that are strongly tied to
> relations. To mitigate risk, I am planning on not modifying md.c
> for the time being.

+1 for doing it separately at first.

I've also vacillated between extending md.c and doing my own
undo_file.c thing. It seems plausible that between SLRU and undo we
could at least share a common smgr implementation, and eventually
maybe md.c. There are a few differences though, and the question is
whether we'd want to do yet another abstraction layer with
callbacks/vtable/configuration points to handle that parameterisation,
or just use the existing indirection in smgr and call it good.

I'm keen to see what you come up with. After we have a patch to
refactor and generalise the fsync stuff from md.c (about which more
below), let's see what is left and whether we can usefully combine
some code.

> 2. Introduce a new smgr_truncate_extended() API to allow truncation of
> a range of blocks starting at a specific offset, and option to
> delete the file instead of simply truncating.

Hmm. In my undo proposal I'm currently implementing only the minimum
smgr interface required to make bufmgr.c happy (basically read and
write blocks), but I'm managing segment files (creating, deleting,
recycling) directly via a separate interface UndoLogAllocate(),
UndoLogDiscard() defined in undolog.c. That seemed necessary for me
because that's where I had machinery to track the meta-data (mostly
head and tail pointers) for each undo log explicitly, but I suppose I
could use a wider smgr interface as you are proposing to move the
filesystem operations over there. Perhaps I should reconsider that
split. I look forward to seeing your code.

> 3. I will continue to use the RelFileNode/SMgrRelation constructs
> through the SMgr API. I will reserve OIDs within the engine that we
> can use as DB ID in RelFileNode to determine which storage manager
> to associate for a specific SMgrRelation. To increase the
> visibility of the OID mappings to the user, I would expose a new
> catalog where the OIDs can be reserved and mapped to existing
> components for template db generation. Internally, SMgr wouldn't
> rely on catalogs, but instead will have them defined in code to not
> block bootstrap. This scheme should be compatible with the undo log
> storage work by Thomas Munro, et al. [0].

+1 for the pseudo-DB OID scheme, for now. I think we can reconsider
how we want to structure buffer tags in the longer term as part of
future projects that overhaul buffer mapping. We shouldn't get hung
up on that now.

I was wondering what the point of exposing the OIDs to users in a
catalog would be though. It's not necessary to do that to reserve
them (and even if it were, pg_database would be the place): the OIDs
we choose for undo, clog, ... just have to be in the system reserved
range to be safe from collisions. I suppose one benefit would be the
ability to join eg pg_buffer_cache against it to get a human readable
name like "clog", but that'd be slightly odd because the DB OID field
would refer to entries in pg_database or pg_storage_manager depending
on the number range.

> 4. For each component that will be transitioned over to the generic
> block storage, I will introduce a page header at the beginning of
> the block and re-work the associated offset calculations along with
> transitioning from SLRU to buffer cache framework.


As mentioned over in the SLRU checksums thread[1], I think that also
means that dirtied pages need to be registered with xlog so they get
full page writes when appropriate to deal with torn pages. I think
SLRUs and undo will all be able to use REGBUF_WILL_INIT and
RBM_ZERO_XXX flags almost all the time because they're append-mostly.
You'll presumably generate one or two FPWs in each SLRU after each
checkpoint; one in the currently active page where the running xids
live, and occasionally an older page if you recently switched clog
page or have some very long running transactions that eventually get
stamped as committed. In other words, there will be very few actual
full page writes generated by this, but it's something we need to get
right for correctness on some kinds of storage. It might be possible
to skip that if checksums are not enabled (based on the theory that
torn pages can't hurt any current SLRU user due to their
write-without-read access pattern, it's just the checksum failures
that we need to worry about).

> 5. Due to the on-disk format changes, simply copying the segments
> during upgrade wouldn't work anymore. Given the nature of data
> stored within SLRU segments today, we can extend pg_upgrade to
> translate the segment files by scanning from relfrozenxid and
> relminmxid and recording the corresponding values at the new
> offsets in the target segments.


(Hmm, if we're going to change all this stuff, I wonder if there would
be any benefit to switching to 64 bit xids for the xid-based SLRUs
while we're here...)

> 6. For now, I will implement a fsync queue handler specific to generic
> block store manager. In the future, once Andres' fsync queue work
> [1] gets merged in, we can move towards a common handler instead of
> duplicating the work.

I'm looking at that now: more soon.

> 7. Will update impacted extensions such as pageinspect and
> pg_buffercache.


> 8. We may need to introduce new shared buffer access strategies to
> limit the components from thrashing buffer cache.

That's going to be an interesting area. It will be good to get some
real experience. For undo, so far it usually seems to work out OK
because we aggressively try to discard pages (that is, drop buffers
and put them on the freelist) at the same rate we dirty them. I
speculate that for the SLRUs it might work out OK because, even though
the "discard" horizon moves very infrequently, pages are dirtied at a
relatively slow rate. Let's see... you can fit just under 32k
transactions into each clog page, so a 50K TPS nonstop workload would
take about a day to trash 1GB of cache with clog. That said, if it
turns out to be a problem we have a range of different hammers to hit
it with (and a number of hackers interested in that problem space).

This clog.c comment is interesting:

* This module replaces the old "pg_log" access code, which treated pg_log
* essentially like a relation, in that it went through the regular buffer
* manager. The problem with that was that there wasn't any good way to
* recycle storage space for transactions so old that they'll never be
* looked up again. Now we use specialized access code so that the commit
* log can be broken into relatively small, independent segments.

So it actually did use the regular buffer pool for a decade or so. It
doesn't look like the buffer pool was the problem (not that that would
tell us much if it had been, given how much has changed since commit
2589735da08c): it was just the lack of a way to truncate the front of
the growing relation file, wasting precious turn-of-the-century disk

> The work would be broken up into several smaller pieces so that we can
> get patches out for review and course-correct if needed.
> 1. Generic block storage manager with changes to SMgr APIs and code to
> initialize the new storage manager based on DB ID in RelFileNode.
> This patch will also introduce the new catalog to show the OIDs
> which map to this new storage manager.

Personally I wouldn't worry too much about that catalog stuff in v0
since it's just window dressing and doesn't actually help us get our
hands on the core feature prototype to test...

> 2. Adapt commit timestamp: simple and easy component to transition
> over as a first step, enabling us to test the whole framework.
> Changes will also include patching pg_upgrade to
> translate commit timestamp segments to the new format and
> associated updates to extensions.

+1, seems like as good a place as any to start.

> Will also include functional test coverage, especially, edge
> cases around data on page boundaries, and benchmark results
> comparing performance per component on SLRU vs buffer cache
> to identify regressions.


> 3. Iterate for each component in SLRU using the work done for commit
> timestamp as an example: multixact, clog, subtrans, async
> notifications, and predicate locking.

Without looking, I wonder if clog.c is going to be the trickiest,
since its slots are also involved in some group LSN stuff IIRC.

> 4. If required, implement shared access strategies, i.e., non-backend
> private ring buffers to limit buffer cache usage by these
> components.

I have a suspicion this won't turn out to be necessary for SLRUs as
mentioned, so I'm not too worried about it.

> Would love to hear feedback and comments on the approach above.

I like it. I'm looking forward to some prototype code. Oh, I think I
already said that a couple of times :-)


Thomas Munro

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2018-08-15 07:07:05 Re: libpq should not look up all host addresses at once
Previous Message Tatsuro Yamada 2018-08-15 04:51:53 Add a semicolon to query related to search_path