Re: Proposal: SLRU to Buffer Cache

From: Shawn Debnath <sdn(at)amazon(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: SLRU to Buffer Cache
Date: 2018-08-21 13:53:21
Message-ID: 20180821135050.GA67907@60f81dc409fc.ant.amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Sorry for the delay!

On Wed, Aug 15, 2018 at 05:56:19PM +1200, Thomas Munro wrote:
> +1 for doing it separately at first.
>
> I've also vacillated between extending md.c and doing my own
> undo_file.c thing. It seems plausible that between SLRU and undo we
> could at least share a common smgr implementation, and eventually
> maybe md.c. There are a few differences though, and the question is
> whether we'd want to do yet another abstraction layer with
> callbacks/vtable/configuration points to handle that parameterisation,
> or just use the existing indirection in smgr and call it good.
>
> I'm keen to see what you come up with. After we have a patch to
> refactor and generalise the fsync stuff from md.c (about which more
> below), let's see what is left and whether we can usefully combine
> some code.

There are a few different approaches we can take here. Let me ponder on
it before implementing. We can iterate on the patch once it’s out.

> > 3. I will continue to use the RelFileNode/SMgrRelation constructs
> > through the SMgr API. I will reserve OIDs within the engine that we
> > can use as DB ID in RelFileNode to determine which storage manager
> > to associate for a specific SMgrRelation. To increase the
> > visibility of the OID mappings to the user, I would expose a new
> > catalog where the OIDs can be reserved and mapped to existing
> > components for template db generation. Internally, SMgr wouldn't
> > rely on catalogs, but instead will have them defined in code to not
> > block bootstrap. This scheme should be compatible with the undo log
> > storage work by Thomas Munro, et al. [0].
>
> +1 for the pseudo-DB OID scheme, for now. I think we can reconsider
> how we want to structure buffer tags in the longer term as part of
> future projects that overhaul buffer mapping. We shouldn't get hung
> up on that now.

+1 We should postpone discussing revamping buffer tags for a later date.
This set of patches will be quite a handful already.

> I was wondering what the point of exposing the OIDs to users in a
> catalog would be though. It's not necessary to do that to reserve
> them (and even if it were, pg_database would be the place): the OIDs
> we choose for undo, clog, ... just have to be in the system reserved
> range to be safe from collisions. I suppose one benefit would be the
> ability to join eg pg_buffer_cache against it to get a human readable
> name like "clog", but that'd be slightly odd because the DB OID field
> would refer to entries in pg_database or pg_storage_manager depending
> on the number range.

Good points. However, there are very few cases where our internal
representation using DB OIDs will be exposed, one such being
pg_buffercache. Wondering if updating the documentation here would be
sufficient as pg_buffercache is an extension used by developers and DBEs
rather than by consumers. We can circle back to this after the initial
set of patches are out.

> > 4. For each component that will be transitioned over to the generic
> > block storage, I will introduce a page header at the beginning of
> > the block and re-work the associated offset calculations along with
> > transitioning from SLRU to buffer cache framework.
>
> +1
>
> As mentioned over in the SLRU checksums thread[1], I think that also
> means that dirtied pages need to be registered with xlog so they get
> full page writes when appropriate to deal with torn pages. I think
> SLRUs and undo will all be able to use REGBUF_WILL_INIT and
> RBM_ZERO_XXX flags almost all the time because they're append-mostly.
> You'll presumably generate one or two FPWs in each SLRU after each
> checkpoint; one in the currently active page where the running xids
> live, and occasionally an older page if you recently switched clog
> page or have some very long running transactions that eventually get
> stamped as committed. In other words, there will be very few actual
> full page writes generated by this, but it's something we need to get
> right for correctness on some kinds of storage. It might be possible
> to skip that if checksums are not enabled (based on the theory that
> torn pages can't hurt any current SLRU user due to their
> write-without-read access pattern, it's just the checksum failures
> that we need to worry about).

Yep agreed on FPW, and good point on potentially skipping it if they are
disabled. For most of the components, we are always setting values at
the advancing offset so I believe we should be okay here.

> > 5. Due to the on-disk format changes, simply copying the segments
> > during upgrade wouldn't work anymore. Given the nature of data
> > stored within SLRU segments today, we can extend pg_upgrade to
> > translate the segment files by scanning from relfrozenxid and
> > relminmxid and recording the corresponding values at the new
> > offsets in the target segments.
>
> +1
>
> (Hmm, if we're going to change all this stuff, I wonder if there would
> be any benefit to switching to 64 bit xids for the xid-based SLRUs
> while we're here...)

Do you mean switching or reserving space for it on the block? The latter
I hope :-)

> > 8. We may need to introduce new shared buffer access strategies to
> > limit the components from thrashing buffer cache.
>
> That's going to be an interesting area. It will be good to get some
> real experience. For undo, so far it usually seems to work out OK
> because we aggressively try to discard pages (that is, drop buffers
> and put them on the freelist) at the same rate we dirty them. I
> speculate that for the SLRUs it might work out OK because, even though
> the "discard" horizon moves very infrequently, pages are dirtied at a
> relatively slow rate. Let's see... you can fit just under 32k
> transactions into each clog page, so a 50K TPS nonstop workload would
> take about a day to trash 1GB of cache with clog. That said, if it
> turns out to be a problem we have a range of different hammers to hit
> it with (and a number of hackers interested in that problem space).

Agreed, my plan is to test it without special ring buffers and evaluate
the performance. I just wanted to raise the issue in case we run into
abnormal behavior.

> > 1. Generic block storage manager with changes to SMgr APIs and code to
> > initialize the new storage manager based on DB ID in RelFileNode.
> > This patch will also introduce the new catalog to show the OIDs
> > which map to this new storage manager.
>
> Personally I wouldn't worry too much about that catalog stuff in v0
> since it's just window dressing and doesn't actually help us get our
> hands on the core feature prototype to test...

Yep, agreed. Like I said above, we can circle back on this. The OID
exposure can be settled on once the functionality has gained acceptance.

> > Would love to hear feedback and comments on the approach above.
>
> I like it. I'm looking forward to some prototype code. Oh, I think I
> already said that a couple of times :-)

More than a couple of times :-) It’s in the works!

--
Shawn Debnath
Amazon Web Services (AWS)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-08-21 14:15:28 Re: Proposal: SLRU to Buffer Cache
Previous Message Alexander Korotkov 2018-08-21 13:10:44 Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept