Proposal: SLRU to Buffer Cache

From: Shawn Debnath <sdn(at)amazon(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Proposal: SLRU to Buffer Cache
Date: 2018-08-14 21:35:00
Message-ID: 20180814213500.GA74618@60f81dc409fc.ant.amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

At the Unconference in Ottawa this year, I pitched the idea of moving
components off of SLRU and on to the buffer cache. The motivation
behind the idea was three fold:

* Improve performance by eliminating fixed sized caches, simplistic
scan and eviction algorithms.
* Ensuring durability and consistency by tracking LSNs and checksums
per block.
* Consolidating caching strategies in the engine to simplify the
codebase, and would benefit from future buffer cache optimizations.

As the changes are quite invasive, I wanted to vet the approach with the
community before digging in to implementation. The changes are strictly
on the storage side and do not change the runtime behavior or protocols.
Here's the current approach I am considering:

1. Implement a generic block storage manager that parameterizes
several options like segment sizes, fork and segment naming and
path schemes, concepts entrenched in md.c that are strongly tied to
relations. To mitigate risk, I am planning on not modifying md.c
for the time being.

2. Introduce a new smgr_truncate_extended() API to allow truncation of
a range of blocks starting at a specific offset, and option to
delete the file instead of simply truncating.

3. I will continue to use the RelFileNode/SMgrRelation constructs
through the SMgr API. I will reserve OIDs within the engine that we
can use as DB ID in RelFileNode to determine which storage manager
to associate for a specific SMgrRelation. To increase the
visibility of the OID mappings to the user, I would expose a new
catalog where the OIDs can be reserved and mapped to existing
components for template db generation. Internally, SMgr wouldn't
rely on catalogs, but instead will have them defined in code to not
block bootstrap. This scheme should be compatible with the undo log
storage work by Thomas Munro, et al. [0].

4. For each component that will be transitioned over to the generic
block storage, I will introduce a page header at the beginning of
the block and re-work the associated offset calculations along with
transitioning from SLRU to buffer cache framework.

5. Due to the on-disk format changes, simply copying the segments
during upgrade wouldn't work anymore. Given the nature of data
stored within SLRU segments today, we can extend pg_upgrade to
translate the segment files by scanning from relfrozenxid and
relminmxid and recording the corresponding values at the new
offsets in the target segments.

6. For now, I will implement a fsync queue handler specific to generic
block store manager. In the future, once Andres' fsync queue work
[1] gets merged in, we can move towards a common handler instead of
duplicating the work.

7. Will update impacted extensions such as pageinspect and
pg_buffercache.

8. We may need to introduce new shared buffer access strategies to
limit the components from thrashing buffer cache.

The work would be broken up into several smaller pieces so that we can
get patches out for review and course-correct if needed.

1. Generic block storage manager with changes to SMgr APIs and code to
initialize the new storage manager based on DB ID in RelFileNode.
This patch will also introduce the new catalog to show the OIDs
which map to this new storage manager.

2. Adapt commit timestamp: simple and easy component to transition
over as a first step, enabling us to test the whole framework.
Changes will also include patching pg_upgrade to
translate commit timestamp segments to the new format and
associated updates to extensions.

Will also include functional test coverage, especially, edge
cases around data on page boundaries, and benchmark results
comparing performance per component on SLRU vs buffer cache
to identify regressions.

3. Iterate for each component in SLRU using the work done for commit
timestamp as an example: multixact, clog, subtrans, async
notifications, and predicate locking.

4. If required, implement shared access strategies, i.e., non-backend
private ring buffers to limit buffer cache usage by these
components.

Would love to hear feedback and comments on the approach above.

Thanks,

Shawn Debnath
Amazon Web Services (AWS)

[0] https://github.com/enterprisedb/zheap/tree/undo-log-storage
[1] https://www.postgresql.org/message-id/flat/20180424180054.inih6bxfspgowjuc%40alap3.anarazel.de

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nico Williams 2018-08-14 23:01:14 Re: Facility for detecting insecure object naming
Previous Message Tomas Vondra 2018-08-14 21:26:30 Re: logical decoding / rewrite map vs. maxAllocatedDescs