[Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Date: 2021-06-15 11:20:24
Message-ID: CAFiTN-sP6yLVTfjR42mEfvFwJ-SZ2iEtG1t0j=QX09X=BM+KWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Currently, CREATE DATABASE forces a checkpoint, then copies all the
files, then forces another checkpoint. The comments in the createdb()
function explain the reasons for this. The attached patch fixes this
problem by making CREATE DATABASE completely WAL-logged so that now we
can avoid checkpoints. The patch modifies both CREATE DATABASE and
ALTER DATABASE..SET TABLESPACE to be fully WAL-logged.

One main advantage of this change is that it will be cheaper. Forcing
checkpoints on an idle system is no big deal, but when the system is
under heavy write load, it's very expensive. Another advantage is that
it makes things better for features like TDE, which might want the
pages in the source database to be encrypted using a different key or
nonce than the pages in the target database.

Design Idea:
-----------------
First, create the target database directory along with the version
file and WAL-log this operation. Create the "relation map file" in
the target database and copy the content from the source database. For
this, we can use some modified versions of the write_relmap_file() and
WAL-log the relmap create operation along with the file content. Now,
read the relmap file to find the relfilenode for pg_class and then we
read pg_class block by block and decode the tuples. For reading the
pg_class blocks, we can use ReadBufferWithoutRelCache() so that we
don't need the relcache. Nothing prevents us from checking visibility
for tuples in another database because CLOG is global to the cluster.
And nothing prevents us from deforming those tuples because the column
definitions for pg_class have to be the same in every database. Then
we can get the relfilenode of every file we need to copy, and prepare
a list of all such relfilenode. Next, for each relfilenode in the
source database, create a respective relfilenode in the target
database (for all forks) using smgrcreate, which is already a
WAL-logged operation. Now read the source relfilenode block by block
using ReadBufferWithoutRelCache() and copy the block to the target
relfilenode using smgrextend() and WAL-log them using log_newpage().
For the source database, we can not directly use the smgrread(),
because there could be some dirty buffers so we will have to read them
through the buffer manager interface, otherwise, we will have to flush
all the dirty buffers.

WAL sequence using pg_waldump
----------------------------------------------------
1. (new wal to create db dir and write PG_VERSION file)
rmgr: Database desc: CREATE create dir 1663/16394

2. (new wal to create and write relmap file)
rmgr: RelMap desc: CREATE database 16394 tablespace 1663 size 512

2. (create relfilenode)
rmgr: Storage desc: CREATE base/16394/16384
rmgr: Storage desc: CREATE base/16394/2619

3. (write page data)
rmgr: XLOG desc: FPI , blkref #0: rel 1663/16394/2619 blk 0 FPW
rmgr: XLOG desc: FPI , blkref #0: rel 1663/16394/2619 blk 1 FPW
............
4. (create other forks)
rmgr: Storage desc: CREATE base/16394/2619_fsm
rmgr: Storage CREATE base/16394/2619_vm
.............

I have attached a POC patch, which shows this idea, with this patch
all basic sanity testing and the "check-world" is passing.

Open points:
-------------------
- This is a POC patch so needs more refactoring/cleanup and testing.
- Might need to relook into the SMGR level API usage.

Credits:
-----------
Thanks to Robert Haas, for suggesting this idea and the high-level design.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
POC-0001-WAL-logged-CREATE-DATABASE.patch text/x-patch 42.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2021-06-15 11:31:05 Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Previous Message Amit Kapila 2021-06-15 11:18:37 Re: Decoding speculative insert with toast leaks memory