Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Date: 2021-06-15 12:04:08
Message-ID: aea01367-d988-fcfa-5224-bbc91839d5d6@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 15/06/2021 14:20, Dilip Kumar wrote:
> Design Idea:
> -----------------
> First, create the target database directory along with the version
> file and WAL-log this operation. Create the "relation map file" in
> the target database and copy the content from the source database. For
> this, we can use some modified versions of the write_relmap_file() and
> WAL-log the relmap create operation along with the file content. Now,
> read the relmap file to find the relfilenode for pg_class and then we
> read pg_class block by block and decode the tuples. For reading the
> pg_class blocks, we can use ReadBufferWithoutRelCache() so that we
> don't need the relcache. Nothing prevents us from checking visibility
> for tuples in another database because CLOG is global to the cluster.
> And nothing prevents us from deforming those tuples because the column
> definitions for pg_class have to be the same in every database. Then
> we can get the relfilenode of every file we need to copy, and prepare
> a list of all such relfilenode.

I guess that would work, but you could also walk the database directory
like copydir() does. How you find the relations to copy is orthogonal to
whether you WAL-log them or use checkpoints. And whether you use the
buffer cache is also orthogonal to the rest of the proposal; you could
issue FlushDatabaseBuffers() instead of a checkpoint.

> Next, for each relfilenode in the
> source database, create a respective relfilenode in the target
> database (for all forks) using smgrcreate, which is already a
> WAL-logged operation. Now read the source relfilenode block by block
> using ReadBufferWithoutRelCache() and copy the block to the target
> relfilenode using smgrextend() and WAL-log them using log_newpage().
> For the source database, we can not directly use the smgrread(),
> because there could be some dirty buffers so we will have to read them
> through the buffer manager interface, otherwise, we will have to flush
> all the dirty buffers.

Yeah, WAL-logging the contents of the source database would certainly be
less weird than the current system. As Julien also pointed out, the
question is, are there people using on "CREATE DATABASE foo TEMPLATE
bar" to copy a large source database, on the premise that it's fast
because it skips WAL-logging?

In principle, we could have both mechanisms, and use the new WAL-logged
system if the database is small, and the old system with checkpoints if
it's large. But I don't like idea of having to maintain both.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-06-15 12:19:02 Re: Race condition in recovery?
Previous Message Andrew Dunstan 2021-06-15 11:54:49 Re: Race condition in recovery?