Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Date: 2022-03-09 11:07:01
Message-ID: CAFiTN-tp24H_9CiJ+yg+AQJB3VXu9ZD5Ri8F1tn24yvJn+AYsg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 9, 2022 at 3:12 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
Thanks for reviewing and valuable feedback.

> Reviewing 0001, the boundaries of the critical section move slightly,
> but only over a memcpy, which can't fail, so that seems fine. But this
> comment looks ominous:
>
> * Note: we're cheating a little bit here by assuming that mapped files
> * are either in pg_global or the database's default tablespace.
>
> It's not clear to me how the code that follows relies on this
> assumption, but the overall patch set would make that not true any
> more, so there's some kind of an issue to think about there.

I think the comments are w.r.t choosing the file path, because here we
assume either it is in the global tablespace or default tablespace of
the database. Here also the comment is partially true because we also
assume that it will be in the default tablespace of the database
(because we do not need to worry about the shared relations). But I
think this comments can move to the caller function where we are
creating the file path.

if (shared)
{
snprintf(mapfilename, sizeof(mapfilename), "global/%s",
RELMAPPER_FILENAME);
}
else
{
snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
dbpath, RELMAPPER_FILENAME);
}

> It's a little asymmetric that load_relmap_file() gets a subroutine
> read_relmap_file() while write_relmap_file() gets a subroutine
> write_relmap_file_internal(). Perhaps we could call the functions
> {load,write}_named_relmap_file() or something of that sort.

Yeah this should be changed.

> Reviewing 0002, your comment updates in relmap_redo() are not
> complete. Note that there's an unmodified comment that says "Write out
> the new map and send sinval" just above where you modify the code to
> only conditionally send sinval. I'm somewhat uncomfortable with the
> shape of this logic, too. It looks weird to be sometimes calling
> write_relmap_file and sometimes write_relmap_file_internal. You'd
> expect functions with those names to be called at different
> abstraction levels, rather than at parallel call sites. The renaming I
> proposed would help with this but it's not just a cosmetic issue: the
> logic to construct mapfilename is in this function in one case, but in
> the called function in the other case. I can't help but think that the
> write_relmap_file()/write_relmap_file_internal() split isn't entirely
> the right thing.
>
> I think part of the confusion here is that, pre-patch,
> write_relmap_file() gets called during both recovery and normal
> running, and the updates to shared_map or local_map are actually
> nonsense during recovery, because the local map at least is local to
> whatever our database is, and we don't have a database connection if
> we're the startup process.

Yeah you are correct about the local map, but I am not sure whether we
can rely on not updating the shared map in the startup process.
Because how can we guarantee that now or in future the startup process
can never look into the map? I agree that it is not connected to the
database so it doesn't make sense to look into the local map but how
we are going to ensure the shared map. Said that I think there are
only 3 function which must be looking at these maps directly
RelationMapOidToFilenode(), RelationMapFilenodeToOid() and
RelationMapUpdateMap() and these functions are called from a very few
places and I don't think these should be called during recovery. So
probably we can put a elog saying they should never be called during
recovery?

After your patch, we're still going through
> write_relmap_file in recovery in some cases, but really those map
> updates don't seem like things that should be happening at all. And on
> the other hand it's not clear to me why the CRC stuff isn't needed in
> all cases, but that's only going to happen when we go through the
> non-internal version of the function. You've probably spent more time
> looking at this code than I have, but I'm wondering if the division
> should be like this: we have one function that does the actual update,
> and another function that does the update plus sets global variables.
> Recovery always uses the first one, and outside of recovery we use the
> first one for the create-database case and the second one otherwise.
> Thoughts?

Right, infact now also if you see the logic, the
write_relmap_file_internal() is taking care of the actual update and
the write_relmap_file() is doing update + setting the global
variables. So yeah we can rename as you suggested in 0001 and here
also we can change the logic as you suggested that the recovery and
createdb will only call the first function which is just doing the
update.

> Regarding 0003, my initial thought was to like the fact that you'd
> avoided duplicating code by using a function parameter, but as I look
> at it a bit more, it's not clear to me that it's enough code that we
> really care about not duplicating it. I would not expect to find a
> function called RelationCopyAllFork() in tablecmds.c.

Okay, actually I see this logic of copying the fork at a few different
places like
index_copy_data() in tablecmds.c. and then in
heapam_relation_copy_data() in heapam_handler.c. So I was not sure
what could be right place for this function so I kept it in the same
file (tablecmds.c) because I splitted it from the function in this
file.

I'd expect to
> find it in storage.c, I think. And I think I'd be surprised to find
> out that it doesn't actually know anything about copying; it's
> basically just a loop over the forks to which you can supply your own
> copy-function.

Yeah but it eventually expects a function pointer to copy storage so
we can not completely deny that it knows nothing about the copy?

And the fact that it's got an argument of type
> copy_relation_storage and the argument name is copy_storage and the
> value is sometimes RelationCopyStorageis a terminological muddle, too.
> So I feel like perhaps this needs more thought.

One option is that we can duplicate this loop in dbcommand.c as well
like we are having it already in tablecmds.c and heapam_handler.c?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dong Wook Lee 2022-03-09 11:13:15 Re: Add pg_freespacemap extension sql test
Previous Message Peter Eisentraut 2022-03-09 10:35:32 Re: parse/analyze API refactoring