Re: [HACKERS] Unlogged tables cleanup

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Unlogged tables cleanup
Date: 2019-05-13 17:37:35
Message-ID: 20190513173735.3znpqfkb3gasig6v@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2019-05-13 13:09:17 -0400, Robert Haas wrote:
> On Mon, May 13, 2019 at 12:50 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > AFAICS ResetUnloggedRelations copies the init fork after replaying WAL,
> > > so it would be sufficient to have the init fork be recovered from WAL
> > > for that to work. However, we also do ResetUnloggedRelations *before*
> > > replaying WAL in order to remove leftover not-init-fork files, and that
> > > process requires that the init fork is present at that time.
> >
> > What scenario are you precisely wondering about? That
> > ResetUnloggedRelations() could overwrite the main fork, while not yet
> > having valid contents (due to the lack of smgrimmedsync())? Shouldn't
> > that only be possible while still in an inconsistent state? A checkpoint
> > would have serialized the correct contents, and we'd not reach HS
> > consistency before having replayed that WAL records resetting the table
> > and the init fork consistency?
>
> I think I see what Alvaro is talking about, or at least I think I see
> *a* possible problem based on his remarks.
>
> Suppose we create an unlogged table and then crash. The main fork
> makes it to disk, and the init fork does not. Before WAL replay, we
> remove any main forks that have init forks, but because the init fork
> was lost, that does not happen. Recovery recreates the init fork.
> After WAL replay, we try to copy_file() each _init fork to the
> corresponding main fork. That fails, because copy_file() expects to be
> able to create the target file, and here it can't do that because it
> already exists.

Ugh, this is all such a mess. But, isn't this broken independently of
the smgrimmedsync() issue? In a basebackup case, the basebackup could
have included the main fork, but not the init fork, and the reverse. WAL
replay *solely* needs to be able to recover from that. At the very
least we'd have to do the cleanup step after becoming consistent, not
just before recovery even started.

> If that's the scenario, I'm not sure the smgrimmedsync() call is
> sufficient. Suppose we log_smgrcreate() but then crash before
> smgrimmedsync()... seems like we'd need to do them in the other order,
> or else maybe just pass a flag to copy_file() telling it not to be so
> picky.

I think I mentioned this elsewhere recently, I think our smgr WAL
logging makes this unneccessarily hard. We should just have the intial
smgrcreate record (be it for the main fork, or just the init fork), have
a flag that tells WAL replay to clean out everything pre-existing

Hm, I'm also a bit confused as to how ResetUnloggedRelations() got to
take its argument as a bitmask - given that we only pass in individual
flags at the moment. Seems pretty clear to me that the call at the end
of recovery would need to be UNLOGGED_RELATION_CLEANUP |
UNLOGGED_RELATION_INIT, unless we make some bigger changes?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-05-13 17:38:34 Re: [HACKERS] Unlogged tables cleanup
Previous Message Alvaro Herrera 2019-05-13 17:33:00 Re: [HACKERS] Unlogged tables cleanup