Re: In-place persistance change of a relation

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: sfrost(at)snowman(dot)net
Cc: osumi(dot)takamichi(at)fujitsu(dot)com, masao(dot)fujii(at)oss(dot)nttdata(dot)com, ashutosh(dot)bapat(dot)oss(at)gmail(dot)com, tsunakawa(dot)takay(at)fujitsu(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: In-place persistance change of a relation
Date: 2020-11-12 06:55:37
Message-ID: 20201112.155537.905170546612920882.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Wed, 11 Nov 2020 09:56:44 -0500, Stephen Frost <sfrost(at)snowman(dot)net> wrote in
> Greetings,
>
> * Kyotaro Horiguchi (horikyota(dot)ntt(at)gmail(dot)com) wrote:
> > At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost <sfrost(at)snowman(dot)net> wrote in
> > > * Kyotaro Horiguchi (horikyota(dot)ntt(at)gmail(dot)com) wrote:
> > > > For fuel(?) of the discussion, I tried a very-quick PoC for in-place
> > > > ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
> > > > trials of several ways, I drifted to the following way after poking
> > > > several ways.
> > > >
> > > > 1. Flip BM_PERMANENT of active buffers
> > > > 2. adding/removing init fork
> > > > 3. sync files,
> > > > 4. Flip pg_class.relpersistence.
> > > >
> > > > It always skips table copy in the SET UNLOGGED case, and only when
> > > > wal_level=minimal in the SET LOGGED case. Crash recovery seems
> > > > working by some brief testing by hand.
> > >
> > > Somehow missed that this patch more-or-less does what I was referring to
> > > down-thread, but I did want to mention that it looks like it's missing a
> > > necessary FlushRelationBuffers() call before the sync, otherwise there
> > > could be dirty buffers for the relation that's being set to LOGGED (with
> > > wal_level=minimal), which wouldn't be good. See the comments above
> > > smgrimmedsync().
> >
> > Right. Thanks. However, since SetRelFileNodeBuffersPersistence()
> > called just above scans shared buffers so I don't want to just call
> > FlushRelationBuffers() separately. Instead, I added buffer-flush to
> > SetRelFileNodeBuffersPersistence().
>
> Maybe I'm missing something, but it sure looks like in the patch that
> SetRelFileNodeBuffersPersistence() is being called after the
> smgrimmedsync() call, and I don't think you get to just switch the order
> of those- the sync is telling the kernel to make sure it's written to
> disk, while the FlushBuffer() is just writing it into the kernel but
> doesn't provide any guarantee that the data has actually made it to
> disk. We have to FlushBuffer() first, and then call smgrimmedsync().
> Perhaps there's a way to avoid having to go through shared buffers
> twice, and I generally agreed it'd be good if we could avoid doing so,
> but this approach doesn't look like it actually works.

Yeah, sorry for the rare-baked version.. I was confused about the
order at the time. The next version works like this:

LOGGED->UNLOGGED
<collect reloids to process>

for each relations:
<set buffer persistence to !BM_PERMANENT (wal-logged if walleve > minimal>
<create init fork>
if it is index call ambuildempty() (which syncs the init fork)
else WAL-log smgr_create then sync the init file.
<update catalog>
...
commit time:
<do nogthing>
abort time:
<unlink init fork>
<revert buffer persistence>

UNLOGGED->LOGGED
<collect reloids to process>

for each relations:
<set buffer persistence to !BM_PERMANENT (wal-logged if walleve > minimal>
<record drop-init-fork to pending-deletes>
<sync storage files>
<update catalog>
...
commit time:
<log smgrunlink>
<smgrunlink init fork>
abort time:
<revert buffer persistence>

> > FWIW this is a revised version of the PoC, which has some known
> > problems.
> >
> > - Flipping of Buffer persistence is not WAL-logged nor even be able to
> > be safely roll-backed. (It might be better to drop buffers).
>
> Not sure if it'd be better to drop buffers or not, but figuring out how
> to deal with rollback seems pretty important. How is the persistence
> change in the catalog not WAL-logged though..?

Rollback works as the above. Buffer persistence change is registered
in pending-deletes. Persistence change in catalog is rolled back in
the ordinary way (or automatically).

If wal_level > minimal, persistence change of buffers is propagated to
standbys by WAL. However I'm not sure we need wal-logging otherwise,
the next version emits WAL since SMGR_CREATE is always logged by
existing code.

> > - This version handles indexes but not yet handle toast relatins.
>
> Would need to be fixed, of course.

Fixed.

> > - tableAMs are supposed to support this feature. (but I'm not sure
> > it's worth allowing them not to do so).
>
> Seems like they should.

Init fork of index relations needs a call to ambuildempty() instead of
"log_smgrcreate-smgrimmedsync" after smgrcreate. Instead of adding
similar interface in indexAm, I reverted changes of tableam and make
RelationCreate/DropInitFork() directly do that. That introduces new
include of amapi.h to storage.c, which is a bit uneasy.

The previous version give up the in-place persistence change in the
case where wal_level > minimal and SET LOGGED since that needs WAL to
be emitted. However, in-place change still has the advantage of not
running a table copy. So the next verson always runs persistence
change in-place.

As suggested by Andres, I'll send a summary of this patch. The patch
will be attached to the coming mail.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2020-11-12 06:59:20 Re: [HACKERS] logical decoding of two-phase transactions
Previous Message Kyotaro Horiguchi 2020-11-12 06:55:24 Re: In-placre persistance change of a relation