Re: In-placre persistance change of a relation

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: tsunakawa(dot)takay(at)fujitsu(dot)com
Cc: osumi(dot)takamichi(at)fujitsu(dot)com, sfrost(at)snowman(dot)net, masao(dot)fujii(at)oss(dot)nttdata(dot)com, ashutosh(dot)bapat(dot)oss(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: In-placre persistance change of a relation
Date: 2020-12-24 08:02:20
Message-ID: 20201224.170220.34532914782663309.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the comment! Sorry for the late reply.

At Fri, 4 Dec 2020 07:49:22 +0000, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com> wrote in
> From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
> > > No, not really. The issue is more around what happens if we crash
> > > part way through. At crash recovery time, the system catalogs are not
> > > available, because the database isn't consistent yet and, anyway, the
> > > startup process can't be bound to a database, let alone every database
> > > that might contain unlogged tables. So the sentinel that's used to
> > > decide whether to flush the contents of a table or index is the
> > > presence or absence of an _init fork, which the startup process
> > > obviously can see just fine. The _init fork also tells us what to
> > > stick in the relation when we reset it; for a table, we can just reset
> > > to an empty file, but that's not legal for indexes, so the _init fork
> > > contains a pre-initialized empty index that we can just copy over.
> > >
> > > Now, to make an unlogged table logged, you've got to at some stage
> > > remove those _init forks. But this is not a transactional operation.
> > > If you remove the _init forks and then the transaction rolls back,
> > > you've left the system an inconsistent state. If you postpone the
> > > removal until commit time, then you have a problem if it fails,
> >
> > It's true. That are the cause of headache.
> ...
> > The current implement is simple. It's enough to just discard old or
> > new relfilenode according to the current transaction ends with commit
> > or abort. Tweaking of relfilenode under use leads-in some skews in
> > some places. I used pendingDelete mechanism a bit complexified way
> > and a violated an abstraction (I think, calling AM-routines from
> > storage.c is not good.) and even introduce a new fork kind only to
> > mark a init fork as "not committed yet". There might be better way,
> > but I haven't find it.
>
> I have no alternative idea yet, too. I agree that we want to avoid them, especially introducing inittmp fork... Anyway, below are the rest of my review comments for 0001. I want to review 0002 when we have decided to go with 0001.
>
>
> (2)
> XLOG_SMGR_UNLINK seems to necessitate modification of the following comments:
>
> [src/include/catalog/storage_xlog.h]
> /*
> * Declarations for smgr-related XLOG records
> *
> * Note: we log file creation and truncation here, but logging of deletion
> * actions is handled by xact.c, because it is part of transaction commit.
> */

Sure. Rewrote it.

> [src/backend/access/transam/README]
> 3. Deleting a table, which requires an unlink() that could fail.
>
> Our approach here is to WAL-log the operation first, but to treat failure
> of the actual unlink() call as a warning rather than error condition.
> Again, this can leave an orphan file behind, but that's cheap compared to
> the alternatives. Since we can't actually do the unlink() until after
> we've committed the DROP TABLE transaction, throwing an error would be out
> of the question anyway. (It may be worth noting that the WAL entry about
> the file deletion is actually part of the commit record for the dropping
> transaction.)

Mmm. I didn't touched theDROP TABLE (RelationDropStorage) path, but I
added a brief description about INITTMP fork to the file.

====
The INITTMP fork file
--------------------------------

An INITTMP fork is created when new relation file is created to mark
the relfilenode needs to be cleaned up at recovery time. The file is
removed at transaction end but is left when the process crashes before
the transaction ends. In contrast to 4 above, failure to remove an
INITTMP file will lead to data loss, in which case the server will
shut down.
====

> (3)
> +/* This is bit-map, not ordianal numbers */
>
> There seems to be no comments using "bit-map". "Flags for ..." can be seen here and there.

I revmoed the comment and use (1 << n) notation to show the fact
instead.

> (4)
> Some wrong spellings:
>
> swithing -> switching
> alredy -> already
> recovery -> recover
> situaton -> situation

Oops! Fixed them.

> (5)
> + table_close(classRel, NoLock);
> +
> +
> +
> +
> }
>
> These empty lines can be deleted.

s/can/should/ :p. Fixed.

>
> (6)
> +/*
> + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
> + */
> +void
> +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
> ...
> + * Make an XLOG entry reporting the file unlink.
>
> Not unlink but buffer persistence?

Silly copy-pasto. Fixed.

> (7)
> + /*
> + * index-init fork needs further initialization. ambuildempty shoud do
> + * WAL-log and file sync by itself but otherwise we do that by myself.
> + */
> + if (rel->rd_rel->relkind == RELKIND_INDEX)
> + rel->rd_indam->ambuildempty(rel);
> + else
> + {
> + log_smgrcreate(&rnode, INIT_FORKNUM);
> + smgrimmedsync(srel, INIT_FORKNUM);
> + }
> +
> + /*
> + * We have created the init fork. If server crashes before the current
> + * transaction ends the init fork left alone corrupts data while recovery.
> + * The inittmp fork works as the sentinel to identify that situaton.
> + */
> + smgrcreate(srel, INITTMP_FORKNUM, false);
> + log_smgrcreate(&rnode, INITTMP_FORKNUM);
> + smgrimmedsync(srel, INITTMP_FORKNUM);
>
> If the server crashes between these two processings, only the init fork exists. Is it correct to create the inittmp fork first?

Right. I change it that way, and did the same with the new code added
to RelationCreateStorage.

> (8)
> + if (inxact_created)
> + {
> + SMgrRelation srel = smgropen(rnode, InvalidBackendId);
> + smgrclose(srel);
> + log_smgrunlink(&rnode, INIT_FORKNUM);
> + smgrunlink(srel, INIT_FORKNUM, false);
> + log_smgrunlink(&rnode, INITTMP_FORKNUM);
> + smgrunlink(srel, INITTMP_FORKNUM, false);
> + return;
> + }
>
> smgrclose() should be called just before return.
> Isn't it necessary here to revert buffer persistence state change?

Mmm. it's a thinko. I was confused with the case of
close/unlink. Fixed all instacnes of the same.

> (9)
> +void
> +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
> +{
> + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
> +}
>
> Maybe it's better to restore smgrdounlinkfork() that was removed in the older release. That function includes dropping shared buffers, which can clean up the shared buffers that may be cached by this transaction.

INITFORK/INITTMP forks cannot be loaded to shared buffer so it's no
use to drop buffers. I added a comment like that.

| /*
| * INIT/INITTMP forks never be loaded to shared buffer so no point in
| * dropping buffers for these files.
| */
| log_smgrunlink(&rnode, INIT_FORKNUM);

I removed DropRelFileNodeBuffers from PDOP_UNLINK_FORK branch in
smgrDoPendingDeletes and added an assertion and a comment instead.

| /* other forks needs to drop buffers */
| Assert(pending->unlink_forknum == INIT_FORKNUM ||
| pending->unlink_forknum == INITTMP_FORKNUM);
|
| log_smgrunlink(&pending->relnode, pending->unlink_forknum);
| smgrunlink(srel, pending->unlink_forknum, false);

> (10)
> [RelationDropInitFork]
> + /* revert buffer-persistence changes at abort */
> + pending = (PendingRelDelete *)
> + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
> + pending->relnode = rnode;
> + pending->op = PDOP_SET_PERSISTENCE;
> + pending->bufpersistence = false;
> + pending->backend = InvalidBackendId;
> + pending->atCommit = true;
> + pending->nestLevel = GetCurrentTransactionNestLevel();
> + pending->next = pendingDeletes;
> + pendingDeletes = pending;
> +}
>
> bufpersistence should be true.

RelationDropInitFork() chnages the relation persisitence to
"persistent" so it shoud be reverted to "non-persistent (= false)" at
abort. (I agree that the function name is somewhat confusing...)

> (11)
> + BlockNumber block = 0;
> ...
> + DropRelFileNodeBuffers(rbnode, &pending->unlink_forknum, 1,
> + &block);
>
> "block" is unnecessary and 0 can be passed directly.

I removed the entire function call.

But, I don't think you're right here.

| DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
| int nforks, BlockNumber *firstDelBlock)

Doesn't just passing 0 lead to SEGV?

> (12)
> - && pending->backend == InvalidBackendId)
> + && pending->backend == InvalidBackendId &&
> + pending->op == PDOP_DELETE)
> nrels++;
>
> It's better to put && at the beginning of the line to follow the existing code here.

It's terrible.. Fixed.

> (13)
> + table_close(rel, lockmode);
>
> lockmode should be NoLock to retain the lock until transaction completion.

I tried to recall the reason for that, but didn't come up with
anything. Fixed.

> (14)
> + ctl.keysize = sizeof(unlogged_relation_entry);
> + ctl.entrysize = sizeof(unlogged_relation_entry);
> + hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM);
> ...
> + memset(key.oid, 0, sizeof(key.oid));
> + memcpy(key.oid, de->d_name, oidchars);
> + ent = hash_search(hash, &key, HASH_FIND, NULL);
>
> keysize should be the oid member of the struct.

It's not a problem since the first member is the oid and perhaps it
seems that I thougth to do someting more on that. Now that I don't
recall what is it and in the first place the key should be just Oid in
the context above. Fixed.

The patch is attached to the next message.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-12-24 08:12:02 Re: doc review for v14
Previous Message Bharath Rupireddy 2020-12-24 07:37:28 Re: Parallel Inserts in CREATE TABLE AS