RE: In-placre persistance change of a relation

From: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To: 'Kyotaro Horiguchi' <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, "sfrost(at)snowman(dot)net" <sfrost(at)snowman(dot)net>, "masao(dot)fujii(at)oss(dot)nttdata(dot)com" <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "ashutosh(dot)bapat(dot)oss(at)gmail(dot)com" <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: In-placre persistance change of a relation
Date: 2020-12-04 07:49:22
Message-ID: TYAPR01MB299045E3B0C655D4066A9F0AFEF10@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
> > No, not really. The issue is more around what happens if we crash
> > part way through. At crash recovery time, the system catalogs are not
> > available, because the database isn't consistent yet and, anyway, the
> > startup process can't be bound to a database, let alone every database
> > that might contain unlogged tables. So the sentinel that's used to
> > decide whether to flush the contents of a table or index is the
> > presence or absence of an _init fork, which the startup process
> > obviously can see just fine. The _init fork also tells us what to
> > stick in the relation when we reset it; for a table, we can just reset
> > to an empty file, but that's not legal for indexes, so the _init fork
> > contains a pre-initialized empty index that we can just copy over.
> >
> > Now, to make an unlogged table logged, you've got to at some stage
> > remove those _init forks. But this is not a transactional operation.
> > If you remove the _init forks and then the transaction rolls back,
> > you've left the system an inconsistent state. If you postpone the
> > removal until commit time, then you have a problem if it fails,
>
> It's true. That are the cause of headache.
...
> The current implement is simple. It's enough to just discard old or
> new relfilenode according to the current transaction ends with commit
> or abort. Tweaking of relfilenode under use leads-in some skews in
> some places. I used pendingDelete mechanism a bit complexified way
> and a violated an abstraction (I think, calling AM-routines from
> storage.c is not good.) and even introduce a new fork kind only to
> mark a init fork as "not committed yet". There might be better way,
> but I haven't find it.

I have no alternative idea yet, too. I agree that we want to avoid them, especially introducing inittmp fork... Anyway, below are the rest of my review comments for 0001. I want to review 0002 when we have decided to go with 0001.

(2)
XLOG_SMGR_UNLINK seems to necessitate modification of the following comments:

[src/include/catalog/storage_xlog.h]
/*
* Declarations for smgr-related XLOG records
*
* Note: we log file creation and truncation here, but logging of deletion
* actions is handled by xact.c, because it is part of transaction commit.
*/

[src/backend/access/transam/README]
3. Deleting a table, which requires an unlink() that could fail.

Our approach here is to WAL-log the operation first, but to treat failure
of the actual unlink() call as a warning rather than error condition.
Again, this can leave an orphan file behind, but that's cheap compared to
the alternatives. Since we can't actually do the unlink() until after
we've committed the DROP TABLE transaction, throwing an error would be out
of the question anyway. (It may be worth noting that the WAL entry about
the file deletion is actually part of the commit record for the dropping
transaction.)

(3)
+/* This is bit-map, not ordianal numbers */

There seems to be no comments using "bit-map". "Flags for ..." can be seen here and there.

(4)
Some wrong spellings:

+ /* we flush this buffer when swithing to PERMANENT */

swithing -> switching

+ * alredy flushed out by RelationCreate(Drop)InitFork called just

alredy -> already

+ * relation content to be WAL-logged to recovery the table.

recovery -> recover

+ * The inittmp fork works as the sentinel to identify that situaton.

situaton -> situation

(5)
+ table_close(classRel, NoLock);
+
+
+
+
}

These empty lines can be deleted.

(6)
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
...
+ * Make an XLOG entry reporting the file unlink.

Not unlink but buffer persistence?

(7)
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by myself.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /*
+ * We have created the init fork. If server crashes before the current
+ * transaction ends the init fork left alone corrupts data while recovery.
+ * The inittmp fork works as the sentinel to identify that situaton.
+ */
+ smgrcreate(srel, INITTMP_FORKNUM, false);
+ log_smgrcreate(&rnode, INITTMP_FORKNUM);
+ smgrimmedsync(srel, INITTMP_FORKNUM);

If the server crashes between these two processings, only the init fork exists. Is it correct to create the inittmp fork first?

(8)
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+ smgrclose(srel);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ log_smgrunlink(&rnode, INITTMP_FORKNUM);
+ smgrunlink(srel, INITTMP_FORKNUM, false);
+ return;
+ }

smgrclose() should be called just before return.
Isn't it necessary here to revert buffer persistence state change?

(9)
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}

Maybe it's better to restore smgrdounlinkfork() that was removed in the older release. That function includes dropping shared buffers, which can clean up the shared buffers that may be cached by this transaction.

(10)
[RelationDropInitFork]
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}

bufpersistence should be true.

(11)
+ BlockNumber block = 0;
...
+ DropRelFileNodeBuffers(rbnode, &pending->unlink_forknum, 1,
+ &block);

"block" is unnecessary and 0 can be passed directly.

(12)
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId &&
+ pending->op == PDOP_DELETE)
nrels++;

It's better to put && at the beginning of the line to follow the existing code here.

(13)
+ table_close(rel, lockmode);

lockmode should be NoLock to retain the lock until transaction completion.

(14)
+ ctl.keysize = sizeof(unlogged_relation_entry);
+ ctl.entrysize = sizeof(unlogged_relation_entry);
+ hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM);
...
+ memset(key.oid, 0, sizeof(key.oid));
+ memcpy(key.oid, de->d_name, oidchars);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);

keysize should be the oid member of the struct.

Regards
Takayuki Tsunakawa

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2020-12-04 08:15:48 Re: pg_stat_statements oddity with track = all
Previous Message Andrey Borodin 2020-12-04 07:33:44 Logical archiving