RE: In-placre persistance change of a relation

From: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
Cc: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, "sfrost(at)snowman(dot)net" <sfrost(at)snowman(dot)net>, "masao(dot)fujii(at)oss(dot)nttdata(dot)com" <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "ashutosh(dot)bapat(dot)oss(at)gmail(dot)com" <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: In-placre persistance change of a relation
Date: 2021-12-17 09:10:30
Message-ID: AM8PR07MB8248217A3AFEC0DD4D59ED83F6789@AM8PR07MB8248.eurprd07.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Kyotaro wrote:
> > In this version, I got rid of the "CLEANUP FORK"s, and added a new
> > system "Smgr marks". The mark files have the name of the
> > corresponding fork file followed by ".u" (which means Uncommitted.).
> > "Uncommited"-marked main fork means the same as the
> CLEANUP2_FORKNUM
> > and uncommitted-marked init fork means the same as the
> CLEANUP_FORKNUM
> > in the previous version.x
> >
> > I noticed that the previous version of the patch still leaves an
> > orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
> > before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
> > revmoed at rollback. In this version the responsibility to remove the
> > mark files is moved to SyncPostCheckpoint, where main fork files are
> > actually removed.
>
> For the record, I noticed that basebackup could be confused by the mark files
> but I haven't looked that yet.
>

Good morning Kyotaro,

the patch didn't apply clean (it's from March; some hunks were failing), so I've fixed it and the combined git format-patch is attached. It did conflict with the following:
b0483263dda - Add support for SET ACCESS METHOD in ALTER TABLE
7b565843a94 - Add call to object access hook at the end of table rewrite in ALTER TABLE
9ce346eabf3 - Report progress of startup operations that take a long time.
f10f0ae420 - Replace RelationOpenSmgr() with RelationGetSmgr().

I'm especially worried if I didn't screw up something/forgot something related to the last one (rd->rd_smgr changes), but I'm getting "All 210 tests passed".

Basic demonstration of this patch (with wal_level=minimal):
create unlogged table t6 (id bigint, t text);
-- produces 110GB table, takes ~5mins
insert into t6 select nextval('s1'), repeat('A', 1000) from generate_series(1, 100000000);
alter table t6 set logged;
on baseline SET LOGGED takes: ~7min10s
on patched SET LOGGED takes: 25s

So basically one can - thanks to this patch - use his application (performing classic INSERTs/UPDATEs/DELETEs, so without the need to rewrite to use COPY) and perform literally batch upload and then just switch the tables to LOGGED.

Some more intensive testing also looks good, assuming table prepared to put pressure on WAL:
create unlogged table t_unlogged (id bigint, t text) partition by hash (id);
create unlogged table t_unlogged_h0 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 0);
[..]
create unlogged table t_unlogged_h3 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 3);

Workload would still be pretty heavy on LWLock/BufferContent,WALInsert and Lock/extend .
t_logged.sql = insert into t_logged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes generates ~1MB of WAL
t_unlogged.sql = insert into t_unlogged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes generates ~3kB of WAL

so using: pgbench -f <tabletypetest>.sql -T 30 -P 1 -c 32 -j 3 t
with synchronous_commit =ON(default):
with t_logged.sql: tps = 229 (lat avg = 138ms)
with t_unlogged.sql tps = 283 (lat avg = 112ms) # almost all on LWLock/WALWrite
with synchronous_commit =OFF:
with t_logged.sql: tps = 413 (lat avg = 77ms)
with t_unloged.sql: tps = 782 (lat avg = 40ms)
Afterwards switching the unlogged ~16GB partitions takes 5s per partition.

As the thread didn't get a lot of traction, I've registered it into current commitfest https://commitfest.postgresql.org/36/3461/ with You as the author and in 'Ready for review' state.
I think it behaves as almost finished one and apparently after reading all those discussions that go back over 10years+ time span about this feature, and lot of failed effort towards wal_level=noWAL I think it would be nice to finally start getting some of that of it into the core.

-Jakub Wartak.

Attachment Content-Type Size
v7-0001-In-place-table-persistence-change-with-new-comman.patch application/octet-stream 84.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2021-12-17 09:17:09 Re: Transparent column encryption
Previous Message Peter Eisentraut 2021-12-17 09:06:05 Re: [PoC] Delegating pg_ident to a third party