RE: Implement UNLOGGED clause for COPY FROM

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "'David G(dot) Johnston'" <david(dot)g(dot)johnston(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: Implement UNLOGGED clause for COPY FROM
Date: 2020-07-17 03:04:25
Message-ID: OSBPR01MB488867D6F32664057CD59DC6ED7C0@OSBPR01MB4888.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

> AFAICS, we can already accomplish basically the same thing as what you want to
> do like this:
>
> alter table foo set unlogged;
> copy foo from ...;
> alter table foo set logged;
This didn't satisfy what I wanted.
In case that 'foo' has huge amount of rows at the beginning,
this example would spend much time to copy
the contents of 'foo' twice to swap relfilenodes atomically.
When that loaded data by COPY is big too, its execution time becomes much longer.

> You keep on ignoring the indexes... not to mention replication.
Sorry for having made you think like this.

When the server crash occurs during data loading of COPY UNLOGGED,
it's a must to keep index consistent of course.
I'm thinking that to rebuild the indexes on the target table would work.

In my opinion, UNLOGGED clause must be designed to guarantee that
where the data loaded by this clause is written starts from the end of all other data blocks.
Plus, those blocks needs to be protected by any write of other transactions during the copy.
Apart from that, the server must be aware of which block is the first block,
or the range about where it started or ended in preparation for the crash.

During the crash recovery, those points are helpful to recognize and detach such blocks
in order to solve a situation that the loaded data is partially synced to the disk and the rest isn't.
In terms of index, we can recreate index based on the relation's data
protected by this mechanism above.

Another idea of index crash recovery was to
copy the indexes on the target table as a backup just before loading and
write new added indexes from loaded data into this temporary index files
in order to localize the new indexes. But, my purpose is to accelerate speed
of data loading under the condition that target table has huge amount of data initially.
Taking this purpose into an evaluation criterion, the initial copy of indexes
would make the execution slow down. Thus, I choose rebuilding index.

Another point I need to add for recovery would be
how the startup postgres knows the condition of COPY UNLOGGED clause.
My current idea is to utilize any other system file or
create a new system file in the cluster for this clause.
At least, it would become necessary for postgres to identify
which blocks should be detached at the beginning when the command is executed.
Therefore, we need add information for it.

Lastly, I have to admit that
the status of target table where data is loaded by COPY UNLOGGED would be marked
as invalid and notified to standbys under replication environment
from the point in time when the operation takes place.
But, I'm going to allow users with special privileges (like DBA) to use this clause
and this kind of tables would be judged by them not to replicate.
Of course, I'm thinking better idea but now what I can say is like this for replication.

Best,
Takamichi Osumi

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2020-07-17 03:08:09 Re: Transactions involving multiple postgres foreign servers, take 2
Previous Message Alvaro Herrera 2020-07-17 02:55:45 Re: expose parallel leader in CSV and log_line_prefix