Re: Implement UNLOGGED clause for COPY FROM

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, "'pgsql-hackers(at)postgresql(dot)org'" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Implement UNLOGGED clause for COPY FROM
Date: 2020-07-09 13:03:10
Message-ID: 86ea09f8-a5ee-a998-84c8-e4e0c9425edb@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/07/09 15:17, osumi(dot)takamichi(at)fujitsu(dot)com wrote:
> Fujii-san
>
> Thank you for your interest in this idea.
>
>> This feature can work safely with wal_level=replica or logical?
>> Or it can work only with wal_level=minimal?
>> If yes, what is the main difference
>> between this method and wal_skip_threshold?
> I'm thinking this feature can be used
> when you set any parameters of wal_level.
> Besides that, data loading into a table *with some data*
> should be allowed. This means I don't want to limit
> the usage of this feature only for initial load
> for empty table or under condition of 'minimal' wal_level in other words.
>
> Let me explain more detail of the background.
>
> I got a report that one of my customers says that
> multiple COPY from multiple sessions have a bottleneck to write WAL.
> Her use case was DWH system using postgres mainly to load dozens of GB (or more) log data
> from multiple data sources to execute night batch processing everyday.
>
> Her scenario included both initial load to empty table
> and load to table that already has records.

Yes, I understand this use case.

>
> In passing, she also used our company's product of parallel loader,
> to load data with dozens of, nearly 100, BGWs at the same time.
> Through investigation of iostat,
> they found the same problem that CPU worked for WAL write intensively.
>
> This could happen after the implementation
> of Parallel copy that is really hotly discussed and reviewed in the mailing lists now.
> So I thought it's good to discuss this in advance.
>
>>> 4. Execute the data loading, bypassing WAL generation for data.
>>> 5. Sync the data to disk by performing checkpoint.
>>
>> What happens if the server crashes before #5? Since no WAL for data-loading can
>> be replayed, the target table should be truncated?
> My answer for this is just to load that COPY data again.
> It's because the application itself knows what kind of data was loaded
> from the command.

When the server crashes before #5, some table and index pages that #4 loaded
the data into might have been partially synced to the disk because of bgwriter
or shared buffer replacement. So ISTM that the target table needs to be
truncated to empty during recovery and users need to load whole the data into
the table again. Is my understanding right? If yes, isn't this the same feature
as that UNLOGGED table provides?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-07-09 13:24:47 Re: replication_origin and replication_origin_lsn usage on subscriber
Previous Message torikoshia 2020-07-09 12:59:32 Re: [doc] modifying unit from characters to bytes