RE: Disable WAL logging to speed up data loading

From: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To: 'Stephen Frost' <sfrost(at)snowman(dot)net>
Cc: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, 'David Steele' <david(at)pgmasters(dot)net>, "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, 'Kyotaro Horiguchi' <horikyota(dot)ntt(at)gmail(dot)com>, "sawada(dot)mshk(at)gmail(dot)com" <sawada(dot)mshk(at)gmail(dot)com>, "robertmhaas(at)gmail(dot)com" <robertmhaas(at)gmail(dot)com>, "masao(dot)fujii(at)oss(dot)nttdata(dot)com" <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "ashutosh(dot)bapat(dot)oss(at)gmail(dot)com" <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Disable WAL logging to speed up data loading
Date: 2021-03-24 08:15:44
Message-ID: OSAPR01MB29771A38CEFAD90E6F3CA2D0FE639@OSAPR01MB2977.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

From: Stephen Frost <sfrost(at)snowman(dot)net>
> * tsunakawa(dot)takay(at)fujitsu(dot)com (tsunakawa(dot)takay(at)fujitsu(dot)com) wrote:
> > As Laurenz-san kindly replied, the database server refuses to start with a
> clear message. So, it's similarly very clear what happens. The user will never
> unknowingly resume operation with possibly corrupt data.
>
> No, instead they'll just have a completely broken system that has to be rebuilt or
> restored from a backup. That doesn't strike me as a good result.

Your understanding is correct. What I wanted to answer to confirm is that the behavior is clear: the server refuses to start, and the user knows he/she should restore the database from backup.

> > So, I understood the point boils down to elegance. Could I ask what makes
> you feel ALTER TABLE UNLOGGED/LOGGED is (more) elegant? I'm purely
> asking as a user.
>
> The impact is localized to those specific tables. The rest of the system should
> come up cleanly and there won't be corruption, instead merely the lack of data
> in UNLOGGED tables.

So, I took your point as the ease and fast time of restore after a crash: the user just has to restore the lost table data using COPY FROM from files that was saved before the data loading job using COPY TO.

In that sense, the backup and restoration of the whole database is an option for users when they have some instant backup and restore feature available.

> > (I don't want to digress, but if we consider the number of options for
> > wal_level as an issue, I feel it's not elegant to have separate
> > "replica" and "logical".)
>
> Do you know of a way to avoid having those two distinct levels while still writing
> only the WAL needed depending on if a system is doing logical replication or
> not..? If you've got suggestions on how to eliminate one of those levels, I'm
> sure there would be interest in doing so. I don't see the fact that we have
> those two levels as justification for adding another spelling of 'minimal'.

Sorry, I have almost no knowledge of logical replication implementation. So, being ignorant of its intricacies, I have felt like as a user "Why do I have to set wal_level = logical, because streaming replication and logical replication are both replication features? If the implementation needs some additional WAL for logical replication, why doesn't the server automatically emit the WAL when the target table of DML statements is in a publication?"

> > The elegance of wal_level = none is that the user doesn't have to
> > remember to add ALTER TABLE to the data loading job when they add load
> > target tables/partitions. If they build and use their own (shell)
> > scripts to load data, that won't be burdon or forgotten. But what
> > would they have to do when they use ETL tools like Talend, Pentaho,
> > and Informatica Power Center? Do those tools allow users to add
> > custom processing like ALTER TABLE to the data loading job steps for
> > each table? (AFAIK, not.)
>
> I don't buy the argument that having to 'remember' to do an ALTER TABLE is
> such a burden when it means that the database will still be consistent and
> operational after a crash.

That depends on whether an instant backup and restore feature is at hand. If the user is comfortable with it, wal_level = none is easier and more attractive. At least, I don't want the feature to be denied.

> As for data loading tools, surely they support loading data into UNLOGGED
> tables and it's certainly not hard to have a script run around and flip those
> tables to LOGGED after they're loaded, and I do actually believe some of those
> tools support building processes of which one step could be such a command
> (I'm fairly confident Pentaho, in particular, does as I remember building such
> pipelines myself...).

Oh, Pentaho has such a feature, doesn't it? But isn't it a separate step from the data output step? Here, I assume ETL tools allow users to compose a data loading job from multiple steps: data input, transformation, data output, etc. I guess the user can't directly incorporate ALTER TABLE into the data output step, and has to add separate custom steps for ALTER TABLE. That's burdonsome and forgettable, I think.

Regards
Takayuki Tsunakawa

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-03-24 08:51:14 Re: shared memory stats: high level design decisions: consistency, dropping
Previous Message Justin Pryzby 2021-03-24 08:14:02 Re: PoC/WIP: Extended statistics on expressions