Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

From: HECTOR INGERTO <HECTOR_25E(at)hotmail(dot)com>
To: " pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?
Date: 2023-01-15 15:47:45
Message-ID: GV1P189MB203648062711BB810D564B05F5C09@GV1P189MB2036.EURP189.PROD.OUTLOOK.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello everybody,

I’m using PostgreSQL on openZFS. I use ZFS snapshots as a backup + hotspare method.

From man zfs-snapshot: “Snapshots are taken atomically, so that all snapshots correspond to the same moment in time.” So if a PSQL instance is started from a zfs snapshot, it will start to replay the WAL from the last checkpoint, in the same way it would do in a crash or power loss scenario. So from my knowledge, ZFS snapshots can be used to rollback to a previous point in time. Also, sending those snapshots to other computers will allow you to have hotspares and remote backups. If I’m wrong here, I would appreciate being told about it because I’m basing the whole question on this premise.

On the other hand, we have the tablespace PGSQL feature, which is great because it allows “unimportant” big data to be written into cheap HDD and frequently used data into fast NVMe.

So far, so good. The problem is when both ideas are merged. Then, snapshots from different pools are NOT atomical, snapshot on the HDD pool isn’t going to be done at the same exact time as the one on the SSD pool, and I don’t know enough about PGSQL internals to know how dangerous this is. So here is where I would like to ask for your help with the following questions:

First of all, what kind of problem can this lead to? Are we talking about potential whole DB corruption or only the loss of a few of the latest transactions?

In second place, if I’m initializing a corrupted PGSQL instance because ZFS snapshots are from different pools and slightly different times, am I going to notice it somehow or is it going to fail silently?

In third and last place, is there some way to quantify the amount of risk taken when snapshotting a PGSQL instance spread across two (or more) different pools?

Thanks for your time,

Héctor

Responses

Browse pgsql-general by date

  From Date Subject
Next Message pf 2023-01-15 19:27:32 pg_upgrade 13.6 to 15.1?
Previous Message Ajin Cherian 2023-01-15 04:09:25 Re: Support logical replication of DDLs