Re: postgres large database backup

From: Michael Loftis <mloftis(at)wgops(dot)com>
To: Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: postgres large database backup
Date: 2022-12-01 16:21:08
Message-ID: CAHDg04sGAfTjsDO1Gt6T4Eq5K5X_Emtma+4iZUBMCBF5bWQJWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Dec 1, 2022 at 06:40 Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com> wrote:

> On 11/30/22 20:41, Michael Loftis wrote:
>
>
> ZFS snapshots don’t typically have much if any performance impact versus
> not having a snapshot (and already being on ZFS) because it’s already doing
> COW style semantics.
>
> Hi Michael,
>
> I am not sure that such statement holds water. When a snapshot is taken,
> the amount of necessary I/O requests goes up dramatically. For every block
> that snapshot points to, it is necessary to read the block, write it to the
> spare location and then overwrite it, if you want to write to a block
> pointed by snapshot. That gives 3 I/O requests for every block written.
> NetApp is trying to optimize it by using 64MB blocks, but ZFS on Linux
> cannot do that, they have to use standard CoW because they don't have the
> benefit of their own hardware and OS. And the standard CoW is tripling the
> number of I/O requests for every write to the blocks pointed to by the
> snapshot, for every snapshot. CoW is a very expensive animal, with horns.
>

Nope, ZFS does not behave that way. Yup AFAIK all other snapshotting
filesystems or volume managers do. One major architectural decision of ZFS
is the atomicity of writes. Data at rest stays at rest. Thus it does NOT
overwrite live data. Snapshots do not change the write path/behavior in
ZFS. In ZFS writes are atomic, you’re always writing new data to free
space, and accounting for where the current record/volume block within a
file or volume actually lives on disk. If a filesystem, volume manager, or
RAID system, is overwriting data and in the middle of that process and has
an issue that breaks that write, and that data is also live data, you can't
be atomic, you've now destroyed data (RAID write hole is one concept of
this). That’s why adding a snapshot isn’t an additional cost for ZFS. For
better or worse you're paying that snapshot cost already because it already
does not overwrite live data. If there's no snapshot once the write is
committed and the refcount is zero for the old blocks, and it's safe (TXG
committed), those old blocks go back to the free pool to be potentially
used again. There's a bunch of optimization to that and how it actually
happens, but at the end of the day, your writes do not overwrite your data
in ZFS, writes of data get directed at free space, and eventually the
on-disk structures get an atomic update that happens to say it now lives
here. In the time between that all happening the ZIL (which may live on
its own special devices called SLOG -- this is why you often see the terms
ZIL/journal/SLOG/log vdev used interchangeably) is the durable bit, but
that's never normally read, it's only read back during recovery. This is
also where the ZFS filesystem property of recordsize or volblocksize
(independently configurable on every filesystem/volume within a pool) is
important for performance. If you clobber a whole record ZFS isn't going
to read anything extra when it gets around to committing, it knows the
whole record changed and can safely write a whole new record (every 5s it
goes about this TXG commit, so two 64k writes are still slower with a 128k
recordsize, but still shouldn't pull in that 128k record). There's other
optimizations there, but at the end of the day as long as the chosen
recordsize/volblocksize that matches up to your writes, and your writes are
aligned to that within your file or volume, you'll not see an extra read of
the data as part of it's normal flow of committing data. Snapshots don't
change that.

Because of those architectural decisions, CoW behavior is part of ZFS'
existing performance penalty, so when you look at that older Oracle ASM vs
ZFS article, remember that that extra...what was it 0.5ms?... is accounting
for most, probably all of the penalties for a snapshot too if you want (or
need) it. It's fundamental to how ZFS works and provides data
durability+atomicity. This is what ZFS calls it's snapshots essentially
free, because you're already paying the performance for it. What would
ASM do if it had a snapshot to manage? Or a few dozen on the same data?
Obviously during the first writes to those snapshotted areas you'd see it.
Ongoing performance penalties with those snapshots? Maybe ASM has an
optimization that saves that benchmark a bunch of time if there is no
snapshot. But once one exists it takes a different write path and adds a
performance penalty? If a snapshot was taken in the middle of the
benchmark? Yeah there's going to be some extra IOPS when you take the
snapshot to say "a snapshot now exists" for ZFS, but that doesn't
dramatically change it's underlying write path after that point.

That atomicity and data durability also means that even if you lose the
SLOG devices (which hold the ZIL/journal, if you don't have SLOG/log vdev
then it's in-pool) you do not lose all the data. Only stuff that somehow
remained uncommitted after the ZIL failed. Say you had some sort of hard
fault/crash, the SLOG/ZIL devices were destroyed, you can still opt to
mount the ZFS pool, and filesystems/volumes, without that ZIL, which could
(well, would) still suck, but would be better than just losing everything.
If the ZIL fails while the system is live, ZFS is going to do it's best to
hopefully get everything committed ASAP as soon as it knows something is
wrong, and keep it that way. So on a SLOG/ZIL failure your performance
WILL suffer (and boy it's UGLY, but at least it's not dead and destroyed).
And because of the atomicity property even if it has further fails during
that window of time where it scrambles to commit, ZFS does not wreck the
filesystem. If the devices are still available it'll still provide
whatever data it can back to you.

So there's a very different approach to what's important with ZFS, it's not
that performance isn't important, it's that your data is more important.
Performance is NOT ignored, but, to get that atomicity and durability you
ARE paying some performance costs. Is that worth it for YOUR database or
files? Only you as an admin can decide that. No ZFS is NOT a great choice
for every database or dataset! For some workloads that penalty is not
going to be acceptable.

So writes in ZFS are always towards the journal (ZIL) first, barring config
property tweaks, once a journal entry is durable it’s considered
written, but uncommitted. If we crash at that point then journal recovery
brings the pool back to that last written journal entry so ZFS is never
lying to the application or OS.

The data is always written back in an atomic manner, writing new or changed
data to a free block, never to the existing block. So if it bombs in the
middle of THAT, you're fine. When the whole (record size or vol block
size) block of sectors is written at once or within a window of time
there’s no read of the original data once it finishes coalescing and the
writing process to fully commit it. There’s of course always the free
space/reference counts. But if they haven’t changed they aren’t written,
and they're needed anyway to find where the data lives so are already
present, so yeah first write best case there’s an extra write in the area
where that is kept (which may itself be coalesced with other writes towards
that), but it's not like every write incurs that extra overhead to maintain
the old block, and if your write replaces that whole recordsize sized
block, But after that until a snapshot is added or removed there’s no more
changes for the references to that old block. When there’s no more
references to a given block it goes into the free pool.

Having a snapshot doesn’t add more work. It’s a byproduct of the atomic
write behavior that ZFS implements (always write to free blocks, always
write a sector). You're just asking ZFS to not free those blocks.

Regards
>
> --
> Mladen Gogala
> Database Consultant
> Tel: (347) 321-1217https://dbwhisperer.wordpress.com
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michael Loftis 2022-12-01 16:35:31 Re: postgres large database backup
Previous Message Laurenz Albe 2022-12-01 16:10:27 Re: Stored procedure code no longer stored in v14 and v15, changed behaviour