Journal based VACUUM FULL

From: Ryan David Sheasby <ryan27968(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Journal based VACUUM FULL
Date: 2019-02-20 23:16:46
Message-ID: CANTTaev-LdgYj4uZoy67catS5SF5u_X-dTHiLH7OKwU6Gv3MFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Team.

New to contributing so hopefully this is the right place. I've searched the
forum and it seems this is the place for feature requests/suggestions.

I was reading on VACUUM and VACUUM FULL and saw that the current
implementation of VACUUM FULL writes to an entirely new file and then
switches to it, as opposed to rewriting the current file in-place. I assume
the reason for this is safety in the event the database shuts down in the
middle of the vacuum, the most you will lose is the progress of the vacuum
and have to start from scratch but otherwise the database will retain its
integrity and not become corrupted. This seems to be an effective but
somewhat rudimentary system that could be improved. Most notably, the
rewriting of almost the entire database takes up basically double the
storage during the duration of the rewrite which can become expensive or
even simply inconvenient in IaaS(and probably other) installations where
the drive sizes are scaled on demand. Even if the VACUUM FULL doesn't need
to run often, having to reallocate drive space for an effective duplication
is not ideal. My suggestion is a journal based in-place rewrite of the
database files.

This means that the VACUUM FULL will do a "pre-processing" pass over the
database and figure out at a fairly low level what operations need to be
done to compact the database back to it's "correct" size. These operations
will be written in their entirety to a journal which records all the
operations about to be performed, with some mechanism for checking if they
have already been performed, using the same principle described here:
https://en.wikipedia.org/wiki/Journaling_file_system. This will allow an
in-place rewrite of the file in a safe manner such that you are able to
recover from an unexpected shutdown by resuming the VACUUM FULL from the
journal, or by detecting where the copy hole is in the file and
recording/ignoring it

The journal could be something as simple as a record of which byte ranges
need to be copied into which other byte ranges locations. The journal
should record whenever a byte range copy completes for the sake of error
recovery. Obviously, each byte range will have a max size of the copy
distance from the source to the destination so that the destination will
not overwrite the source, therefore making recovery impossible(how can you
know where in the copy you stopped?). However, this will have a snowball
effect as the further you are in the rewrite, the further the source and
destination ranges will be so you can copy bigger chunks at a time, and
won't have to update the journal's completion flags as often. In the case
of a shutdown during a copy, you merely read the journal, looking for the
first copy that isn't completed yet, and continue rewriting from there.
Even if some of the bytes have been copied already, there will be no
corruption as you haven't overwritten the source bytes at all. Finally, a
simple file truncate can take place once all the copies are complete, and
the journal can be deleted. This means the headroom required in the
filesystem would be much smaller, and would pay for itself in any copy of
at least 17 bytes or more (assuming 2*8 byte pointers plus a bit as a
completion flag). The only situation in which this system would consume
more space than a total copy is if the database has more than 50% garbage,
and the garbage is perfectly spread out i.e. isn't in large chunks that can
be copied at once and therefore recorded in the journal at once, and each
piece of garbage is smaller than 17 bytes. Obviously, the journal itself
would need a error checking mechanism to ensure the journal was correctly
and completely written, but this can be as simple as a total file hash at
the end of the file.

An alternative to the completion flags is to compute a hash of the data to
be copied and store it in the journal, and then in recovery you can compare
the destination with the hash. This has the advantage of not needing to
write to the journal to keep it up to date during the operations, but the
disadvantages associated with having to compute many hashes while
recovering and storing the hashes in the journal, taking up more space.
It's also arguably less safe as there is always the chance(albeit extremely
unlikely) of a collision, which would mean that the data is not actually
validated. I would argue the risk of this is lower than the risk of bit-rot
flipping the completion bit, however.

A journaling system like this *might* have performance benefits too,
specifically when running in less intelligent file systems like NTFS which
can become easily fragmented(causing potentially big performance issues on
spinning disks). Rewriting the same file will never require a file-system
de-fragment. The other advantage as mentioned before is in the case of
auto-scaling drives if used as storage for the DB(as they often are in
IaaS/Paas services). Not having to scale up rapidly could be a performance
boost in some cases.

Finally, a journaling system like this will also lend itself to
stopping/resuming in the middle of the VACUUM FULL. Once the journal is
created and the rewrites have started, assuming the journal "completion"
flag is kept up to date, you can stop the operation in the
middle(presumably writing the current "gap" with null bytes or otherwise
indicating to the DB that there's a gap in the middle that should be
ignored), and continue using the database as usual. This means you can do a
"soft" recovery wherein the database is only halfway vacuumed but it's till
perfectly operational and functional. You can also resume from this soft
recovery by simply continuing to write from the last copy that was
completed. Obviously you will only regain disk space when you reach the end
and truncate the file but you are at least able to pause/resume the
operation, waiting only for the current copy block to finish instead of for
the entire VACUUM FULL to finish.

I hope this was a coherent explanation of my suggestion. It's possible and
maybe even likely that there's a glaring misunderstanding or assumption on
my part that means this isn't practical, but I'd still love to get feedback
on it.

*ThanksRyan Sheasby*

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shawn Debnath 2019-02-20 23:27:40 Re: Refactoring the checkpointer's fsync request queue
Previous Message Robert Haas 2019-02-20 22:46:50 Re: Pluggable Storage - Andres's take