AdvanceXLInsertBuffer vs. WAL segment compressibility

From: Chapman Flack <chap(at)anastigmatix(dot)net>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: AdvanceXLInsertBuffer vs. WAL segment compressibility
Date: 2016-07-22 22:02:32
Message-ID: 579297F8.7020107@anastigmatix.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Teaser: change made in 9.4 to simplify WAL segment compression
made it easier to compress a low-activity-period WAL segment
from 16 MB to about 27 kB ... but much harder to do better than
that, as I was previously doing (about two orders of magnitude
better).

At $work, we have a usually-low-activity PG database, so that almost
always the used fraction of each 16 MB WAL segment is far smaller
than 16 MB, and so it's a big win for archived-WAL storage space
if an archive-command can be written that compresses those files
effectively.

Our database was also running on a pre-9.4 version, and I'm
currently migrating to 9.5.3. As I understand it, 9.4 was where
commit 9a20a9b landed, which changed what happens in the unwritten
'tail' of log segments.

In my understanding, before 9.4, the 'tail' of any log segment
on disk just wasn't written, and so (as segment recycling simply
involves renaming a file that held some earlier segment), the
remaining content was simply whatever had been there before
recycling. That was never a problem for recovery (which could
tell when it reached the end of real data), but was not well
compressible with a generic tool like gzip. Specialized tools
like pg_clearxlogtail existed, but had to know too much about
the internal format, and ended up unmaintained and therefore
difficult to trust.

The change in 9.4 included this, from the git comment:

This has one user-visible change: switching to a new WAL segment
with pg_switch_xlog() now fills the remaining unused portion of
the segment with zeros.

... thus making the segments easily compressible with bog standard
tools. So I can just point gzip at one of our WAL segments from a
light-activity period and it goes from 16 MB down to about 27 kB.
Nice, right?

But why does it break my earlier approach, which was doing about
two orders of magnitude better, getting low-activity WAL segments
down to 200 to 300 *bytes*? (Seriously: my last solid year of
archived WAL is contained in a 613 MB zip file.)

That approach was based on using rsync (also bog standard) to
tease apart the changed and unchanged bits of the newly-archived
segment and the last-seen content of the file with the same
i-number. You would expect that to work just as well when the
tail is always zeros as it was working before, right?

And what's breaking it now is the tiny bit of fine
print that's in the code comment for AdvanceXLInsertBuffer but
not in the git comment above:

* ... Any new pages are initialized to zeros, with pages headers
* initialized properly.

That innocuous "headers initialized" means that the tail of the
file is *almost* all zeros, but every 8 kB there is a tiny header,
and in each tiny header, there is *one byte* that differs from
its value in the pre-recycle content at the same i-node, because
that one byte in each header reflects the WAL segment number.

Before the 9.4 change, I see there were still headers there,
and they did contain a byte matching the segment number, but in
the unwritten portion of course it matched the pre-recycle
segment number, and rsync easily detected the whole unchanged
tail of the file. Now there is one changed byte every 8 kB,
and the rsync output, instead of being 100x better than vanilla
gzip, is about 3x worse.

Taking a step back, isn't overwriting the whole unused tail of
each 16 MB segment really just an I/O intensive way of communicating
to the archive-command where the valid data ends? Could that not
be done more efficiently by adding another code, say %e, in
archive-command, that would be substituted by the offset of the
end of the XLOG_SWITCH record? That way, however archive-command
is implemented, it could simply know how much of the file to
copy.

Would it then be possible to go back to the old behavior (or make
it selectable) of not overwriting the full 16 MB every time?
Or did the 9.4 changes also change enough other logic that stuff
would now break if that isn't done?

-Chap

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2016-07-22 23:01:54 Proposal: revert behavior of IS NULL on row types
Previous Message David Fetter 2016-07-22 22:01:37 Re: [PROPOSAL] Temporal query processing with range types