Re: [REVIEW] Re: Compression of full-page-writes

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-25 06:33:07
Message-ID: CAB7nPqRhgwBbdbiOa7uA+AZYRg+g1iu6X8WFf9F9itJfXaQQgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 13, 2014 at 12:15 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>
> On 2014-11-12 10:13:18 -0500, Robert Haas wrote:
> > On Tue, Nov 11, 2014 at 4:27 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > > The more important thing here is that I see little chance of this
> > > getting in before Heikki's larger rework of the wal format gets
> > > in. Since that'll change everything around anyay I'm unsure how much
> > > point there is to iterate till that's done. I know that sucks, but I
> > > don't see much of an alternative.
> >
> > Why not do this first? Heikki's patch seems quite far from being
> > ready to commit at this point - it significantly increases WAL volume
> > and reduces performance. Heikki may well be able to fix that, but I
> > don't know that it's a good idea to make everyone else wait while he
> > does.
>
> Because it imo builds the infrastructure to do the compression more
> sanely. I.e. provide proper space to store information about the
> compressedness of the blocks and such.

Now that the new WAL format has been committed, here are some comments
about this patch and what we can do. First, in xlogrecord.h there is a
short description of how a record looks like. The portion of a block
data looks like that for a given block ID:
1) block image if BKPBLOCK_HAS_IMAGE, whose size of BLCKSZ - hole
2) data related to the block if BKPBLOCK_HAS_DATA, with a size
determined by what the caller inserts with XLogRegisterBufData for a
given block.
The data associated with a block has a length that cannot be
determined before XLogRegisterBufData is used. We could add a 3rd
parameter to XLogEnsureRecordSpace to allocate enough space for a
buffer wide enough to allocate data for a single buffer before
compression (BLCKSZ * number of blocks + total size of block data) but
this seems really error-prone for new features as well as existing
features. So for those reasons I think that it would be wise to not
include the block data in what is compressed.

This brings me to the second point: we would need to reorder the
entries in the record chain if we are going to do the compression of
all the blocks inside a single buffer, it has the following
advantages:
- More compression, as proved with measurements on this thread
And the following disadvantages:
- Need to change the entries in record chain once again for this
release to something like that for the block data (note that current
record chain format is quite elegant btw):
compressed block images
block data of ID = M
block data of ID = N
etc.
- Slightly longer replay time, because we would need to loop two times
through the block data to fill in DecodedBkpBlock: once to decompress
all the blocks, and once for the data of each block. It is not much
because there are not that many blocks replayed per record, but still.

So, all those things gathered, with a couple of hours hacking this
code, make me think that it would be more elegant to do the
compression per block and not per group of blocks in a single record.
I actually found a couple of extra things:
- pg_lzcompress and pg_lzdecompress should be in src/port to make
pg_xlogdump work. Note that pg_lzdecompress has one call to elog,
hence it would be better to have it return a boolean state and let the
caller return an error of decompression failed.
- In the previous patch versions, a WAL record was doing unnecessary
processing: first it built uncompressed image block entries, then
compressed them, and replaced in the record chain the existing
uncompressed records by the compressed ones.
- CompressBackupBlocks enforced compression to BLCKSZ, which was
incorrect for groups of blocks, it should have been BLCKSZ *
num_blocks.
- it looks to be better to add a simple uint16 in
XLogRecordBlockImageHeader to store the compressed length of a block,
if 0 the block is not compressed. This helps the new decoder facility
to track the length of data received. If a block has a hole, it is
compressed without it.

Now here are two patches:
- Move pg_lzcompress.c to src/port to make pg_xlogdump work with the
2nd patch. I imagine that this would be useful as well for client
utilities, similarly to what has been done for pg_crc some time ago.
- The patch itself doing the FPW compression, note that it passes
regression tests but at replay there is still one bug, triggered
roughly before numeric.sql when replaying changes on a standby. I am
still looking at it, but it does not prevent basic testing as well as
a continuation of the discussion.
For now here are the patches either way, so feel free to comment.
Regards,
--
Michael

Attachment Content-Type Size
0001-Fix-flag-marking-GIN-index-as-being-built-for-new-en.patch text/x-patch 1.1 KB
0002-Support-fillfactor-for-GIN-indexes.patch text/x-patch 6.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2014-11-25 06:33:58 Re: [REVIEW] Re: Compression of full-page-writes
Previous Message Amit Langote 2014-11-25 05:45:00 A possbile typo in src/bin/pg_dump.c