Re: Compression of full-page-writes

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-11 03:30:41
Message-ID: CAHGQGwGBsYFHptNBiWOLcUfcQ4k8RZphSqMHv1vW1xQMUycJPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 11, 2013 at 3:44 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Fri, Oct 11, 2013 at 1:20 AM, Dimitri Fontaine
> <dimitri(at)2ndquadrant(dot)fr> wrote:
>> Hi,
>>
>> I did a partial review of this patch, wherein I focused on the patch and
>> the code itself, as I saw other contributors already did some testing on
>> it, so that we know it applies cleanly and work to some good extend.
>
> Thanks a lot!
>
>> In full_page_writes_str() why are you returning "unrecognized" rather
>> than doing an ELOG(ERROR, …) for this unexpected situation?
>
> It's because the similar functions 'wal_level_str' and 'dbState' also return
> 'unrecognized' in the unexpected situation. I just implemented
> full_page_writes_str()
> in the same manner.
>
> If we do an elog(ERROR) in that case, pg_xlogdump would fail to dump
> the 'broken' (i.e., unrecognized fpw is set) WAL file. I think that some
> users want to use pg_xlogdump to investigate the broken WAL file, so
> doing an elog(ERROR) seems not good to me.
>
>> The code switches to compression (or trying to) when the following
>> condition is met:
>>
>> + if (fpw <= FULL_PAGE_WRITES_COMPRESS)
>> + {
>> + rdt->data = CompressBackupBlock(page, BLCKSZ - bkpb->hole_length, &(rdt->len));
>>
>> We have
>>
>> + typedef enum FullPageWritesLevel
>> + {
>> + FULL_PAGE_WRITES_OFF = 0,
>> + FULL_PAGE_WRITES_COMPRESS,
>> + FULL_PAGE_WRITES_ON
>> + } FullPageWritesLevel;
>>
>> + #define FullPageWritesIsNeeded(fpw) (fpw >= FULL_PAGE_WRITES_COMPRESS)
>>
>> I don't much like using the <= test against and ENUM and I'm not sure I
>> understand the intention you have here. It somehow looks like a typo and
>> disagrees with the macro.
>
> I thought that FPW should be compressed only when full_page_writes is
> set to 'compress' or 'off'. That is, 'off' implies a compression. When it's set
> to 'off', FPW is basically not generated, so there is no need to call
> CompressBackupBlock() in that case. But only during online base backup,
> FPW is forcibly generated even when it's set to 'off'. So I used the check
> "fpw <= FULL_PAGE_WRITES_COMPRESS" there.
>
>> What about using the FullPageWritesIsNeeded
>> macro, and maybe rewriting the macro as
>>
>> #define FullPageWritesIsNeeded(fpw) \
>> (fpw == FULL_PAGE_WRITES_COMPRESS || fpw == FULL_PAGE_WRITES_ON)
>
> I'm OK to change the macro so that the <= test is not used.
>
>> Also, having "on" imply "compress" is a little funny to me. Maybe we
>> should just finish our testing and be happy to always compress the full
>> page writes. What would the downside be exactly (on buzy IO system
>> writing less data even if needing more CPU will be the right trade-off).
>
> "on" doesn't imply "compress". When full_page_writes is set to "on",
> FPW is not compressed at all.
>
>> I like that you're checking the savings of the compressed data with
>> respect to the uncompressed data and cancel the compression if there's
>> no gain. I wonder if your test accounts for enough padding and headers
>> though given the results we saw in other tests made in this thread.
>
> I'm afraid that the patch has only limited effects in WAL reduction and
> performance improvement unless the database contains highly-compressible
> data like large blank characters column. It really depends on the contents
> of the database. So, obviously FPW compression should not be the default.
> Maybe we can treat it as just tuning knob.
>
>> Why do we have both the static function full_page_writes_str() and the
>> macro FullPageWritesStr, with two different implementations issuing
>> either "true" and "false" or "on" and "off"?
>
> First I was thinking to use "on" and "off" because they are often used
> as the setting value of boolean GUC. But unfortunately the existing
> pg_xlogdump uses "true" and "false" to show the value of full_page_writes
> in WAL. To avoid breaking the backward compatibility, I implmented
> the "true/false" version of function. I'm really not sure how many people
> want such a compatibility of pg_xlogdump, though.
>
>> ! unsigned hole_offset:15, /* number of bytes before "hole" */
>> ! flags:2, /* state of a backup block, see below */
>> ! hole_length:15; /* number of bytes in "hole" */
>>
>> I don't understand that. I wanted to use that patch as a leverage to
>> smoothly discover the internals of our WAL system but won't have the
>> time to do that here.
>
> We need the flag indicating whether each FPW is compressed or not.
> If no such a flag exists in WAL, the standby cannot determine whether
> it should decompress each FPW or not, and then cannot replay
> the WAL containing FPW properly. That is, I just used a 'space' in
> the header of FPW to have such a flag.
>
>> That said, I don't even know that C syntax.
>
> The struct 'ItemIdData' uses the same C syntax.
>
>> + #define BKPBLOCK_UNCOMPRESSED 0 /* uncompressed */
>> + #define BKPBLOCK_COMPRESSED 1 /* comperssed */
>>
>> There's a typo in the comment above.
>
> Yep.
>
>>> [time required to replay WAL generated during running pgbench]
>>> 61s (on) .... 1209911 transactions were replayed,
>>> recovery speed: 19834.6 transactions/sec
>>> 39s (compress) .... 1445446 transactions were replayed,
>>> recovery speed: 37062.7 transactions/sec
>>> 37s (off) .... 1629235 transactions were replayed,
>>> recovery speed: 44033.3 transactions/sec
>>
>> How did you get those numbers ? pg_basebackup before the test and
>> archiving, then a PITR maybe? Is it possible to do the same test with
>> the same number of transactions to replay, I guess using the -t
>> parameter rather than the -T one for this testing.
>
> Sure. To be honest, when I received the same request from Andres,
> I did that benchmark. But unfortunately because of machine trouble,
> I could not report it, yet. Will do that again.

Here is the benchmark result:

* Result
[tps]
1317.306391 (full_page_writes = on)
1628.407752 (compress)

[the amount of WAL generated during running pgbench]
1319 MB (on)
326 MB (compress)

[time required to replay WAL generated during running pgbench]
19s (on)
2013-10-11 12:05:09 JST LOG: redo starts at F/F1000028
2013-10-11 12:05:28 JST LOG: redo done at 10/446B7BF0

12s (on)
2013-10-11 12:06:22 JST LOG: redo starts at F/F1000028
2013-10-11 12:06:34 JST LOG: redo done at 10/446B7BF0

12s (on)
2013-10-11 12:07:19 JST LOG: redo starts at F/F1000028
2013-10-11 12:07:31 JST LOG: redo done at 10/446B7BF0

8s (compress)
2013-10-11 12:17:36 JST LOG: redo starts at 10/50000028
2013-10-11 12:17:44 JST LOG: redo done at 10/655AE478

8s (compress)
2013-10-11 12:18:26 JST LOG: redo starts at 10/50000028
2013-10-11 12:18:34 JST LOG: redo done at 10/655AE478

8s (compress)
2013-10-11 12:19:07 JST LOG: redo starts at 10/50000028
2013-10-11 12:19:15 JST LOG: redo done at 10/655AE478

[benchmark]
transaction type: TPC-B (sort of)
scaling factor: 100
query mode: prepared
number of clients: 32
number of threads: 4
number of transactions per client: 10000
number of transactions actually processed: 320000/320000

Regards,

--
Fujii Masao

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2013-10-11 03:45:23 Re: Compression of full-page-writes
Previous Message Andrew Dunstan 2013-10-11 03:00:30 Re: Bugfix and new feature for PGXS