Re: Teach pg_receivewal to use lz4 compression

From: gkokolatos(at)pm(dot)me
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Teach pg_receivewal to use lz4 compression
Date: 2021-09-16 15:17:15
Message-ID: 6GriwrD-LWnn-8G-AdhMyJQh_Dk9uU22mfJS5C_3knfhbL9I35i9aGnBxiY4tvePz-hCdZ45Le0zUSaslCoPEBf3h8zLoB3nQcns1Z5MHrM=@pm.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Wednesday, September 15th, 2021 at 08:46, Michael Paquier michael(at)paquier(dot)xyz wrote:

Hi,

thank you for the review.

> On Fri, Sep 10, 2021 at 08:21:51AM +0000, gkokolatos(at)pm(dot)me wrote:
>
> > Agreed. A default value of 5, which is in the middle point of options, has been
> > defined and used.
> > In addition, the tests have been adjusted to mimic the newly added gzip tests.
>
> Looking at lz4frame.h, there is LZ4F_flush() that allows to compress
> immediately any data buffered in the frame context but not compressed
> yet. It seems to me that dir_sync() should be extended to support
> LZ4.

Agreed. LZ4F_flush() calls have been added where appropriate.

>
> export GZIP_PROGRAM=$(GZIP)
> +export LZ4
> [...]
> +PGAC_PATH_PROGS(LZ4, lz4)
> -
> PGAC_PATH_BISON
>
> The part of the test assigning LZ4 is fine, but I'd rather switch to a
> logic à-la-gzip, where we just save "lz4" in Makefile.global.in,
> saving cycles in ./configure.

Reluctantly agreed.

>
> +static bool
> +is_xlogfilename(const char *filename, bool *ispartial,
> - WalCompressionMethod *wal_compression_method)
>
>
> I like the set of simplifications you have done here to detection if a
> segment is partial and which compression method applies to it.

Thank you very much.

>
> + if (compression_method != COMPRESSION_ZLIB && compresslevel != 0)
> + {
> + pg_log_error("can only use --compress together with "
> + "--compression-method=gzip");
> +#ifndef HAVE_LIBLZ4
> + pg_log_error("this build does not support compression via gzip");
> +#endif
>
> s/HAVE_LIBLZ4/HAVE_LIBZ/.
>

Fixed.

> +$primary->command_fails(
> + [
> + 'pg_receivewal', '-D', $stream_dir, '--compression-method', 'lz4',
> + '--compress', '1'
> + ],
> + 'failure if --compression-method=lz4 specified with --compress');
> This would fail when the code is not built with LZ4 with a non-zero
> error code but with an error that is not what we expect. I think that
> you should use $primary->command_fails_like() instead. That's quite
> new, as of de1d4fe. The matching error pattern will need to change
> depending on if we build the code with LZ4 or not. A simpler method
> is to use --compression-method=none, to bypass the first round of
> errors and make that build-independent, but that feels incomplete if
> you want to tie that to LZ4.

Fixed. Now a regex has been added to address both builds.

>
> + pg_log_warning("compressed segment file \\\\"%s\\\\" has incorrect header size %lu, skipping",
> + dirent->d_name, consumed_len);
> + LZ4F_freeDecompressionContext(ctx);
>
> I agree that skipping all those cases when calculating the streaming
> start point is more consistent.

Thanks.

>
> + if (r < 0)
> + pg_log_error("could not read compressed file \\\\"%s\\\\": %m",
> + fullpath);
> + else
> + pg_log_error("could not read compressed file \\\\"%s\\\\": read %d of %lu",
> + fullpath, r, sizeof(buf));
>
> Let's same in translation effort here by just using "could not read",
> etc. by removing the term "compressed".

The string is also present in the gzip compressed case, i.e. prior to this patch.
The translation effort will not be reduced by changing this string only.

> + pg_log_error("can only use --compress together with "
> + "--compression-method=gzip");
>
> Better to keep these in a single line to ease grepping. We don't care
> if error strings are longer than the 72-80 character limit.

Fixed.

> +/* Size of lz4 input chunk for .lz4 */
> +#define LZ4_IN_SIZE 4096
>
> Why this choice? Does it need to use LZ4_COMPRESSBOUND?

This value is used in order to calculate the bound, before any buffer is
received. Then when we receive buffer, we consume them in LZ4_IN_SIZE chunks.
Note the call to LZ4F_compressBound() in dir_open_for_write().

+ ctx_out = LZ4F_createCompressionContext(&ctx, LZ4F_VERSION);
+ lz4bufsize = LZ4F_compressBound(LZ4_IN_SIZE, &lz4preferences);

> - if (dir_data->compression > 0)
> + if (dir_data->compression_method == COMPRESSION_ZLIB)
> gzclose(gzfp);
> else
>
> Hm. The addition of the header in dir_open_for_write() uses
> LZ4F_compressBegin. Shouldn't we use LZ4F_compressEnd() if
> fsync_fname() or fsync_parent_path() fail on top of closing the fd?
> That would be more consistent IMO to do so. The patch does that in
> dir_close(). You should do that additionally if there is a failure
> when writing the header.

Fixed. LZ4_flush() have been added where appropriate.

>
> + pg_log_error("invalid compression-method \"%s\", optarg);
> + exit(1);
>
> This could be "invalid value \"%s\" for option %s", see
> option_parse_int() in fe_utils/option_utils.c.

Fixed.

>
> After running the TAP tests, the LZ4 section is failing as follows:
> pg_receivewal: stopped log streaming at 0/4001950 (timeline 1)
> pg_receivewal: not renaming "000000010000000000000004.partial", segment is not complete
> pg_receivewal: error: could not close file "000000010000000000000004": Undefined error: 0
> ok 26 - streaming some WAL using --compression-method=lz4
> The third log line I am quoting here looks unexpected to me. Saying
> that, the tests integrate nicely with the existing code.

Strange that you got an undefined error. I managed to _almost_ reproduce
with the log line looking like:

pg_receivewal: error: could not close file "000000010000000000000004": Success

This was due to a call to LZ4F_compressEnd() on a partial file. In v5 of
the patch, LZ4F_compressEnd() is called when the WalCloseMethod is CLOSE_NORMAL
otherwise LZ4F_flush is used. This seems to remove the log line and a
more consistent behaviour overall.

In passing, close_walfile() has been taught to consider compression in
the filename, via get_file_name().

> As mentioned upthread, LZ4-compressed files don't store the file size
> by default. I think that we should document that better in the code
> and the documentation, in two ways at least:
>
> - Add some comments mentioning lz4 --content-size, with at least one
> in FindStreamingStart().
> - Add a new paragraph in the documentation of --compression-method.

Apologies, I didn't understood what you meant upstream. Now I do.

How about:

By default, LZ4-compressed files don't store the uncompressed file size.
However, the program pg_receivewal, does store that information. As a
consequence, the file does not need to be decompressed if the external
program is used, e.g. lz4 -t --content-size <file>, will report the
uncompressed file size.

> The name of the compression method is "LZ4" with upper-case
> characters. Some comments in the code and the tests, as well as the
> docs, are not careful about that.

Hopefully fixed.

Cheers,
//Georgios

> --
> Michael
>

Attachment Content-Type Size
v5-0001-Teach-pg_receivewal-to-use-LZ4-compression.patch application/octet-stream 33.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-09-16 16:19:17 Re: Partial index "microvacuum"
Previous Message Fujii Masao 2021-09-16 15:13:41 Re: Improve logging when using Huge Pages