Re: refactoring basebackup.c

From: Jeevan Ladhe <jeevan(dot)ladhe(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>
Subject: Re: refactoring basebackup.c
Date: 2021-09-13 10:02:45
Message-ID: CAOgcT0Nm9TfigWttAaVdQrydz-SrLQRwSEeqzm6mRU-jUmKJnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks, Robert for your response.

On Thu, Sep 9, 2021 at 1:09 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Wed, Sep 8, 2021 at 2:14 PM Jeevan Ladhe
> <jeevan(dot)ladhe(at)enterprisedb(dot)com> wrote:
> > To give an example, I put some logging statements, and I can see in the
> log:
> > "
> > bytes remaining in mysink->base.bbs_next->bbs_buffer: 16537
> > input size to be compressed: 512
> > estimated size for compressed buffer by LZ4F_compressBound(): 262667
> > actual compressed size: 16
> > "
>
> That is pretty lame. I don't know why it needs a ~256k buffer to
> produce 16 bytes of output.
>

As I mentioned earlier, I think it has something to do with the lz4
blocksize. Currently, I have chosen it has 256kB, which is 262144 bytes,
and here the LZ4F_compressBound() has returned 262667 for worst-case
accommodation of 512 bytes i.e. 262144(256kB) + 512 + I guess some
book-keeping bytes. If I choose to have blocksize as 64K, then this turns
out to be: 66059 which is 65536(64 kB) + 512 + bookkeeping bytes.

The way the gzip APIs I used work, you tell it how big the output
> buffer is and it writes until it fills that buffer, or until the input
> buffer is empty, whichever happens first. But this seems to be the
> other way around: you tell it how much input you have, and it tells
> you how big a buffer it needs. To handle that elegantly, I think I
> need to make some changes to the design of the bbsink stuff. What I'm
> thinking is that each bbsink somehow tells the next bbsink how big to
> make the buffer. So if the LZ4 buffer is told that its buffer should
> be at least, I don't know, say 64kB. Then it can compute how large an
> output buffer the LZ4 library requires for 64kB. Hopefully we can
> assume that liblz4 never needs a smaller buffer for a larger input.
> Then we can assume that if a 64kB input requires, say, a 300kB output
> buffer, every possible input < 64kB also requires an output buffer <=
> 300 kB.
>

I agree, this assumption is fair enough.

But we can't just say, well, we were asked to create a 64kB buffer (or
> whatever) so let's ask the next bbsink for a 300kB buffer (or
> whatever), because then as soon as we write any data at all into it
> the remaining buffer space might be insufficient for the next chunk.
> So instead what I think we should do is have bbsink_lz4 set the size
> of the next sink's buffer to its own buffer size +
> LZ4F_compressBound(its own buffer size). So in this example if it's
> asked to create a 64kB buffer and LZ4F_compressBound(64kB) = 300kB
> then it asks the next sink to set the buffer size to 364kB. Now, that
> means that there will always be at least 300 kB available in the
> output buffer until we've accumulated a minimum of 64 kB of compressed
> data, and then at that point we can flush.

I think this would be relatively clean and would avoid the need for
> the double copying that the current design forced you to do. What do
> you think?
>

I think this should work.

>
> + /*
> + * If we do not have enough space left in the output buffer for this
> + * chunk to be written, first archive the already written contents.
> + */
> + if (nextChunkLen > mysink->base.bbs_next->bbs_buffer_length -
> mysink->bytes_written ||
> + mysink->bytes_written >= mysink->base.bbs_next->bbs_buffer_length)
> + {
> + bbsink_archive_contents(sink->bbs_next, mysink->bytes_written);
> + mysink->bytes_written = 0;
> + }
>
> I think this is flat-out wrong. It assumes that the compressor will
> never generate more than N bytes of output given N bytes of input,
> which is not true. Not sure there's much point in fixing it now
> because with the changes described above this code will have to change
> anyway, but I think it's just lucky that this has worked for you in
> your testing.

I see your point. But for it to be accurate, I think we need to then
considered the return value of LZ4F_compressBound() to check if that
many bytes are available. But, as explained earlier our output buffer is
already way smaller than that.

>
>
+ /*
> + * LZ4F_compressUpdate() returns the number of bytes written into output
> + * buffer. We need to keep track of how many bytes have been cumulatively
> + * written into the output buffer(bytes_written). But,
> + * LZ4F_compressUpdate() returns 0 in case the data is buffered and not
> + * written to output buffer, set autoFlush to 1 to force the writing to
> the
> + * output buffer.
> + */
> + prefs->autoFlush = 1;
>
> I don't see why this should be necessary. Elsewhere you have code that
> caters to bytes being stuck inside LZ4's buffer, so why do we also
> require this?
>

This is needed to know the actual bytes written in the output buffer. If it
is

set to 0, then LZ4F_compressUpdate() would randomly return 0 or actual

bytes are written to the output buffer, depending on whether it has buffered

or really flushed data to the output buffer.

IIUC, you are referring to the following comment for
bbsink_lz4_end_archive():

"

* There might be some data inside lz4's internal buffers; we need to get

* that flushed out, also finalize the lz4 frame and then get that forwarded

* to the successor sink as archive content.

"

I think it should be modified to:

"

* Finalize the lz4 frame and then get that forwarded to the successor sink
as

* archive content.

"

Regards,
Jeevan Ladhe.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rajkumar Raghuwanshi 2021-09-13 10:17:36 Re: Multi-Column List Partitioning
Previous Message Erik Rijkers 2021-09-13 09:41:56 Re: SQL/JSON: JSON_TABLE