Re: refactoring basebackup.c

From: Jeevan Ladhe <jeevan(dot)ladhe(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>
Subject: Re: refactoring basebackup.c
Date: 2021-09-21 13:07:37
Message-ID: CAOgcT0PdvW1aV+Pnim-NuqDxU3zkN4EkQJ1Op9ZVKOssuWBQkw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
> >> + /*
> >> + * LZ4F_compressUpdate() returns the number of bytes written into
> output
> >> + * buffer. We need to keep track of how many bytes have been
> cumulatively
> >> + * written into the output buffer(bytes_written). But,
> >> + * LZ4F_compressUpdate() returns 0 in case the data is buffered and not
> >> + * written to output buffer, set autoFlush to 1 to force the writing
> to the
> >> + * output buffer.
> >> + */
> >> + prefs->autoFlush = 1;
> >>
> >> I don't see why this should be necessary. Elsewhere you have code that
> >> caters to bytes being stuck inside LZ4's buffer, so why do we also
> >> require this?
> >
> > This is needed to know the actual bytes written in the output buffer. If
> it is
> > set to 0, then LZ4F_compressUpdate() would randomly return 0 or actual
> > bytes are written to the output buffer, depending on whether it has
> buffered
> > or really flushed data to the output buffer.
>
> The problem is that if we autoflush, I think it will cause the
> compression ratio to be less good. Try un-lz4ing a file that is
> produced this way and then re-lz4 it and compare the size of the
> re-lz4'd file to the original one. Compressors rely on postponing
> decisions about how to compress until they've seen as much of the
> input as possible, and flushing forces them to decide earlier, and
> maybe making a decision that isn't as good as it could have been. So I
> believe we should look for a way of avoiding this. Now I realize
> there's a problem there with doing that and also making sure the
> output buffer is large enough, and I'm not quite sure how we solve
> that problem, but there is probably a way to do it.
>

Yes, you are right here, and I could verify this fact with an experiment.
When autoflush is 1, the file gets less compressed i.e. the compressed file
is of more size than the one generated when autoflush is set to 0.
But, as of now, I couldn't think of a solution as we need to really advance
the
bytes written to the output buffer so that we can write into the output
buffer.

Regards,
Jeevan Ladhe

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-09-21 13:28:51 Re: proposal: possibility to read dumped table's name from file
Previous Message Pavel Stehule 2021-09-21 12:46:27 Re: proposal: possibility to read dumped table's name from file