Re: libpq compression

From: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Daniil Zakhlystov <usernamedt(at)yandex-team(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: libpq compression
Date: 2020-11-24 20:59:20
Message-ID: 826687dc-f9f6-e854-5975-65122a7549a4@garret.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24.11.2020 21:35, Robert Haas wrote:
> On Tue, Nov 24, 2020 at 12:35 PM Daniil Zakhlystov
> <usernamedt(at)yandex-team(dot)ru> wrote:
>> To sum up, I think that the current implementation already introduces good benefits. As I proposed in the Usability review, we may introduce the new approaches later as separate compression 'algorithms'.
> I don't think the current patch is so close to being committable that
> we shouldn't be considering what we really want to have here. It's one
> thing to say, well, this patch is basically done, let's not start
> redesigning it now. But that's not the case here. For example, I don't
> see any committer accepting the comments in zpq_stream.c as adequate,
> or the documentation, either. Some comments that have been made
> previously, like Andres's remark about the non-standard message
> construction in pq_configure(), have not been addressed, and I do not
> think any committer is going to agree with the idea that the novel
> method chosen by the patch is superior here, not least but not only
> because it seems like it's endian-dependent. That function also uses
> goto, which anybody thinking of committing this will surely try to get
> rid of, and I'm pretty sure the sscanf() isn't good enough to reject
> trailing garbage, and the error message that follows is improperly
> capitalized. I'm sure there's other stuff, too: this is just based on
> a quick look.
>
> Before we start worrying about any of that stuff in too much detail, I
> think it makes a lot of sense to step back and consider the design.
> Honestly, the work of changing the design might be smaller than the
> amount of cleanup the patch needs. But even if it's larger, it's
> probably not vastly larger. And in any case, I quite disagree with the
> idea that we should commit to a user-visible interface that exposes a
> subset of the functionality that we needed and then try to glue the
> rest of the functionality on top of it later. If we make a libpq
> connection option called compression that controls the type of
> compression that is used in both direction, then how exactly would we
> extend that later to allow for different compression in the two
> directions? Some syntax like compression=zlib/none, where the value
> before the slash controls one direction and the value after the slash
> controls the other? Maybe. But on the other hand, maybe it's better to
> have separate connection options for client compression and server
> compression. Or, maybe the kind of compression used by the server
> should be controlled via a GUC rather than a connection option. Or,
> maybe none of that is right and we should stick with the approach the
> patch currently takes. But it's not like we can do something for v1
> and then just change things randomly later: there will be
> backward-compatibility to worry about. So the time to talk about the
> general approach here is now, before anything gets committed, before
> the project has committed itself to any particular design. If we
> decide in that discussion that certain things can be left for the
> future, that's fine. If we've have discussed how they could be added
> without breaking backward compatibility, even better. But we can't
> just skip over having that discussion.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com

First of all thank you for review.
And I completely agree with you that this patch is not ready for
committing -
at least documentation written in my bad english has to be checked and
fixed. Also I have not tesyed it much at Windows
and other non-unix systems.

I do not want to discuss small technical things like sending compression
message
in pq_configure. I have answered Andres why I can not use standard
functions in this case:
just because this function is called in initialization of connection and
so connection handle is not ready yet.
But this definitely can be changed (although it is not endian-dependent:
libpq messages are using big-endian order).
Also it seems to be strange to discuss presence of "goto" in the code:
we are not"puritans", are we? ;)
Yes, I also try to avoid use of goto-s as much as possible as them can
make code less readable.
But complete prohibiting of goto-s and replacing them with artificial
loops and redundant checks seems to be
a kind of radical approaches which rarely lead to something good. By the
way - there 3 thousands gotos in Postgres code
(may be some of them are in generated code - I have not checked:)

So lets discuss more fundamental things, like your suggestion for
complete redesign of compression support.
I am not against such discussion, although I personally do not think
that there are a lot of topics for discussion here.
Definitely I do not want to say that my implementation is perfect and it
can not be reimplemented in better way. Certainly it can.
But compression itself, especially compression of protocol messages is
not a novel area. It was implemented many times in many
different systems (recently it was added to mysql) and it is hard to
find some "afflatus" here.

We can suggest many different things which can make compression more
complex and flexible:
- be able to switch it on/off on the fly
- use different compression algorithms in different directions
- be able to change compression algorithm and compression level on the fly
- dynamically choose the best compression algorithm based on data stream
content
...

I am sure that after thinking few moments any of us can add several more
items to this list.
If we try to develop protocol which will be able to support all of them,
then most likely it will be too complicated, too expensive and inefficient.
There is general rule: the more flexible some thing is, the less
efficient it is...

Before suggesting any of this advanced feature, it is necessary to have
strong arguments why it is actually needed
and which problem it is trying to fix.

Right now, "statu quo" is that we have few well known compression
algorithms: zlib, zstd, lz4,... which characteristics are not so
different. For some of them it is possible to say that algorithm A is
almost always better than algorithm B, for example
modern zstd almost always better than old zlib. But zlib is available
almost everywhere and is already supported by default by postgres.
Some algorithms are faster, some provide better compression ratio. But
for small enough protocol messages speed is much more
important than compression quality.

Sorry, for my may be too "chaotic" arguments. I really do not think that
there is much sense in supporting
a lot of different compression algorithms.
Both zstd or lz4 will provide comparable speed and compression ratio.
And zlib  can be used as universal solution.
And definitely it will be much better than lack of any compression.

What makes me really worry is that there are several places in Postgres
which requires compression and all of them are implemented
in their own way: TOAST compression, WAL compression, backup
compression,...  zedheap compression.
There is now custom compression methods patch at commit fest, which
propose ... very flexible way of fixing just one of
this cases. And which is completely orthogonal to the proposed libpq
compression.
And my main concern is this awful code duplication and the fact that
postgres is still using its own pglz implementation
which is much worse (according to zedstore benchmarks and my own
measurements) than lz4 and any other popular algorithms.
And I do not believe that situation will be changed in PG14 or PG15.

So why do we need to be able to switch compression on/off?
High CPU overhead? Daniil measurements don't prove it... Compression
should not be used for all connections.
There are some well known connections for which it will be useful: first
of all replication protocol.
If client uploads data to the database using COPY command or perform
some complex OLAP queries returning huge results
then compression also may be useful.
I can not imagine scenario when it is necessary to be able to switch
on/off compression on the fly.
And it is not clear who (user himself, libpq library, JDBC driver,...?)
and how will control it.

Possibility to use different algorithms in different directions seems to
be even less sensible.
Especially if we do not use data-specific compression algorithms. And
using such algorithms is very very
doubtful, taken in account that data between client and server is now
mostly transferred in text format
(ok, in case of replication it is not true).

Dynamic switch of used compression algorithm depending on content of
data stream is also interesting
but practically useless idea, because modern compression algorithms are
doing them best trying to adopt to
the input data.

Finally I just want to notice once again that from my point of view
compression is not the area where we need to invent
something new. It is enough to provide the similar functionality which
is available at most of other communication protocols.
IMHO there is no sense to allow user to specify particular algorithm and
compression level.
Like it is done in SSL: you can switch compression on or off. You can
not specify algorithm, compression level...
and moreover somehow dynamically control compression on the fly.

Unfortunately your proposal to throw away this patch, start design
discussion from the scratch most likely mean
that we will not have compression in some observable feature. And it is
a pity, because I see many real use cases
where compression is actually needed and where Oracle wins Postgres
several times just because it is sending less data
through network.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-11-24 20:59:54 Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path
Previous Message Tom Lane 2020-11-24 20:49:33 Re: mark/restore failures on unsorted merge joins