Re: libpq compression

From: Daniil Zakhlystov <usernamedt(at)yandex-team(dot)ru>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: libpq compression
Date: 2020-12-08 14:42:14
Message-ID: A1D0D9A4-1FC4-4311-83C7-EEA90749EE07@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Robert!

First of all, thanks for your detailed reply.

> On Dec 3, 2020, at 2:23 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> Like, in the protocol that you proposed previously, you've got a
> four-phase handshake to set up compression. The startup packet carries
> initial information from the client, and the server then sends
> CompressionAck, and then the client sends SetCompressionMethod, and
> then the server sends SetCompressionMethod. This system is fairly
> complex, and it requires some form of interlocking.

I proposed a slightly different handshake (three-phase):

1. At first, the client sends _pq_.compression parameter in startup packet
2. Server replies with CompressionAck and following it with SetCompressionMethod message.
These two might be combined but I left them like this for symmetry reasons. In most cases they
will arrive as one piece without any additional delay.
3. Client replies with SetCompressionMethod message.

The handshake like above allows forbidding the uncompressed client-to-server or/and server-to-client communication.

For example, if the client did not explicitly specify ‘uncompressed’ in the supported decompression methods list, and
the server does not support any of the other compression algorithms sent by the client, the server will send back
SetCompressionMethod with ‘-1’ index. After receiving this message, the client will terminate the connection.

> On Dec 3, 2020, at 2:23 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> And again, if you allow the compression method to be switched at any
> time, you just have to know how what to do when you get a
> SetCompressionMethod. If you only allow it to be changed once, to set
> it initially, then you have to ADD code to reject that message the
> next time it's sent. If that ends up avoiding a significant amount of
> complexity somewhere else then I don't have a big problem with it, but
> if it doesn't, it's simpler to allow it whenever than to restrict it
> to only once.

Yes, there is actually some amount of complexity involved in implementing the switchable on-the-fly compression.
Currently, compression itself operates on a different level, independently of libpq protocol. By allowing
the compression to be switchable on the fly, we need to solve these tasks:

1. When the new portion of bytes comes to the decompressor from the socket.read() call, there may be
a situation when the first part of these bytes is a compressed fragment and the other is part is uncompressed, or worse,
in a single portion of new bytes, there may be the end of some ZLIB compressed message and the beginning of the ZSTD compressed message.
The problem is that we don’t know the exact end of the ZLIB compressed message before decompressing the entire chunk of new bytes
and reading the SetCompressionMethod message. Moreover, streaming compression by itself may involve some internal buffering,
which also complexifies this problem.

2. When sending the new portion of bytes, it may be not sufficient to keep track of only the current compression method.
There may be a situation when there could be multiple SetCompressionMessages in PqSendBuffer (backend) or conn->outBuffer (frontend).
It means that it is not enough to simply track the current compression method but also keep track of all compression method
switches in PqSendBuffer or conn->outBuffer. Also, same as for decompression,
internal buffering of streaming compression makes the situation more complex in this case too.

Despite that the above two problems might be solvable, I doubt if we should oblige to solve these problems not only in libpq,
but in all other third-party Postgres protocol libraries since the exact areas of application for switchable compression are not clear yet.
I agree with Konstantine’s point of view on this one:

> And more important question - if we really want to switch algorithms on
> the fly: who and how will do it?
> Do we want user to explicitly control it (something like "\compression
> on" psql command)?
> Or there should be some API for application?
> How it can be supported for example by JDBC driver?
> I do not have answers for this questions...

However, as previously mentioned in the thread, it might be useful in the future and we should design a protocol
that supports it so we won’t have any problems with backward compatibility.
So, basically, this was the only reason to introduce the two separate compression modes - switchable and permanent.

In the latest patch, Konstantin introduced the extension part. So in the future versions, we can introduce the switchable compression
handling in this extension part. By now, let the permanent compression be the default mode.

> On Dec 3, 2020, at 2:23 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> I think that there's no such thing as being able to decompress some
> compression levels with the same algorithm but not others. The level
> only controls behavior on the compression side. So, the client can
> only send zlib data if the server can decompress it, but the server
> need not advertise which levels it can decompress, because it's all or
> nothing.

Depending on the chosen compression algorithm, compression level may affect the decompression speed and memory usage.
That's why I think that it may be nice for the server to forbid some compression levels with high CPU / memory usage required for decompression.

> On Dec 3, 2020, at 2:23 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On the other hand, one could take a whole different
> approach and imagine the server being in charge of both directions,
> like having a GUC that is set on the server. Clients advertise what
> they can support, and the server tells them to do whatever the GUC
> says they must. That sounds awfully heavy-handed, but it has the
> advantage of letting the server administrator set site policy.

I personally think that this approach is the most practical one. For example:

In the server’s postgresql.conf:

compress_algorithms = ‘uncompressed' // means that the server forbids any server-to-client compression
decompress_algorithms = 'zstd:7,8;uncompressed' // means that the server can only decompress zstd with compression ratio 7 and 8 or communicate with uncompressed messages

In the client connection string:

“… compression=zlib:1,3,5;zstd:6,7,8;uncompressed …” // means that the client is able to compress/decompress zlib, zstd, or communicate with uncompressed messages

For the sake of simplicity, the client’s “compression” parameter in the connection string is basically an analog of the server’s compress_algorithms and decompress_algorithms.
So the negotiation process for the above example would look like this:

1. Client sends startup packet with “algorithms=zlib:1,3,5;zstd:6,7,8;uncompressed;”
Since there is no compression mode specified, assume that the client wants permanent compression.
In future versions, the client can turn request the switchable compression after the ‘;’ at the end of the message

2. Server replies with two messages:
- CompressionAck message containing “algorithms=zstd:7,8;uncompressed;”
Where the algorithms section basically matches the “decompress_algorithms” server GUC parameter.
In future versions, the server can specify the chosen compression mode after the ‘;’ at the end of the message

- Following SetCompressionMethod message containing “alg_idx=1;level_idx=1” which
essentially means that the server chose zstd with compression level 7 for server-to-client compression. Every next message from the server is now compressed with zstd

3. Client replies with SetCompressionMethod message containing “alg_idx=0” which means that the client chose the uncompressed
client-to-server messaging. Actually, the client had no other options, because the “uncompressed” was the only option left after the intersection of
compression algorithms from the connection string and algorithms received from the server in the CompressionAck message.
Every next message from the client is now being sent uncompressed.


Daniil Zakhlystov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2020-12-08 14:52:13 Re: get_constraint_index() and conindid
Previous Message vignesh C 2020-12-08 14:35:36 Re: Parallel INSERT (INTO ... SELECT ...)