Re: libpq compression

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Daniil Zakhlystov <usernamedt(at)yandex-team(dot)ru>
Cc: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: libpq compression
Date: 2020-12-02 21:23:20
Message-ID: CA+TgmoZMTLydjP3iWBir20OrSXTugxS2CXC1WP1nWM5=Zqz0Nw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 26, 2020 at 8:15 AM Daniil Zakhlystov
<usernamedt(at)yandex-team(dot)ru> wrote:
> However, I don’t mean by this that we shouldn’t support switchable compression. I propose that we can offer two compression modes: permanent (which is implemented in the current state of the patch) and switchable on-the-fly. Permanent compression allows us to deliver a robust solution that is already present in some databases. Switchable compression allows us to support more complex scenarios in cases when the frontend and backend really need it and can afford development effort to implement it.

I feel that one thing that may be getting missed here is that my
suggestions were intended to make this simpler, not more complicated.
Like, in the design I proposed, switchable compression is not a
separate form of compression and doesn't require any special support.
Both sides are just allowed to set the compression method;
theoretically, they could set it more than once. Similarly, I don't
intend the possibility of using different compression algorithms in
the two directions as a request for an advanced feature so much as a
way of simplifying the protocol.

Like, in the protocol that you proposed previously, you've got a
four-phase handshake to set up compression. The startup packet carries
initial information from the client, and the server then sends
CompressionAck, and then the client sends SetCompressionMethod, and
then the server sends SetCompressionMethod. This system is fairly
complex, and it requires some form of interlocking. Once the client
has sent a SetCompressionMethod message, it cannot send any other
protocol message until it receives a SetCompressionMethod message back
from the server. Otherwise, it doesn't know whether the server
actually responded with SetCompressionMethod as well, or whether it
sent say ErrorResponse or NoticeResponse or something. In the former
case it needs to send compressed data going forward; in the latter
uncompressed; but it can't know which until it seems the server
message. And keep in mind that control isn't necessarily with libpq at
this point, because non-blocking mode could be in use. This is all
solvable, but the way I proposed it, you don't have that problem. You
never need to wait for a message from the other end before being able
to send a message yourself.

Similarly, allowing different compression methods in the two
directions may seem to make things more complicated, but I don't think
it really is. Arguably it's simpler. The instant the server gets the
startup packet, it can issue SetCompressionMethod. The instant the
client gets SupportedCompressionTypes, it can issue
SetCompressionMethod. So there's practically no hand-shaking at all.
You get a single protocol message and you immediately respond by
setting the compression method and then you just send compressed
messages after that. Perhaps the time at which you begin receiving
compressed data will be a little different than the time at which you
begin sending it, or perhaps compression will only ever be used in one
direction. But so what? The code really does need to care. You just
need to keep track of the active compression mode in each direction,
and that's it.

And again, if you allow the compression method to be switched at any
time, you just have to know how what to do when you get a
SetCompressionMethod. If you only allow it to be changed once, to set
it initially, then you have to ADD code to reject that message the
next time it's sent. If that ends up avoiding a significant amount of
complexity somewhere else then I don't have a big problem with it, but
if it doesn't, it's simpler to allow it whenever than to restrict it
to only once.

> 2. List of the compression algorithms which the frontend is able to decompress in the order of preference.
> For example:
> “zlib:1,3,5;zstd:7,8;uncompressed” means that frontend is able to:
> - decompress zlib with 1,3 or 5 compression levels
> - decompress zstd with 7 or 8 compression levels
> - “uncompressed” at the end means that the frontend agrees to receive uncompressed messages. If there is no “uncompressed” compression algorithm specified it means that the compression is required.

I think that there's no such thing as being able to decompress some
compression levels with the same algorithm but not others. The level
only controls behavior on the compression side. So, the client can
only send zlib data if the server can decompress it, but the server
need not advertise which levels it can decompress, because it's all or
nothing.

> Supported compression and decompression methods are configured using GUC parameters:
>
> compress_algorithms = ‘...’ // default value is ‘uncompressed’
> decompress_algorithms = ‘...’ // default value is ‘uncompressed’

This raises an interesting question which I'm not quite sure about. It
doesn't seem controversial to assert that the client must be able to
advertise which algorithms it does and does not support, and likewise
for the server. After all, just because we offer lz4, say, as an
option doesn't mean every PostgreSQL build will be performed
--with-lz4. But, how should the compression algorithm that actually
gets used be controlled? One can imagine that the client is in charge
of the compression algorithm and the compression level in both
directions. If we insist on those being the same, the client says
something like compression=lz4:1 and then it uses that algorithm and
instructs the server to do the same; otherwise there might be separate
connection parameters for client-compression and server-compression,
or some kind of syntax that lets you specify both using a single
parameter. On the other hand, one could take a whole different
approach and imagine the server being in charge of both directions,
like having a GUC that is set on the server. Clients advertise what
they can support, and the server tells them to do whatever the GUC
says they must. That sounds awfully heavy-handed, but it has the
advantage of letting the server administrator set site policy. One can
also imagine combination approaches, like letting the server GUC
define the default but allowing the client to override using a
connection parameter. Or even putting each side in charge of what it
sends: the GUC controls the what the server tries to do, provided the
client can support it; and the connection parameter controls the
client behavior, provided the server can support it.

I am not really sure what's best here, but it's probably something we
need to think about a bit before we get too deep into this. I'm
tentatively inclined to think that the server should have a GUC that
defines the *allowable* compression algorithms so that the
administrator can disable algorithms that are compiled into the binary
but which she does not want to permit (e.g. because a security problem
was discovered in a relevant library). The default can simply be
'all', meaning everything the binary supports. And then the rest of
the control should be on the client side, so that the server GUC can
never influence the selection of which algorithm is actually chosen,
but only rule things out. But that is just a tentative opinion; maybe
it's not the right idea.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2020-12-02 21:38:14 Proposed patch for key managment
Previous Message Thomas Munro 2020-12-02 21:15:06 Re: Commitfest 2020-11 is closed