Re: libpq compression

From: Daniil Zakhlystov <usernamedt(at)yandex-team(dot)ru>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: libpq compression
Date: 2020-12-14 17:53:56
Message-ID: D5354E7A-3B9F-4D32-B3AB-F65058D36500@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Dec 10, 2020, at 1:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> I still think this is excessively baroque and basically useless.
> Nobody wants to allow compression levels 1, 3, and 5 but disallow 2
> and 4. At the very most, somebody might want to start a maximum or
> minimum level. But even that I think is pretty pointless. Check out
> the "Decompression Time" and "Decompression Speed" sections from this
> link:
>
> https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison/
>
> This shows that decompression time and speed is basically independent
> of compression method for all three of these compressors; to the
> extent that there is a difference, higher compression levels are
> generally slightly faster to decompress. I don't really see the
> argument for letting either side be proscriptive here. Deciding with
> algorithms you're willing to accept is totally reasonable since
> different things may be supported, security concerns, etc. but
> deciding you're only willing to accept certain levels seems unuseful.
> It's also unenforceable, I think, since the receiving side has no way
> of knowing what the sender actually did.

I agree that decompression time and speed are basically the same for different compression ratios for most algorithms.
But it seems like that this may not be true for memory usage.

Check out these links: http://mattmahoney.net/dc/text.html and https://community.centminmod.com/threads/round-4-compression-comparison-benchmarks-zstd-vs-brotli-vs-pigz-vs-bzip2-vs-xz-etc.18669/

According to these sources, zstd uses significantly more memory while decompressing the data which has been compressed with high compression ratios.

So I’ll test the different ZSTD compression ratios with the current version of the patch and post the results later this week.

> On Dec 10, 2020, at 1:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>
> Good points. I guess you need to arrange to "flush" at the compression
> layer as well as the libpq layer so that you don't end up with data
> stuck in the compression buffers.

I think that “flushing” the libpq and compression buffers before setting the new compression method will help to solve issues only at the compressing (sender) side
but won't help much on the decompressing (receiver) side.

In the current version of the patch, the decompressor acts as a proxy between secure_read and PqRecvBuffer / conn->inBuffer. It is unaware of the Postgres protocol and
will fail to do anything other than decompressing the bytes received from the secure_read function and appending them to the PqRecvBuffer.
So the problem is that we can’t decouple the compressed bytes from the uncompressed ones (actually ZSTD detects the compressed block end, but some other algorithms don’t).

We may introduce some hinges to control the decompressor behavior from the underlying levels after reading the SetCompressionMethod message
from PqRecvBuffer, but I don’t think that it is the correct approach.

> On Dec 10, 2020, at 1:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> Another idea is that you could have a new message type that says "hey,
> the payload of this is 1 or more compressed messages." It uses the
> most-recently set compression method. This would make switching
> compression methods easier since the SetCompressionMethod message
> itself could always be sent uncompressed and/or not take effect until
> the next compressed message. It also allows for a prudential decision
> not to bother compressing messages that are short anyway, which might
> be useful. On the downside it adds a little bit of overhead. Andres
> was telling me on a call that he liked this approach; I'm not sure if
> it's actually best, but have you considered this sort of approach?

This may help to solve the above issue. For example, we may introduce the CompressedData message:

CompressedData (F & B)

Byte1(‘m’) // I am not so sure about the ‘m’ identifier :)
Identifies the message as compressed data.

Int32
Length of message contents in bytes, including self.

Byten
Data that forms part of a compressed data stream.

Basically, it wraps some chunk of compressed data (like the CopyData message).

On the sender side, the compressor will wrap all outgoing message chunks into the CopyData messages.

On the receiver side, some intermediate component between the secure_read and the decompressor will do the following:
1. Read the next 5 bytes (type and length) from the buffer
2.1 If the message type is other than CompressedData, forward it straight to the PqRecvBuffer / conn->inBuffer.
2.2 If the message type is CompressedData, forward its contents to the current decompressor.

What do you think of this approach?


Daniil Zakhlystov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2020-12-14 18:01:17 Re: Rethinking plpgsql's assignment implementation
Previous Message Bharath Rupireddy 2020-12-14 17:43:08 Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit