Re: libpq compression

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc: Andreas Karlsson <andreas(at)proxel(dot)se>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: libpq compression
Date: 2019-02-13 14:46:00
Message-ID: dcc86dc6-5401-f2cc-9b3b-b5bcca2d7a45@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Dmitry,

On 13.02.2019 16:56, Dmitry Dolgov wrote:
> On Mon, Feb 11, 2019 at 4:52 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>> On 2019-02-11 12:46:07 -0300, Alvaro Herrera wrote:
>>> On 2019-Feb-11, Andreas Karlsson wrote:
>>>
>>>> Imagine the following query which uses the session ID from the cookie to
>>>> check if the logged in user has access to a file.
>>>>
>>>> SELECT may_download_file(session_id => $1, path => $2);
>>>>
>>>> When the query with its parameters is compressed the compressed size will
>>>> depend on the similarity between the session ID and the requested path
>>>> (assuming Zstd works similar to DEFLATE), so by tricking the web browser
>>>> into making requests with specifically crafted paths while monitoring the
>>>> traffic between the web server and the database the compressed request size
>>>> can be use to hone in the session ID and steal people's login sessions, just
>>>> like the CRIME attack[1].
>>> I would have said that you'd never let the attacker eavesdrop into the
>>> traffic between webserver and DB, but then that's precisely the scenario
>>> you'd use SSL for, so I suppose that even though this attack is probably
>>> just a theoretical risk at this point, it should definitely be
>>> considered.
>> Right.
>>
>>
>>> Now, does this mean that we should forbid libpq compression
>>> completely? I'm not sure -- maybe it's still usable if you encapsulate
>>> that traffic so that the attackers cannot get at it, and there's no
>>> reason to deprive those users of the usefulness of the feature. But
>>> then we need documentation warnings pointing out the risks.
>> I think it's an extremely useful feature, and can be used in situation
>> where this kind of attack doesn't pose a significant
>> danger. E.g. pg_dump, pg_basebackup seem candidates for that, and even
>> streaming replication seems much less endangered than sessions with lots
>> of very small queries. But I think that means it needs to be
>> controllable from both the server and client, and default to off
>> (although it doesn't seem crazy to allow it in the aforementioned
>> cases).
> Totally agree with this point.
>
> On Thu, Nov 29, 2018 at 8:13 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>>> On Mon, Aug 13, 2018 at 8:48 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>
>>> I agree with the critiques from Robbie Harwood and Michael Paquier
>>> that the way in that compression is being hooked into the existing
>>> architecture looks like a kludge. I'm not sure I know exactly how it
>>> should be done, but the current approach doesn't look natural; it
>>> looks like it was bolted on.
>> After some time spend reading this patch and investigating different points,
>> mentioned in the discussion, I tend to agree with that. As far as I see it's
>> probably the biggest disagreement here, that keeps things from progressing.
> To address this point, I've spent some time playing with the patch and to see
> how it would look like if compression logic would not incorporate everything
> else. So far I've come up with a buffers juggling that you can find in the
> attachments. It's based on the v11 (I'm not able to keep up with Konstantin),
> doesn't change all relevant code (mostly stuff around secure read/write), and
> most likely misses a lot of details, but at least works as expected
> while testing
> manually with psql (looks like tests are broken with the original v11, so I
> haven't tried to run them with the attached patch either). I wonder if this
> approach could be more natural for the feature?
>
> Also I have few questions/commentaries:
>
> * in this discussion few times was mentioned a situation when compression logic
> nees to read more data to decompress the current buffer. Am I understand
> correctly, that it's related to Z_BUF_ERROR, when zlib couldn't make any
> progress and needs more data? If yes, where is it handled?
>
> * one of the points was that the code should be copy pasted between be/fe for a
> proper separation, but from attached experimental patch it doesn't look that
> a lot of code should be duplicated.
>
> * I've noticed on v11 a problem, when an attempt to connect to a db without
> specified compression failed with the error "server is not supported
> requested compression algorithm". E.g. when I'm just executing createdb.
>
> I haven't looked at the negotiation logic yet, hope to do that soon.

First of all thank you for attempting to push this patch, because there
is really seems to be some disagreement which blocks
progress of this patch. Unfortunately first reviewer (Robbie Harwood)
think that my approach cause some layering violation
and should be done in other way. Robbie several times suggested me to
look "how the buffering in TLS or GSSAPI works" and do it in similar way.
Frankly speaking I do not see some principle differences differences.

As far as precise amount of data required for decompression algorithm to
produce some output is not known,
it is natural (from my point of view) to pass tx/rx functions to stream
compressor implementation to let it read or write data itself.
Also it allows to use streaming compression not only with libpq, but
with any other streaming data.

Right now we are reading data in two places - in frontend and backend.
Passing tx/rx function to compression stream implementation we avoid
code duplication.
In your implementation pair of zpq_read+zpq_read_drain is called twice -
in pqsecure_read and secure_read.
Moreover, please notice that your implementation is still passing
functions tx/rx functions to stream constructor and so zpq_read is still
able to read data itself.
So I do not understand which problem you have solved by replacing
zpq_read with pair of zpq_read_drain+zpq_read.

If we are speaking about layering, then from my point of view it is no a
good idea to let secure_read function perform decompression as well.
At least name will be confusing.

Answering your questions:

1. When decompressor has not enough data to produce any extra output, it
doesn't return error. It just not moving forward output position in the
buffer.
In my implementation (and actually in your's  as well because you leave
this code), it is done in zpq_read function itself:

          if (out.pos != 0)
          {
/* If we have some decompressed data, then we return immediately */
               zs->rx_total_raw += out.pos;
              return out.pos;
          }
          if (zs->rx.pos == zs->rx.size)
          {
               zs->rx.pos = zs->rx.size = 0; /* Reset rx buffer */
          }
 /* Otherwise we try to fetch more data using rx function */
         rc = zs->rx_func(zs->arg, (char*)zs->rx.src + zs->rx.size,
ZSTD_BUFFER_SIZE - zs->rx.size);

2.  Code duplication is always bad, doesn't matter how much code is copied.
Frankly speaking I think that duplication of IO code between backend and
frontend is one of the most awful parts of Postgres.
It is always better to avoid duplication when it is possible.

3. Sorry, it was really a bug in 11 version of the patch, fixed in 12 patch.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2019-02-13 14:52:13 Re: libpq compression
Previous Message Andrey Borodin 2019-02-13 14:37:26 [Patch][WiP] Tweaked LRU for shared buffers