Re: [proposal] de-TOAST'ing using a iterator

From: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
To: Binguo Bao <djydewang(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Владимир Лесков <vladimirlesk(at)yandex-team(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [proposal] de-TOAST'ing using a iterator
Date: 2019-07-15 11:20:14
Message-ID: CACPNZCu3+rTURQZ9BprT4pnP8b0KRX0U3kS-GhYz8WTqcA=bZw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 19, 2019 at 8:51 PM Binguo Bao <djydewang(at)gmail(dot)com> wrote:
> [v4 patch]

Hi Binguo,

I can verify I get no warnings with the v4 patch. I've done some
additional performance testing. First, to sum up your results:

> insert into detoast_c (a) select repeat('1234567890-=abcdefghijklmnopqrstuvwxyz', 1000000)||'321' as a from generate_series(1,100);

When the search pattern was at the beginning, the patch was several
times faster , and when the pattern was at the end, it was 3% slower
when uncompressed and 9% slower when compressed.

First, I'd like to advocate for caution when using synthetic
benchmarks involving compression. Consider this test:

insert into detoast_c (a)
select
'abc'||
repeat(
(SELECT string_agg(md5(chr(i)), '')
FROM generate_series(1,127) i)
, 10000)
||'xyz'
from generate_series(1,100);

The results for the uncompressed case were not much different then
your test. However, in the compressed case the iterator doesn't buy us
much with beginning searches since full decompression is already fast:

master patch
comp. beg. 869ms 837ms
comp. end 14100ms 16100ms
uncomp. beg. 6360ms 800ms
uncomp. end 21100ms 21400ms

and with compression it's 14% slower searching to the end. This is
pretty contrived, but I include it for demonstration.

To test something hopefully a bit more realistic, I loaded 100 records
each containing the 1995 CIA fact book (~3MB of ascii) with a pattern
string put at the beginning and end. For the end search, I used a
longer needle to speed up the consumption of text, hoping to put more
stress on the detoasting algorithms, for example:

select max(position('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
in a)) from detoast_*;

comp. beg. 836ms 22ms
comp. end 1510ms 1700ms
uncomp. beg. 185ms 12ms
uncomp. end 851ms 903ms

Here, the "beginning" case is ~15-35x faster, which is very impressive
and much faster than with your generated contents. The "end" case is
up to 13% slower. It would be interesting to see where the break-even
point is, where the results are the same.

Reading the thread where you're working on optimizing partial
decompression [1], it seems you have two separate solutions for the
two problems. Maybe this is fine, but I'd like to bring up the
possibility of using the same approach for both kinds of callers.

I'm not an expert on TOAST, but maybe one way to solve both problems
is to work at the level of whole TOAST chunks. In that case, the
current patch would look like this:

1. The caller requests more of the attribute value from the de-TOAST iterator.
2. The iterator gets the next chunk and either copies or decompresses
the whole chunk into the buffer. (If inline, just decompress the whole
thing)

This seems simpler and also easy to adapt to callers that do know how
big a slice they want. I also suspect this way would be easier to
adapt to future TOAST formats not tied to heap or to a certain
compression algorithm. With less bookkeepping overhead, maybe there'll
be less worst-case performance degradation, while not giving up much
in the best case. (Note also that commit 9556aa01c6 already introduced
some performance degradation in near-end searches, when using
multibyte strings. This patch would add to that.) The regression
doesn't seem large, but I see more than your test showed, and it would
be nice to avoid it.

Thoughts, anyone?

[1] https://www.postgresql.org/message-id/flat/CAL-OGkux7%2BBm_J%3Dt5VpH7fJGGSm%2BPxWJtgs1%2BWU2g6cmLru%3D%3DA%40mail.gmail.com#705d074aa4ae305ed3d992b7e5b7af3c

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Antonin Houska 2019-07-15 12:38:42 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Previous Message Antonin Houska 2019-07-15 11:10:10 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)