Re: Shared detoast Datum proposal

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andy Fan <zhihuifan1213(at)163(dot)com>
Cc: Nikita Malakhov <hukutoc(at)gmail(dot)com>, Michael Zhilin <m(dot)zhilin(at)postgrespro(dot)ru>, Peter Smith <smithpb2250(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Shared detoast Datum proposal
Date: 2024-03-04 16:06:23
Message-ID: a70cfbef-45ae-4056-9c3d-cc6191ebc001@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/4/24 02:23, Andy Fan wrote:
>
> Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> writes:
>
>> On 2/26/24 16:29, Andy Fan wrote:
>>>
>>> ...>
>>> I can understand the benefits of the TOAST cache, but the following
>>> issues looks not good to me IIUC:
>>>
>>> 1). we can't put the result to tts_values[*] since without the planner
>>> decision, we don't know if this will break small_tlist logic. But if we
>>> put it into the cache only, which means a cache-lookup as a overhead is
>>> unavoidable.
>>
>> True - if you're comparing having the detoasted value in tts_values[*]
>> directly with having to do a lookup in a cache, then yes, there's a bit
>> of an overhead.
>>
>> But I think from the discussion it's clear having to detoast the value
>> into tts_values[*] has some weaknesses too, in particular:
>>
>> - It requires decisions which attributes to detoast eagerly, which is
>> quite invasive (having to walk the plan, ...).
>>
>> - I'm sure there will be cases where we choose to not detoast, but it
>> would be beneficial to detoast.
>>
>> - Detoasting just the initial slices does not seem compatible with this.
>>
>> IMHO the overhead of the cache lookup would be negligible compared to
>> the repeated detoasting of the value (which is the current baseline). I
>> somewhat doubt the difference compared to tts_values[*] will be even
>> measurable.
>>
>>>
>>> 2). It is hard to decide which entry should be evicted without attaching
>>> it to the TupleTableSlot's life-cycle. then we can't grantee the entry
>>> we keep is the entry we will reuse soon?
>>>
>>
>> True. But is that really a problem? I imagined we'd set some sort of
>> memory limit on the cache (work_mem?), and evict oldest entries. So the
>> entries would eventually get evicted, and the memory limit would ensure
>> we don't consume arbitrary amounts of memory.
>>
>
> Here is a copy from the shared_detoast_datum.org in the patch. I am
> feeling about when / which entry to free is a key problem and run-time
> (detoast_attr) overhead vs createplan.c overhead is a small difference
> as well. the overhead I paid for createplan.c/setref.c looks not huge as
> well.

I decided to whip up a PoC patch implementing this approach, to get a
better idea of how it compares to the original proposal, both in terms
of performance and code complexity.

Attached are two patches that both add a simple cache in detoast.c, but
differ in when exactly the caching happens.

toast-cache-v1 - caching happens in toast_fetch_datum, which means this
happens before decompression

toast-cache-v2 - caching happens in detoast_attr, after decompression

I started with v1, but that did not help with the example workloads
(from the original post) at all. Only after I changed this to cache
decompressed data (in v2) it became competitive.

There's GUC to enable the cache (enable_toast_cache, off by default), to
make experimentation easier.

The cache is invalidated at the end of a transaction - I think this
should be OK, because the rows may be deleted but can't be cleaned up
(or replaced with a new TOAST value). This means the cache could cover
multiple queries - not sure if that's a good thing or bad thing.

I haven't implemented any sort of cache eviction. I agree that's a hard
problem in general, but I doubt we can do better than LRU. I also didn't
implement any memory limit.

FWIW I think it's a fairly serious issue Andy's approach does not have
any way to limit the memory. It will detoasts the values eagerly, but
what if the rows have multiple large values? What if we end up keeping
multiple such rows (from different parts of the plan) at once?

I also haven't implemented caching for sliced data. I think modifying
the code to support this should not be difficult - it'd require tracking
how much data we have in the cache, and considering that during lookup
and when adding entries to cache.

I've implemented the cache as "global". Maybe it should be tied to query
or executor state, but then it'd be harder to access it from detoast.c
(we'd have to pass the cache pointer in some way, and it wouldn't work
for use cases that don't go through executor).

I think implementation-wise this is pretty non-invasive.

Performance-wise, these patches are slower than with Andy's patch. For
example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and
v2 at ~150ms. I'm sure there's a number of optimization opportunities
and v2 could get v2 closer to the 100ms.

v1 is not competitive at all in this jsonb/Q1 test - it's just as slow
as master, because the data set is small enough to be fully cached, so
there's no I/O and it's the decompression is the actual bottleneck.

That being said, I'm not convinced v1 would be a bad choice for some
cases. If there's memory pressure enough to evict TOAST, it's quite
possible the I/O would become the bottleneck. At which point it might be
a good trade off to cache compressed data, because then we'd cache more
detoasted values.

OTOH it's likely the access to TOAST values is localized (in temporal
sense), i.e. we read it from disk, detoast it a couple times in a short
time interval, and then move to the next row. That'd mean v2 is probably
the correct trade off.

Not sure.

>
> """
> A alternative design: toast cache
> ---------------------------------
>
> This method is provided by Tomas during the review process. IIUC, this
> method would maintain a local HTAB which map a toast datum to a detoast
> datum and the entry is maintained / used in detoast_attr
> function. Within this method, the overall design is pretty clear and the
> code modification can be controlled in toasting system only.
>

Right.

> I assumed that releasing all of the memory at the end of executor once
> is not an option since it may consumed too many memory. Then, when and
> which entry to release becomes a trouble for me. For example:
>
> QUERY PLAN
> ------------------------------
> Nested Loop
> Join Filter: (t1.a = t2.a)
> -> Seq Scan on t1
> -> Seq Scan on t2
> (4 rows)
>
> In this case t1.a needs a longer lifespan than t2.a since it is
> in outer relation. Without the help from slot's life-cycle system, I
> can't think out a answer for the above question.
>

This is true, but how likely such plans are? I mean, surely no one would
do nested loop with sequential scans on reasonably large tables, so how
representative this example is?

Also, this leads me to the need of having some sort of memory limit. If
we may be keeping entries for extended periods of time, and we don't
have any way to limit the amount of memory, that does not seem great.

AFAIK if we detoast everything into tts_values[] there's no way to
implement and enforce such limit. What happens if there's a row with
multiple large-ish TOAST values? What happens if those rows are in
different (and distant) parts of the plan?

It seems far easier to limit the memory with the toast cache.

> Another difference between the 2 methods is my method have many
> modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
> it can save some run-time effort like hash_search find / enter run-time
> in method 2 since I put them directly into tts_values[*].
>
> I'm not sure the factor 2 makes some real measurable difference in real
> case, so my current concern mainly comes from factor 1.
> """
>

This seems a bit dismissive of the (possible) issue. It'd be good to do
some measurements, especially on simple queries that can't benefit from
the detoasting (e.g. because there's no varlena attribute).

In any case, my concern is more about having to do this when creating
the plan at all, the code complexity etc. Not just because it might have
performance impact.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
toast-cache-v1.patch text/x-patch 6.4 KB
toast-cache-v2.patch text/x-patch 7.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2024-03-04 16:14:43 Re: Extract numeric filed in JSONB more effectively
Previous Message Bharath Rupireddy 2024-03-04 15:45:00 Re: LogwrtResult contended spinlock