Avoid detoast overhead when possible

From: zhihuifan1213(at)163(dot)com
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Avoid detoast overhead when possible
Date: 2023-12-04 06:37:02
Message-ID: 87ttoyihgm.fsf@163.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Currently our code can do lazily detoast by design, for example:

SELECT toast_col FROM t;
SELECT toast_col FROM t ORDER BY b;
SELECT toast_col FROM t join t2 using(c);

it is only detoast at {type}_out function. The benefits includes:
1. The life time of detoast datum is pretty short which is good for
general memory usage.
2. In the order by / hash case, the less memory usage can let the
work_mem hold more tuples so it is good for performance aspect.

Recently I run into a user case like this:

create table b(big jsonb);
...
select big->'1', big->'2', big->'3', big->'5', big->'10' from b;

In the above query, we can see the 'big' datum is detoasted 5 times, and
if the toast value is huge, it causes a pretty bad performance. jsonb
will be a common case to access the toast value multi times, but it
is possible for other data type as well. for example:

SELECT f1(big_toast_col), f2(big_toast_col) FROM t;

I attached a POC patch which eagerly detoast the datum during
EEOP_INNER/OUTER/SCAN_VAR step and store the detoast value back to the
original slot->tts_values, so the later call of slot->tts_values[n] will
use the detoast value automatically. With the attached setup.sql and
the patch, the performance is easy to reduced to 310ms from 1600ms.

select big->'1', big->'2', big->'3', big->'5', big->'10' from b;
QUERY PLAN
---------------------------------------------------------------
Seq Scan on b (actual time=1.731..1577.911 rows=1001 loops=1)
Planning Time: 0.099 ms
Execution Time: 1578.411 ms
(3 rows)

set jit to off;

select big->'1', big->'2', big->'3', big->'5', big->'10' from b;
QUERY PLAN
--------------------------------------------------------------
Seq Scan on b (actual time=0.417..309.937 rows=1001 loops=1)
Planning Time: 0.097 ms
Execution Time: 310.255 m

(I used 'jit=off' to turn on this feature just because I'm still not
ready for JIT code.)

However this patch just throws away almost all the benefits of toast, so
how can we draw a line between should vs should not do this code path?
IMO, we should only run the 'eagerly detoast' when we know that we will
have a FuncCall against the toast_col on the current plan node. I think
this information can be get from Qual and TargetList. If so, we can set
the slot->detoast_attrs accordingly.

if we code like this:

SELECT f1(toast_col) FROM t join t2 using(c);

We only apply the code path on the join plan node, so even the join method
is hash / sort merge, the benefit of toast is still there.

'SELECT f1(toast_col) FROM t;' will apply this code path, but nothing
gain and nothing lost. Applying this code path only when the toast
datum is accessed 1+ times needs some extra run-time effort. I don't
implement this so far, I'd like to see if I miss some obvious points.
Any feedback is welcome.

Attachment Content-Type Size
v1-0001-A-PoC-code-to-show-the-benefit-of-eagerly-detoast.patch text/x-diff 4.3 KB
setup.sql application/sql 235 bytes
unknown_filename text/plain 28 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2023-12-04 06:41:58 proposal: plpgsql - OPEN LOCAL statement
Previous Message Sutou Kouhei 2023-12-04 06:35:48 Make COPY format extendable: Extract COPY TO format implementations