Re: Making jsonb_agg() faster

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Making jsonb_agg() faster
Date: 2025-07-22 17:43:31
Message-ID: CAHyXU0zOLCn_6NXg0ooEm3pijFxZaaiN9A=W4AfFvY4+n1GG3A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 22, 2025 at 10:37 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> There have been some complaints recently about how jsonb_agg()
> is a lot slower than json_agg() [1]. That's annoying considering
> that the whole selling point of JSONB is to have faster processing
> than the original JSON type, so I poked into that. What I found
> is that jsonb_agg() and its variants are just really inefficiently
> implemented. Basically, for each aggregate input value, they will:
>
> 1. Build a JsonbValue tree representation of the input value.
> 2. Flatten the JsonbValue tree into a Jsonb in on-disk format.
> 3. Iterate through the Jsonb, building a JsonbValue that is part
> of the aggregate's state stored in aggcontext, but is otherwise
> identical to what phase 1 built.
>
> The motivation for this seems to have been to make sure that any
> memory leakage during phase 1 does not happen in the long-lived
> aggcontext. But it's hard not to call it a Rube Goldberg contraption.
>
> The attached patch series gets rid of phases 2 and 3 by refactoring
> pushJsonbValue() and related functions so that the JsonbValue tree
> they construct can be constructed in a context that's not
> CurrentMemoryContext. With that and some run-of-the-mill optimization
> work, I'm getting 2.5X speedup for jsonb_agg on a text column (as
> measured by the attached test script) and a bit over 2X on an int8
> column. It's still a little slower than json_agg, but no longer
> slower by integer multiples.
>
> 0001 is a somewhat invasive refactoring of the API for
> pushJsonbValue and friends. It doesn't in itself have any
> measurable speed consequences as far as I can tell, but I think
> it makes the code nicer in any case. (I really do not like the
> existing coding setup where sometimes it's important to capture
> the result of pushJsonbValue and sometimes it's not; that
> seems awfully confusing and bug-prone.) The real point though
> is to have a way of commanding pushJsonbValue to build the
> JsonbValue tree somewhere other than CurrentMemoryContext.
>
> Having laid the groundwork with 0001, 0002 simply amounts to
> telling pushJsonbValue to put its handiwork in the aggcontext
> and then ripping out phases 2 and 3 of the aggregate transfns.
>
> 0003 is some simple micro-optimization of the datatype conversion
> code in datum_to_jsonb_internal.
>
> Thoughts?
>

Really excited about this -- I'll do some testing. Performance of
serialization (generating json output from non json data) is the main
reason I still recommend json...jsonb is better in most other cases, but
serialization performance is extremely important.

The other reason json type is needed is being able to serialize WYSIWYG:

superuser(at)postgres=# select to_json(q), to_jsonb(q) from (select 1 as b, 2
as a, 3 as b) q;
to_json | to_jsonb
---------------------+------------------
{"b":1,"a":2,"b":3} | {"a": 2, "b": 3}

jsonb output can be obnoxious for aesthetic reasons here as well as
uncommon technical ones where key order is important to the receiving
processor. Point being, if there was some way to do that in the jsonb
interface with good performance, the need for json type (or at least an
independent implementation) would completely evaporate.

The only hypothetical use case for legacy json is precise storage, but that
can be worked around with TEXT.

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2025-07-22 17:50:16 Re: index prefetching
Previous Message Peter Geoghegan 2025-07-22 17:35:14 Re: index prefetching