Re: Collecting statistics about contents of JSONB columns

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>, Mahendra Thalor <mahendra(dot)thalor(at)enterprisedb(dot)com>, Oleg Bartunov <obartunov(at)postgrespro(dot)ru>
Subject: Re: Collecting statistics about contents of JSONB columns
Date: 2022-01-06 19:48:39
Message-ID: 30a1fbe7-b7c5-e7df-26c2-ff06b45d1da9@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/5/22 21:22, Simon Riggs wrote:
> On Fri, 31 Dec 2021 at 22:07, Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
>> The patch does something far more
>> elegant - it simply uses stavalues to store an array of JSONB documents,
>> each describing stats for one path extracted from the sampled documents.
>
> Sounds good.
>
>> I'm sure there's plenty open questions - for example I think we'll need
>> some logic to decide which paths to keep, otherwise the statistics can
>> get quite big, if we're dealing with large / variable documents. We're
>> already doing something similar for MCV lists.
>>
>> One of Nikita's patches not included in this thread allow "selective"
>> statistics, where you can define in advance a "filter" restricting which
>> parts are considered interesting by ANALYZE. That's interesting, but I
>> think we need some simple MCV-like heuristics first anyway.
>>
>> Another open question is how deep the stats should be. Imagine documents
>> like this:
>>
>> {"a" : {"b" : {"c" : {"d" : ...}}}}
>>
>> The current patch build stats for all possible paths:
>>
>> "a"
>> "a.b"
>> "a.b.c"
>> "a.b.c.d"
>>
>> and of course many of the paths will often have JSONB documents as
>> values, not simple scalar values. I wonder if we should limit the depth
>> somehow, and maybe build stats only for scalar values.
>
> The user interface for designing filters sounds hard, so I'd hope we
> can ignore that for now.
>

Not sure I understand. I wasn't suggesting any user-defined filtering,
but something done by default, similarly to what we do for regular MCV
lists, based on frequency. We'd include frequent paths while excluding
rare ones.

So no need for a user interface.

That might not work for documents with stable schema and a lot of
top-level paths, because all the top-level paths have 1.0 frequency. But
for documents with dynamic schema (different documents having different
schemas/paths) it might help.

Similarly for the non-scalar values - I don't think we can really keep
regular statistics on such values (for the same reason why it's not
enough for whole JSONB columns), so why to build/store that anyway.

Nikita did implement a way to specify custom filters using jsonpath, but
I did not include that into this patch series. And questions regarding
the interface were one of the reasons.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-01-06 19:58:31 Re: Bugs in pgoutput.c
Previous Message Tomas Vondra 2022-01-06 19:26:43 Re: Collecting statistics about contents of JSONB columns