Re: Collecting statistics about contents of JSONB columns

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Mahendra Thalor <mahendra(dot)thalor(at)enterprisedb(dot)com>, Oleg Bartunov <obartunov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Collecting statistics about contents of JSONB columns
Date: 2022-01-24 22:19:49
Message-ID: b3919391-4f2d-6c00-0005-2fbbaae65b63@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/23/22 01:24, Nikita Glukhov wrote:
> Hi!
>
> I am glad that you found my very old patch interesting and started to
> work on it. We failed to post it in 2016 mostly because we were not
> satisfied with JSONB storage. Also we decided to wait for completion
> of work on extended statistics as we thought that it could help us.
> But in early 2017 we switched to SQL/JSON and forgot about this patch.
>

Understood. Let's see how feasible this idea is and if we can move this
forward.

>
> I think custom datatype is necessary for better performance. With a
> plain JSONB we need to do a lot of work for extraction of path stats:
> - iterate through MCV/Histogram JSONB arrays
> - cast numeric values to float, string to text etc.
> - build PG arrays from extracted datums
> - form pg_statistic tuple.
>
> With a custom data type we could store pg_statistic tuple unmodified
> and use it without any overhead. But then we need modify a bit
> VariableStatData and several functions to pass additional nullfrac
> corrections.
>

I'm not against evaluating/exploring alternative storage formats, but my
feeling is the real impact on performance will be fairly low. At least I
haven't seen this as very expensive while profiling the patch so far. Of
course, I may be wrong, and it may be more significant in some cases.

> Maybe simple record type (path text, data pg_statistic, ext_data jsonb)
> would be enough.
>
Maybe, but then you still need to store a bunch of those, right? So
either an array (likely toasted) or 1:M table. I'm not sure it's goiing
to be much cheaper than JSONB.

I'd suggest we focus on what we need to store first, which seems like
tha primary question, and worry about the exact storage format then.

> Also there is an idea to store per-path separately in pg_statistic_ext
> rows using expression like (jb #> '{foo,bar}') as stxexprs. This could
> also help user to select which paths to analyze simply by using some
> sort of CREATE STATISTICS. But it is really unclear how to:
> * store pg_statistic_ext rows from typanalyze
> * form path expressions for array elements (maybe add new jsonpath
> operator)
> * match various -> operator chains to stxexprs
> * jsonpath-like languages need simple API for searching by stxexprs
>

Sure, you can do statistics on expressions, right? Of course, if that
expression produces JSONB value, that's not very useful at the moment.
Maybe we could have two typanalyze functions - one for regular analyze,
one for extended statistics?

That being said, I'm not sure extended stats are a good match for this.
My feeling was we should collect these stats for all JSONB columns,
which is why I argued for putting that in pg_statistic.

>
>
> Per-path statistics should only be collected for scalars. This can be
> enabled by flag JsonAnalyzeContext.scalarsOnly. But there are is a
> problem: computed MCVs and histograms will be broken and we will not be
> able to use them for queries like (jb > const) in general case. Also
> we will not be and use internally in scalarineqsel() and var_eq_const()
> (see jsonSelectivity()). Instead, we will have to implement own
> estimator functions for JSONB comparison operators that will correctly
> use our hacked MCVs and histograms (and maybe not all cases will be
> supported; for example, comparison to scalars only).
>

Yeah, but maybe that's an acceptable trade-off? I mean, if we can
improve estimates for most clauses, and there's a some number of clauses
that are estimated just like without stats, that's still an improvement,
right?

> It's possible to replace objects and arrays with empty ones when
> scalarsOnly is set to keep correct frequencies of non-scalars.
> But there is an old bug in JSONB comparison: empty arrays are placed
> before other values in the JSONB sort order, although they should go
> after all scalars. So we need also to distinguish empty and non-empty
> arrays here.
>

Hmmm ...
>
>
> I tried to fix a major part of places marked as XXX and FIXME, the new
> version of the patches is attached. There are a lot of changes, you
> can see them in a step-by-step form in the corresponding branch
> jsonb_stats_20220122 in our GitHub repo [1].
>

Thanks! I'll go through the changes soon.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2022-01-24 22:21:40 Re: CREATEROLE and role ownership hierarchies
Previous Message Andrew Dunstan 2022-01-24 21:47:28 Re: fairywren is generating bogus BASE_BACKUP commands