Re: Collecting statistics about contents of JSONB columns

From: Mahendra Singh Thalor <mahi6run(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Mahendra Thalor <mahendra(dot)thalor(at)enterprisedb(dot)com>, Oleg Bartunov <obartunov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Collecting statistics about contents of JSONB columns
Date: 2022-01-25 16:56:01
Message-ID: CAKYtNArmMg=cU6SogL6u-b21Gx=rMoR9G1ah6kLANRw7LhHukA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 25 Jan 2022 at 03:50, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:
>
> On 1/23/22 01:24, Nikita Glukhov wrote:
> > Hi!
> >
> > I am glad that you found my very old patch interesting and started to
> > work on it. We failed to post it in 2016 mostly because we were not
> > satisfied with JSONB storage. Also we decided to wait for completion
> > of work on extended statistics as we thought that it could help us.
> > But in early 2017 we switched to SQL/JSON and forgot about this patch.
> >
>
> Understood. Let's see how feasible this idea is and if we can move this
> forward.
>
> >
> > I think custom datatype is necessary for better performance. With a
> > plain JSONB we need to do a lot of work for extraction of path stats:
> > - iterate through MCV/Histogram JSONB arrays
> > - cast numeric values to float, string to text etc.
> > - build PG arrays from extracted datums
> > - form pg_statistic tuple.
> >
> > With a custom data type we could store pg_statistic tuple unmodified
> > and use it without any overhead. But then we need modify a bit
> > VariableStatData and several functions to pass additional nullfrac
> > corrections.
> >
>
> I'm not against evaluating/exploring alternative storage formats, but my
> feeling is the real impact on performance will be fairly low. At least I
> haven't seen this as very expensive while profiling the patch so far. Of
> course, I may be wrong, and it may be more significant in some cases.
>
> > Maybe simple record type (path text, data pg_statistic, ext_data jsonb)
> > would be enough.
> >
> Maybe, but then you still need to store a bunch of those, right? So
> either an array (likely toasted) or 1:M table. I'm not sure it's goiing
> to be much cheaper than JSONB.
>
> I'd suggest we focus on what we need to store first, which seems like
> tha primary question, and worry about the exact storage format then.
>
>
> > Also there is an idea to store per-path separately in pg_statistic_ext
> > rows using expression like (jb #> '{foo,bar}') as stxexprs. This could
> > also help user to select which paths to analyze simply by using some
> > sort of CREATE STATISTICS. But it is really unclear how to:
> > * store pg_statistic_ext rows from typanalyze
> > * form path expressions for array elements (maybe add new jsonpath
> > operator)
> > * match various -> operator chains to stxexprs
> > * jsonpath-like languages need simple API for searching by stxexprs
> >
>
> Sure, you can do statistics on expressions, right? Of course, if that
> expression produces JSONB value, that's not very useful at the moment.
> Maybe we could have two typanalyze functions - one for regular analyze,
> one for extended statistics?
>
> That being said, I'm not sure extended stats are a good match for this.
> My feeling was we should collect these stats for all JSONB columns,
> which is why I argued for putting that in pg_statistic.
>
> >
> >
> > Per-path statistics should only be collected for scalars. This can be
> > enabled by flag JsonAnalyzeContext.scalarsOnly. But there are is a
> > problem: computed MCVs and histograms will be broken and we will not be
> > able to use them for queries like (jb > const) in general case. Also
> > we will not be and use internally in scalarineqsel() and var_eq_const()
> > (see jsonSelectivity()). Instead, we will have to implement own
> > estimator functions for JSONB comparison operators that will correctly
> > use our hacked MCVs and histograms (and maybe not all cases will be
> > supported; for example, comparison to scalars only).
> >
>
> Yeah, but maybe that's an acceptable trade-off? I mean, if we can
> improve estimates for most clauses, and there's a some number of clauses
> that are estimated just like without stats, that's still an improvement,
> right?
>
> > It's possible to replace objects and arrays with empty ones when
> > scalarsOnly is set to keep correct frequencies of non-scalars.
> > But there is an old bug in JSONB comparison: empty arrays are placed
> > before other values in the JSONB sort order, although they should go
> > after all scalars. So we need also to distinguish empty and non-empty
> > arrays here.
> >
>
> Hmmm ...
> >
> >
> > I tried to fix a major part of places marked as XXX and FIXME, the new
> > version of the patches is attached. There are a lot of changes, you
> > can see them in a step-by-step form in the corresponding branch
> > jsonb_stats_20220122 in our GitHub repo [1].
> >
>
> Thanks! I'll go through the changes soon.
>
>

Thanks, Nikita and Tomas for these patches.

For the last few days, I was trying to understand these patches, and based
on Tomas's suggestion, I was doing some performance tests.

With the attached .SQL file, I can see that analyze is taking more time
with these patches.

*Setup: *
autovacuum=off
rest all are default settings.

Insert attached file with and without the patch to compare the time taken
by analyze.

*With json patches:*
postgres=# analyze test ;
ANALYZE
Time: *28464.062 ms (00:28.464)*
postgres=#
postgres=# SELECT pg_size_pretty(
pg_total_relation_size('pg_catalog.pg_statistic') );
pg_size_pretty
----------------
328 kB
(1 row)
--

*Without json patches:*
postgres=# analyze test ;
ANALYZE
*Time: 120.864* ms
postgres=# SELECT pg_size_pretty(
pg_total_relation_size('pg_catalog.pg_statistic') );
pg_size_pretty
----------------
272 kB

I haven't found the root cause of this but I feel that this time is due to
a loop of all the paths.
In my test data, there is a total of 951 different-2 paths. While doing
analysis, first we check all the sample rows(30000) and we collect all the
different-2 paths (here 951), and after that for every single path, we loop
over all the sample rows again to collect stats for a particular path. I
feel that these loops might be taking time.

I will run perf and will try to find out the root cause of this.

Apart from this performance issue, I haven't found any crashes or issues.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
test_jsonb_1.sql application/octet-stream 21.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-01-25 17:45:15 Re: Replace uses of deprecated Python module distutils.sysconfig
Previous Message Justin Pryzby 2022-01-25 16:45:29 Re: pg_upgrade should truncate/remove its logs before running