Quick Links

Re: Collecting statistics about contents of JSONB columns

From:	Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Mahendra Singh Thalor <mahi6run(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Mahendra Thalor <mahendra(dot)thalor(at)enterprisedb(dot)com>, Oleg Bartunov <obartunov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject:	Re: Collecting statistics about contents of JSONB columns
Date:	2022-03-10 22:58:54
Message-ID:	bf7e1b50-8a35-a98a-251f-2a243558e470@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 04.02.2022 05:47, Tomas Vondra wrote:
> On 1/25/22 17:56, Mahendra Singh Thalor wrote:
> >
>> ...
>>
>> For the last few days, I was trying to understand these patches, and
>> based on Tomas's suggestion, I was doing some performance tests.
>>
>> With the attached .SQL file, I can see that analyze is taking more
>> time with these patches.
>>
>> I haven't found the root cause of this but I feel that this time is
>> due to a loop of all the paths.
>> In my test data, there is a total of 951 different-2 paths. While
>> doing analysis, first we check all the sample rows(30000) and we
>> collect all the different-2 paths (here 951), and after that for
>> every single path, we loop over all the sample rows again to collect
>> stats for a particular path. I feel that these loops might be taking
>> time.
>>
> Thanks, I've been doing some performance tests too, and you're right
> it takes quite a bit of time.

That is absolutely not surprising, I have warned about poor performance
in cases with a large number of paths.

> I agree the slowness is largely due to extracting all paths and then
> processing them one by one - which means we have to loop over the
> tuples over and over. In this case there's about 850k distinct paths
> extracted, so we do ~850k loops over 30k tuples. That's gotta take time.
>
> I don't know what exactly to do about this, but I already mentioned we
> may need to pick a subset of paths to keep, similarly to how we pick
> items for MCV. I mean, if we only saw a path once or twice, it's
> unlikely to be interesting enough to build stats for it. I haven't
> tried, but I'd bet most of the 850k paths might be ignored like this.
>
> The other thing we might do is making it the loops more efficient. For
> example, we might track which documents contain each path (by a small
> bitmap or something), so that in the loop we can skip rows that don't
> contain the path we're currently processing. Or something like that.
>
>> Apart from this performance issue, I haven't found any crashes or
>> issues.
>>
>
> Well, I haven't seen any crashes either, but as I mentioned for
> complex documents (2 levels, many distinct keys) the ANALYZE starts
> consuming a lot of memory and may get killed by OOM. For example if
> you generate documents like this
>
> ./json-generate.py 30000 2 8 1000 6 1000
>
> and then run ANALYZE, that'll take ages and it very quickly gets into
> a situation like this (generated from gdb by calling
> MemoryContextStats on TopMemoryContext): and then run ANALYZE, that'll
> take ages and it very quickly gets into a situation like this
> (generated from gdb by calling MemoryContextStats on TopMemoryContext):
>
> -------------------------------------------------------------------------
> TopMemoryContext: 80776 total in 6 blocks; 13992 free (18 chunks);
> 66784 used
> ...
> TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536
> used
>     PortalContext: 1024 total in 1 blocks; 488 free (0 chunks); 536
> used: <unnamed>
>       Analyze: 472726496 total in 150 blocks; 3725776 free (4 chunks);
> 469000720 used
>         Analyze Column: 921177696 total in 120 blocks; 5123256 free
> (238 chunks); 916054440 used
>           Json Analyze Tmp Context: 8192 total in 1 blocks; 5720 free
> (1 chunks); 2472 used
>             Json Analyze Pass Context: 8192 total in 1 blocks; 7928
> free (0 chunks); 264 used
>           JSON analyze path table: 1639706040 total in 25084 blocks;
> 1513640 free (33 chunks); 1638192400 used
>       Vacuum: 8192 total in 1 blocks; 7448 free (0 chunks); 744 used
> ...
> Grand total: 3035316184 bytes in 25542 blocks; 10971120 free (352
> chunks); 3024345064 used
> -------------------------------------------------------------------------
>
>
> Yes, that's backend 3GB of memory, out of which 1.6GB is in "analyze
> path table" context, 400MB in "analyze" and 900MB in "analyze column"
> contexts. I mean, that seems a bit excessive. And it grows over time,
> so after a while my laptop gives up and kills the backend.
>
> I'm not sure if it's a memory leak (which would be fixable), or it's
> due to keeping stats for all the extracted paths. I mean, in this
> particular case we have 850k paths - even if stats are just 1kB per
> path, that's 850MB. This requires more investigation.

Thank you for the tests and investigation.

I have tried to reduce memory consumption and speed up row scanning:

1. "JSON analyze path table" context contained ~1KB JsonPathAnlStats
structure per JSON path in the global hash table. I have moved
JsonPathAnlStats to the stack of compute_json_stats(), and hash
table now consumes ~70 bytes per path.

2. I have fixed copying of resulting JSONB stats into context, which
reduced the size of "Analyze Column" context.

3. I have optimized consumption of single-pass algorithm by storing
only value lists in the non-temporary context. That helped to
execute "2 64 64" test case in 30 seconds. Single-pass is a
bit faster in non-TOASTed cases, and much faster in TOASTed.
But it consumes much more memory and still goes to OOM in the
cases with more than ~100k paths.

4. I have implemented per-path document lists/bitmaps, which really
speed up the case "2 8 1000". List is converted into bitmap when
it becomes larger than bitmap.

5. Also I have fixed some bugs.

All these changes you can find commit form in our GitHub repository
on the branch jsonb_stats_20220310 [1].

Updated results of the test:

levels keys uniq keys paths master multi-pass single-pass
ms MB ms MB
-------------------------------------------------------------------
1 1 1 2 153 122 10 82 14
1 1 1000 1001 134 105 11 78 38
1 8 8 9 157 384 19 328 32
1 8 1000 1001 155 454 23 402 72
1 64 64 65 129 2889 45 2386 155
1 64 1000 1001 158 3990 94 1447 177
2 1 1 3 237 147 10 91 16
2 1 1000 30577 152 264 32 394 234
2 8 8 72 245 1943 37 1692 139
2 8 1000 852333 152 9175 678 OOM
2 64 64 4161 1784 ~1 hour 53 30018 1750
2 64 1000 1001001 4715 ~4 hours 1600 OOM

The two last multi-pass results are too slow, because JSONBs becomes
TOASTed. For measuring master in these tests, I disabled
WIDTH_THRESHOLD check which skipped TOASTed values > 1KB.

Next, I am going to try to disable all-paths collection and implement
collection of most common paths (and/or hashed paths maybe).

[1]https://github.com/postgrespro/postgres/tree/jsonb_stats_20220310

--
Nikita Glukhov
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
0001-Add-pg_operator.oprstat-for-derived-operator-statist-20220310.patch	text/x-patch	11.9 KB
0002-Add-stats_form_tuple-20220310.patch	text/x-patch	3.9 KB
0003-Add-symbolic-names-for-some-jsonb-operators-20220310.patch	text/x-patch	7.1 KB
0004-Add-helper-jsonb-functions-and-macros-20220310.patch	text/x-patch	9.3 KB
0005-Export-scalarineqsel-20220310.patch	text/x-patch	1.5 KB
0006-Add-jsonb-statistics-20220310.patch	text/x-patch	137.0 KB

In response to

Re: Collecting statistics about contents of JSONB columns at 2022-02-04 02:47:48 from Tomas Vondra

Responses

Re: Collecting statistics about contents of JSONB columns at 2022-04-08 00:31:22 from Justin Pryzby
Re: Collecting statistics about contents of JSONB columns at 2022-05-17 11:44:23 from Mahendra Singh Thalor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Treat	2022-03-10 23:02:13	Re: [Proposal] vacuumdb --schema only
Previous Message	Robert Treat	2022-03-10 22:58:05	Re: Changing "Hot Standby" to "hot standby"