Re: Collecting statistics about contents of JSONB columns

From: Mahendra Singh Thalor <mahi6run(at)gmail(dot)com>
To: Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Mahendra Thalor <mahendra(dot)thalor(at)enterprisedb(dot)com>, Oleg Bartunov <obartunov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Collecting statistics about contents of JSONB columns
Date: 2022-05-17 11:44:23
Message-ID: CAKYtNAq=Qcy2SGLQKVcaQ8UXu0dVxQUDdTQGyMyZGTPakvGpNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 11 Mar 2022 at 04:29, Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>
wrote:
>
>
> On 04.02.2022 05:47, Tomas Vondra wrote:
>
> On 1/25/22 17:56, Mahendra Singh Thalor wrote:
> >
>
> ...
>
> For the last few days, I was trying to understand these patches, and
based on Tomas's suggestion, I was doing some performance tests.
>
> With the attached .SQL file, I can see that analyze is taking more time
with these patches.
>
> I haven't found the root cause of this but I feel that this time is due
to a loop of all the paths.
> In my test data, there is a total of 951 different-2 paths. While doing
analysis, first we check all the sample rows(30000) and we collect all the
different-2 paths (here 951), and after that for every single path, we loop
over all the sample rows again to collect stats for a particular path. I
feel that these loops might be taking time.
>
> Thanks, I've been doing some performance tests too, and you're right it
takes quite a bit of time.
>
>
> That is absolutely not surprising, I have warned about poor performance
> in cases with a large number of paths.
>
>
> I agree the slowness is largely due to extracting all paths and then
processing them one by one - which means we have to loop over the tuples
over and over. In this case there's about 850k distinct paths extracted, so
we do ~850k loops over 30k tuples. That's gotta take time.
>
> I don't know what exactly to do about this, but I already mentioned we
may need to pick a subset of paths to keep, similarly to how we pick items
for MCV. I mean, if we only saw a path once or twice, it's unlikely to be
interesting enough to build stats for it. I haven't tried, but I'd bet most
of the 850k paths might be ignored like this.
>
> The other thing we might do is making it the loops more efficient. For
example, we might track which documents contain each path (by a small
bitmap or something), so that in the loop we can skip rows that don't
contain the path we're currently processing. Or something like that.
>
> Apart from this performance issue, I haven't found any crashes or issues.
>
>
> Well, I haven't seen any crashes either, but as I mentioned for complex
documents (2 levels, many distinct keys) the ANALYZE starts consuming a lot
of memory and may get killed by OOM. For example if you generate documents
like this
>
> ./json-generate.py 30000 2 8 1000 6 1000
>
> and then run ANALYZE, that'll take ages and it very quickly gets into a
situation like this (generated from gdb by calling MemoryContextStats on
TopMemoryContext): and then run ANALYZE, that'll take ages and it very
quickly gets into a situation like this (generated from gdb by calling
MemoryContextStats on TopMemoryContext):
>
> -------------------------------------------------------------------------
> TopMemoryContext: 80776 total in 6 blocks; 13992 free (18 chunks); 66784
used
> ...
> TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used
> PortalContext: 1024 total in 1 blocks; 488 free (0 chunks); 536 used:
<unnamed>
> Analyze: 472726496 total in 150 blocks; 3725776 free (4 chunks);
469000720 used
> Analyze Column: 921177696 total in 120 blocks; 5123256 free (238
chunks); 916054440 used
> Json Analyze Tmp Context: 8192 total in 1 blocks; 5720 free (1
chunks); 2472 used
> Json Analyze Pass Context: 8192 total in 1 blocks; 7928 free
(0 chunks); 264 used
> JSON analyze path table: 1639706040 total in 25084 blocks;
1513640 free (33 chunks); 1638192400 used
> Vacuum: 8192 total in 1 blocks; 7448 free (0 chunks); 744 used
> ...
> Grand total: 3035316184 bytes in 25542 blocks; 10971120 free (352
chunks); 3024345064 used
> -------------------------------------------------------------------------
>
>
> Yes, that's backend 3GB of memory, out of which 1.6GB is in "analyze path
table" context, 400MB in "analyze" and 900MB in "analyze column" contexts.
I mean, that seems a bit excessive. And it grows over time, so after a
while my laptop gives up and kills the backend.
>
> I'm not sure if it's a memory leak (which would be fixable), or it's due
to keeping stats for all the extracted paths. I mean, in this particular
case we have 850k paths - even if stats are just 1kB per path, that's
850MB. This requires more investigation.
>
> Thank you for the tests and investigation.
>
> I have tried to reduce memory consumption and speed up row scanning:
>
> 1. "JSON analyze path table" context contained ~1KB JsonPathAnlStats
> structure per JSON path in the global hash table. I have moved
> JsonPathAnlStats to the stack of compute_json_stats(), and hash
> table now consumes ~70 bytes per path.
>
> 2. I have fixed copying of resulting JSONB stats into context, which
> reduced the size of "Analyze Column" context.
>
> 3. I have optimized consumption of single-pass algorithm by storing
> only value lists in the non-temporary context. That helped to
> execute "2 64 64" test case in 30 seconds. Single-pass is a
> bit faster in non-TOASTed cases, and much faster in TOASTed.
> But it consumes much more memory and still goes to OOM in the
> cases with more than ~100k paths.
>
> 4. I have implemented per-path document lists/bitmaps, which really
> speed up the case "2 8 1000". List is converted into bitmap when
> it becomes larger than bitmap.
>
> 5. Also I have fixed some bugs.
>
>
> All these changes you can find commit form in our GitHub repository
> on the branch jsonb_stats_20220310 [1].
>
>
> Updated results of the test:
>
> levels keys uniq keys paths master multi-pass single-pass
> ms MB ms MB
> -------------------------------------------------------------------
> 1 1 1 2 153 122 10 82 14
> 1 1 1000 1001 134 105 11 78 38
> 1 8 8 9 157 384 19 328 32
> 1 8 1000 1001 155 454 23 402 72
> 1 64 64 65 129 2889 45 2386 155
> 1 64 1000 1001 158 3990 94 1447 177
> 2 1 1 3 237 147 10 91 16
> 2 1 1000 30577 152 264 32 394 234
> 2 8 8 72 245 1943 37 1692 139
> 2 8 1000 852333 152 9175 678 OOM
> 2 64 64 4161 1784 ~1 hour 53 30018 1750
> 2 64 1000 1001001 4715 ~4 hours 1600 OOM
>
> The two last multi-pass results are too slow, because JSONBs becomes
> TOASTed. For measuring master in these tests, I disabled
> WIDTH_THRESHOLD check which skipped TOASTed values > 1KB.
>
>
> Next, I am going to try to disable all-paths collection and implement
> collection of most common paths (and/or hashed paths maybe).

Hi Nikita,
I and Tomas discussed the design for disabling all-paths collection(collect
stats for only some paths). Below are some thoughts/doubts/questions.

*Point 1)* Please can you elaborate more that how are you going to
implement this(collect stats for only some paths).
*Point 2) *As JSON stats are taking time so should we add an on/off switch
to collect JSON stats?
*Point 3)* We thought of one more design: we can give an explicit path to
collect stats for a particular path only or we can pass a subset of the
JSON values but this may require a lot of code changes like syntax and all
so we are thinking that it will be good if we can collect stats only for
some common paths(by limit or any other way)

Thoughts?

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2022-05-17 11:49:19 Re: Expand palloc/pg_malloc API
Previous Message Peter Eisentraut 2022-05-17 11:41:03 Expand palloc/pg_malloc API