Re: Collecting statistics about contents of JSONB columns

From: Mahendra Singh Thalor <mahi6run(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Mahendra Thalor <mahendra(dot)thalor(at)enterprisedb(dot)com>, Oleg Bartunov <obartunov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Collecting statistics about contents of JSONB columns
Date: 2022-03-11 17:10:43
Message-ID: CAKYtNAo1Pf0c-4t2UYobg73UR+fkkfBA8ZYHkeUGpV5bE2i7Sw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 4 Feb 2022 at 08:30, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:
>
>
>
> On 2/4/22 03:47, Tomas Vondra wrote:
> > ./json-generate.py 30000 2 8 1000 6 1000
>
> Sorry, this should be (different order of parameters):
>
> ./json-generate.py 30000 2 1000 8 6 1000
>

Thanks, Tomas for this test case.

Hi Hackers,

For the last few days, I was doing testing on the top of these JSON
optimizers patches and was taking help fro Tomas Vondra to understand
patches and testing results.
Thanks, Tomas for your feedback and suggestions.

Below is the summary:
*Point 1)* analyse is taking very much time for large documents:
For large JSON documents, analyze took very large time as compared to the
current head. For reference, I am attaching test file (./json-generate.py
30000 2 1000 8 6 1000)

Head: analyze test ; Time: 120.864 ms
With patch: analyze test ; Time: more than 2 hours

analyze is taking a very large time because with these patches, firstly we
iterate over all sample rows (in above case 30000), and we store all the
paths (here around 850k paths).
In another pass, we took 1 path at a time and collects stats for the
particular path by analyzing all the sample rows and we continue this
process for all 850k paths or we can say that we do 850k loops, and in each
loop we extract values for a single path.

*Point 2)* memory consummation increases rapidly for large documents:
In the above test case, there are total 851k paths and to keep stats for
one path, we allocate 1120 bytes.

Total paths : 852689 ~ 852k

Memory for 1 path to keep stats: 1120 ~ 1 KB

(sizeof(JsonValueStats) = 1120 from “Analyze Column”)

Total memory for all paths: 852689 * 1120 = 955011680 ~ 955 MB

Extra memory for each path will be more. I mean, while analyzing each path,
we allocate some more memory based on frequency and others

To keep all entries(851k paths) in the hash, we use around 1GB memory for
hash so this is also very large.

*Point 3*) Review comment noticed by Tomas Vondra:

+ oldcxt = MemoryContextSwitchTo(ctx->stats->anl_context);
+ pstats->stats = jsonAnalyzeBuildPathStats(pstats);
+ MemoryContextSwitchTo(oldcxt);

Above should be:
+ oldcxt = MemoryContextSwitchTo(ctx->mcxt);
+ pstats->stats = jsonAnalyzeBuildPathStats(pstats);
+ MemoryContextSwitchTo(oldcxt);

*Response from Tomas Vondra:*
The problem is "anl_context" is actually "Analyze", i.e. the context for
the whole ANALYZE command, for all the columns. But we only want to keep
those path stats while processing a particular column. At the end, after
processing all paths from a column, we need to "build" the final stats in
the column, and this result needs to go into "Analyze" context. But all the
partial results need to go into "Analyze Column" context.

*Point 4)*

+/*

+ * jsonAnalyzeCollectPath

+ * Extract a single path from JSON documents and collect its
values.

+ */

+static void

+jsonAnalyzeCollectPath(JsonAnalyzeContext *ctx, Jsonb *jb, void *param)

+{

+ JsonPathAnlStats *pstats = (JsonPathAnlStats *) param;

+ JsonbValue jbvtmp;

+ JsonbValue *jbv = JsonValueInitBinary(&jbvtmp, jb);

+ JsonPathEntry *path;

+ JsonPathEntry **entries;

+ int i;

+

+ entries = palloc(sizeof(*entries) * pstats->depth);

+

+ /* Build entry array in direct order */

+ for (path = &pstats->path, i = pstats->depth - 1;

+ path->parent && i >= 0;

+ path = path->parent, i--)

+ entries[i] = path;

+

+ jsonAnalyzeCollectSubpath(ctx, pstats, jbv, entries, 0);

+

+ pfree(entries);

----many times, we are trying to palloc with zero size and entries is
pointing to invalid memory (because pstats->depth=0) so I think, we should
not try to palloc with 0??

*Fix:*

+ If (pstats->depth)

+ entries = palloc(sizeof(*entries) * pstats->depth);

From these points, we can say that we should rethink our design to collect
stats for all paths.

We can set limits(like MCV) for paths or we can give an explicit path to
collect stats for a particular path only or we can pass a subset of the
JSON values.

In the above case, there are total 851k paths, but we can collect stats for
only 1000 paths that are most common so this way we can minimize time and
memory also and we might even keep at
least frequencies for the non-analyzed paths.

Next, I will take the latest patches from Nikita's last email and I will do
more tests.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2022-03-11 17:26:26 Re: Add 64-bit XIDs into PostgreSQL 15
Previous Message Robert Haas 2022-03-11 16:51:38 Re: role self-revocation