Re: Collecting statistics about contents of JSONB columns

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>, Mahendra Thalor <mahendra(dot)thalor(at)enterprisedb(dot)com>, Oleg Bartunov <obartunov(at)postgrespro(dot)ru>
Subject: Re: Collecting statistics about contents of JSONB columns
Date: 2022-01-01 21:16:47
Message-ID: CALNJ-vT8qwNd46Qf+1hSiR+9rz=Sv9w-PQabiwbwM-=CgixyZw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 1, 2022 at 7:33 AM Zhihong Yu <zyu(at)yugabyte(dot)com> wrote:

>
>
> On Fri, Dec 31, 2021 at 2:07 PM Tomas Vondra <
> tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
>> Hi,
>>
>> One of the complaints I sometimes hear from users and customers using
>> Postgres to store JSON documents (as JSONB type, of course) is that the
>> selectivity estimates are often pretty poor.
>>
>> Currently we only really have MCV and histograms with whole documents,
>> and we can deduce some stats from that. But that is somewhat bogus
>> because there's only ~100 documents in either MCV/histogram (with the
>> default statistics target). And moreover we discard all "oversized"
>> values (over 1kB) before even calculating those stats, which makes it
>> even less representative.
>>
>> A couple weeks ago I started playing with this, and I experimented with
>> improving extended statistics in this direction. After a while I noticed
>> a forgotten development branch from 2016 which tried to do this by
>> adding a custom typanalyze function, which seemed like a more natural
>> idea (because it's really a statistics for a single column).
>>
>> But then I went to pgconf NYC in early December, and I spoke to Oleg
>> about various JSON-related things, and he mentioned they've been working
>> on this topic some time ago too, but did not have time to pursue it. So
>> he pointed me to a branch [1] developed by Nikita Glukhov.
>>
>> I like Nikita's branch because it solved a couple architectural issues
>> that I've been aware of, but only solved them in a rather hackish way.
>>
>> I had a discussion with Nikita about his approach what can we do to move
>> it forward. He's focusing on other JSON stuff, but he's OK with me
>> taking over and moving it forward. So here we go ...
>>
>> Nikita rebased his branch recently, I've kept improving it in various
>> (mostly a lot of comments and docs, and some minor fixes and tweaks).
>> I've pushed my version with a couple extra commits in [2], but you can
>> ignore that except if you want to see what I added/changed.
>>
>> Attached is a couple patches adding adding the main part of the feature.
>> There's a couple more commits in the github repositories, adding more
>> advanced features - I'll briefly explain those later, but I'm not
>> including them here because those are optional features and it'd be
>> distracting to include them here.
>>
>> There are 6 patches in the series, but the magic mostly happens in parts
>> 0001 and 0006. The other parts are mostly just adding infrastructure,
>> which may be a sizeable amount of code, but the changes are fairly
>> simple and obvious. So let's focus on 0001 and 0006.
>>
>> To add JSON statistics we need to do two basic things - we need to build
>> the statistics and then we need to allow using them while estimating
>> conditions.
>>
>>
>> 1) building stats
>>
>> Let's talk about building the stats first. The patch does one of the
>> things I experimented with - 0006 adds a jsonb_typanalyze function, and
>> it associates it with the data type. The function extracts paths and
>> values from the JSONB document, builds the statistics, and then stores
>> the result in pg_statistic as a new stakind.
>>
>> I've been planning to store the stats in pg_statistic too, but I've been
>> considering to use a custom data type. The patch does something far more
>> elegant - it simply uses stavalues to store an array of JSONB documents,
>> each describing stats for one path extracted from the sampled documents.
>>
>> One (very simple) element of the array might look like this:
>>
>> {"freq": 1,
>> "json": {
>> "mcv": {
>> "values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
>> "numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
>> "width": 19,
>> "distinct": 10,
>> "nullfrac": 0,
>> "correlation": 0.10449},
>> "path": "$.\"a\"",
>> "freq_null": 0, "freq_array": 0, "freq_object": 0,
>> "freq_string": 0, "freq_boolean": 0, "freq_numeric": 0}
>>
>> In this case there's only a MCV list (represented by two arrays, just
>> like in pg_statistic), but there might be another part with a histogram.
>> There's also the other columns we'd expect to see in pg_statistic.
>>
>> In principle, we need pg_statistic for each path we extract from the
>> JSON documents and decide it's interesting enough for estimation. There
>> are probably other ways to serialize/represent this, but I find using
>> JSONB for this pretty convenient because we need to deal with a mix of
>> data types (for the same path), and other JSON specific stuff. Storing
>> that in Postgres arrays would be problematic.
>>
>> I'm sure there's plenty open questions - for example I think we'll need
>> some logic to decide which paths to keep, otherwise the statistics can
>> get quite big, if we're dealing with large / variable documents. We're
>> already doing something similar for MCV lists.
>>
>> One of Nikita's patches not included in this thread allow "selective"
>> statistics, where you can define in advance a "filter" restricting which
>> parts are considered interesting by ANALYZE. That's interesting, but I
>> think we need some simple MCV-like heuristics first anyway.
>>
>> Another open question is how deep the stats should be. Imagine documents
>> like this:
>>
>> {"a" : {"b" : {"c" : {"d" : ...}}}}
>>
>> The current patch build stats for all possible paths:
>>
>> "a"
>> "a.b"
>> "a.b.c"
>> "a.b.c.d"
>>
>> and of course many of the paths will often have JSONB documents as
>> values, not simple scalar values. I wonder if we should limit the depth
>> somehow, and maybe build stats only for scalar values.
>>
>>
>> 2) applying the statistics
>>
>> One of the problems is how to actually use the statistics. For @>
>> operator it's simple enough, because it's (jsonb @> jsonb) so we have
>> direct access to the stats. But often the conditions look like this:
>>
>> jsonb_column ->> 'key' = 'value'
>>
>> so the condition is actually on an expression, not on the JSONB column
>> directly. My solutions were pretty ugly hacks, but Nikita had a neat
>> idea - we can define a custom procedure for each operator, which is
>> responsible for "calculating" the statistics for the expression.
>>
>> So in this case "->>" will have such "oprstat" procedure, which fetches
>> stats for the JSONB column, extracts stats for the "key" path. And then
>> we can use that for estimation of the (text = text) condition.
>>
>> This is what 0001 does, pretty much. We simply look for expression stats
>> provided by an index, extended statistics, and then - if oprstat is
>> defined for the operator - we try to derive the stats.
>>
>> This opens other interesting opportunities for the future - one of the
>> parts adds oprstat for basic arithmetic operators, which allows deducing
>> statistics for expressions like (a+10) from statistics on column (a).
>>
>> Which seems like a neat feature on it's own, but it also interacts with
>> the extended statistics in somewhat non-obvious ways (especially when
>> estimating GROUP BY cardinalities).
>>
>> Of course, there's a limit of what we can reasonably estimate - for
>> example, there may be statistical dependencies between paths, and this
>> patch does not even attempt to deal with that. In a way, this is similar
>> to correlation between columns, except that here we have a dynamic set
>> of columns, which makes it much much harder. We'd need something like
>> extended stats on steroids, pretty much.
>>
>>
>> I'm sure I've forgotten various important bits - many of them are
>> mentioned or explained in comments, but I'm sure others are not. And I'd
>> bet there are things I forgot about entirely or got wrong. So feel free
>> to ask.
>>
>>
>> In any case, I think this seems like a good first step to improve our
>> estimates for JSOB columns.
>>
>> regards
>>
>>
>> [1] https://github.com/postgrespro/postgres/tree/jsonb_stats
>>
>> [2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework
>>
>> --
>> Tomas Vondra
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>
>
> Hi,
> For patch 1:
>
> + List *statisticsName = NIL; /* optional stats estimat.
> procedure */
>
> I think if the variable is named estimatorName (or something similar), it
> would be easier for people to grasp its purpose.
>
> + /* XXX perhaps full "statistics" wording would be better */
> + else if (strcmp(defel->defname, "stats") == 0)
>
> I would recommend (stats sounds too general):
>
> + else if (strcmp(defel->defname, "statsestimator") == 0)
>
> + statisticsOid = ValidateStatisticsEstimator(statisticsName);
>
> statisticsOid -> statsEstimatorOid
>
> For get_oprstat():
>
> + }
> + else
> + return (RegProcedure) InvalidOid;
>
> keyword else is not needed (considering the return statement in if block).
>
> For patch 06:
>
> + /* FIXME Could be before the memset, I guess? Checking
> vardata->statsTuple. */
> + if (!data->statsTuple)
> + return false;
>
> I would agree the check can be lifted above the memset call.
>
> + * XXX This does not really extract any stats, it merely allocates the
> struct?
> + */
> +static JsonPathStats
> +jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type)
>
> As comments says, I think allocJsonPathStats() would be better name for
> the func.
>
> + * XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to what
> + * jsonPathStatsGetLengthStats does?
>
> I think `jsonPathStatsGetTypeFreq(pstats, jbvArray, 0.0) <= 0.0` check
> should be added for jsonPathStatsGetArrayLengthStats().
>
> To be continued ...
>
Hi,

+static JsonPathStats
+jsonStatsFindPathStats(JsonStats jsdata, char *path, int pathlen)

Stats appears twice in the method name. I think findJsonPathStats() should
suffice.
It should check `if (jsdata->nullfrac >= 1.0)` as jsonStatsGetPathStatsStr
does.

+JsonPathStats
+jsonStatsGetPathStatsStr(JsonStats jsdata, const char *subpath, int
subpathlen)

This func can be static, right ?
I think findJsonPathStatsWithPrefix() would be a better name for the func.

+ * XXX Doesn't this need ecape_json too?
+ */
+static void
+jsonPathAppendEntryWithLen(StringInfo path, const char *entry, int len)
+{
+ char *tmpentry = pnstrdup(entry, len);
+ jsonPathAppendEntry(path, tmpentry);

ecape_json() is called within jsonPathAppendEntry(). The XXX comment can be
dropped.

+jsonPathStatsGetArrayIndexSelectivity(JsonPathStats pstats, int index)

It seems getJsonSelectivityWithArrayIndex() would be a better name.

+ sel = scalarineqsel(NULL, operator,
+ operator == JsonbGtOperator ||
+ operator == JsonbGeOperator,
+ operator == JsonbLeOperator ||
+ operator == JsonbGeOperator,

Looking at the comment for scalarineqsel():

* scalarineqsel - Selectivity of "<", "<=", ">", ">=" for scalars.
*
* This is the guts of scalarltsel/scalarlesel/scalargtsel/scalargesel.
* The isgt and iseq flags distinguish which of the four cases apply.

It seems JsonbLtOperator doesn't appear in the call, can I ask why ?

Cheers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2022-01-02 00:07:50 Re: Probable memory leak with ECPG and AIX
Previous Message Justin Pryzby 2022-01-01 17:25:05 Re: libpq compression (part 2)