Re: Collecting statistics about contents of JSONB columns

From: Greg Stark <stark(at)mit(dot)edu>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>, Mahendra Thalor <mahendra(dot)thalor(at)enterprisedb(dot)com>, Oleg Bartunov <obartunov(at)postgrespro(dot)ru>
Subject: Re: Collecting statistics about contents of JSONB columns
Date: 2022-01-25 20:06:30
Message-ID: CAM-w4HNeUQu0Dvh=-fd+h2ywZEfnxWpLDxHb8HjvqkMPevkekA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 6 Jan 2022 at 14:56, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
>
> Not sure I understand. I wasn't suggesting any user-defined filtering,
> but something done by default, similarly to what we do for regular MCV
> lists, based on frequency. We'd include frequent paths while excluding
> rare ones.
>
> So no need for a user interface.

Not sure but I think he was agreeing with you. That we should figure
out the baseline behaviour and get it as useful as possible first then
later look at adding some way to customize it. I agree -- I don't
think the user interface will be hard technically but I think it will
require some good ideas and there could be lots of bikeshedding. And a
lot of users will never even use it anyways so it's important to get
the defaults as useful as possible.

> Similarly for the non-scalar values - I don't think we can really keep
> regular statistics on such values (for the same reason why it's not
> enough for whole JSONB columns), so why to build/store that anyway.

For a default behaviour I wonder if it wouldn't be better to just
flatten and extract all the scalars. So if there's no matching path
then at least we have some way to estimate how often a scalar appears
anywhere in the json document.

That amounts to assuming the user knows the right path to find a given
scalar and there isn't a lot of overlap between keys. So it would at
least do something useful if you have something like {gender: female,
name: {first: nancy, last: reagan], state: california, country: usa}.
It might get things slightly wrong if you have some people named
"georgia" or have names that can be first or last names.

But it would generally be doing something more or less useful as long
as they look for "usa" in the country field and "male" in the gender
field. If they looked for "male" in $.name.first path it would give
bad estimates but assuming they know their data structure they won't
be doing that.

--
greg

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-01-25 20:12:51 Re: refactoring basebackup.c
Previous Message Blake, Geoff 2022-01-25 19:57:33 Re: Add spin_delay() implementation for Arm in s_lock.h