Re: jsonpath

From: "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)postgrespro(dot)ru>, Michael Paquier <michael(at)paquier(dot)xyz>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: jsonpath
Date: 2019-03-17 16:46:01
Message-ID: 04981bad-2156-a927-b04c-50ff2c9709ae@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/17/19 3:13 AM, Alexander Korotkov wrote:
> On Sat, Mar 16, 2019 at 9:39 PM Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>> so 16. 3. 2019 v 10:36 odesílatel Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> napsal:
>>>
>>> On Thu, Mar 14, 2019 at 12:07 PM Alexander Korotkov
>>> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>>>> On Sun, Mar 10, 2019 at 1:51 PM Alexander Korotkov
>>>> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>>>>> On Wed, Mar 6, 2019 at 12:40 AM Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru> wrote:
>>>>>> Attached 36th version of the patches.
>>>>>
>>>>> Thank yo for the revision!
>>>>>
>>>>> In the attached revision following changes are made:
>>>>>
>>>>>> "unknown" refers here to ordinary three-valued logical Unknown, which is
>>>>>> represented in SQL by NULL.
>>>>>>
>>>>>> JSON path expressions return sequences of SQL/JSON items, which are defined by
>>>>>> SQL/JSON data model. But JSON path predicates (logical expressions), which are
>>>>>> used in filters, return three-valued logical values: False, True, or Unknown.
>>>>>
>>>>> * I've added short explanation of this to the documentation.
>>>>> * Removed no longer present data structures from typedefs.list of the
>>>>> first patch.
>>>>> * Moved GIN support patch to number 3. Seems to be well-isolated and
>>>>> not very complex patch. I propose to consider this to 12 too. I
>>>>> added high-level comment there, commit message and made some code
>>>>> beautification.
>>>>
>>>> I think patches 1 and 2 are in committable shape (I reached Tomas
>>>> off-list, he doesn't have more notes regarding them). While patch 3
>>>> requires more review.
>>>>
>>>> I'm going to push 1 and 2 if no objections.
>>>
>>> So, pushed. Many thanks to reviewers and authors!
>>>
>>> Remaining part I'm proposing for 12 is attached. I appreciate review of it.
>>
>>
>> I tested this patch and I didn't find any issue - just I tested basic functionality and regress tests.
>>
>> looks well
>
> Thank you for your feedback!

Like Pavel, I did some basic testing of the patch (on current HEAD
4178d8b91c) trying out various JSON path expressions, and yes, it all
worked. I had a brief scare while testing on 4178d8b91c where initdb was
failing on the bootstrapping step, but after doing a thorough wipe of
build files and my output directory, it seems to be initializing okay.

I also did some testing of the GIN patch upthread, as the quickness of
retrieval of the data using JSON path is of course important as well.
Using a schema roughly like this:

CREATE TABLE news_feed (
id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
data jsonb NOT NULL
);
CREATE INDEX news_feed_data_gin_idx ON news_feed USING gin(data);

I loaded in a data set of roughly 420,000 rows. Each row had all the
same keys but differing values (e.g. "length" and "content" as keys)

I tested a few different JSON path scenarios. Some of the index scans
performed way better than the equivalent sequential scans, for instance:

SELECT count(*)
FROM news_feed
WHERE data @? '$.length ? (@ == 200)';

SELECT *
FROM news_feed
WHERE data @? '$.id ? (@ == "22613cbc-d83e-4a29-8b59-3b9f5cd61825")';

Using the index outperformed the sequential scan (and parallel seq scan)
by ~10-100x based on my config + laptop hardware!

However, when I did something a little more complex, like the below:

SELECT count(*)
FROM news_feed
WHERE data @? '$.length ? (@ < 150)';

SELECT count(*)
FROM news_feed
WHERE data @? '$.content ? (@ like_regex "^Start")';

SELECT id, jsonb_path_query(data, '$.content')
FROM news_feed
WHERE data @? '$.content ? (@ like_regex "risk" flag "i")';

I would find that the index scan performed as well as the sequential
scan. Additionally, on my laptop, the parallel sequential scan would
beat the index scan by ~2.5x in some cases.

Reading up on what the GIN patch does, this all makes sense: it's
optimized for equality, I understand there are challenges to be able to
handle inequality, regex exps, etc. And the cases where it really does
work well, it's _incredibly_ fast.

My suggestion would be adding some additional guidance in the user
documentation around how GIN works with the @@ and @? operators so they
can understand where GIN will work very well with JSON path + their data
and not be surprised when other types of JSON path queries are
performing on par with a sequential scan (or worse than a parallel seq
scan).

Thanks,

Jonathan

In response to

  • Re: jsonpath at 2019-03-17 07:13:08 from Alexander Korotkov

Responses

  • Re: jsonpath at 2019-03-17 16:55:20 from Alexander Korotkov

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2019-03-17 16:55:20 Re: jsonpath
Previous Message Alexander Korotkov 2019-03-17 16:31:38 Re: jsonpath