Unable to make use of "deep" JSONB index

From: Shaheed Haque <shaheedhaque(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Unable to make use of "deep" JSONB index
Date: 2022-06-02 10:27:26
Message-ID: CAHAc2jdaD63ANRFuQpzZ13XibyP1QPqRUUDHTznRVep48e=7nA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

I'm having trouble using an index on the "deep" innards of a JSONB field in
that the matching query it is trying to accelerate never uses it. I did get
some advice on a simplified version of the problem at [1], but the actual
problem remains the same in that "EXPLAIN ANALYZE" never refers to the
index. Here are the details including a test case below...

1. The JSONB can be several MB in size. This works fine for all but one
access pattern.
2. The JSON in the problem use case looks like this:

{
"...stuff...": ...
"employee": {
"999": {"id": 999, "integer attribute": 0, "boolean-may-be-missing":
true, "state": {
"nested-list": [[], [], ...]
}
}
}

3. As per the discussion at [2], using a SELECT with a WHERE on the 3
attributes of interest ("integer attribute", the "boolean-may-be-missing"
and "nested-list") incurs a significant overheard which suggests that the
JSONB storage is being accessed 3 times. In order to optimise for this
case, I constructed a query using the jsonpath support which seems to
successfully avoid the triple-fetch by keeping the logic inside the
jsonpath query like this:

WHERE (snapshot @? '$.employee."999" ? (@.pay_graph <> 0 ||
@.last_run_of_employment
== true || @.state.employment[last][2] == 0)')

4. Then I created an index "matching" this query.
5. According to EXPLAIN ANALYSE, the index is never used.

=== version and platform ===

Version: PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC)
7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit
Platform: Ubuntu 22.04 (Jammy), using OS-supplied build

=== test case ===
CREATE TABLE payrun (
id serial primary key,
snapshot JSONB
);

INSERT INTO payrun(snapshot)
VALUES
('{"employee": {"999": {"id": 999, "state": {"employment":
[["1920-01-01", null, 3]]}, "pay_graph": 0, "last_run_of_employment":
false}, "111": {"id": 111, "state": {"employment": [["1920-01-01", null,
5]]}, "pay_graph": 3, "last_run_of_employment": true}}}'),
('{"employee": {"999": {"id": 999, "state": {"employment":
[["1970-01-01", null, 3]]}, "pay_graph": 6, "last_run_of_employment":
true}, "222": {"id": 222, "state": {"employment": [["1920-01-01", null,
5]]}, "pay_graph": 5, "last_run_of_employment": true}}}'),
('{"employee": {"998": {"id": 998, "state": {"employment":
[["1980-01-01", null, 3]]}, "pay_graph": 7, "last_run_of_employment":
false}, "333": {"id": 333, "state": {"employment": [["1920-01-01", null,
5]]}, "pay_graph": 3, "last_run_of_employment": true}}}')
;

SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun" WHERE
(snapshot @? '$.employee."999" ? (@.pay_graph <> 0 ||
@.last_run_of_employment == true || @.state.employment[last][2] == 0)');

--
-- Create index designed to match the query.
--
create index idx1 on payrun using gin ((snapshot->'$.employee.* ?
(@.pay_graph <> 0 || @.last_run_of_employment == true ||
@.state.employment[last][2] == 0)'));

set enable_seqscan = OFF;

--
-- EXPLAIN ANALYZE ...query above...
--
explain analyse SELECT id,snapshot #>'{employee,999,state,employment}' FROM
"payrun" WHERE (snapshot @? '$.employee."999" ? (@.pay_graph <> 0 ||
@.last_run_of_employment
== true || @.state.employment[last][2] == 0)');
QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Seq Scan on payrun (cost=10000000000.00..10000000001.04 rows=1 width=36)
(actual time=0.040..0.042 rows=1 loops=1)
Filter: (snapshot @? '$."employee"."999"?((@."pay_graph" != 0 ||
@."last_run_of_employment" == true) || @."state"."employment"[last][2] ==
0)'::jsonpath)
Rows Removed by Filter: 2
Planning Time: 0.883 ms
Execution Time: 0.078 ms
(5 rows)

=== end test case ===

The expected result is that with enable_seqscan = OFF, the index should be
used, but instead a sequential scan is reported as above. The same happens
without enable_seqscan = OFF on
a moderately large test set of over 2000 rows (with extended run times
circa 10+ seconds on my hardware).

I have tried the same with similar results on PG13.

Have I constructed the index incorrectly, or is there some other way to
convince the query to use it?

Thanks, Shahee

[1]
https://www.postgresql.org/message-id/CAHAc2jdiRtw3qus_rvz1QvcxUJ9AgaBKObP-Fvx1q6Vr80n_xw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/CAHAc2jf6Yp2uddfKQyGp%3DbyvOuUzYpsHoPjyKXnUhS5%2B%3DAcuww%40mail.gmail.com

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message operations i 2022-06-02 11:00:40 Re: How is this possible "publication does not exist"
Previous Message Amit Kapila 2022-06-02 09:51:17 Re: How is this possible "publication does not exist"