working around JSONB's lack of stats?

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: working around JSONB's lack of stats?
Date: 2015-01-27 07:06:09
Message-ID: 54C738E1.8080405@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Folks,

Currently, JSONB fields don't have statistics, and estimate a flat 1%
selectivity. This can result in poor query plans, and I'm wondering if
anyone has a suggested workaround for this short of hacking a new
selectivity function. For example, take the common case of using JSONB
to hold a list of "tags" for tagging documents:

Table "public.doc_tags_json"
Column | Type | Modifiers
--------+---------+-----------
doc_id | integer |
tags | jsonb |
Indexes:
"doc_tags_json_doc_id_idx" UNIQUE, btree (doc_id)
"doc_tags_json_tags_idx" gin (tags)

This query:

select doc_id
from doc_tags_json
where tags @> '[ "math", "physics" ]'
order by doc_id desc limit 25;

Uses this plan:

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..709.79 rows=25 width=4) (actual time=24.529..340.499
rows=25 loops=1)
-> Index Scan Backward using doc_tags_json_doc_id_idx on
doc_tags_json (cost=0.43..283740.95 rows=10000 width=4) (actual
time=24.528..340.483 rows=25 loops=1)
Filter: (tags @> '["math", "physics"]'::jsonb)
Rows Removed by Filter: 1011878
Planning time: 0.090 ms
Execution time: 340.528 ms

It does this because it expects @> '["math", "physics"]' to match 10,000
rows, which means that it expects to only scan 25,000 entries in the
doc_id index to return the top 25. However, the matching condition is
much rarer than it thinks, so it's actually far faster to use the index
on the JSONB column:

drop index doc_tags_json_doc_id_idx;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=10517.08..10517.14 rows=25 width=4) (actual
time=7.594..7.602 rows=25 loops=1)
-> Sort (cost=10517.08..10542.08 rows=10000 width=4) (actual
time=7.593..7.596 rows=25 loops=1)
Sort Key: doc_id
Sort Method: top-N heapsort Memory: 26kB
-> Bitmap Heap Scan on doc_tags_json (cost=92.90..10234.89
rows=10000 width=4) (actual time=6.733..7.475 rows=257 loops=1)
Recheck Cond: (tags @> '["math", "physics"]'::jsonb)
Heap Blocks: exact=256
-> Bitmap Index Scan on doc_tags_json_tags_idx
(cost=0.00..90.40 rows=10000 width=0) (actual time=6.695..6.695 rows=257
loops=1)
Index Cond: (tags @> '["math", "physics"]'::jsonb)
Planning time: 0.093 ms
Execution time: 7.632 ms

On a normal column, I'd raise n_distinct to reflect the higher
selecivity of the search terms. However, since @> uses contsel,
n_distinct is ignored. Anyone know a clever workaround I don't
currently see?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Merlin Moncure 2015-01-27 14:11:17 Re: Why is PostgreSQL not using my index?
Previous Message Tomas Vondra 2015-01-27 04:25:02 Re: Why is PostgreSQL not using my index?