Less-silly selectivity for JSONB matching operators

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Less-silly selectivity for JSONB matching operators
Date: 2020-02-27 19:51:14
Message-ID: 12237.1582833074@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

While looking at a recent complaint about bad planning, I was
reminded that jsonb's @> and related operators use "contsel"
as their selectivity estimator. This is really bad, because
(a) contsel is only a stub, yielding a fixed default estimate,
and (b) that default is 0.001, meaning we estimate these operators
as five times more selective than equality, which is surely pretty
silly.

There's a good model for improving this in ltree's ltreeparentsel():
for any "var OP constant" query, we can try applying the operator
to all of the column's MCV and histogram values, taking the latter
as being a random sample of the non-MCV values. That code is
actually 100% generic except for the question of exactly what
default selectivity ought to be plugged in when we don't have stats.

Hence, the attached draft patch moves that logic into a generic
function in selfuncs.c, and then invents "matchsel" and "matchjoinsel"
generic estimators that have a default estimate of twice DEFAULT_EQ_SEL.
(I'm not especially wedded to that number, but it seemed like a
reasonable starting point.)

There were a couple of other operators that seemed to be inappropriately
using contsel, so I changed all of these to use matchsel:

@>(tsquery,tsquery) | tsq_mcontains
<@(tsquery,tsquery) | tsq_mcontained
@@(text,text) | ts_match_tt
@@(text,tsquery) | ts_match_tq
-|-(anyrange,anyrange) | range_adjacent
@>(jsonb,jsonb) | jsonb_contains
?(jsonb,text) | jsonb_exists
?|(jsonb,text[]) | jsonb_exists_any
?&(jsonb,text[]) | jsonb_exists_all
<@(jsonb,jsonb) | jsonb_contained
@?(jsonb,jsonpath) | jsonb_path_exists_opr
@@(jsonb,jsonpath) | jsonb_path_match_opr

Note: you might think that we should just shove this generic logic
into contsel itself, and maybe areasel and patternsel while at it.
However, that would be pretty useless for these functions' intended
usage with the geometric operators, because we collect neither MCV
nor histogram stats for the geometric data types, making the extra
complexity worthless. Pending somebody putting some effort into
estimation for the geometric data types, I think we should just get
out of the business of having non-geometric types relying on these
estimators.

This patch is not complete, because I didn't look at changing
the contrib modules, and grep says at least some of them are using
contsel for non-geometric data types. But I thought I'd put it up
for discussion at this stage.

regards, tom lane

Attachment Content-Type Size
better-jsonb-selectivity-1.patch text/x-diff 16.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-02-27 20:26:36 Re: [HACKERS] Doubt in pgbench TPS number
Previous Message Stephen Frost 2020-02-27 19:31:52 Re: [GsoC] Read/write transaction-level routing in Odyssey Project Idea