Re: BUG #16993: Query Planner does not use index for EXISTS-query on a large table with all equal foreign key values

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: a(dot)schnabl(at)synedra(dot)com
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16993: Query Planner does not use index for EXISTS-query on a large table with all equal foreign key values
Date: 2021-05-05 17:47:41
Message-ID: 4035793.1620236861@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

PG Bug reporting form <noreply(at)postgresql(dot)org> writes:
> CREATE TABLE data_node (id BIGINT PRIMARY KEY);
> CREATE TABLE data_entry (id BIGINT PRIMARY KEY, node_fk BIGINT NOT NULL,
> FOREIGN KEY (node_fk) REFERENCES data_node (id));
> CREATE INDEX node_ix ON data_entry USING BTREE (node_fk);
> INSERT INTO data_node (id) SELECT generate_series(1,10);
> INSERT INTO data_entry (id, node_fk) SELECT s, 2 FROM
> generate_series(1,10000000) s;
> VACUUM ANALYZE data_node;
> VACUUM ANALYZE data_entry;
> ...
> SELECT * FROM data_node
> WHERE EXISTS (
> SELECT 1 FROM data_entry WHERE node_fk = data_node.id
> );

I looked into this. The reason for the strange behavior is that,
thanks to the extremely skewed data distribution in data_entry.node_fk,
the indexscan you're hoping it would use is actually costed as being
more expensive than a seqscan. That causes it to get discarded at
the relation planning level. Later on, when we consider join plans,
there's a special exception that understands that a semijoin with
an inner indexscan that uses all the join quals is particularly
efficient; which is why the desirable plan is correctly given a
small cost, *if* we consider it at all. But if the indexscan was
already thrown away, we won't. Of course, disabling the seqscan
allows the indexscan to survive the scan-level path tournament.

I'm not sure what a good fix for this would look like. We could imagine
trying to mark indexscan paths that satisfy has_indexed_join_quals()
at time of generation, and then giving them a free pass in add_path().
However
(1) as written, has_indexed_join_quals() depends on the particular
join being considered. That might be just an artifact of coding
convenience, but I'm not sure.
(2) add_path() is enough of a hot spot that giving it yet more things
to worry about is fairly unattractive.
(3) allowing more paths to survive the tournament slows things down
overall.

Tweaking the example shows that you need a *very* skewed data
distribution to make this happen, so maybe we should write it off
as a corner case that's not worth fixing. I'm for sure not eager
to slow planning down overall to improve this case, so I don't
much like the above proposal even if it could be made to work.

A different idea is to reconsider the selectivity estimation rules
with an eye to cutting the estimated cost of the indexscan. The
thing that I'm wondering about here is that we're estimating that
using eqsel() rather than eqjoinsel(), and it seems like the latter
might give us a more useful value. AFAIR we never thought hard
about that aspect when we invented parameterized plans. But for
sure, changing that would be a research project with a lot of
potential impact.

A narrower fix would be to hack var_eq_non_const so that it doesn't
assume that the comparison value must be one of the entries in the
column. But it seems like whatever change we made in that line would
be a very unprincipled hack, because what are you going to assume
instead?

Anyway, short answer is that this is more complicated than it
looks, and there isn't any simple improvement we can make.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Federico 2021-05-05 19:17:49 Re: BUG #16991: regclass is not case sensitive causing "relation does not exist" error
Previous Message Alvaro Herrera 2021-05-05 16:24:17 Re: ALTER CONSTRAINT on a partitioned FK isn't working