<> join selectivity estimate question

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: <> join selectivity estimate question
Date: 2017-03-17 05:54:46
Message-ID: CAEepm=270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

While studying a regression reported[1] against my parallel hash join
patch, I noticed that we can also reach a good and a bad plan in
unpatched master. One of the causes seems to be the estimated
selectivity of a semi-join with an extra <> filter qual.

Here are some times I measured for TPCH Q21 at scale 10 and work_mem
of 1GB. That is a query with a large anti-join and a large semi-join.

8 workers = 8.3s
7 workers = 8.2s
6 workers = 8.5s
5 workers = 8.9s
4 workers = 9.5s
3 workers = 39.7s
2 workers = 36.9s
1 worker = 38.2s
0 workers = 47.9s

Please see the attached query plans showing the change in plan from
Hash Semi Join to Nested Loop Semi Join that happens only once we
reach 4 workers and the (partial) base relation size becomes smaller.
The interesting thing is that row estimate for the semi-join and
anti-join come out as 1 (I think this is 0 clamped to 1).

The same thing can be seen with a simple semi-join, if you happen to
have TPCH loaded. Compare these two queries:

SELECT *
FROM lineitem l1
WHERE EXISTS (SELECT *
FROM lineitem l2
WHERE l1.l_orderkey = l2.l_orderkey);

-> estimates 59986012 rows, actual rows 59,986,052 (scale 10 TPCH)

SELECT *
FROM lineitem l1
WHERE EXISTS (SELECT *
FROM lineitem l2
WHERE l1.l_orderkey = l2.l_orderkey
AND l1.l_suppkey <> l2.l_suppkey);

-> estimates 1 row, actual rows 57,842,090 (scale 10 TPCH)

Or for a standalone example:

CREATE TABLE foo AS
SELECT (generate_series(1, 1000000) / 4)::int AS a,
(generate_series(1, 1000000) % 100)::int AS b;

ANALYZE foo;

SELECT *
FROM foo f1
WHERE EXISTS (SELECT *
FROM foo f2
WHERE f1.a = f2.a);

-> estimates 1,000,000 rows

SELECT *
FROM foo f1
WHERE EXISTS (SELECT *
FROM foo f2
WHERE f1.a = f2.a
AND f1.b <> f2.b);

-> estimates 1 row

I'm trying to wrap my brain around the selectivity code, but am too
green to grok how this part of the planner that I haven't previously
focused on works so far, and I'd like to understand whether this is
expected behaviour so that I can figure out how to tackle the reported
regression with my patch. What is happening here?

Thanks for reading.

[1] https://www.postgresql.org/message-id/CAEepm%3D3Og-7-b3WOkiT%3Dc%2B6y3eZ0VVSyb1K%2BSOvF17BO5KAt0A%40mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
master_q21_3workers.txt text/plain 5.4 KB
master_q21_4workers.txt text/plain 5.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Beena Emerson 2017-03-17 06:08:56 Re: increasing the default WAL segment size
Previous Message Kyotaro HORIGUCHI 2017-03-17 05:23:13 Re: Protect syscache from bloating with negative cache entries