Missed query planner optimization

From: Mathias Kunter <mathiaskunter(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Missed query planner optimization
Date: 2023-06-19 10:14:45
Message-ID: c43bf236-ba37-3978-2845-cf8fcbc024c8@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

this is about a limitation of the current query planner implementation
which causes big performance declines for certain types of queries.
Affected queries will typically execute about 1000 times slower than
they could. Examples are given below.

After talking about this with David Rowley on the pgsql-bugs mailing
list (bug #17964), it turns out that the reason for the problem
apparently is that eligible IN clauses are always converted into
semi-joins. This is done even in situations where such a conversion
prevents further optimizations to be made.

Hence, it would be desirable that the planner would intelligently decide
based on estimated costs whether or not an IN clause should be converted
into a semi-join. The planner obviously can already correctly estimate
which variant will be faster, as shown in the query plans below.

The tricky part is that it's unfortunately not guaranteed that we'll
find the BEST possible solution if we decide independently for each IN
clause, because estimated total costs will depend on the other IN
clauses of the query as well. However, a simple heuristic solution would
be to restrain from converting an IN clause into a semi-join if the
estimated number of rows returned by the subselect is below a certain
threshold. Then, the planner should make its final decision based on the
estimated total cost of the two possible query variants (i.e. between
applying the heuristic vs. not applying the heuristic).

Example queries follow. Full query plans are provided within the linked
database fiddles.

Example 1: Combining an IN clause with OR.

SELECT * FROM book WHERE
author_id IS NULL OR
author_id IN (SELECT id FROM author WHERE name = 'some_name');

Execution time: 159.227 ms
Execution time (optimized variant): 0.084 ms (1896 times faster)

Estimated cost: 16933.31
Estimated cost (optimized variant): 2467.85 (6.86 times lower)

Full query plans here: https://dbfiddle.uk/SOOJBMwI

Example 2: Combining two IN clauses with OR.

SELECT * FROM book WHERE
author_id IN (SELECT id FROM author WHERE name = 'some_name') OR
publisher_id IN (SELECT id FROM publisher WHERE name = 'some_name');

Execution time: 227.822 ms
Execution time (optimized variant): 0.088 ms (2589 times faster)

Estimated cost: 20422.61
Estimated cost (optimized variant): 4113.39 (4.96 times lower)

Full query plans here: https://dbfiddle.uk/q6_4NuDX

Example 3: Combining an IN clause with UNION.

SELECT * FROM
(SELECT * FROM table1 UNION SELECT * FROM table2) AS q
WHERE id IN (SELECT id FROM table3);

Execution time: 932.412 ms
Execution time (optimized variant): 0.728 ms (1281 times faster)

Estimated cost: 207933.98
Estimated cost (optimized variant): 97.40 (2135 times lower)

Full query plans here: https://dbfiddle.uk/TXASgMZf

Example 4: Complex real-life query from our project.

The full query is linked below.

Execution time: 72436.509 ms
Execution time (optimized variant): 0.201 ms (360381 times faster)

Estimated cost: 3941783.92
Estimated cost (optimized variant): 1515.62 (2601 times lower)

Original query here: https://pastebin.com/raw/JsY1PzG3
Optimized query here: https://pastebin.com/raw/Xvq7zUY2

Now, I'm not familiar with the current planner implementation, but
wanted to know whether there is anybody on this list who would be
willing to work on this. Having the planner consider the costs of
converting IN clauses into semi-joins obviously seems like a worthy
goal. As shown, the performance improvements are gigantic for certain
types of queries.

Thank you very much!

Mathias

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema 2023-06-19 10:52:48 Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
Previous Message Amit Kapila 2023-06-19 10:03:37 Re: Synchronizing slots from primary to standby