Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dmitry Astapov <dastapov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Date: 2022-02-05 13:32:40
Message-ID: 1642f540-badc-ef36-5d24-465f3ccb3343@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

there's been an interesting case [1] of a slow query on pgsql-general,
related to the topic discussed in this thread. It causes an order the
query to run slower by multiple orders of magnitude, and I think it's
interesting, so let me present a simple example demonstrating it.

------------------------------------------------------------------------
create table t1 (a int);
create table t2 (a int);

insert into t1
select i from generate_series(1,100000) s(i);

insert into t2
select mod(i,100000) from generate_series(1,10000000) s(i);

create index on t1(a);
create index on t2(a);

vacuum analyze t1, t2;

-- we need to force mergejoin
set enable_nestloop = off;
------------------------------------------------------------------------

Now, let's run a simple query:

SELECT t1.a, t2.a FROM t1 JOIN t2 USING (a)
WHERE (t1.a > 99000) ORDER BY t1.a LIMIT 100;

QUERY PLAN
------------------------------------------------------------------------
Limit (cost=4.63..224.57 rows=100 width=8)
(actual time=8999.487..8999.707 rows=100 loops=1)
-> Merge Join (cost=4.63..209447.97 rows=95226 width=8)
(actual time=8999.485..8999.620 rows=100 loops=1)
Merge Cond: (t1.a = t2.a)
-> Index Only Scan using t1_a_idx on t1
(cost=0.29..29.25 rows=969 width=4)
(actual time=0.010..0.011 rows=1 loops=1)
Index Cond: (a > 99000)
Heap Fetches: 0
-> Index Only Scan using t2_a_idx on t2
(cost=0.43..183464.09 rows=9999977 width=4)
(actual time=0.026..4594.757 rows=9900200 loops=1)
Heap Fetches: 0
Planning Time: 0.338 ms
Execution Time: 8999.768 ms
(10 rows)

Now, let's do a simple trick and add condition on t2.a, "implied" by the
join condition (t1.a = t2.a) and inequality (t1.a > 99000).

SELECT t1.a, t2.a FROM t1 JOIN t2 USING (a)
WHERE (t1.a > 99000) AND (t2.a > 99000) ORDER BY t1.a LIMIT 100;

QUERY PLAN
------------------------------------------------------------------------
Limit (cost=0.77..250.39 rows=100 width=8)
(actual time=0.040..0.294 rows=100 loops=1)
-> Merge Join (cost=0.77..2297.23 rows=920 width=8)
(actual time=0.039..0.172 rows=100 loops=1)
Merge Cond: (t1.a = t2.a)
-> Index Only Scan using t1_a_idx on t1
(cost=0.29..29.25 rows=969 width=4)
(actual time=0.031..0.031 rows=1 loops=1)
Index Cond: (a > 99000)
Heap Fetches: 0
-> Index Only Scan using t2_a_idx on t2
(cost=0.43..2014.87 rows=96596 width=4)
(actual time=0.005..0.052 rows=100 loops=1)
Index Cond: (a > 99000)
Heap Fetches: 0
Planning Time: 0.222 ms
Execution Time: 0.414 ms
(11 rows)

Well, that's quite a difference! From 9000ms to 1ms, pretty good.

What is happening in the first plan is the merge join needs t2 sorted by
t2.a, and the index-only-scan looks like a great way to do that, as it
has low startup cost (because LIMIT likes that). But this completely
misses that (t1.a > 9900) implies (t2.a > 9900) through the equality in
join condition. So we start scanning t2_a_idx, only to throw the first
99% of tuples away.

In the original report this is particularly egregious, because the index
only scan looks like this:

-> Index Only Scan using data_class_pkey on data_class ta
(cost=0.57..4935483.78 rows=216964862 width=8)
(actual time=0.018..35022.908 rows=151321889 loops=1)
Heap Fetches: 151321889

So yeah, 151M heap fetches, that's bound to be expensive :-/

Adding the condition on t2.a allows just skipping the first chunk of the
index, eliminating the expensive part.

Of course, this breaks the estimates in the faster query, because we now
apply the condition twice - once for the index scan, one as the join
clause. So instead of ~100k rows the join is estimated as ~1000 rows.

I'm also not claiming this is 100% worth it - queries with a suitable
combination of clauses (conditions on the join keys) seems rather
uncommon. But it seems like an interesting example, because it may be
seen either as missed execution optimization (failing to skip the
initial chunk of rows), or an costing issue due to not accounting for
having to process the rows (which would likely result in picking a
different plan).

regards

[1]
https://www.postgresql.org/message-id/CA%2B1Wm9U_sP9237f7OH7O%3D-UTab71DWOO4Qc-vnC78DfsJQBCwQ%40mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2022-02-05 14:31:20 Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
Previous Message Alvaro Herrera 2022-02-05 12:25:09 Re: [BUG]Update Toast data failure in logical replication