Re: Merge join vs merge semi join against primary key

From: Sean Rhea <sean(dot)c(dot)rhea(at)gmail(dot)com>
To: Igor Neyman <ineyman(at)perceptron(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Merge join vs merge semi join against primary key
Date: 2015-10-09 20:30:00
Message-ID: CACZYdDjgvnyNZnwZNEsgP6D=KtB5fu-iOWMoNHXN9kSWfiQYKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

It does the merge (not-semi) join:

production=> explain analyze SELECT ac.* FROM balances ac JOIN customers o
ON (o.id = ac.customer_id AND o.group_id = 45);

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=2172.47..19959.82 rows=6 width=80) (actual
time=114.578..243898.199 rows=7318 loops=1)
Merge Cond: (ac.customer_id = o.id)
-> Index Scan using balances_customer_id_index on balances ac
(cost=0.00..727.42 rows=16876 width=80) (actual time=0.025..20.972
rows=16876 loops=1)
-> Index Scan using customers_pkey on customers o (cost=0.00..64811.57
rows=179 width=8) (actual time=92.174..243813.231 rows=7672 loops=1)
Filter: (group_id = 45)
Rows Removed by Filter: 212699113
Total runtime: 243901.595 ms
(7 rows)

Sean

On Fri, Oct 9, 2015 at 1:09 PM, Igor Neyman <ineyman(at)perceptron(dot)com> wrote:

> Hello,
>
>
>
> I'm seeing some inexplicable (to me) behavior in PostgreSQL 9.2. I checked
>
> the archives, but I still can't explain it. Apologies if I missed
> something.
>
>
>
> 1. When I join two tables with "WHERE id IN (...)" versus with an explicit
>
> join, and the join column for the inner table is a primary key, I would
> expect
>
> the same behavior in both cases, but the optimizer is choosing a merge
> join in
>
> one case and a merge semi join in the other. There's at most one customer
>
> with a given id. Why not do a semi join?
>
>
>
> 2. Even though the join methods are different, I would expect about the
> same
>
> performance in either case, but one query takes only a few hundred
>
> milliseconds while the other takes hundreds of seconds. Ouch!
>
>
>
> Can anyone help me explain this behavior?
>
>
>
> Some details are below. Let me know if it would be helpful to gather
> others.
>
>
>
> Sean
>
>
>
>
>
> production=> select version();
>
> version
>
>
> ----------------------------------------------------------------------------------------
>
> PostgreSQL 9.2.13 on i686-pc-linux-gnu, compiled by gcc (Debian 4.7.2-5)
> 4.7.2, 32-bit
>
> (1 row)
>
>
>
> production=> \d customers
>
> Table "public.customers"
>
> Column | Type | Modifiers
>
>
> ---------------+----------+-------------------------------------------------------
>
> id | bigint | not null default
> nextval('customers_id_seq'::regclass)
>
> group_id | bigint |
>
> ...
>
> Indexes:
>
> "customers_pkey" PRIMARY KEY, btree (id)
>
> ...
>
>
>
> production=> select count(*) from customers;
>
> count
>
> --------
>
> 473733
>
> (1 row)
>
>
>
> production=> \d balances
>
> Table "public.balances"
>
> Column | Type | Modifiers
>
>
> -------------------+----------+------------------------------------------------------
>
> id | bigint | not null default
> nextval('balances_id_seq'::regclass)
>
> balance | integer | not null default 0
>
> tracking_number | integer | not null
>
> customer_id | bigint | not null
>
> ...
>
> Indexes:
>
> "balances_pkey" PRIMARY KEY, btree (id)
>
> "balances_customer_tracking_number_index" UNIQUE, btree (customer_id,
> tracking_number)
>
> ...
>
>
>
> production=> select count(*) from balances;
>
> count
>
> -------
>
> 16876
>
> (1 row)
>
>
>
> production=> analyze verbose customers;
>
> INFO: analyzing "public.customers"
>
> INFO: "customers": scanned 14280 of 14280 pages, containing 475288 live
> rows and 1949 dead rows; 300000 rows in sample, 475288 estimated total rows
>
> ANALYZE
>
>
>
> production=> analyze verbose balances;
>
> INFO: analyzing "public.balances"
>
> INFO: "balances": scanned 202 of 202 pages, containing 16876 live rows
> and 0 dead rows; 16876 rows in sample, 16876 estimated total rows
>
> ANALYZE
>
>
>
> production=> explain analyze SELECT * FROM balances where customer_id IN
> (SELECT id from customers WHERE group_id = 45);
>
>
> QUERY PLAN
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
> Merge Semi Join (cost=2442.14..19958.30 rows=16876 width=80) (actual
> time=119.905..145.126 rows=7318 loops=1)
>
> Merge Cond: (balances.customer_id = customers.id)
>
> -> Index Scan using balances_customer_id_index on balances
> (cost=0.00..727.79 rows=16876 width=80) (actual time=0.302..9.477
> rows=16876 loops=1)
>
> -> Index Scan using customers_pkey on customers (cost=0.00..64192.97
> rows=184 width=8) (actual time=103.354..126.459 rows=359 loops=1)
>
> Filter: (group_id = 45)
>
> Rows Removed by Filter: 141684
>
> Total runtime: 146.659 ms
>
> (7 rows)
>
>
>
> production=> explain analyze SELECT ac.* FROM balances ac join customers o
> ON o.id = ac.customer_id WHERE o.group_id = 45;
>
>
> QUERY PLAN
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Merge Join (cost=2214.50..20216.86 rows=30 width=80) (actual
> time=185.615..201991.752 rows=7318 loops=1)
>
> Merge Cond: (ac.customer_id = o.id)
>
> -> Index Scan using balances_customer_tracking_number_index on
> balances ac (cost=0.00..1007.49 rows=16876 width=80) (actual
> time=0.068..25.036 rows=16876 loops=1)
>
> -> Index Scan using customers_pkey on customers o
> (cost=0.00..63836.61 rows=836 width=8) (actual time=159.840..201915.765
> rows=7672 loops=1)
>
> Filter: (group_id = 45)
>
> Rows Removed by Filter: 212699113
>
> Total runtime: 201995.044 ms
>
> (7 rows)
>
>
>
>
>
> What if you rewrite your second query like this:
>
>
>
> SELECT ac.*
>
> FROM balances ac JOIN customers o ON (o.id = ac.customer_id AND
> o.group_id = 45);
>
>
>
> Regards,
>
> Igor Neyman
>
>
>
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message droberts 2015-10-09 22:16:48 Re: Best practices for aggregate table design
Previous Message Igor Neyman 2015-10-09 20:09:51 Re: Merge join vs merge semi join against primary key