Optimizer internals

From: "John Vincent" <pgsql-performance(at)lusis(dot)org>
To: "PGSQL Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Optimizer internals
Date: 2006-06-15 18:05:46
Message-ID: c841561b0606151105v459d830era158b9eea5f59a7e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I'm not a programmer so understanding the optimizer code is WAY beyond my
limits.

My question, that I haven't seen answered elsewhere, is WHAT things can
affect the choice of an index scan over a sequence scan. I understand that
sometimes a sequence scan is faster and that you still have to get the data
from the disk but my question relates to an issue we had pop up today.

We have 2 tables, which we'll refer to as laaf and laaf_new. The first table
has 220M rows and the second table has 4M rows. What were basically doing is
aggregating the records from the first table into the second one at which
point we're going to drop the first one. This is the same table I mentioned
previously in my post about pg_dump.

laaf_new has one less column than laaf and both were freshly vacuum analyzed
after having an index added on a single column (other than the primary key).

The query we were doing was as follows:

select main_account_status_dim_id, count(*)
from cla_dw.loan_account_agg_fact_new
group by main_account_status_dim_id
order by main_account_status_dim_id;

One of our problems is that we don't have any PGSQL dbas here. All of our
guys are DB2 (we're still looking though).

Now I've been told by our DBA that we should have been able to wholy satisfy
that query via the indexes.

We did regular EXPLAINS on the query with seqscan enabled and disabled and
even in our own tests actually running the queries, the results WERE faster
with a seq scan than an index scan but the question we were discussing is
WHY did it choose the index scan and why is the index scan slower than the
sequence scan? He's telling me that DB2 would have been able to do the whole
thing with indexes.

EXPLAINS:

(the reason for the random_page_cost was that we had the default of 4 in the
.conf file and were planning on changing it to 2 anyway to match our other
server)

set random_page_cost=2;
set enable_seqscan=on;
explain select main_account_status_dim_id, count(*)
from cla_dw.loan_account_agg_fact
group by main_account_status_dim_id
order by main_account_status_dim_id;

"Sort (cost=8774054.54..8774054.66 rows=48 width=4)"
" Sort Key: main_account_status_dim_id"
" -> HashAggregate (cost=8774052.60..8774053.20 rows=48 width=4)"
" -> Seq Scan on loan_account_agg_fact
(cost=0.00..7609745.40rows=232861440 width=4)"

set random_page_cost=2;
set enable_seqscan=off;
explain select main_account_status_dim_id, count(*)
from cla_dw.loan_account_agg_fact
group by main_account_status_dim_id
order by main_account_status_dim_id;

"Sort (cost=108774054.54..108774054.66 rows=48 width=4)"
" Sort Key: main_account_status_dim_id"
" -> HashAggregate (cost=108774052.60..108774053.20 rows=48 width=4)"
" -> Seq Scan on loan_account_agg_fact (cost=
100000000.00..107609745.40 rows=232861440 width=4)"
Here's the DDL for the table laaf:

When the system is not busy again, I'll run a verbose version. The query was
run against each of the tables to compare the results of aggregation change
with the new table.

CREATE TABLE cla_dw.loan_account_agg_fact
(
loan_account_agg_fact_id int8 NOT NULL DEFAULT
nextval('loan_account_agg_fact_loan_account_agg_fact_id_seq'::regclass),
dw_load_date_id int4 NOT NULL DEFAULT 0,
servicer_branch_dim_id int4 NOT NULL DEFAULT 0,
main_account_status_dim_id int4 NOT NULL DEFAULT 0,
product_dim_id int4 NOT NULL DEFAULT 0,
next_due_date_id int4 NOT NULL DEFAULT 0,
account_balance numeric(15,6) NOT NULL DEFAULT 0,
loan_count int4 NOT NULL DEFAULT 0,
principal numeric(15,6) NOT NULL DEFAULT 0,
interest numeric(15,6) NOT NULL DEFAULT 0,
fees numeric(15,6) NOT NULL DEFAULT 0,
gl_principal numeric(15,6) NOT NULL DEFAULT 0,
gl_interest numeric(15,6) NOT NULL DEFAULT 0,
accruable_principal numeric(15,6) NOT NULL DEFAULT 0,
unaccruable_principal numeric(15,6) NOT NULL DEFAULT 0,
calculated_principal numeric(15,6) DEFAULT 0,
current_interest numeric(15,6) NOT NULL DEFAULT 0,
past_due_interest numeric(16,5) NOT NULL DEFAULT 0,
cash_available numeric(15,6) DEFAULT 0,
cash_collected numeric(15,6) DEFAULT 0,
cash_collected_date_id int4 DEFAULT 0,
dw_agg_load_dt timestamp(0) DEFAULT ('now'::text)::timestamp(6) with time
zone,
cash_available_principal numeric(15,6) DEFAULT 0,
cash_available_current numeric(15,6) DEFAULT 0,
cash_available_last numeric(15,6) DEFAULT 0,
cash_available_interest numeric(15,6) DEFAULT 0,
cash_available_fees numeric(15,6) DEFAULT 0,
cash_not_collected numeric(15,6) DEFAULT 0,
number_contacts_total int4 DEFAULT 0,
number_broken_commitments int4 DEFAULT 0,
loc_current_due_total numeric(15,6) DEFAULT 0,
loc_current_due_principal numeric(15,6) DEFAULT 0,
loc_current_due_interest numeric(15,6) DEFAULT 0,
loc_current_due_fees numeric(15,6) DEFAULT 0,
loc_past_due_last numeric(15,6) DEFAULT 0,
loc_past_due_total numeric(15,6) DEFAULT 0,
number_made_commitments int4 DEFAULT 0,
CONSTRAINT loan_account_agg_fact_pkey PRIMARY KEY
(loan_account_agg_fact_id)
)
WITH OIDS;

CREATE INDEX loan_account_agg_fact_main_account_status_dim_id
ON cla_dw.loan_account_agg_fact
USING btree
(main_account_status_dim_id)
TABLESPACE fact_idx_part1_ts;

Here's the DDL for the table laaf_new:

CREATE TABLE cla_dw.loan_account_agg_fact_new
(
loan_account_agg_fact_id bigserial NOT NULL,
dw_load_date_id int4 NOT NULL,
servicer_branch_dim_id int4 NOT NULL,
main_account_status_dim_id int4 NOT NULL,
product_dim_id int4 NOT NULL,
dw_agg_load_dt timestamp,
account_balance numeric(15,6) NOT NULL DEFAULT 0,
loan_count int4 NOT NULL DEFAULT 0,
principal numeric(15,6) NOT NULL DEFAULT 0,
interest numeric(15,6) NOT NULL DEFAULT 0,
fees numeric(15,6) NOT NULL DEFAULT 0,
gl_principal numeric(15,6) NOT NULL DEFAULT 0,
gl_interest numeric(15,6) NOT NULL DEFAULT 0,
accruable_principal numeric(15,6) DEFAULT 0,
unaccruable_principal numeric(15,6) DEFAULT 0,
calculated_principal numeric(15,6) DEFAULT 0,
current_interest numeric(15,6) DEFAULT 0,
past_due_interest numeric(15,6) DEFAULT 0,
cash_available numeric(15,6) DEFAULT 0,
cash_collected numeric(15,6) DEFAULT 0,
cash_available_principal numeric(15,6) DEFAULT 0,
cash_available_current numeric(15,6) DEFAULT 0,
cash_available_last numeric(15,6) DEFAULT 0,
cash_available_interest numeric(15,6) DEFAULT 0,
cash_available_fees numeric(15,6) DEFAULT 0,
cash_not_collected numeric(15,6) DEFAULT 0,
number_contacts_total int4 DEFAULT 0,
number_broken_commitments int4 DEFAULT 0,
loc_current_due_total numeric(15,6) DEFAULT 0,
loc_current_due_principal numeric(15,6) DEFAULT 0,
loc_current_due_interest numeric(15,6) DEFAULT 0,
loc_current_due_fees numeric(15,6) DEFAULT 0,
loc_past_due_last numeric(15,6) DEFAULT 0,
loc_past_due_total numeric(15,6) DEFAULT 0,
number_made_commitments int4 DEFAULT 0,
CONSTRAINT loan_account_agg_fact_pkey_new PRIMARY KEY
(loan_account_agg_fact_id) USING INDEX TABLESPACE default_ts
)
WITH OIDS TABLESPACE fact_data_part1_ts;

CREATE INDEX laafn_main_account_status_dim
ON cla_dw.loan_account_agg_fact_new
USING btree
(main_account_status_dim_id)
TABLESPACE fact_idx_part2_ts;

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Zoltan Boszormenyi 2006-06-15 18:19:10 Re: Precomputed constants?
Previous Message Bill Moran 2006-06-15 17:53:33 Re: Is it possible to start two instances of postgresql?