Real vs Int performance

From: David Greco <David_Greco(at)harte-hanks(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Real vs Int performance
Date: 2011-01-26 19:31:58
Message-ID: CDFA9340E95A764E9366B4EDF3A43125BCACF7E68D@VA3DIAXVS091.RED001.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

New to Postgres and am prototyping a migration from Oracle to Postgres 9.0.1 on Linux. Starting with the data warehouse. Current process is to load the data from
our OLTP (Oracle), dump it into another instance of Oracle for staging and manipulation, then extract it and load it into Infobright. I am trying
to replace the Oracle instance used for staging and manipulation with Postgres. Kettle (PDI), a Java ETL tool, is used for this process.

Came across a problem I find perplexing. I recreated the dimensional tables in Oracle and the fields that are integers in Oracle became integers
in Postgres. Was experiencing terrible performance during the load and narrowed down to a particular dimensional lookup problem. The table
dim_carrier holds about 80k rows. You can see the actual query issued by Kettle below, but basically I am looking up using the business key from
our OLTP system. This field is carrier_source_id and is indexed as you can see below. If I change this field from an integer to a real, I get
about a 70x increase in performance of the query. The EXPLAIN ANALYZE output is nearly identical, except for the casting of 1 to a real when the column
is a real. In real life, this query is actually bound and parameterized, but I wished to simplify things a bit here (and don't yet know how to EXPLAIN ANALYZE a parameterized
query). Now in terms of actual performance, the same query executed about 25k times takes 7 seconds with the real column, and 500 seconds with the integer column.

What gives here? Seems like integer (or serial) is a pretty common choice for primary key columns, and therefore what I'm experiencing must be an anomoly.

Table "hits_olap.dim_carrier"
Column | Type | Modifiers
-------------------+-----------------------------+-----------
carrier_id | integer | not null
dim_version | smallint |
dim_effect_date | timestamp without time zone |
dim_expire_date | timestamp without time zone |
carrier_source_id | integer |
carrier_name | character varying(30) |
carrier_type | character varying(30) |
carrier_scac | character varying(4) |
carrier_currency | character varying(3) |
current_row | smallint | default 0
Indexes:
"dim_carrier_pk" PRIMARY KEY, btree (carrier_id)
"idx_dim_carrier_lookup" btree (carrier_source_id)

VACUUM
ANALYZE
REINDEX

EXPLAIN ANALYZE SELECT CARRIER_ID, DIM_VERSION FROM HITS_OLAP.DIM_CARRIER WHERE CARRIER_SOURCE_ID = '1' AND now() >= DIM_EFFECT_DATE
AND now() < DIM_EXPIRE_DATE;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_dim_carrier_lookup on dim_carrier (cost=0.00..12.10 rows=2 width=6) (actual time=0.076..0.077 rows=1 loops=1)
Index Cond: (carrier_source_id = 1)
Filter: ((now() >= dim_effect_date) AND (now() < dim_expire_date)) Total runtime: 0.108 ms
(4 rows)

ALTER TABLE
ALTER TABLE
Table "hits_olap.dim_carrier"
Column | Type | Modifiers
-------------------+-----------------------------+-----------
carrier_id | integer | not null
dim_version | smallint |
dim_effect_date | timestamp without time zone |
dim_expire_date | timestamp without time zone |
carrier_source_id | real |
carrier_name | character varying(30) |
carrier_type | character varying(30) |
carrier_scac | character varying(4) |
carrier_currency | character varying(3) |
current_row | smallint | default 0
Indexes:
"dim_carrier_pk" PRIMARY KEY, btree (carrier_id)
"idx_dim_carrier_lookup" btree (carrier_source_id)

VACUUM
ANALYZE
REINDEX

EXPLAIN ANALYZE SELECT CARRIER_ID, DIM_VERSION FROM HITS_OLAP.DIM_CARRIER WHERE CARRIER_SOURCE_ID = '1' AND now() >= DIM_EFFECT_DATE
AND now() < DIM_EXPIRE_DATE;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_dim_carrier_lookup on dim_carrier (cost=0.00..12.10 rows=2 width=6) (actual time=0.068..0.069 rows=1 loops=1)
Index Cond: (carrier_source_id = 1::real)
Filter: ((now() >= dim_effect_date) AND (now() < dim_expire_date)) Total runtime: 0.097 ms
(4 rows)

Thanks for the help,

Dave Greco

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Kevin Grittner 2011-01-26 21:52:51 Re: Real vs Int performance
Previous Message Shaun Thomas 2011-01-26 17:16:38 Re: FW: Queries becoming slow under heavy load