Re: index usage on queries on inherited tables

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: Joseph Shraibman <jks(at)selectacast(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: index usage on queries on inherited tables
Date: 2011-04-28 00:18:17
Message-ID: BANLkTiktd3hKAV9ntLWTxPCdz0Lh7usSUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Wed, Apr 27, 2011 at 2:11 PM, Joseph Shraibman <jks(at)selectacast(dot)net>wrote:

> On 04/27/2011 04:32 PM, Robert Haas wrote:
> > In the first case, PostgreSQL evidently thinks that using the indexes
> > will be slower than just ignoring them. You could find out whether
> > it's right by trying it with enable_seqscan=off.
>
> My point is that this is just a problem with inherited tables. It
> should be obvious to postgres that few rows are being returned, but in
> the inherited tables case it doesn't use indexes. This was just an
> example. In a 52 gig table I have a "select id from table limit 1 order
> by id desc" returns instantly, but as soon as you declare a child table
> it tries to seq scan all the tables.
>
>
If I'm understanding correctly, this kind of obviates the utility of
partitioning if you structure a warehouse in a traditional manner. Assuming
a fact table partitioned by time, but with foreign keys to a time dimension,
it is now not possible to gain any advantage from the partitioning if
selecting on columns in the time dimension.

"select * from fact_table f join time_dimension t on f.time_id = t.time_id
where t.quarter=3 and t.year = 2010" will scan all partitions of the fact
table despite the fact that all of the rows would come from 3 partitions,
assuming a partitioning schema that uses one partition for each month.

I use a time id that is calculable from the from the timestamp so it doesn't
need to be looked up, and partitioning on time_id directly is easy enough to
handle, but if I'm understanding the problem, it sounds like nothing short
of computing the appropriate time ids before issuing the query and then
including a 'where f.time_id between x and y' clause to the query will
result in the partitions being correctly excluded. Is that what people are
doing to solve this problem? The alternative is to leave a timestamp column
in the fact table (something I tend to do since it makes typing ad-hoc
queries in psql much easier) and partition on that column and then always
include a where clause for that column that is at least as large as the
requested row range. Both result in fairly ugly queries, though I can
certainly see how I might structure my code to always build queries which
adhere to this.

I'm just in the process of designing a star schema for a project and was
intending to use exactly the structure I described at the top of the email.
Is there a postgres best-practices for solving this problem? There's no way
I can get away without partitioning. I'm looking at a worst case table of
100,000 rows being written every 5 minutes, 24x7 - 29 million rows per day,
a billion rows per month - with most queries running over a single month or
comparing same months from differing years and quarters - so a month based
partitioning. Normal case is closer to 10K rows per 5 minutes.

Suggestions?

--sam

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Sok Ann Yap 2011-04-28 00:19:01 Re: reducing random_page_cost from 4 to 2 to force index scan
Previous Message Kevin Grittner 2011-04-27 23:23:36 Re: reducing random_page_cost from 4 to 2 to force index scan