Re: Index only scan

From: Ondrej Ivanič <ondrej(dot)ivanic(at)gmail(dot)com>
To: Lars Helge Øverland <larshelge(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Index only scan
Date: 2012-10-10 23:36:20
Message-ID: CAM6mieKhkwhgkzsO2aqe51v_HOHGvB89g9BEsE_fypOBLhSm=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

On 10 October 2012 23:03, Lars Helge Øverland <larshelge(at)gmail(dot)com> wrote:
> We are now in the process of designing a new component for analytics
> and this feature got me thinking we could utilize postgres over other
> alternatives like column-oriented databases. Basically we will have a
> wide, denormalized table with 20+ columns with relatively low
> cardinality. Typically we will have queries which sums a fact column
> based on where/group by clauses on several dimension columns (standard
> data warehouse stuff). An example would be "select a, b, c, sum(d)
> from analytics where a=1 and b=2 group by a,b,c";
>
> Finally, is there anyone else who are using postgres for this purpose
> and have some good tips to share in order to achieve good performance,
> including index strategies, beyond the standard config best practices?

yes, we had fact table which has around 250 columns and 250mil rows.
The question is if you can partition your data set. For example,
monthly partition. This keeps indexes small but all queries must be
constrained by the same column as is used for partitioning (ie.
monthly partitions -> every query should have "datetime between ...
and ...")

From my experience postgres is not good with large group by queries.
For example, your query:
select a, b, c, sum(d) from analytics where a=1 and b=2 group by a,b,c

could be executed over multiple connections:
insert into t select select a, b, c, sum(d) as d from analytics where
c >= val1 and c < val2 and a=1 and b=2 group by a,b,c
insert into t select select a, b, c, sum(d) as d from analytics where
c >= val2 and c < val3 and a=1 and b=2 group by a,b,c
...
insert into t select select a, b, c, sum(d) as d from analytics where
c >= valN-1 and c < valN and a=1 and b=2 group by a,b,c

and then get the final result:
select a, b, c, sum(d) from t group by a,b,c

You can use pgpool-II parallel query feature instead of manual slicing.

--
Ondrej Ivanic
(ondrej(dot)ivanic(at)gmail(dot)com)
(http://www.linkedin.com/in/ondrejivanic)

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2012-10-10 23:41:42 Re: Index only scan
Previous Message Greg Sabino Mullane 2012-10-10 23:36:18 Re: Planner chooses multi-column index in 9.2 when maybe it should not