Re: Optimize date query for large child tables: GiST or GIN?

From: Yeb Havinga <yebhavinga(at)gmail(dot)com>
To: David Jarvis <thangalin(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thom Brown <thombrown(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Optimize date query for large child tables: GiST or GIN?
Date: 2010-05-21 08:38:43
Message-ID: 4BF64693.8050800@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

David Jarvis wrote:
>
> Also, you're trying to do constraint_exclusion, but have you made sure
> that it's turned on? And have you made sure that those
> constraints are
> really the right ones and that they make sense? You're using a
> bunch of
> extract()'s there too, why not just specify a CHECK constraint on the
> date ranges which are allowed in the table..?
>
>
> I don't know what the date ranges are? So I can't partition them by year?
>
> Right now I created 72 child tables by using the category and month.
> This may have been a bad choice. But at least all the data is in the
> system now so dissecting or integrating it back in different ways
> shouldn't take days.
>
> Thanks everyone for all your help, I really appreciate the time you've
> taken to guide me in the right direction to make the system as fast as
> it can be.

My $0.02 - its hard to comment inline due to the number of responses,
but: the partitioning is only useful for speed, if it matches how your
queries select data. For time based data I would for sure go for year
based indexing. If you want a fixed number of partitions, you could
perhaps do something like year % 64. I did a test to see of the
constraint exclusion could work with extract but that failed:

test=# create table parent(t timestamptz);
test=# create table child1(check ((extract(year from t)::int % 2)=0))
inherits( parent);
test=# create table child2(check ((extract(year from t)::int % 2)=1))
inherits(parent);
test=# explain select * from parent where (extract(year from t)::int %
2) = 0;
QUERY PLAN
---------------------------------------------------------------------------
Result (cost=0.00..158.40 rows=33 width=8)
-> Append (cost=0.00..158.40 rows=33 width=8)
-> Seq Scan on parent (cost=0.00..52.80 rows=11 width=8)
Filter: (((date_part('year'::text, t))::integer % 2) = 0)
-> Seq Scan on child1 parent (cost=0.00..52.80 rows=11 width=8)
Filter: (((date_part('year'::text, t))::integer % 2) = 0)
-> Seq Scan on child2 parent (cost=0.00..52.80 rows=11 width=8)
Filter: (((date_part('year'::text, t))::integer % 2) = 0)

It hits all partitions even when I requested for a single year.

So an extra column would be needed, attempt 2 with added year smallint.

test=# create table parent(t timestamptz, y smallint);
test=# create table child1(check ((y % 2)=0)) inherits( parent);
test=# create table child2(check ((y % 2)=1)) inherits( parent);
test=# explain select * from parent where (y % 2) between 0 and 0;
QUERY
PLAN
---------------------------------------------------------------------------------
Result (cost=0.00..122.00 rows=20 width=10)
-> Append (cost=0.00..122.00 rows=20 width=10)
-> Seq Scan on parent (cost=0.00..61.00 rows=10 width=10)
Filter: ((((y)::integer % 2) >= 0) AND (((y)::integer %
2) <= 0))
-> Seq Scan on child1 parent (cost=0.00..61.00 rows=10 width=10)
Filter: ((((y)::integer % 2) >= 0) AND (((y)::integer %
2) <= 0))

This works: only one child table hit.

That made me think: if you'd scan two consecutive years, you'd always
hit two different partitions. For your use case it'd be nice if some
year wraparounds would fall in the same partition. The following query
shows partition numbers for 1900 - 2010 with 4 consecutive years in the
same partition. It also shows that in this case 32 partitions is enough:

test=# select x, (x / 4) % 32 from generate_series(1900,2010) as x(x);
x | ?column?
------+----------
1900 | 27
1901 | 27
1902 | 27
1903 | 27
1904 | 28
1905 | 28
etc
1918 | 31
1919 | 31
1920 | 0
1921 | 0
etc
2005 | 21
2006 | 21
2007 | 21
2008 | 22
2009 | 22
2010 | 22
(111 rows)

This would mean that a extra smallint column is needed which would
inflate the 300M relation with.. almost a GB, but I still think it'd be
a good idea.

create or replace function yearmod(int) RETURNS int
as 'select (($1 >> 2) %32);'
language sql
immutable
strict;

create table parent(t timestamptz, y smallint);

select 'create table child'||x||'(check (yearmod(y)='||x-1||'))
inherits(parent);' from generate_series(1,32) as x(x);
?column?
---------------------------------------------------------------
create table child1(check (yearmod(y)=0)) inherits(parent);
create table child2(check (yearmod(y)=1)) inherits(parent);
create table child3(check (yearmod(y)=2)) inherits(parent);
etc
create table child30(check (yearmod(y)=29)) inherits(parent);
create table child31(check (yearmod(y)=30)) inherits(parent);
create table child32(check (yearmod(y)=31)) inherits(parent);
(32 rows)

Copy and paste output of this query in psql to create child tables.

Example with period 1970 to 1980:

test=# explain select * from parent where yearmod(y) between
yearmod(1970) and yearmod(1980);
QUERY
PLAN
-----------------------------------------------------------------------------
Result (cost=0.00..305.00 rows=50 width=10)
-> Append (cost=0.00..305.00 rows=50 width=10)
-> Seq Scan on parent (cost=0.00..61.00 rows=10 width=10)
Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
-> Seq Scan on child13 parent (cost=0.00..61.00 rows=10 width=10)
Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
-> Seq Scan on child14 parent (cost=0.00..61.00 rows=10 width=10)
Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
-> Seq Scan on child15 parent (cost=0.00..61.00 rows=10 width=10)
Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
-> Seq Scan on child16 parent (cost=0.00..61.00 rows=10 width=10)
Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
(12 rows)

This works: query for 11 consecutive years hits only 4 from 31.

But the between fails for yearmods that wrap the 31 boundary, what
happens here between 1910 and 1920

test=# explain select * from parent where yearmod(y) between
yearmod(1910) and yearmod(1920);
QUERY PLAN
------------------------------------------
Result (cost=0.00..0.01 rows=1 width=0)
One-Time Filter: false
(2 rows)

So for the wraparound case we need a CASE:

test=# explain select * from parent where case when yearmod(1910) <=
yearmod(1920)
then yearmod(y) between yearmod(1910) and yearmod(1920)
else (yearmod(y) >= yearmod(1910) or yearmod(y) <= yearmod(1920)) end;
QUERY
PLAN
-------------------------------------------------------------------------------
Result (cost=0.00..305.00 rows=5665 width=10)
-> Append (cost=0.00..305.00 rows=5665 width=10)
-> Seq Scan on parent (cost=0.00..61.00 rows=1133 width=10)
Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
-> Seq Scan on child1 parent (cost=0.00..61.00 rows=1133
width=10)
Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
-> Seq Scan on child30 parent (cost=0.00..61.00 rows=1133
width=10)
Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
-> Seq Scan on child31 parent (cost=0.00..61.00 rows=1133
width=10)
Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
-> Seq Scan on child32 parent (cost=0.00..61.00 rows=1133
width=10)
Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
(12 rows)

This should work for all year ranges and I think is a good solution for
partitioning on year with a fixed amount of partitions.

From the optimizer perspective I wonder what the best access path for
this kind of query would be (if there would be no partitions). Building
on ideas from one of Thom Brown's first replies with indexes on year and
doy, and Tom Lane's remark about the leap year problem. Suppose the leap
years did not exist, having a index on year, and having a different
index on doy, sounds like a bitmap and of a scan of both the year and
doy indexes could provide a optimal path. Maybe this would still be
possible, if the leap year problem could be 'fixed' by a additional
condition in the where clause that filters the surplus records.

regards,
Yeb Havinga

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Yeb Havinga 2010-05-21 08:45:36 Re: Optimize date query for large child tables: GiST or GIN?
Previous Message David Jarvis 2010-05-21 04:11:52 Re: Optimize date query for large child tables: GiST or GIN?