Re: indexing for distinct search in timestamp based table

From: "Scott Carey" <scott(at)richrelevance(dot)com>
To: "David Rowley" <dgrowley(at)gmail(dot)com>
Cc: "Rainer Mager" <rainer(at)vanten(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: indexing for distinct search in timestamp based table
Date: 2008-08-28 23:01:58
Message-ID: a1ec7d000808281601s6077f190s8b6dec8b1f38260f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Another suggestion is to partition the table by date ranges. If most of the
range queries occur on particular batches of time, this will make all
queries more efficient, and improve locality and efficiency of all indexes
on the table.

This is more work than simply a table CLUSTER, especially in maintenance
overhead, but it will generally help a lot in cases like these.
Additionally, if these don't change much after some period of time the
tables older than the modification window can be vacuumed, clustered, and
reindexed if needed to make them as efficient as possible and maintenance
free after that point (other than backups and archives).

Another benefit of clustering is in backup / restore. You can incrementally
back up only the index partitions that have changed -- for large databases
this reduces pg_dump and pg_restore times substantially. To do this you
combine regular expressions with the pg_dump "exclude tables" or "include
tables" flags.

On Thu, Aug 28, 2008 at 3:48 PM, David Rowley <dgrowley(at)gmail(dot)com> wrote:

> I once also had a similar performance problem when looking for all matching
> rows between two timestamps. In fact that's why I'm here today. The problem
> was with MySQL. I had some tables of around 10 million rows and all my
> searching was timestamp based. MySQL didn't do what I wanted. I found that
> using a CLUSTERED index with postgresql to be lightning quick. Yet mostly
> the matching rows I was working with was not much over the 100k mark. I'm
> wondering if clustering the table on ad_log_start_time will help cut down
> on
> random reads.
>
> That's if you can afford to block the users while postgresql clusters the
> table.
>
> If you're inserting in order of the start_time column (which I was) then
> the
> cluster should almost maintain itself (I think), providing you're not
> updating or deleting anyway, I'd assume that since it looks like a log
> table.
>
> David.
>
>
> -----Original Message-----
> From: pgsql-performance-owner(at)postgresql(dot)org
> [mailto:pgsql-performance-owner(at)postgresql(dot)org] On Behalf Of Rainer Mager
> Sent: 28 August 2008 09:06
> To: pgsql-performance(at)postgresql(dot)org
> Subject: [PERFORM] indexing for distinct search in timestamp based table
>
> I'm looking for some help in speeding up searches. My table is pretty
> simple
> (see below), but somewhat large, and continuously growing. Currently it has
> about 50 million rows.
>
> The table is (I know I have excessive indexes, I'm trying to get the
> appropriate ones and drop the extras):
> Table "public.ad_log"
> Column | Type |
> Modifiers
>
> --------------+-----------------------------+-------------------------------
> -----------------------------
> ad_log_id | integer | not null default
> nextval('ad_log_ad_log_id_seq'::regclass)
> channel_name | text | not null
> player_name | text | not null
> ad_name | text | not null
> start_time | timestamp without time zone | not null
> end_time | timestamp without time zone | not null
> Indexes:
> "ad_log_pkey" PRIMARY KEY, btree (ad_log_id)
> "ad_log_channel_name_key" UNIQUE, btree (channel_name, player_name,
> ad_name, start_time, end_time)
> "ad_log_ad_and_start" btree (ad_name, start_time)
> "ad_log_ad_name" btree (ad_name)
> "ad_log_all" btree (channel_name, player_name, start_time, ad_name)
> "ad_log_channel_name" btree (channel_name)
> "ad_log_end_time" btree (end_time)
> "ad_log_player_and_start" btree (player_name, start_time)
> "ad_log_player_name" btree (player_name)
> "ad_log_start_time" btree (start_time)
>
>
>
> The query I'm trying to speed up is below. In it the <field> tag can be one
> of channel_name, player_name, or ad_name. I'm actually trying to return the
> distinct values and I found GROUP BY to be slightly faster than using
> DISTINCT. Also, any of those fields may be unspecified in the WHERE clauses
> in which case we use '%', but it seems Postgres optimizes that pretty well.
>
> SELECT <field> FROM ad_log
> WHERE channel_name LIKE :channel_name
> AND player_name LIKE :player_name
> AND ad_name LIKE :ad_name
> AND start_time BETWEEN :start_date AND (date(:end_date) + 1)
> GROUP BY <field> ORDER BY <field>
>
>
> A typical query is:
>
> explain analyze SELECT channel_name FROM ad_log WHERE channel_name LIKE '%'
> AND ad_name LIKE '%' AND start_time BETWEEN '2008-07-01' AND
> (date('2008-07-28') + 1) GROUP BY channel_name ORDER BY channel_name;
>
> with the result being:
>
> QUERY PLAN
>
> ----------------------------------------------------------------------------
>
> ----------------------------------------------------------------------------
> -------
> Sort (cost=1163169.02..1163169.03 rows=5 width=10) (actual
> time=75460.187..75460.192 rows=15 loops=1)
> Sort Key: channel_name
> Sort Method: quicksort Memory: 17kB
> -> HashAggregate (cost=1163168.91..1163168.96 rows=5 width=10) (actual
> time=75460.107..75460.114 rows=15 loops=1)
> -> Bitmap Heap Scan on ad_log (cost=285064.30..1129582.84
> rows=13434427 width=10) (actual time=8506.250..65771.597 rows=13701296
> loops=1)
> Recheck Cond: ((start_time >= '2008-07-01
> 00:00:00'::timestamp without time zone) AND (start_time <=
> '2008-07-29'::date))
> Filter: ((channel_name ~~ '%'::text) AND (ad_name ~~
> '%'::text))
> -> Bitmap Index Scan on ad_log_start_time
> (cost=0.00..281705.70 rows=13434427 width=0) (actual
> time=8488.443..8488.443
> rows=13701296 loops=1)
> Index Cond: ((start_time >= '2008-07-01
> 00:00:00'::timestamp without time zone) AND (start_time <=
> '2008-07-29'::date))
> Total runtime: 75460.361 ms
>
>
> It seems to me there should be some way to create an index to speed this
> up,
> but the various ones I've tried so far haven't helped. Any suggestions
> would
> be greatly appreciated.
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message david 2008-08-28 23:08:24 Re: select on 22 GB table causes "An I/O error occured while sending to the backend." exception
Previous Message David Rowley 2008-08-28 23:01:05 Re: Nested Loop join being improperly chosen