Performance on large, append-only tables

From: David Yeu <david(dot)yeu(at)skype(dot)net>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Cc: "ops(at)groupme(dot)com" <ops(at)groupme(dot)com>
Subject: Performance on large, append-only tables
Date: 2012-02-08 18:03:04
Message-ID: CB582308.955%david.yeu@skype.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi there,

We've got a pretty large table that sees millions of new rows a day, and
we're trying our best to optimize queries against it. We're hoping to find
some guidance on this list.

Thankfully, the types of queries that we perform against this table are
pretty constrained. We never update rows and we never join against other
tables. The table essentially looks like this:

| id | group_id | created_at | everything elseŠ

Where `id' is the primary key, auto-incrementing, `group_id' is the
foreign key that we always scope against, and `created_at' is the
insertion time. We have indices against the primary key and the group_id.
Our queries essentially fall into the following cases:

* Š WHERE group_id = ? ORDER BY created_at DESC LIMIT 20;
* Š WHERE group_id = ? AND id > ? ORDER BY created_at DESC;
* Š WHERE group_id = ? AND id < ? ORDER BY created_at DESC LIMIT 20;
* Š WHERE group_id = ? ORDER BY created_at DESC LIMIT 20 OFFSET ?;

In human words, we're looking for:

* The most recent (20) rows.
* The most recent rows after a given `id'.
* Twenty rows before a given `id'.
* Pages of twenty rows.

Originally, this table was part of our primary database, but recently we
saw queries take upwards of thirty seconds or more to complete. Since
we're serving web requests, that's basically unacceptable, and caused a
lot of requests to backup. Our interim solution has been to simply carve
out a new database that hosts only this table, and that has worked to some
degree. We aren't seeing thirty seconds plus database response times
anymore, but some queries still take many seconds and the cost of spinning
up a new master-slave configuration hasn't been cheap.

In the meantime, we're hoping to investigate other ways to optimize this
table and the queries against it. Heroku's data team has suggested balling
up these rows into arrays, where a single row would represent a group_id,
and the data would occupy a single column as an array. We don't have any
experience with this and were wondering if anyone here has tried it.

And finally, we're also trying out alternative stores, since it seems like
this data and its retrieval could be well suited to document-oriented
backends. Redis and DynamoDB are currently the best contenders.

Thanks in advance for any help,

Regards,

Dave Yeu & Neil Sarkar
GroupMe

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Ofer Israeli 2012-02-08 19:20:22 Re: Inserts or Updates
Previous Message Rural Hunter 2012-02-08 14:36:30 Re: index scan forward vs backward = speed difference of 357X slower!