Re: how many record versions

From: Greg Stark <gsstark(at)mit(dot)edu>
To: David Garamond <lists(at)zara(dot)6(dot)isreserved(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-general(at)postgresql(dot)org
Subject: Re: how many record versions
Date: 2004-05-24 14:04:10
Message-ID: 87pt8tq5ut.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

David Garamond <lists(at)zara(dot)6(dot)isreserved(dot)com> writes:

> Greg Stark wrote:
> >>Actually, each record will be incremented probably only thousands of times a
> >>day. But there are many banners. Each record has a (bannerid, campaignid,
> >>websiteid, date, countrycode) "dimensions" and (impression, click) "measures".
> > In the past when I had a very similar situation we kept the raw impression and
> > click event data. Ie, one record per impression in the impression table and
> > one record per click in the click data.
>
> > That makes the tables insert-only which is efficient and not prone to locking
> > contention. They would never have to be vacuumed except after purging old data.
>
> Assuming there are 10 millions of impressions per day, the impression table
> will grow at least 200-400MB per day, is that correct? What do you do and how
> often do you purge old data? Do you do a mass DELETE on the impression table
> itself or do you switch to another table? I've found that deleting
> tens/hundreds of thousands of row, at least in InnoDB, takes long, long time
> (plus it sucks CPU and slows other queries).

Well this was actually under Oracle, but I can extrapolate to Postgres given
my experience.

The idea tool for the job is a feature that Postgres has discussed but hasn't
implemented yet, "partitioned tables". Under Oracle with partitioned tables we
were able to drop entire partitions virtually instantaneously. It also made
copying the data out to near-line backups much more efficient than index
scanning as well.

Before we implemented partitioned tables we used both techniques you
described. At first we had an ad-hoc procedure of creating a new table and
swapping it out. But that involved a short downtime and was a manual process.
Eventually we set up an automated batch job which used deletes.

Deletes under postgres should be fairly efficient. The I/O use would be
unavoidable, so doing it during off-peak hours would still be good. But it
shouldn't otherwise interfere with other queries. There should be no locking
contention, no additional work for other queries (like checking rollback
segments or logs) or any of the other problems other databases suffer from
with large updates.

I find the 10 million impressions per day pretty scary though. That's over
100/s across the entire 24 period. Probably twice that at peak hours. That
would have to be one pretty beefy server just to handle the transaction
processing itself. (And updates under postgres are essentially inserts where
vacuum cleans up the old tuple later, so they would be no less taxing.) A
server capable of handling that ought to be able to make quick work of
deleting a few hundred megabytes of records.

Another option is simply logging this data to a text file. Or multiple text
files one per server. Then you can load the text files with batch loads
offline. This avoids slowing down your servers handling the transactions in
the critical path. But it's yet more complex with more points for failure.

Something else you might be interested in is using a tool like this:

http://www.danga.com/memcached/

I could see it being useful for caching the counts you were looking to keep so
that the ad server doesn't need to consult the database to calculate which ad
to show. A separate job could periodically sync the counts to the database or
from the database.

--
greg

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Greg Stark 2004-05-24 14:06:56 Re: Combining several rows
Previous Message Greg Spiegelberg 2004-05-24 13:59:56 Re: Clustering Postgres