Re: Huge Data sets, simple queries

From: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
To: Mike Biamonte <mike(at)dbeat(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Huge Data sets, simple queries
Date: 2006-01-30 03:19:46
Message-ID: 43DD85D2.5020207@modgraph-usa.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Mike Biamonte wrote:
> Does anyone have any experience with extremely large data sets?
> I'm mean hundreds of millions of rows.
>
> The queries I need to run on my 200 million transactions are relatively
> simple:
>
> select month, count(distinct(cardnum)) count(*), sum(amount) from
> transactions group by month;

This may be heretical to post to a relational-database group, but sometimes a problem can be better solved OUTSIDE of the relational system.

I had a similar problem recently: I have a set of about 100,000 distinct values, each of which occurs one to several million times in the database, with an aggregate total of several hundred million occurances in the database.

Sorting this into distinct lists ("Which rows contain this value?") proved quite time consuming (just like your case), but on reflection, I realized that it was dumb to expect a general-purpose sorting algorithm to sort a list about which I had specialized knowledge. General-purpose sorting usually takes O(N*log(N)), but if you have a small number of distinct values, you can use "bucket sorting" and sort in O(N) time, a huge improvement. In my case, it was even more specialized -- there was a very small number of the lists that contained thousands or millions of items, but about 95% of the lists only had a few items.

Armed with this knowledge, it took me couple weeks to write a highly-specialized sorting system that used a combination of Postgres, in-memory and disk caching, and algorithms dredged up from Knuth. The final result ran in about four hours.

The thing to remember about relational databases is that the designers are constrained by the need for generality, reliability and SQL standards. Given any particular well-defined task where you have specialized knowledge about the data, and/or you don't care about transactional correctness, and/or you're not concerned about data loss, a good programmer can always write a faster solution.

Of course, there's a huge penalty. You lose support, lose of generality, the application takes on complexity that should be in the database, and on and on. A hand-crafted solution should be avoided unless there's simply no other way.

A relational database is a tool. Although powerful, like any tool it has limitations. Use the tool where it's useful, and use other tools when necessary.

Craig

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2006-01-30 03:50:58 Re: Desperate: View not using indexes (very slow)
Previous Message Michael Adler 2006-01-30 00:32:15 Re: Huge Data sets, simple queries