Re: Performance issues when the number of records are around 10 Million

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "venu madhav" <venutaurus539(at)gmail(dot)com>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Performance issues when the number of records are around 10 Million
Date: 2010-05-13 13:56:07
Message-ID: 4BEBBEA70200002500031699@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

venu madhav <venutaurus539(at)gmail(dot)com> wrote:
> Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov wrote:

>> > I calculate the count first.
>>
>> This and other comments suggest that the data is totally static
>> while this application is running. Is that correct?
>>
> No, the data gets added when the application is running. As I've
> mentioned before it could be as faster as 100-400 records per
> second. And it is an important application which will be running
> 24/7.

Then how can you trust that the count you run before selecting is
accurate when you run the SELECT? Are they both in the same
REPEATABLE READ or SERIALIZABLE transaction?

>> Also, you didn't address the issue of storing enough information
>> on the page to read off either edge in the desired sequence with
>> just a LIMIT and no offset. "Last page" or "page up" would need
>> to reverse the direction on the ORDER BY. This would be very
>> fast if you have appropriate indexes. Your current technique can
>> never be made very fast.
>>
> I actually didn't understand what did you mean when you said
> "storing enough information on the page to read off either edge in
> the desired sequence with just a LIMIT and no offset". What kind
> of information can we store to improve the performance.

Well, for starters, it's entirely possible that the "hitlist"
approach posted by Craig James will work better for you than what
I'm about to describe. Be sure to read this post carefully:

http://archives.postgresql.org/pgsql-performance/2010-05/msg00058.php

The reason that might work better than the idea I was suggesting is
that the combination of selecting on timestamp and ordering by
something else might make it hard to use reasonable indexes to
position and limit well enough for the technique I was suggesting to
perform well. It's hard to say without testing.

For what I was describing, you must use an ORDER BY which guarantees
a consistent sequence for the result rows. I'm not sure whether you
always have that currently; if not, that's another nail in the
coffin of your current technique, since the same OFFSET into the
result might be different rows from one time to the next, even if
data didn't change. If your ORDER BY can't guarantee a unique set
of ordering values for every row in the result set, you need to add
any missing columns from a unique index (usually the primary key) to
the ORDER BY clause.

Anyway, once you are sure you have an ORDER BY which is
deterministic, you make sure your software remembers the ORDER BY
values for the first and last entries on the page. Then you can do
something like (abstractly):

SELECT x, y, z
FROM a, b
WHERE ts BETWEEN m AND n
AND a.x = b.a_x
AND (x, y) > (lastx, lasty)
ORDER BY x, y
LIMIT 20;

With the right indexes, data distributions, selection criteria, and
ORDER BY columns -- that *could* be very fast. If not, look at
Craig's post.

-Kevin

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Piotr Legiecki 2010-05-14 08:24:20 old server, new server, same performance
Previous Message venu madhav 2010-05-13 04:41:46 Re: Performance issues when the number of records are around 10 Million