Re: Finding latest record for a number of groups in an INSERT-only table

From: Daniel Farina <daniel(at)heroku(dot)com>
To: Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl>
Cc: David Johnston <polobo(at)yahoo(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Finding latest record for a number of groups in an INSERT-only table
Date: 2011-07-05 18:30:33
Message-ID: CAAZKuFb0teHKrc=2mCX0jMFu5Z5p5YO3JhC5egrKVtb=GC9sqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Jul 5, 2011 at 9:49 AM, Alban Hertroys
<dalroi(at)solfertje(dot)student(dot)utwente(dot)nl> wrote:
> On 5 Jul 2011, at 9:13, Daniel Farina wrote:
>
>>> Setup a materialized view.
>>
>> This rather defeats the point AFAIK, because keeping the materialized
>> view up to date (being more than thirty seconds out of date is not
>> desirable) will be expensive.  Maintaining the index on the (key,
>> recency) pair is, in effect, a materialization of sorts already.
>
> Except that you can't query an index directly and you can't create an index that only contains the latest rows without marking them as such, which is practically what a materialized view does.

It doesn't need to contain only the latest rows (although that'd be an
interesting index type, come to think of it), but the speed of adding
things to the index is greater than having to find the original row
via UPDATE, and the index is considerably more compact than the rest
of this fairly-fat row. Ergo, it's not "practically" doing the work
of a materialized view.

> Your concerns about the overhead of the materialized view seem rather pessimistic. The materialization adds some overhead to every insert/update/delete to test whether the current record in the materialized view should be replaced or kept, but since it's a much smaller table, that should be quite fast.

It *might* be fast enough, but the point of this thread was to see if
I could write a lucid query less contorted than the one I have come up
with (emulated skip-scan + SELECT per record). Right now the
application is better off doing a thousand plus queries (Cn+1) because
the optimizer knows how to deal with finding the latest record given a
specific group, but not how to find the latest records over an entire
relation. By wiring this approach into the backend I can avoid the
painful network overhead.

We also have other (denormalization-based) approaches in mind that are
likely faster than even proper use of the index, but it's unfortunate
we have to take that step, as otherwise the latest-record index
scanning approach would likely be fast-enough, but as long as we're
doing contortions, we may as well try to use the one that we think
will result in maximum read performance.

> You really only want to know that a row is the latest in a group of similar rows. With MVCC, you can only really achieve that by updating the latest row and the last latest row to reflect their status (latest or not). That means you do two DELETE and INSERT operations every time you want to update what the latest row is.

What you say about MVCC does not follow from the problem.

--
fdr

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2011-07-05 18:54:50 Re: Python UCS4 error
Previous Message John R Pierce 2011-07-05 17:45:13 Re: Dump large DB and restore it after all.