Re: Can Postgres Not Do This Safely ?!?

From: Andy Colson <andy(at)squeakycode(dot)net>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Karl Pickett <karl(dot)pickett(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Can Postgres Not Do This Safely ?!?
Date: 2010-10-29 21:57:24
Message-ID: 4CCB4344.3010602@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 10/29/2010 9:49 AM, Merlin Moncure wrote:
> On Thu, Oct 28, 2010 at 10:04 PM, Karl Pickett<karl(dot)pickett(at)gmail(dot)com> wrote:
>> Hello Postgres Hackers,
>>
>> We have a simple 'event log' table that is insert only (by multiple
>> concurrent clients). It has an integer primary key. We want to do
>> incremental queries of this table every 5 minutes or so, i.e. "select
>> * from events where id> LAST_ID_I_GOT" to insert into a separate
>> reporting database. The problem is, this simple approach has a race
>> that will forever skip uncommitted events. I.e., if 5000 was
>> committed sooner than 4999, and we get 5000, we will never go back and
>> get 4999 when it finally commits. How can we solve this? Basically
>> it's a phantom row problem but it spans transactions.
>>
>> I looked at checking the internal 'xmin' column but the docs say that
>> is 32 bit, and something like 'txid_current_snapshot' returns a 64 bit
>> value. I don't get it. All I want to is make sure I skip over any
>> rows that are newer than the oldest currently running transaction.
>> Has nobody else run into this before?
>
> You don't have a sequence problem so much as a wrong implementation
> problem. Sequences are always *grabbed* in order but they can hit the
> table out of order and there is a time lag between when the sequence
> value is generated and the transaction commits. If I issue 'begin',
> insert a log record, and hold the commit for 5 minutes you are going
> to skip the record because you are only looking at the last processed
> record. Your algorithm is going to fail if you use a sequence,
> timestamp, or gapless sequence to manage your queue position. You
> need to divide your log records into two logical sets, procesed and
> unprocessed, and look at the set as a whole.
>
> I would suggest staging your unprocessed records to a queue table and
> having your writer consume them and move them to a processed table.
> You can also look at already built queuing implementations like PGQ
> written by our spectacularly skilled friends at Skype (haven't used it
> myself, but I've heard it's good!).
>
> merlin
>

Yep, you dont want a sequence. You want a flag.

add a boolean "processed" flag, default it to false.

then every 5 minutes run this:

begin
insert into logged select * from events where processed = false;
update events set processed = true where processed = false;
commit;

or, if you want to select them and do something to them:

begin
select * from events where processed = false;
... do you processing on each, which would include inserting it...
update events set processed = true where processed = false;
commit;

Just make sure you do it all in the same transaction, so the update sees
the exact same set as the select.

You could also create a function index on processed to keep track of
just those that are false.

-Andy

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2010-10-29 22:00:56 Re: Max Tables in a union
Previous Message Guillaume Lelarge 2010-10-29 21:46:59 Re: Unhandled exception in PGAdmin when opening 16-million-record table