Re: Efficient processing of staging data

From: "Asko Oja" <ascoja(at)gmail(dot)com>
To: "Joris Dobbelsteen" <joris(at)familiedobbelsteen(dot)nl>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Efficient processing of staging data
Date: 2008-09-07 23:18:15
Message-ID: ecd779860809071618t3f00d29enc52ae873959affb2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

That sounds like something that is good to handle with PgQ that is part of
SkyTools package.
PgQ is very efficient event processing system for PostgreSQL. Each event
queue may have several consumers.
What may interest you is that PgQ takes care of processed events by keeping
them in rotated tables and using truncate after all consumers have processed
events thus removing need for delete. Also convenient means for keeping
track what has been processed are provided.
One descriptive example:
http://kaiv.wordpress.com/2007/10/19/skytools-database-scripting-framework-pgq/

Skytools contains several scripts that may be good starting points while
designing your own.
- table_dispather.py - used write incoming events into partitioned table in
target database
- queue_mover.py - used to copy events from one database into another (for
later processing on less loaded machine than the one where events are
produced)
- queue_splitter.py - used to split one queue from online database into
multiple queues in target database there events are processed by various
btach jobs.

So the solution might be to
- to use insert triggers to push stuff into queue
- and let pgq consumer or onsumers process that

regards,
Asko
skypename: askoja

On Sat, Sep 6, 2008 at 7:01 PM, Joris Dobbelsteen <
joris(at)familiedobbelsteen(dot)nl> wrote:

> Dear,
>
> I'm looking for an efficient way to process data that I have stored in a
> staging table. This is syslog output, so it consists of text, where I need
> regexes to filter it out. The data will be split into several/a lot of
> tables.
>
> I currently have a set of queries that
> * read from staging and insert into destination table.
> * delete from staging (using destination table).
> This requires 2 scans over the tables and the index operations (which pay
> of here). I do the processing incrementally, in order to keep run time under
> control. But for only 4 filters this will take around 6 to 30 minutes.
>
> My later idea was to do a (incremental) table scan on the staging table
> and have a function do the processing of the row. The function will then
> either:
> * decide to do nothing
> * decide to insert the (transformed) row into the destination table and
> delete it from the staging table.
>
> An obvious extension would be to put this processing in the INSERT trigger
> of the staging table, saving the I/O's requires for an insert and delete
> from the staging table. I like to do this afterwards for the moment.
>
> What are the recommended methods and helpful implementation hints to get it
> working optimally? I.e. would my method work or are there any better
> solutions possible?
>
> How can I write the filtering functions in such a manner that I can later
> transform the solution in a trigger based one?
>
> Regards,
>
> - Joris Dobbelsteen
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Filip Rembiałkowski 2008-09-08 00:27:44 Re: [GENERAL] secure connections
Previous Message Mickey Shekdar 2008-09-07 20:52:39 PostgreSQL