Re: performance for high-volume log insertion

From: Richard Huxton <dev(at)archonet(dot)com>
To: david(at)lang(dot)hm
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: performance for high-volume log insertion
Date: 2009-04-21 08:56:23
Message-ID: 49ED8A37.4030509@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

david(at)lang(dot)hm wrote:
> On Tue, 21 Apr 2009, Stephen Frost wrote:
>
>> * david(at)lang(dot)hm (david(at)lang(dot)hm) wrote:
>>> while I fully understand the 'benchmark your situation' need, this isn't
>>> that simple.
>>
>> It really is. You know your application, you know it's primary use
>> cases, and probably have some data to play with. You're certainly in a
>> much better situation to at least *try* and benchmark it than we are.
>
> rsyslog is a syslog server. it replaces (or for debian and fedora, has
> replaced) your standard syslog daemon. it recieves log messages from
> every app on your system (and possibly others), filters, maniulates
> them, and then stores them somewhere. among the places that it can store
> the logs are database servers (native support for MySQL, PostgreSQL, and
> Oracle. plus libdbi for others)

Well, from a performance standpoint the obvious things to do are:
1. Keep a connection open, do NOT reconnect for each log-statement
2. Batch log statements together where possible
3. Use prepared statements
4. Partition the tables by day/week/month/year (configurable I suppose)

The first two are vital, the third takes you a step further. The fourth
is a long-term admin thing.

And possibly
5. Have two connections, one for fatal/error etc and one for info/debug
level log statements (configurable split?). Then you can use the
synchronous_commit setting on the less important ones. Might buy you
some performance on a busy system.

http://www.postgresql.org/docs/8.3/interactive/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS

> other apps then search and report on the data after it is stored. what
> apps?, I don't know either. pick your favorite reporting tool and you'll
> be a step ahead of me (I don't know a really good reporting tool)
>
> as for sample data, you have syslog messages, just like I do. so you
> have the same access to data that I have.
>
> how would you want to query them? how would people far less experianced
> that you want to query them?
>
> I can speculate that some people would do two columns (time, everything
> else), others will do three (time, server, everything else), and others
> will go further (I know some who would like to extract IP addresses
> embedded in a message into their own column). some people will index on
> the time and host, others will want to do full-text searches of everything.

Well, assuming it looks much like traditional syslog, I would do
something like: (timestamp, host, facility, priority, message). It's
easy enough to stitch back together if people want that.

PostgreSQL's full-text indexing is quite well suited to logfiles I'd
have thought, since it knows about filenames, urls etc already.

If you want to get fancy, add a msg_type column and one subsidiary table
for each msg_type. So - you might have smtp_connect_from (hostname,
ip_addr). A set of perl regexps can match and extract the fields for
these extra tables, or you could do it with triggers inside the
database. I think it makes sense to do it in the application. Easier for
users to contribute new patterns/extractions. Meanwhile, the core table
is untouched so you don't *need* to know about these extra tables.

If you have subsidiary tables, you'll want to partition those too and
perhaps stick them in their own schema (logs200901, logs200902 etc).

--
Richard Huxton
Archonet Ltd

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Matthew Wakeling 2009-04-21 10:40:19 Re: GiST index performance
Previous Message david 2009-04-21 08:26:23 Re: performance for high-volume log insertion