Skip site navigation (1) Skip section navigation (2)

Re: performance for high-volume log insertion

From: Richard Huxton <dev(at)archonet(dot)com>
To: david(at)lang(dot)hm
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: performance for high-volume log insertion
Date: 2009-04-21 08:56:23
Message-ID: 49ED8A37.4030509@archonet.com (view raw or flat)
Thread:
Lists: pgsql-performance
david(at)lang(dot)hm wrote:
> On Tue, 21 Apr 2009, Stephen Frost wrote:
> 
>> * david(at)lang(dot)hm (david(at)lang(dot)hm) wrote:
>>> while I fully understand the 'benchmark your situation' need, this isn't
>>> that simple.
>>
>> It really is.  You know your application, you know it's primary use
>> cases, and probably have some data to play with.  You're certainly in a
>> much better situation to at least *try* and benchmark it than we are.
> 
> rsyslog is a syslog server. it replaces (or for debian and fedora, has 
> replaced) your standard syslog daemon. it recieves log messages from 
> every app on your system (and possibly others), filters, maniulates 
> them, and then stores them somewhere. among the places that it can store 
> the logs are database servers (native support for MySQL, PostgreSQL, and 
> Oracle. plus libdbi for others)

Well, from a performance standpoint the obvious things to do are:
1. Keep a connection open, do NOT reconnect for each log-statement
2. Batch log statements together where possible
3. Use prepared statements
4. Partition the tables by day/week/month/year (configurable I suppose)

The first two are vital, the third takes you a step further. The fourth 
is a long-term admin thing.

And possibly
5. Have two connections, one for fatal/error etc and one for info/debug 
level log statements (configurable split?). Then you can use the 
synchronous_commit setting on the less important ones. Might buy you 
some performance on a busy system.

http://www.postgresql.org/docs/8.3/interactive/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS

> other apps then search and report on the data after it is stored. what 
> apps?, I don't know either. pick your favorite reporting tool and you'll 
> be a step ahead of me (I don't know a really good reporting tool)
> 
> as for sample data, you have syslog messages, just like I do. so you 
> have the same access to data that I have.
> 
> how would you want to query them? how would people far less experianced 
> that you want to query them?
> 
> I can speculate that some people would do two columns (time, everything 
> else), others will do three (time, server, everything else), and others 
> will go further (I know some who would like to extract IP addresses 
> embedded in a message into their own column). some people will index on 
> the time and host, others will want to do full-text searches of everything.

Well, assuming it looks much like traditional syslog, I would do 
something like: (timestamp, host, facility, priority, message). It's 
easy enough to stitch back together if people want that.

PostgreSQL's full-text indexing is quite well suited to logfiles I'd 
have thought, since it knows about filenames, urls etc already.

If you want to get fancy, add a msg_type column and one subsidiary table 
for each msg_type. So - you might have smtp_connect_from (hostname, 
ip_addr). A set of perl regexps can match and extract the fields for 
these extra tables, or you could do it with triggers inside the 
database. I think it makes sense to do it in the application. Easier for 
users to contribute new patterns/extractions. Meanwhile, the core table 
is untouched so you don't *need* to know about these extra tables.

If you have subsidiary tables, you'll want to partition those too and 
perhaps stick them in their own schema (logs200901, logs200902 etc).

-- 
   Richard Huxton
   Archonet Ltd

In response to

Responses

pgsql-performance by date

Next:From: Matthew WakelingDate: 2009-04-21 10:40:19
Subject: Re: GiST index performance
Previous:From: davidDate: 2009-04-21 08:26:23
Subject: Re: performance for high-volume log insertion

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group