Skip site navigation (1) Skip section navigation (2)

Re: performance for high-volume log insertion

From: Richard Huxton <dev(at)archonet(dot)com>
To: david(at)lang(dot)hm
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: performance for high-volume log insertion
Date: 2009-04-21 08:56:23
Message-ID: (view raw or whole thread)
Lists: pgsql-performance
david(at)lang(dot)hm wrote:
> On Tue, 21 Apr 2009, Stephen Frost wrote:
>> * david(at)lang(dot)hm (david(at)lang(dot)hm) wrote:
>>> while I fully understand the 'benchmark your situation' need, this isn't
>>> that simple.
>> It really is.  You know your application, you know it's primary use
>> cases, and probably have some data to play with.  You're certainly in a
>> much better situation to at least *try* and benchmark it than we are.
> rsyslog is a syslog server. it replaces (or for debian and fedora, has 
> replaced) your standard syslog daemon. it recieves log messages from 
> every app on your system (and possibly others), filters, maniulates 
> them, and then stores them somewhere. among the places that it can store 
> the logs are database servers (native support for MySQL, PostgreSQL, and 
> Oracle. plus libdbi for others)

Well, from a performance standpoint the obvious things to do are:
1. Keep a connection open, do NOT reconnect for each log-statement
2. Batch log statements together where possible
3. Use prepared statements
4. Partition the tables by day/week/month/year (configurable I suppose)

The first two are vital, the third takes you a step further. The fourth 
is a long-term admin thing.

And possibly
5. Have two connections, one for fatal/error etc and one for info/debug 
level log statements (configurable split?). Then you can use the 
synchronous_commit setting on the less important ones. Might buy you 
some performance on a busy system.

> other apps then search and report on the data after it is stored. what 
> apps?, I don't know either. pick your favorite reporting tool and you'll 
> be a step ahead of me (I don't know a really good reporting tool)
> as for sample data, you have syslog messages, just like I do. so you 
> have the same access to data that I have.
> how would you want to query them? how would people far less experianced 
> that you want to query them?
> I can speculate that some people would do two columns (time, everything 
> else), others will do three (time, server, everything else), and others 
> will go further (I know some who would like to extract IP addresses 
> embedded in a message into their own column). some people will index on 
> the time and host, others will want to do full-text searches of everything.

Well, assuming it looks much like traditional syslog, I would do 
something like: (timestamp, host, facility, priority, message). It's 
easy enough to stitch back together if people want that.

PostgreSQL's full-text indexing is quite well suited to logfiles I'd 
have thought, since it knows about filenames, urls etc already.

If you want to get fancy, add a msg_type column and one subsidiary table 
for each msg_type. So - you might have smtp_connect_from (hostname, 
ip_addr). A set of perl regexps can match and extract the fields for 
these extra tables, or you could do it with triggers inside the 
database. I think it makes sense to do it in the application. Easier for 
users to contribute new patterns/extractions. Meanwhile, the core table 
is untouched so you don't *need* to know about these extra tables.

If you have subsidiary tables, you'll want to partition those too and 
perhaps stick them in their own schema (logs200901, logs200902 etc).

   Richard Huxton
   Archonet Ltd

In response to


pgsql-performance by date

Next:From: Matthew WakelingDate: 2009-04-21 10:40:19
Subject: Re: GiST index performance
Previous:From: davidDate: 2009-04-21 08:26:23
Subject: Re: performance for high-volume log insertion

Privacy Policy | About PostgreSQL
Copyright © 1996-2015 The PostgreSQL Global Development Group