> On Tue, 21 Apr 2009, Stephen Frost wrote:
>> * david(at)lang(dot)hm (david(at)lang(dot)hm) wrote:
>>> while I fully understand the 'benchmark your situation' need, this isn't
>>> that simple.
>> It really is. You know your application, you know it's primary use
>> cases, and probably have some data to play with. You're certainly in a
>> much better situation to at least *try* and benchmark it than we are.
> rsyslog is a syslog server. it replaces (or for debian and fedora, has
> replaced) your standard syslog daemon. it recieves log messages from
> every app on your system (and possibly others), filters, maniulates
> them, and then stores them somewhere. among the places that it can store
> the logs are database servers (native support for MySQL, PostgreSQL, and
> Oracle. plus libdbi for others)
Well, from a performance standpoint the obvious things to do are:
1. Keep a connection open, do NOT reconnect for each log-statement
2. Batch log statements together where possible
3. Use prepared statements
4. Partition the tables by day/week/month/year (configurable I suppose)
The first two are vital, the third takes you a step further. The fourth
is a long-term admin thing.
5. Have two connections, one for fatal/error etc and one for info/debug
level log statements (configurable split?). Then you can use the
synchronous_commit setting on the less important ones. Might buy you
some performance on a busy system.
> other apps then search and report on the data after it is stored. what
> apps?, I don't know either. pick your favorite reporting tool and you'll
> be a step ahead of me (I don't know a really good reporting tool)
> as for sample data, you have syslog messages, just like I do. so you
> have the same access to data that I have.
> how would you want to query them? how would people far less experianced
> that you want to query them?
> I can speculate that some people would do two columns (time, everything
> else), others will do three (time, server, everything else), and others
> will go further (I know some who would like to extract IP addresses
> embedded in a message into their own column). some people will index on
> the time and host, others will want to do full-text searches of everything.
Well, assuming it looks much like traditional syslog, I would do
something like: (timestamp, host, facility, priority, message). It's
easy enough to stitch back together if people want that.
PostgreSQL's full-text indexing is quite well suited to logfiles I'd
have thought, since it knows about filenames, urls etc already.
If you want to get fancy, add a msg_type column and one subsidiary table
for each msg_type. So - you might have smtp_connect_from (hostname,
ip_addr). A set of perl regexps can match and extract the fields for
these extra tables, or you could do it with triggers inside the
database. I think it makes sense to do it in the application. Easier for
users to contribute new patterns/extractions. Meanwhile, the core table
is untouched so you don't *need* to know about these extra tables.
If you have subsidiary tables, you'll want to partition those too and
perhaps stick them in their own schema (logs200901, logs200902 etc).
In response to
pgsql-performance by date
|Next:||From: Matthew Wakeling||Date: 2009-04-21 10:40:19|
|Subject: Re: GiST index performance|
|Previous:||From: david||Date: 2009-04-21 08:26:23|
|Subject: Re: performance for high-volume log insertion|