Re: performance for high-volume log insertion

From: david(at)lang(dot)hm
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: performance for high-volume log insertion
Date: 2009-04-21 07:49:59
Message-ID: alpine.DEB.1.10.0904210018010.12662@asgard.lang.hm
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, 21 Apr 2009, Stephen Frost wrote:

> * david(at)lang(dot)hm (david(at)lang(dot)hm) wrote:
>> while I fully understand the 'benchmark your situation' need, this isn't
>> that simple.
>
> It really is. You know your application, you know it's primary use
> cases, and probably have some data to play with. You're certainly in a
> much better situation to at least *try* and benchmark it than we are.

rsyslog is a syslog server. it replaces (or for debian and fedora, has
replaced) your standard syslog daemon. it recieves log messages from every
app on your system (and possibly others), filters, maniulates them, and
then stores them somewhere. among the places that it can store the logs
are database servers (native support for MySQL, PostgreSQL, and Oracle.
plus libdbi for others)

other apps then search and report on the data after it is stored. what
apps?, I don't know either. pick your favorite reporting tool and you'll
be a step ahead of me (I don't know a really good reporting tool)

as for sample data, you have syslog messages, just like I do. so you have
the same access to data that I have.

how would you want to query them? how would people far less experianced
that you want to query them?

I can speculate that some people would do two columns (time, everything
else), others will do three (time, server, everything else), and others
will go further (I know some who would like to extract IP addresses
embedded in a message into their own column). some people will index on
the time and host, others will want to do full-text searches of
everything.

I can talk about the particular use case I have at work, but that would be
far from normal (full text searches on 10s of TB of data, plus reports,
etc) but we don't (currently) use postgres to do that, and I'm not sure
how I would configure postgres for that sort of load. so I don't think
that my personal situation is a good fit. I looked at bizgres a few years
ago, but I didn't know enough about what I was trying to do or how much
data I was trying to deal with to go forward with it at that time.

do I do the benchmark on the type of hardware that I use for the system
above (after spending how much time experimenting to find corret tuning)?
or on a stock, untuned postgres running on a desktop-type system (we all
know how horrible the defaults are), how am I supposed to know if the
differences that I will see in my 'benchmarks' are the result of the
differences between the commands, and not that I missed some critical knob
to turn?

benchmarking is absolutly the right answer for some cases, especially when
someone is asking exactly how something will work for them. but in this
case I don't have the specific use case. I am trying to find out where the
throretical advantages are for these things that 'everybody knows you
should do with a database' to understand the probability that they will
make a difference in this case.

>> in this case we are trying to decide what API/interface to use in a
>> infrastructure tool that will be distributed in common distros (it's now
>> the default syslog package of debian and fedora), there are so many
>> variables in hardware, software, and load that trying to benchmark it
>> becomes effectivly impossible.
>
> You don't need to know how it will perform in every situation. The main
> question you have is if using prepared queries is faster or not, so pick
> a common structure, create a table, get some data, and test. I can say
> that prepared queries will be more likely to give you a performance
> boost with wider tables (more columns).

this is very helpful, I can't say what the schema would look like, but I
would guess that it will tend towards the narrow side (or at least not
update very many columns explicitly)

>> based on Stephan's explination of where binary could help, I think the
>> easy answer is to decide not to bother with it (the postgres text to X
>> converters get far more optimization attention than anything rsyslog
>> could deploy)
>
> While that's true, there's no substitute for not having to do a
> conversion at all. After all, it's alot cheaper to do a bit of
> byte-swapping on an integer value that's already an integer in memory
> than to sprintf and atoi it.

but it's not a integer in memory, it's text that arrived over the network
or through a socket as a log message from another application.

David Lang

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message david 2009-04-21 08:26:23 Re: performance for high-volume log insertion
Previous Message Stephen Frost 2009-04-21 06:50:59 Re: performance for high-volume log insertion