Re: performance for high-volume log insertion

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: david(at)lang(dot)hm
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: performance for high-volume log insertion
Date: 2009-04-21 04:10:08
Message-ID: 20090421041008.GU8123@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

David,

* david(at)lang(dot)hm (david(at)lang(dot)hm) wrote:
> the database structure is not being defined by (or specificly for)
> rsyslog. so at compile time we have _no_ idea how many variables of what
> type there are going to be. my example of ($timestamp,$msg) was intended
> to just be a sample (avoiding typing out some elaberate set of
> parameters)

That's fine. I don't see any reason that the API I suggested be a
compile-time option. Certainly, libpq has no idea how many, or what
kind, of types are being passed to it by the caller, that's why it *has*
the API that it does. You just need to work out what each prepared
queries' parameters are and construct the necessary arrays.

> rsyslog provides the following items, which can be sliced and diced with
> substatutions, substrings, and additional inserted text.

[...]

Looks like mainly text fields still, so you might want to just stick
with text for now.

> rsyslog message formatting provides tools for doing the nessasary
> escaping (and is using it for the single insert messages today)
>
> prepared statements in text mode have similar problems (although they
> _are_ better in defending against sql injection attacks, so a bit safer).

Uhh, if you use prepared statements with PQexecPrepared, there *is no*
escaping necessary. I'm not sure what you mean by 'similar problems'.
Can you elaborate on that? If you mean doing 'prepared queries' by
using creating a string and then using PQexec with
'EXECUTE blah (1,2,3);' then you really need to go read the
documentation I suggested. That's *not* what I'm getting at when I say
'prepared queries', I'm talking about a protocol-level well-defined
format for passing arguments independently of commands. A call to
PQexecPrepared looks like this:

PQprepare(conn, "myquery", "INSERT INTO TAB1 VALUES ($1, $2);", 0, NULL);

values[0] = "a";
values[1] = "b";
PQexecPrepared(conn, "myquery", 2, values, NULL, NULL, 0);

Note that we don't ever send an 'EXECUTE myquery (1,2,3);' type of thing
to libpq. libpq will handle the execute and the parameters and whatnot
as part of the PG 3.0 protocol.

> I don't see how you would easily use the API that you pointed me at above
> without having to know the database layout at compile time.

The arrays don't have to be of fixed length.. You can malloc() them at
runtime based on the number of parameters which are being used in a
particular message. Perhaps what I'm missing here is exactly what
you're expecting the user to provide you with versus what you're going
to be giving to libpq. I have been assuming that you have a format
definition system already in place that looks something like:

log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'

Which you then parse, figure out what the escape codes mean, and build
some structure which understands the whole thing at an abstract level.
For example, you should know that '%Y' is the first variable, is an
integer, etc.

From this, you can determine that there are 6 parameters, at runtime.
You can then malloc() an array with 6 char* pointers. Then, when you
have those 6 strings somewhere, you just set each of your array
parameters to the appropriate in-memory string address, eg:

array = malloc(sizeof(char*) * num_of_params);
for (int i = 0; i < num_of_params; i++) {
switch(params[i]) {
case 'Y': array[i] = my_year_string; break;
case 'm': array[i] = my_month_string; break;
}
}

etc, until you eventually have your array of pointers, with a valid
in-memory string somewhere for each pointer, that you can then turn
around and pass to PQexecPrepared. Obviously, you don't have to
malloc() every time, if you keep track of each type of message.

> I agree that defining a fixed table layout and compiling that knowledge
> into rsyslog is the safest (and probably most efficiant) way to do
> things, but there is no standard for log messages in a database, and
> different people will want to do different things with the logs, so I
> don't see how a fixed definition could work.

I didn't intend to imply that you have to use a fixed definition, just
that if you currently only have 1 then you might as well. It's entirely
possible to support any definition using the API I suggested.

> so the binary mode only makes a difference on things like timestamps and
> numbers? (i.e. no significant added efficiancy in processing the command
> itself?)

I'm slightly confused by what you mean by this? Binary mode is for
parameters only, commands are never 'binary'. Binary mode just means
that the application gives the value to the database in the format which
it expects, and so the database doesn't have to translate a textual
representation of a value into the binary format the database needs.

> thanks for taking the time to answer, I was trying to keep the problem
> definition small and simple, and from your reply it looks like I made it
> too simple.

Yes, probably. You likely assumed that I knew something about how
rsyslog works with databases, and to be honest, I have no idea. :)

> I think the huge complication is that when RedHat compiles rsyslog to
> ship it in the distro, they have no idea how it is going to be used (if
> it will go to a database, what database engine it will interface with, or
> what the schema of that database would look like). Only the
> sysadmin(s)/dba(s) know that and they need to be able to tell rsyslog
> what to do to get the data where they want it to be, and in the format
> they want it to be in.

That's really all fine. You just need to get from the user, at runtime,
what they want their commands to look like. Once you have that, it
should be entirely possible to dynamically construct the prepared
queries, most likely without the user having to know anything about COPY
or prepared statements or anything. For my part, I'd want something
like:

table = "mytable";
data = "$Y, $m, $d, $H, $msg";

I'd avoid having the user provide actual SQL, because that becomes
difficult to deal with unless you embed an SQL parser in rsyslog, and
I don't really see the value in that. If the user wants to do
something fancy with the data in the database, I would encourage them
to put an 'ON INSERT' trigger on 'mytable' to do whatever they want with
the data that's coming in. This gives you the freedom necessary to
build an appropriate statement for any database you're connecting to,
dynamically, using prepared queries, and even binary mode if you want.

Thanks,

Stephen

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message david 2009-04-21 06:00:54 Re: performance for high-volume log insertion
Previous Message david 2009-04-21 03:29:54 Re: performance for high-volume log insertion