Re: performance for high-volume log insertion

From: david(at)lang(dot)hm
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: performance for high-volume log insertion
Date: 2009-04-21 06:00:54
Message-ID: alpine.DEB.1.10.0904202239130.12662@asgard.lang.hm
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, 21 Apr 2009, Stephen Frost wrote:

> David,
>
> * david(at)lang(dot)hm (david(at)lang(dot)hm) wrote:
>> the database structure is not being defined by (or specificly for)
>> rsyslog. so at compile time we have _no_ idea how many variables of what
>> type there are going to be. my example of ($timestamp,$msg) was intended
>> to just be a sample (avoiding typing out some elaberate set of
>> parameters)
>
> That's fine. I don't see any reason that the API I suggested be a
> compile-time option. Certainly, libpq has no idea how many, or what
> kind, of types are being passed to it by the caller, that's why it *has*
> the API that it does. You just need to work out what each prepared
> queries' parameters are and construct the necessary arrays.

I misunderstood what you were saying (more below)

>> rsyslog provides the following items, which can be sliced and diced with
>> substatutions, substrings, and additional inserted text.
>
> [...]
>
> Looks like mainly text fields still, so you might want to just stick
> with text for now.

yes, almost exclusivly text fields, and the fields that could be numbers
(or dates) are in text formats when we get them anyway.

>> rsyslog message formatting provides tools for doing the nessasary
>> escaping (and is using it for the single insert messages today)
>>
>> prepared statements in text mode have similar problems (although they
>> _are_ better in defending against sql injection attacks, so a bit safer).
>
> Uhh, if you use prepared statements with PQexecPrepared, there *is no*
> escaping necessary. I'm not sure what you mean by 'similar problems'.
> Can you elaborate on that? If you mean doing 'prepared queries' by
> using creating a string and then using PQexec with
> 'EXECUTE blah (1,2,3);' then you really need to go read the
> documentation I suggested. That's *not* what I'm getting at when I say
> 'prepared queries', I'm talking about a protocol-level well-defined
> format for passing arguments independently of commands. A call to
> PQexecPrepared looks like this:
>
> PQprepare(conn, "myquery", "INSERT INTO TAB1 VALUES ($1, $2);", 0, NULL);
>
> values[0] = "a";
> values[1] = "b";
> PQexecPrepared(conn, "myquery", 2, values, NULL, NULL, 0);
>
> Note that we don't ever send an 'EXECUTE myquery (1,2,3);' type of thing
> to libpq. libpq will handle the execute and the parameters and whatnot
> as part of the PG 3.0 protocol.

when you said to stick with text mode, I thought you were meaning that we
would create a string with EXECUTE.... in it and send that. it would have
similar escaping issues (although with fewer vunerabilities if they mess
up)

>> so the binary mode only makes a difference on things like timestamps and
>> numbers? (i.e. no significant added efficiancy in processing the command
>> itself?)
>
> I'm slightly confused by what you mean by this? Binary mode is for
> parameters only, commands are never 'binary'. Binary mode just means
> that the application gives the value to the database in the format which
> it expects, and so the database doesn't have to translate a textual
> representation of a value into the binary format the database needs.

I thought that part of the 'efficiancy' and 'performance' to be gained
from binary modes were avoiding the need to parse commands, if it's only
the savings in converting column contents from text to specific types,
it's much less important.

>> I think the huge complication is that when RedHat compiles rsyslog to
>> ship it in the distro, they have no idea how it is going to be used (if
>> it will go to a database, what database engine it will interface with, or
>> what the schema of that database would look like). Only the
>> sysadmin(s)/dba(s) know that and they need to be able to tell rsyslog
>> what to do to get the data where they want it to be, and in the format
>> they want it to be in.
>
> That's really all fine. You just need to get from the user, at runtime,
> what they want their commands to look like. Once you have that, it
> should be entirely possible to dynamically construct the prepared
> queries, most likely without the user having to know anything about COPY
> or prepared statements or anything. For my part, I'd want something
> like:
>
> table = "mytable";
> data = "$Y, $m, $d, $H, $msg";

if the user creates the data this way, you just reintroduced the escaping
problem. they would have to do something like

data = "$Y"
data = "$m"
data = "$d"
data = "$H"
data = "$msg"

one key thing is that it's very probable that the user will want to
manipulate the string, not just send a single variable as-is

> I'd avoid having the user provide actual SQL, because that becomes
> difficult to deal with unless you embed an SQL parser in rsyslog, and
> I don't really see the value in that.

there's no need for rsyslog to parse the SQL, just to be able to escape it
appropriately and then pass it to the database for execution

> If the user wants to do
> something fancy with the data in the database, I would encourage them
> to put an 'ON INSERT' trigger on 'mytable' to do whatever they want with
> the data that's coming in. This gives you the freedom necessary to
> build an appropriate statement for any database you're connecting to,
> dynamically, using prepared queries, and even binary mode if you want.

one huge advantage of putting the sql into the configuration is the
ability to work around other users of the database.

for example, what if the database has additional columns that you don't
want to touch (say an item# sequence), if the SQL is in the config this is
easy to work around, if it's seperate (or created by the module), this is
very hard to do.

I guess you could give examples of the SQL in the documentation for how to
create the prepared statement etc in the databases, but how is that much
better than having it in the config file?

for many users it's easier to do middlein -fancy stuff in the SQL than
loading things into the database (can you pre-load prepared statements in
the database? or are they a per-connection thing?)

so back to the main questions of the advantages

prepared statements avoid needing to escape things, but at the
complication of a more complex API.

there's still the question of the performance difference. I have been
thinking that the overhead of doing the work itself would overwelm the
performance benifits of prepared statements.

as I understand it, the primary performance benifit is the ability to
avoid the parsing and planning stages of the command. for simple commands
(which I assume inserts to be, even if inserting a lot of stuff), the
planning would seem to be cheap compared to the work of doing the inserts

on a fully tuned database are we talking about 10% performance? 1%? 0.01%?

any ideas

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message david 2009-04-21 06:05:33 Re: performance for high-volume log insertion
Previous Message Stephen Frost 2009-04-21 04:10:08 Re: performance for high-volume log insertion