Re: Redacting information from logs

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Redacting information from logs
Date: 2019-08-04 18:17:57
Message-ID: 20190804181757.jkkqfew2ucqgkvxe@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-08-03 19:14:13 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > I.e. something very roughly like
>
> > ereport(ERROR,
> > errmsg_rich("string with %{named}s references to %{parameter}s"),
> > errparam("named", somevar),
> > errparam("parameter", othervar, .redact = CONTEXT));
>
> I'm not terribly attracted by that specific approach, though. With
> this, we'd get to maintain *two* variants of snprintf, and I think
> the one with annotation knowledge would have significant performance
> problems. (Something I'm sensitive to, on account of certain persons
> beating me over the head about snprintf.c's performance.)

I don't think that the performance issues of *printf in general - which
we use in a lot of hot paths like output functions - are the same as
when we log messages/throw errors. My gut feeling that at least >=
DEBUG1 the difference would be hard to measure. And given how we
e.g. build up the log_line_prefix, we could probably also recoup a lot
potentially lost performance, around logging.

We would probably just allow both positional references and named
references. I.e. for smaller messages we can just use the current
format, and just for the very long ones with many parameters we'd
normally use a positional reference - but still have the name associated
with the parameter. Even if we don't want to allow reference parameters
by name in the format string, I think it's worthwhile. See below.

It's really kind of a shame that we constantly re-parse the format
strings. With something like modern C++ it'd be fairly easy to not do
that (by using compile time evaluated constexprs when possible) - but I
wonder if we couldn't also manage to do that within plain C, at the cost
of a branch and some macro magic. I'm imagining something like

#define pg_printf(fmt, ...) \
do { \
if ( __builtin_constant_p(fmt)) \
{ \
static processed_fmt processed_fmt_ = {.format = fmt}; \
if (unlikely(!processed_fmt_.parsed)) \
preprocess_format_string(&processed_fmt_) \
pg_printf_processed(&processed_fmt_, __VA_ARGS__); \
} \
else \
pg_printf_unprocessed(fmt, __VA_ARGS__); \
} while (0) \

having to use a static variable for this purpose sucks somewhat, but I
can't think of something in C that'd let us avoid that.

> It'd be an amazing pain in the rear for translators, too, on account
> of their tools not understanding this format-string language.

The limitations of that world are starting to be really frustrating.

> It seems to me that it'd be sufficient to do the annotation by
> inserting wrapper functions, like the errparam() you suggest above.
> If we just had errparam() choosing whether to return "..." instead of
> its argument string, we'd have what we need, without messing with
> the format language.

I agree that if redaction were the only reason, that the expanded format
strings wouldn't be worth it. And perhaps they still aren't.

I think we've quite repeatedly had requests for a log format that can be
parsed reasonably, and annotated with additional information that's too
verbose for the main message. It's a pain to have to parse quite
complex strings in production, just to extract the value of the actual
information in the message.

As an *easy* example of something I wanted to do in the past: Plot the
total wait time of backends that had to wait for locks longer than
deadlock_timeout. In that case, because we saw the system failing with
too much waiting, but didn't really know when that started become problematic, and what the neormal amount of waiting is
So we need to parse
ereport(LOG,
(errmsg("process %d acquired %s on %s after %ld.%03d ms",
MyProcPid, modename, buf.data, msecs, usecs)));
into something usable. First problem is of course that the
log_line_prefix is variable, and that we'll likely need information from
it to hunt down the source of locks. Parsing the above with regexes
isn't too hard - but then we also need to deal with the fact that the
message can be translated. Oh, and finding the associated statement
requires looking forward in the log stream by a hard to predict amount,
and requires knowledge about log verbosity was set to.

In the past I had also wanted to parse the log to understand how much
work manual and auto-vacuum an installation was doing, and whether
that's perhaps correlated to problems. But for that one need a parser
that can handle this mess:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/vacuumlazy.c;h=a3c4a1df3b4b28953ab1dfcd43f3d566eecac555;hb=HEAD#l396

which is obviously much harder.

If we instead had the relevant variables separated out in a named
fashion, we could - like many other tools these days - have a richer log
target, that'd first have typical log message as a human readable
string, but then also all the interpolated values separately.

E.g. in theaforementioned simpler case it could be something like

ts="2019-08-04 10:19:21.286 PDT" pid=32416 vxid=12/9 sev=LOG \
msg="Prozess 15027 erlangte AccessShareLock-Sperre auf Relation 1259 der Datenbank 13416 nach 29808,865 ms" \
fmt="process %d acquired %s on %s after %ld.%03d ms" \
p:modename="AccessShareLock-Sperre auf Relation 1259" \
p:msec=18343 p:usec=173 \
p:locktag=190345/134343/0/0/relation/default \
c:statement="LOCK TABLE pg_class;" \
l:func=ProcSleep l:file=proc.c l:line=1495

So, the prefix fields are just included as k=v, the evaluated message as
one field, the *untranslated* format as another. All the parameters are
included prefixed with p:, context information prefixed with c:, and
location with l:. Note how e.g. p:modename is a pre-translated string,
which is why p:locktag was added as an intentionall untranslated
paramter that's not included in the message.

This isn't perfect, but still a heck of a lot easier to parse than what
we have now. All of the data is one block rather than in consecutive log
lines that may or may not belong together, it's key=value in a way that
can be parsed, the untranslated format string is available, so tools can
actually analyze logs independent of the current locale.

This isn't a proposal to adopt precisely the above log format - that's
just something I came up with while writing this email - but to have
enough information available to actually produce something like it when
emitting logs.

If we have enough information to pass to the logging hook, we don't even
need to define how all of this is going to look like exactly (although
I'd probably argue that a logfmt or json target ought to be in core).

The cool part - and people from other projects/company projects might be
laughing about me describing it as cool in this day and age - is that
once you have a format that can readily be parsed it's pretty simple to
write tools that process a logfile and extract just the information you
want.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2019-08-04 19:00:53 Re: pgsql: Improve pruning of a default partition
Previous Message Michail Nikolaev 2019-08-04 18:13:46 Re: Optimize single tuple fetch from nbtree index