Re: [HACKERS] CSV Logging questions

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] CSV Logging questions
Date: 2017-11-30 21:09:42
Message-ID: CA+TgmoZzUjL=2ZB5WWaSxqyNHyhaKVmjwES5e7aEmrHSAd_1OA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 4, 2017 at 12:27 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> 1) Why do we gather a per-session log line number? Is it just to aid
> people importing to avoid duplicate entries from partial files? Is
> there some other purpose given that entries will already be sequential
> in the csv file?

I think the idea is that if you see a line for a session you can tell
whether there are any earlier lines for the same session. It might
not be obvious, because they could be much earlier in the log if a
session was idle for a while. I've certainly run into this problem in
real-world troubleshooting.

> 2) Why is the file error conditional on log_error_verbosity? Surely
> the whole point of a structured log is that you can log everything and
> choose what to display later -- i.e. why csv logging doesn't look at
> log_line_prefix to determine which other bits to display. There's no
> added cost to include this information unconditionally and they're far
> from the largest piece of data being logged either.
>
> 3) Similarly I wonder if the statement should always be included even
> with hide_stmt is set so that users can write sensible queries against
> the data even if it means duplicating data.

I think the principle that the CSV log should contain all of the
output fields can be taken too far, and I'd put both of these ideas in
that category. I don't see any reason to believe there couldn't be a
user who wants CSV logging but not at maximum verbosity -- and
hide_stmt is used for cases like this:

ereport(LOG,
(errmsg("statement: %s", query_string),
errhidestmt(true),
errdetail_execute(parsetree_list)));

Actually, I think it's poor design to force the CSV log to contain all
of the output fields. For some users, that might make it unusable by
making the output too big. I think it would be better if the data
were self-identifying - e.g. by sticking a header line on each log
file - and perhaps complete by default, but still configurable. We've
had the idea of adding new %-escapes shot down on the grounds that
that would force us to include them all the time in CSV output and
they are too specialized to justify this, but that seems to me to be a
case of the tail wagging the dog.

> 4) Why the session start time? Is this just so that <process_id,
> session_start_time> uniquely identiifes a session? Should we perhaps
> generate a unique session identifier instead?

We could do that, but it doesn't seem better...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexey Kondratov 2017-11-30 21:19:28 Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling
Previous Message Robert Haas 2017-11-30 20:55:00 Re: using index or check in ALTER TABLE SET NOT NULL