Re: Patch to implement pg_current_logfile() function

From: "Karl O(dot) Pinc" <kop(at)meme(dot)com>
To: Gilles Darold <gilles(dot)darold(at)dalibo(dot)com>
Cc: Christoph Berg <myon(at)debian(dot)org>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch to implement pg_current_logfile() function
Date: 2016-10-27 03:25:13
Message-ID: 20161026222513.74cd3def@slate.meme.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 27 Oct 2016 00:31:56 +0200
Gilles Darold <gilles(dot)darold(at)dalibo(dot)com> wrote:

> Thanks a lot for the documentation fixes, I've also patched some of
> your changes, see v7 of the patch and explanations bellow.

Thanks. Sorry if I've not kept up on your latest decisions.

> > Put pg_log_file in alphabetical order in the file name listing
> > and section body.
>
> This file is now named current_logfile, I have changed the named in
> the documentation, especially in storage.sgml.

Since it now can contain multiple pathnames perhaps the name of the
file should be "current_logfiles"? Seems more descriptive.

> One other change in documentation is the explanation of values stored
> in this file. This is a path: log_directory/log_filename, and no more
> the log file name only. This will help to get full path of the log at
> system command level. This is also the value returned by function the
> pg_current_logfile() to be able to read file directly with a simple:
>
> SELECT pg_read_file(pg_current_logfile());

Sounds good.

> > I also have a couple of preliminary comments.

> > Since pg_log_file may contain only one line, and that
> > line may be either the filename of the csv log file or
> > the file name of the stderr file name it's impossible
> > to tell whether that single file is in csv or stderr
> > format. I suppose it might be possible based on file
> > name suffix, but Unix expressly ignores file name
> > extensions and it seems unwise to force dependence
> > on them. Perhaps each line could begin with
> > the "type" of the file, either 'csv' or 'stderr'
> > followed by a space and the file name?
>
> The current_logfile may contain 2 lines, one for csvlog and an other
> for stderr when they are both defined in log_destination. As said
> above, the csvlog file will always have the .csv suffix, I guess it
> is enough to now the format but I agree that it will not be true in
> all cases. To keep things simple I prefer to only register the path
> to the log file, external tools can easily detect the format or can
> ask for the path to a specific log format using SELECT
> pg_current_logfile('stderr'); for example. This is my point of view,
> but if there's a majority to add the log format into the
> current_logfile why not.

Let me explain my reasoning: The current_logfile file's location
is not typically going to change. The cluster gets created and
current_logfile then has a known location.

Having a known location in the filesystem for current_logfile is sort
of the point of current_logfile. A script knows where to find it,
to look at it's content and find out where to get the log file
it needs. Execing a psql process or otherwise creating a database
connection, just in order to execute "SELECT
pg_current_logfile('stderr');" to discover where the current logs
are, is _way_ more complicated and error prone that reading
a file's content.

But what if current_logfile contains only a single line? What
sort of file format does the logfile have? If you don't know
you can't process the logfile content.

When there's multiple lines in current_logfile your script might
be looking for a logfile in a particular format. How is the
script supposed to know the file format of each logfile listed?
Relying on a '.csv' file name extension is adequate, but
only under a number of assumptions. What if PG adds another
logfile format that's different from stderr; how would this
new format be distinguished from stderr? What if MS Excel
changes the CSV format (yet again) or a non MS Excel CSV
format is supported by PG, so that PG supports
multiple CSV formats; how would the different CSV
file formats be distinguished from each other. Without
going on and on, there's good reasons why Unix does not
rely on filename extensions to determine the format of file
content. Sure, scripts could always do "something", like
exec the "file" program, to discover, or at least guess
at, the format of the log file. Or we could have PG
be forever dependent upon using filename extensions
when introducing new logfile formats. There's always
kludges which will handle new circumstances. But why
go there? PG knows the format of the files it's writing into
current_logfile.

My argument is strongly related to the "Explicit is better
than implicit" philosophy of design.

It's not hard to cleanly expose the logfile format to the
reader of current_logfile; guessing would never be required
no matter the future of PG.
"current_logfile" must then be parsed, but if you _are_ going
to expose the file format then putting it into the same file
as the logfile pathnames is the way to go.

"format <space> path" is as easy to parse as looking for
a ".csv" suffix, and a lot more clear and future proof. (Even
works with spaces in the "path", in shell, using the "read"
builtin.) It does mean that every user of current_logfile _must_
parse. If you don't put an explicit format into the file content
readers who already know what file format they're going
to get don't have to parse. Making these readers parse
does not seem like too high a price.

What it comes down to is I don't buy the adequacy of the
".csv" suffix test and think that "keeping things simple" now
is a recipe for future breakage, or at least significant future
complication and confusion when it come to processing logfile
content.

Regards the data structure to use to expose the file format
I can't vouch that "format path" is most future-proof.
It's what I came up with on the spur of the moment.
Something like: "format <format>: path <path>",
where ":" is the field separator and each data element is
tagged, would still be parseable by the shell "read" built-in
so long as the path comes last. I don't really care about
the exact data structure but I do think the file format
meta-information should be included.

> I have copy/paste here your other comments to limit the number of
> messages:

Thanks. I was not very disciplined and wrote a lot of emails.

> > You're writing Unix eol characters into pg_log_file. (I think.)
> > Does this matter on MS Windows? (I'm not up on MS Windows,
> > and haven't put any thought into this at all. But didn't
> > want to forget about the potential issue.)
>
> This doesn't matter as the file is opened in text mode (see
> logfile_open() in syslogger.c) and use of LF in fprintf() will be
> automatically converted to CRLF on MS Windows.

This I should have recalled. Thanks for the explanation.

> > While you're at it, it wouldn't hurt to provide another
> > function that tells you the format of the file returned
> > by pg_current_logfile(), since pg_current_logfile()
> > called without arguments could return either a stderr
> > formatted file or a csvlog formatted file.
>
> The log format can be check using something like:
>
> postgres=# SELECT pg_current_logfile(), pg_current_logfile('csvlog');
> pg_current_logfile |
> pg_current_logfile
> -----------------------------------------+-----------------------------------------
> pg_log/postgresql-2016-10-26_231700.log |
> pg_log/postgresql-2016-10-26_231700.csv
> (1 row)
>
> postgres=# SELECT CASE WHEN (pg_current_logfile() =
> pg_current_logfile('stderr')) THEN 'stderr' ELSE 'csvlog' END AS
> log_format; log_format
> ------------
> stderr
> (1 row)
>
> but if we want we can have an other internal function with a call
> like:
>
> SELECT pg_log_format(pg_current_logfile());
>
> that will return stderr or csvlog but I'm not sure this is necessary.

Why not just: SELECT pg_log_format(); since you only ever need to
know what log format is returned by pg_current_logfile() when called
without arguments? Otherwise you already know the log format because
you asked for something specific.

My thoughts are as follows: Either you know the log format because
you configured the cluster or you don't. If you don't know the log
format having the log file is halfway useless. You can do something
like back it up, but you can't ever look at it's contents (in some
sense) without knowing what data structure you're looking at.

Therefore pg_current_logfile() without any arguments is, in the sense
of any sort of automated processing of the logfile content, useless.
You must know the format of the returned logfile. Anybody
who's doing automated log file processing, on a cluster which they
did not configure, must write a "pg_log_format()" equivalent
function. (Probably, as you say, by using the pg_current_logfile(text)
function.) Not everybody's going to be doing this sort of logfile
processing but I'd think it'd be common enough that a pg_log_format()
function would be in some demand.

In a lot of ways this argument is related to the one above. I see
a need for automated processing of logfiles in arbitrarily configured
PG clusters.

> Thanks for your review, let me know if there's any thing to adapt.

I'll look over your v7 patch very soon, I hope. If I have documentation
edits I'll probably send these first. Knowing what the patch is
supposed to do helps me when reading the code. :)

Regards,

Karl <kop(at)meme(dot)com>
Free Software: "You don't pay back, you pay forward."
-- Robert A. Heinlein

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-10-27 04:21:54 Re: [BUG] pg_basebackup from disconnected standby fails
Previous Message Amit Langote 2016-10-27 02:08:24 Re: Declarative partitioning - another take