BUG #7624: Misleading Log Message & Inconsistent Configuration Design

From: tixu(at)cs(dot)ucsd(dot)edu
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #7624: Misleading Log Message & Inconsistent Configuration Design
Date: 2012-10-28 07:35:36
Message-ID: E1TSNPE-0003BZ-Sb@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 7624
Logged by: Tianyin Xu
Email address: tixu(at)cs(dot)ucsd(dot)edu
PostgreSQL version: 9.2.1
Operating system: Linux (actually doesn't matter)
Description:

Hi, Postgresql,

I'm here to report a very misleading log message and inconsistent
configuration design. It confuses users, unnecessarily complicating
configuration error diagnosis and causing frustration. As a computer system
grad student familiar with C code and GDB, it took me ~1.8 hours to figure
out the problem.

--------------------------

So, in the postgresql.conf, I had the following entries:

data_directory = 'local/pgsql/data/'
hba_file = 'local/pgsql/data/pg_hba.conf'
ident_file = 'local/pgsql/data/pg_ident.conf'
unix_socket_directory = 'local/pgsql/data/'

Without the “unix_socket_directory” setting, when I started the server,
everything was fine.

However, with the “unix_socket_directory” setting, the server refused to
start with the following message:

FATAL: could not create lock file "local/pgsql/data/.s.PGSQL.5432.lock": No
such file or directory

Since the sever definitely could find “data_directory”, “hba_file”, and
“ident_file”, the “local/pgsql/data” was definitely considered as
***valid*** path.

The weird thing to me is that typing “#touch
local/pgsql/data/.s.PGSQL.5432.lock” could generate the file, and even I
generated the file, the server still complained with the same log message.

Then, I used GDB to trace the error point and found the message was printed
because of the failure of “fopen”,

/* src/backend/utils/init/miscinit.c */
static void
CreateLockFile(const char *filename, bool amPostmaster,
bool isDDLock, const char *refName)
{
….
fd = open(filename, O_RDWR | O_CREAT | O_EXCL, 0600);
if ((errno != EEXIST && errno != EACCES) || ntries > 100)
ereport(FATAL,
(errcode_for_file_access(),
errmsg("could not create lock file
\"%s\": %m",
filename)));
….
}

I copied this piece of code out into a test.c file, compiled, and ran. It
did generate the “local/pgsql/data/.s.PGSQL.5432.lock”!!!, which means the
function “open” definitely works.

Then, I print the errno and found it's 2, i.e., ENOENT:

O_CREAT is not set and the named file does not exist; or O_CREAT is set and
either the path prefix does not exist or the path argument points to an
empty string.

But I thought (I was stupid) this's bullshit. Because path was not an empty
string, and it definitely existed! Otherwise how could “data_directory”,
“hba_file”, “ident_file” and the other similar file/directory paths work?

No idea...

I installed the newest postgres-9.2.1 again, but nothing changed. I
suspected it's because of some weird settings on my machine, so I installed
postgres again on a new server machine, but still nothing changed!

No idea...

Found a senior student and explained this problem. He tried the same thing
as I did and also had no idea.

What the hell...

Go back to the code, trace more carefully (patient, patient, patient), and
finally found the problem!!!

On the execution path, the server first called:

/*src/backend/postmaster/postmaster.c*/
757 ChangeToDataDir();

And then, called

/*src/backend/utils/init/miscinit.c*/
965 CreateLockFile(lockfile, amPostmaster, false, socketfile);

So, when it called open(1), the working directory is changed to the data
directory!

Actually the errno makes sense. I was stupid!!

-----------------------------

But how can users know this without reading the code... The only interface
exposed to users are the configuration file which is not consistent at all.
And the log message does not help but mislead users (the directory **does**
exist...).

I really suggest to make the configuration file consistent in this case. But
I understand it might not be easy. But at least I think we should print a
better log message which pinpoints to the absolute path like

FATAL: 58P01: could not create lock file
"/home/tianyin/postgresql-9.2.1/local/pgsql/data/local/pgsql/data/.s.PGSQL.5432.lock":
No such file or directory

In this case, the users can definitely figure out the problem by themselves.

Here is the patch:

diff --git a/miscinit_old.c b/miscinit.c
index fb376a0..8df83a8 100644
--- a/miscinit_old.c
+++ b/miscinit.c
@@ -736,11 +736,13 @@ CreateLockFile(const char *filename, bool
amPostmaster,
/*
* Couldn't create the pid file. Probably it already
exists.
*/
- if ((errno != EEXIST && errno != EACCES) || ntries > 100)
+ if ((errno != EEXIST && errno != EACCES) || ntries > 100) {
+ char* abs_filename = make_absolute_path(filename);
ereport(FATAL,
(errcode_for_file_access(),
errmsg("could not create lock file
\"%s\": %m",
- filename)));
+ abs_filename)));
+ }

/*
* Read the file to get the old owner's PID. Note race
condition

Thanks a lot!
Tianyin

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message tixu 2012-10-29 00:07:01 BUG #7625: corrupted pid file prevents postgreSQL from normal starting
Previous Message Craig Ringer 2012-10-28 07:30:54 Re: BUG #7620: array_to_json doesn't support heterogeneous arrays