Re: A couple of random BF failures in kerberosCheck

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A couple of random BF failures in kerberosCheck
Date: 2019-08-03 21:04:16
Message-ID: 21009.1564866256@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2019-07-24%2003%3A22%3A17
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-08-02%2007%3A17%3A25

> I wondered if this might be like the recently fixed problem with slapd
> not being ready to handle requests yet, since we start up krb5kdc
> first and then don't do anything explicit to wait for it, but it
> doesn't look like an obvious failure to reach it.

I spent a bit of time trying to reproduce these failures, using a
Fedora 29 install that should pretty nearly match what crake is
running. I didn't yet match the shown failures, but I did get this
after a bunch of attempts:

# Running: /usr/sbin/krb5kdc -P /home/tgl/pgsql/src/test/kerberos/tmp_check/krb5kdc.pid
Bail out! system /usr/sbin/krb5kdc failed

Looking into the tmp_check/krb5kdc.log file finds

Aug 03 16:04:01 mini12.sss.pgh.pa.us krb5kdc[14340](info): setting up network...
krb5kdc: Address already in use - Cannot bind server socket on 127.0.0.1.55324
Aug 03 16:04:01 mini12.sss.pgh.pa.us krb5kdc[14340](Error): Failed setting up a TCP socket (for 127.0.0.1.55324)
krb5kdc: Address already in use - Error setting up network

So this leads to two points:

* kerberos/t/001_auth.pl just blithely assumes that it can pick
any random port above 48K and that's guaranteed to be free.
Maybe we should split out the code in get_new_node for finding
a free TCP port, so we can call it here?

* AFAICS, the only provision for shutting down krb5kdc at the end of
the test run is

END
{
kill 'INT', `cat $kdc_pidfile` if -f $kdc_pidfile;
}

I wonder how reliable that is, especially in contexts where the calling
script might do "rm -rf tmp_check" shortly afterwards. Maybe it'd be
better to try to shut down krb5kdc explicitly before we exit the test
script. I'd suggest waiting for krb5kdc to remove its pidfile, except
it seems not to do so :-(

Despite my suspicions about the shutdown provisions, I found that this is
somewhat reproducible and the problematic port is *not* the one assigned
in the previous iteration of 001_auth.pl. However, I notice that after
running this test in a loop for awhile, there are an awful lot of local
loopback connections in TIME_WAIT state. I hypothesize that the failures
correspond to cases where we try to re-use a port number that some
previous test iteration used, possibly on the client side not the server
side of GSS. I wonder whether we are doing something that keeps those
GSS query connections from being closed more cleanly/rapidly. (The
sockets do go away after a minute or so, but why are they in TIME_WAIT
at all?)

None of these points seem to explain the buildfarm failures, though,
especially not elver's where only one connection attempt failed.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-08-03 21:34:46 Re: Redacting information from logs
Previous Message Etsuro Fujita 2019-08-03 19:45:47 Re: partition routing layering in nodeModifyTable.c