Re: Support for NSS as a libpq TLS backend

From: Daniel Gustafsson <daniel(at)yesql(dot)se>
To: Jacob Champion <pchampion(at)vmware(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>, "andrew(dot)dunstan(at)2ndquadrant(dot)com" <andrew(dot)dunstan(at)2ndquadrant(dot)com>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "thomas(dot)munro(at)gmail(dot)com" <thomas(dot)munro(at)gmail(dot)com>, "andres(at)anarazel(dot)de" <andres(at)anarazel(dot)de>, "sfrost(at)snowman(dot)net" <sfrost(at)snowman(dot)net>
Subject: Re: Support for NSS as a libpq TLS backend
Date: 2021-06-16 19:52:47
Message-ID: FBBFDB5D-644C-45E7-B8F0-016068464B0C@yesql.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 16 Jun 2021, at 18:15, Jacob Champion <pchampion(at)vmware(dot)com> wrote:
>
> On Wed, 2021-06-16 at 15:31 +0200, Daniel Gustafsson wrote:
>>> On 16 Jun 2021, at 01:50, Jacob Champion <pchampion(at)vmware(dot)com> wrote:
>>> I've been tracking down reference leaks in the client. These open
>>> references prevent NSS from shutting down cleanly, which then makes it
>>> impossible to open a new context in the future. This probably affects
>>> other libpq clients more than it affects psql.
>>
>> Ah, nice catch, that's indeed a bug in the frontend implementation. The
>> problem is that the NSS trustdomain cache *must* be empty before shutting down
>> the context, else this very issue happens. Note this in be_tls_destroy():
>>
>> /*
>> * It reads a bit odd to clear a session cache when we are destroying the
>> * context altogether, but if the session cache isn't cleared before
>> * shutting down the context it will fail with SEC_ERROR_BUSY.
>> */
>> SSL_ClearSessionCache();
>>
>> Calling SSL_ClearSessionCache() in pgtls_close() fixes the error.
>
> That's unfortunate. The session cache is global, right? So I'm guessing
> we'll need to refcount and lock that call, to avoid cleaning up out
> from under a thread that's actively using the the cache?

I'm not sure, the documentation doesn't give any answers and implementations of
libnss tend to just clear the cache without consideration. In libcurl we do
just that, and haven't had any complaints - which doesn't mean it's correct but
it's a datapoint.

>> There is another resource leak left (visible in one test after the above is
>> added), the SECMOD module needs to be unloaded in case it's been loaded.
>> Implementing that with SECMOD_UnloadUserModule trips a segfault in NSS which I
>> have yet to figure out (when acquiring a lock with NSSRWLock_LockRead).
>>
>> [...]
>>
>> With your patches I'm seeing a couple of these:
>>
>> SSL error: The one-time function was previously called and failed. Its error code is no longer available
>
> Hmm. Adding SSL_ClearSessionCache() (without thread-safety at the
> moment) fixes all of the SSL tests for me, and I don't see either the
> SECMOD leak or the "one-time function" error that you've mentioned.

Reading the code I don't think a loaded user module is considered a resource
that must've been released prior to closing the context. I will dig for what
showed up in my tests, but I don't think it was caused by this.

> What version of NSS are you running? I'm on 3.63.

Right now I'm using what Debian 10 is packaging which is 3.42. Admittedly not
hot off the press but I've been trying to develop off a packaged version which
we might see users wanting to deploy against should this get shipped.

--
Daniel Gustafsson https://vmware.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2021-06-16 20:08:39 Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Previous Message Tom Lane 2021-06-16 19:47:56 Re: Improving the isolationtester: fewer failures, less delay