Re: narwhal and PGDLLIMPORT

From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dave Page <dpage(at)pgadmin(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: narwhal and PGDLLIMPORT
Date: 2014-10-20 05:03:31
Message-ID: 20141020050331.GA259313@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 15, 2014 at 12:53:03AM -0400, Noah Misch wrote:
> On Tue, Oct 14, 2014 at 07:07:17PM -0400, Tom Lane wrote:
> > Dave Page <dpage(at)pgadmin(dot)org> writes:
> > > On Tue, Oct 14, 2014 at 11:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > >> I think we're hoping that somebody will step up and investigate how
> > >> narwhal's problem might be fixed.
>
> I have planned to look at reproducing narwhal's problem once the dust settles
> on orangutan, but I wouldn't mind if narwhal went away instead.

> > No argument here. I would kind of like to have more than zero
> > understanding of *why* it's failing, just in case there's more to it
> > than "oh, probably a bug in this old toolchain". But finding that out
> > might well take significant time, and in the end not tell us anything
> > very useful.
>
> Agreed on all those points.

I reproduced narwhal's problem using its toolchain on another 32-bit Windows
Server 2003 system. The crash happens at the SHGetFolderPath() call in
pqGetHomeDirectory(). A program can acquire that function via shfolder.dll or
via shell32.dll; we've used the former method since commit 889f038, for better
compatibility[1] with Windows NT 4.0. On this system, shfolder.dll's version
loads and unloads shell32.dll. In PostgreSQL built using this older compiler,
shfolder.dll:SHGetFolderPath() unloads libpq in addition to unloading shell32!
That started with commit 846e91e. I don't expect to understand the mechanism
behind it, but I recommend we switch back to linking libpq with shell32.dll.
The MSVC build already does that in all supported branches, and it feels right
for the MinGW build to follow suit in 9.4+. Windows versions that lack the
symbol in shell32.dll are now ancient history.

I happened to try the same contrib/dblink test suite on PostgreSQL built with
modern MinGW-w64 (i686-4.9.1-release-win32-dwarf-rt_v3-rev1). That, too, gave
a crash-like symptom starting with commit 846e91e. Specifically, a backend
that LOADed any module linked to libpq (libpqwalreceiver, dblink,
postgres_fdw) would suffer this after calling exit(0):

===
3056 2014-10-20 00:40:15.163 GMT LOG: disconnection: session time: 0:00:00.515 user=cyg_server database=template1 host=127.0.0.1 port=3936

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
9300 2014-10-20 00:40:15.163 GMT LOG: server process (PID 3056) exited with exit code 3
===

The mechanism turned out to be disjoint from the mechanism behind the
ancient-compiler crash. Based on the functions called from exit(), my best
guess is that exit() encountered recursion and used something like an abort()
to escape. (I can send the gdb transcript if anyone is curious to see the
gory details.) The proximate cause was commit 846e91e allowing modules to use
shared libgcc. A 32-bit libpq acquires 64-bit integer division from libgcc.
Passing -static-libgcc to the link restores the libgcc situation as it stood
before commit 846e91e. The main beneficiary of shared libgcc is C++/Java
exception handling, so PostgreSQL doesn't care. No doubt there's some deeper
bug in libgcc or in PostgreSQL; loading a module that links with shared libgcc
should not disrupt exit(). I'm content with this workaround.

Patches attached. For readers inclined toward animal husbandry, we could
benefit from more MinGW buildfarm coverage. There are three compiler strains
(classic MinGW, 32-bit MinGW-w64, 64-bit MinGW-w64) and three supported
Windows Server versions.

[1] http://www.postgresql.org/message-id/flat/6BCB9D8A16AC4241919521715F4D8BCE476665(at)algol(dot)sollentuna(dot)se

Attachment Content-Type Size
shell32-v1.patch text/plain 3.0 KB
static-libgcc-v1.patch text/plain 1.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2014-10-20 05:54:05 Typos in comments
Previous Message Jim Nasby 2014-10-20 01:57:36 Proposal: Log inability to lock pages during vacuum