Re: narwhal and PGDLLIMPORT

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dave Page <dpage(at)pgadmin(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: narwhal and PGDLLIMPORT
Date: 2014-10-20 20:24:47
Message-ID: 20141020202447.GH7176@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2014-10-20 01:03:31 -0400, Noah Misch wrote:
> On Wed, Oct 15, 2014 at 12:53:03AM -0400, Noah Misch wrote:
> > On Tue, Oct 14, 2014 at 07:07:17PM -0400, Tom Lane wrote:
> > > Dave Page <dpage(at)pgadmin(dot)org> writes:
> > > > On Tue, Oct 14, 2014 at 11:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > > >> I think we're hoping that somebody will step up and investigate how
> > > >> narwhal's problem might be fixed.
> >
> > I have planned to look at reproducing narwhal's problem once the dust settles
> > on orangutan, but I wouldn't mind if narwhal went away instead.
>
> > > No argument here. I would kind of like to have more than zero
> > > understanding of *why* it's failing, just in case there's more to it
> > > than "oh, probably a bug in this old toolchain". But finding that out
> > > might well take significant time, and in the end not tell us anything
> > > very useful.
> >
> > Agreed on all those points.
>
> I reproduced narwhal's problem using its toolchain on another 32-bit Windows
> Server 2003 system. The crash happens at the SHGetFolderPath() call in
> pqGetHomeDirectory(). A program can acquire that function via shfolder.dll or
> via shell32.dll; we've used the former method since commit 889f038, for better
> compatibility[1] with Windows NT 4.0. On this system, shfolder.dll's version
> loads and unloads shell32.dll. In PostgreSQL built using this older compiler,
> shfolder.dll:SHGetFolderPath() unloads libpq in addition to unloading shell32!
> That started with commit 846e91e. I don't expect to understand the mechanism
> behind it, but I recommend we switch back to linking libpq with shell32.dll.
> The MSVC build already does that in all supported branches, and it feels right
> for the MinGW build to follow suit in 9.4+. Windows versions that lack the
> symbol in shell32.dll are now ancient history.

Ick. Nice detective work of a ugly situation.

> I happened to try the same contrib/dblink test suite on PostgreSQL built with
> modern MinGW-w64 (i686-4.9.1-release-win32-dwarf-rt_v3-rev1). That, too, gave
> a crash-like symptom starting with commit 846e91e. Specifically, a backend
> that LOADed any module linked to libpq (libpqwalreceiver, dblink,
> postgres_fdw) would suffer this after calling exit(0):
>
> ===
> 3056 2014-10-20 00:40:15.163 GMT LOG: disconnection: session time: 0:00:00.515 user=cyg_server database=template1 host=127.0.0.1 port=3936
>
> This application has requested the Runtime to terminate it in an unusual way.
> Please contact the application's support team for more information.
>
> This application has requested the Runtime to terminate it in an unusual way.
> Please contact the application's support team for more information.
> 9300 2014-10-20 00:40:15.163 GMT LOG: server process (PID 3056) exited with exit code 3
> ===
>
> The mechanism turned out to be disjoint from the mechanism behind the
> ancient-compiler crash. Based on the functions called from exit(), my best
> guess is that exit() encountered recursion and used something like an abort()
> to escape.

Hm.

> (I can send the gdb transcript if anyone is curious to see the
> gory details.)

That would be interesting.

> The proximate cause was commit 846e91e allowing modules to use
> shared libgcc. A 32-bit libpq acquires 64-bit integer division from libgcc.
> Passing -static-libgcc to the link restores the libgcc situation as it stood
> before commit 846e91e. The main beneficiary of shared libgcc is C++/Java
> exception handling, so PostgreSQL doesn't care. No doubt there's some deeper
> bug in libgcc or in PostgreSQL; loading a module that links with shared libgcc
> should not disrupt exit(). I'm content with this workaround.

I'm unconvinced by this reasoning. Popular postgres extensions like
postgis do use C++. It's imo not hard to imagine situations where
switching to a statically linked libgcc statically could cause problems.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G Johnston 2014-10-20 20:49:19 Re: Add regression tests for autocommit-off mode for psql and fix some omissions
Previous Message Andres Freund 2014-10-20 20:11:14 Re: Autovacuum fails to keep visibility map up-to-date in mostly-insert-only-tables