Re: PostgreSQL crashes with SIGSEGV

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andreas Seltenreich <andreas(dot)seltenreich(at)credativ(dot)de>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Bernd Helmle <mailings(at)oopsware(dot)de>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: PostgreSQL crashes with SIGSEGV
Date: 2017-12-15 01:58:39
Message-ID: CAH2-WzmR=Aucv9mjstXRD8SyP0mfw00PSK+bBnpTE3TRwqR67Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Thu, Dec 14, 2017 at 12:06 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I can reproduce this against REL9_6_STABLE, once work_mem is set to
> 4MB, and replacement_sort_tuples is set to 150000.

I've figured this one out. It's a double free. Consider this code:

(Although REL9_5_STABLE is also affected, the following remarks
reference REL9_6_STABLE-only code.)

* The slot receives a copied tuple (sometimes allocated in caller memory
* context) that will stay valid regardless of future manipulations of the
* tuplesort's state.
*/
bool
tuplesort_gettupleslot(/* SNIP */)

I wrote that comment block, in commit a5f0bd77a, an early bugfix for
9.6 (I actually mentioned this specific commit to Bernd in passing
already). This bug is very similar to the one fixed by a5f0bd77a, and
is arguably a broader version of that same bug. Note that I didn't do
anything with external sorts until 9.6, so I expect that this affects
all stable branches. That said, it may not be possible to produce a
hard crash on all versions. More or less by accident.

The function opens with this code:

MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
SortTuple stup;
bool should_free;

if (!tuplesort_gettuple_common(state, forward, &stup, &should_free))
stup.tuple = NULL;

MemoryContextSwitchTo(oldcontext);
/* SNIP */
...

As long as the slot receives a copied tuple that may or may not be in
caller's memory context (it may instead get allocated in a
tuplesort-private memory context), and as long as destroying a
tuplesort via tuplesort_end() counts as a "future manipulation", then
you it's possible to crash with a double-pfree() while not violating
the tuplesort_gettupleslot() contract. The memory may or may not be
freed as part of tuplesort_end() teardown, based only on accidental
implementation factors (that's the first time the memory gets freed
when we do crash). Specifically, if state->status == TSS_SORTEDONTAPE,
we risk this crash on versions before Postgres 10 (more on why 10
appears to be okay tomorrow), when tuplesort_end() just so happens to
have been called (destroying the 2 memory contexts) before the
executor's superfluous pfree() within ExecResetTupleTable().

The ambiguity about who owns the tuple memory when is the basic
problem, once again. One could argue that it's the caller's fault for
not knowing to not pfree() the tuple after tuplesort_end() is called,
but I don't buy that because it seems like an undue burden on callers
to do that. It seems okay to either provide a very weak, simple
guarantee about tuple memory lifetime, or very strong simple guarantee
about tuple memory lifetime. That's what routines like
tuplesort_gettupleslot() and tuplesort_getindextuple() should limit
their contracts to -- we've had enough problems in this area over the
years that that seems pretty clear to me.

ISTM that an appropriate fix is one that results in having
tuplesort_gettupleslot() tuple memory that is *consistently* allocated
within caller's own memory context, at least on versions prior to
Postgres 10 (it's a bit more complicated in Postgres 10). I'll begin
work at a patch for this tomorrow. It's not obvious what approach to
take, because on 9.6+, memory is allocated in tuplecontext, not
sortcontext. A well-targeted fix will not burden low-level READTUP()
routines (that use tuplecontext) with direct knowledge of the problem.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message dbaber 2017-12-15 10:26:04 BUG #14979: PostGres does not start at reboot
Previous Message John R Pierce 2017-12-15 00:58:01 Re: BUG #14978: Import/Export button is disabled in pdAdmin 4.2.0

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-12-15 02:30:59 Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
Previous Message Tom Lane 2017-12-15 01:55:31 Re: pearltidy source code has been removed (pgindent)