Re: [sqlsmith] Unpinning error in parallel worker

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Jonathan Rudenberg <jonathan(at)titanous(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Seltenreich <seltenreich(at)gmx(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [sqlsmith] Unpinning error in parallel worker
Date: 2018-04-17 22:38:14
Message-ID: CAEepm=0gtExezsnVabv79hKSzn61dbdEbzWRxqyJf1nf8hzppQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 18, 2018 at 8:52 AM, Jonathan Rudenberg
<jonathan(at)titanous(dot)com> wrote:
> Hundreds of queries stuck with a wait_event of DynamicSharedMemoryControlLock and pg_terminate_backend did not terminate the queries.
>
> In the log:
>
>> FATAL: cannot unpin a segment that is not pinned

Thanks for the report. That error is reachable via two paths:

1. Cleanup of a DSA area at the end of a query, giving back all
segments. This is how the bug originally reported in this thread
reached it, and that's because of a case where we tried to
double-destroy the DSA area when refcount went down to zero, then back
up again, and then back to zero (late starting parallel worker that
attached in a narrow time window). That was fixed in fddf45b3: once
it reaches zero we recognise it as already destroyed and don't even
let anyone attach.

2. In destroy_superblock(), called by dsa_free(), when we're where
we've determined that a 64kb superblock can be given back to the DSM
segment, and that the DSM segment is now entirely free so can be given
back to the operating system. To do that, after we put the pages back
into the free page manager we test fpm_largest(segment_map->fpm) ==
segment_map->header->usable_pages to see if the largest span of free
pages is now the same size as the whole segment.

I don't have any theories about how that could be going wrong right
now, but I'm looking into it. There could be a logic bug in dsa.c, or
a logic bug in client code running an invalid sequence of
dsa_allocate(), dsa_free() calls that corrupts state (I wonder if a
well timed double dsa_free() could produce this effect), or a
common-or-garden overrun bug somewhere that trashes control state.

> I don't have a backtrace yet, but I will provide them if/when the issue happens again.

Thanks, that would be much appreciated, as would any clues about what
workload you're running. Do you know what the query plan looks like
for the queries that crashed?

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan Rudenberg 2018-04-17 23:01:05 Re: [sqlsmith] Unpinning error in parallel worker
Previous Message Alvaro Herrera 2018-04-17 22:11:18 Re: pruning disabled for array, enum, record, range type partition keys