Re: DSM robustness failure (was Re: Peripatus/failures)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Larry Rosenman <ler(at)lerctr(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: Re: DSM robustness failure (was Re: Peripatus/failures)
Date: 2018-10-18 02:58:06
Message-ID: CAEepm=2dyAcmZOUv8VsgWKiSRjjF1X0oRNecna94+nwTbyoGTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 18, 2018 at 2:36 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Larry's REL_10_STABLE failure logs are interesting:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2018-10-17%2020%3A42%3A17
>
> 2018-10-17 15:48:08.849 CDT [55240:7] LOG: dynamic shared memory control segment is corrupt
> 2018-10-17 15:48:08.849 CDT [55240:8] LOG: sem_destroy failed: Invalid argument
> 2018-10-17 15:48:08.850 CDT [55240:9] LOG: sem_destroy failed: Invalid argument
> 2018-10-17 15:48:08.850 CDT [55240:10] LOG: sem_destroy failed: Invalid argument
> 2018-10-17 15:48:08.850 CDT [55240:11] LOG: sem_destroy failed: Invalid argument
> ... lots more ...
> 2018-10-17 15:48:08.862 CDT [55240:122] LOG: sem_destroy failed: Invalid argument
> 2018-10-17 15:48:08.862 CDT [55240:123] LOG: sem_destroy failed: Invalid argument
> TRAP: FailedAssertion("!(dsm_control_mapped_size == 0)", File: "dsm.c", Line: 182)
>
> So at least in this case, not only did we lose the DSM segment but also
> all of our semaphores. Is it conceivable that Python somehow destroyed
> those objects, rather than stomping on the contents of the DSM segment?
> If not, how do we explain this log?

One idea: In the backend I'm looking at there is a contiguous run of
read/write mappings from the the location of the semaphore array
through to the DSM control segment. That means that a single runaway
loop/memcpy/memset etc could overwrite both of those. Eventually it
would run off the end of contiguously mapped space and SEGV, and we do
indeed see a segfault from that Python code before the trouble begins.

> Also, why is there branch-specific variation? The fact that v11 and HEAD
> aren't whinging about lost semaphores is not hard to understand --- we
> stopped using SysV semas. But why don't the older branches look like v10
> here?

I think v10 is where we switched to POSIX unnamed (= sem_destroy()),
so it's 10, 11 and master that should be the same in this respect, no?

--
Thomas Munro
http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Imai, Yoshikazu 2018-10-18 03:13:06 RE: Small performance tweak to run-time partition pruning
Previous Message Amit Langote 2018-10-18 02:15:22 Re: speeding up planning with partitions