Re: segmentation fault when cassert enabled

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: Re: segmentation fault when cassert enabled
Date: 2019-12-06 12:00:01
Message-ID: CAA4eK1+9vQb34faKLJbY6KD62HZOJ5Jm9PzoZGiK_9J7cvyDdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 25, 2019 at 8:25 PM Jehan-Guillaume de Rorthais
<jgdr(at)dalibo(dot)com> wrote:
>
> On Wed, 6 Nov 2019 14:34:38 +0100
> Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
>
> > On 2019-11-05 17:29, Jehan-Guillaume de Rorthais wrote:
> > > My best bet so far is that logicalrep_relmap_invalidate_cb is not called
> > > after the DDL on the subscriber so the relmap cache is not invalidated. So
> > > we end up with slot->tts_tupleDescriptor->natts superior than
> > > rel->remoterel->natts in slot_store_cstrings, leading to the overflow on
> > > attrmap and the sigsev.
> >
> > It looks like something like that is happening. But it shouldn't.
> > Different table schemas on publisher and subscriber are well supported,
> > so this must be an edge case of some kind. Please continue investigating.
>
> I've been able to find the origin of the crash, but it was a long journey.
>
> <debugger hard life>
>
> I was unable to debug using gdb record because of this famous error:
>
> Process record does not support instruction 0xc5 at address 0x1482758a4b30.
>
> Program stopped.
> __memset_avx2_unaligned_erms ()
> at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:168
> 168 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such
> file or directory.
>
> Trying with rr, I had constant "stack depth limit exceeded", even with
> unlimited one. Does it worth opening a discussion or a wiki page about these
> tools? Peter, it looks like you have some experience with rr, any feedback?
>
> Finally, Julien Rouhaud spend some time with me after work hours, a,swering
> my questions about some parts of the code and pointed me to the excellent
> backtrace_functions GUC addition few days ago. This finally did the trick to
> find out what was happening. Many thanks Julien!
>
> </debugger hard life>
>
> Back to the bug itself. Consider a working logical replication with constant
> update/insert activity, eg. pgbench running against provider.
>
> Now, on the subscriber side, a session issue an "ALTER TABLE ADD
> COLUMN" on a subscribed table, eg. pgbench_branches. A cache invalidation
> message is then pending for this table.
>
> In the meantime, the logical replication worker receive an UPDATE to apply. It
> opens the local relation using "logicalrep_rel_open". It finds the related
> entry in LogicalRepRelMap is valid, so it does not update its attrmap
> and directly opens and locks the local relation:
>

- /* Try to find and lock the relation by name. */
+ /* Try to find the relation by name */
relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,\
remoterel->relname, -1),
- lockmode, true);
+ NoLock, true);

I think we can't do this because it could lead to locking the wrong
reloid. See RangeVarGetRelidExtended. It ensures that after locking
the relation (which includes accepting invalidation messages), that
the reloid is correct. I think changing the code in the way you are
suggesting can lead to locking incorrect reloid.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2019-12-06 12:59:15 RE: [Proposal] Level4 Warnings show many shadow vars
Previous Message Amit Khandekar 2019-12-06 11:30:10 Re: logical decoding : exceeded maxAllocatedDescs for .spill files