Re: segmentation fault when cassert enabled

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: Re: segmentation fault when cassert enabled
Date: 2019-12-13 06:40:07
Message-ID: CALDaNm0kVT3+ku99M6=Wy9X=3=-H8jp_Yt7WVuTraPuREZugxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 6, 2019 at 5:30 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Nov 25, 2019 at 8:25 PM Jehan-Guillaume de Rorthais
> <jgdr(at)dalibo(dot)com> wrote:
> >
> > On Wed, 6 Nov 2019 14:34:38 +0100
> > Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> >
> > > On 2019-11-05 17:29, Jehan-Guillaume de Rorthais wrote:
> > > > My best bet so far is that logicalrep_relmap_invalidate_cb is not called
> > > > after the DDL on the subscriber so the relmap cache is not invalidated. So
> > > > we end up with slot->tts_tupleDescriptor->natts superior than
> > > > rel->remoterel->natts in slot_store_cstrings, leading to the overflow on
> > > > attrmap and the sigsev.
> > >
> > > It looks like something like that is happening. But it shouldn't.
> > > Different table schemas on publisher and subscriber are well supported,
> > > so this must be an edge case of some kind. Please continue investigating.
> >
> > I've been able to find the origin of the crash, but it was a long journey.
> >
> > <debugger hard life>
> >
> > I was unable to debug using gdb record because of this famous error:
> >
> > Process record does not support instruction 0xc5 at address 0x1482758a4b30.
> >
> > Program stopped.
> > __memset_avx2_unaligned_erms ()
> > at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:168
> > 168 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such
> > file or directory.
> >
> > Trying with rr, I had constant "stack depth limit exceeded", even with
> > unlimited one. Does it worth opening a discussion or a wiki page about these
> > tools? Peter, it looks like you have some experience with rr, any feedback?
> >
> > Finally, Julien Rouhaud spend some time with me after work hours, a,swering
> > my questions about some parts of the code and pointed me to the excellent
> > backtrace_functions GUC addition few days ago. This finally did the trick to
> > find out what was happening. Many thanks Julien!
> >
> > </debugger hard life>
> >
> > Back to the bug itself. Consider a working logical replication with constant
> > update/insert activity, eg. pgbench running against provider.
> >
> > Now, on the subscriber side, a session issue an "ALTER TABLE ADD
> > COLUMN" on a subscribed table, eg. pgbench_branches. A cache invalidation
> > message is then pending for this table.
> >
> > In the meantime, the logical replication worker receive an UPDATE to apply. It
> > opens the local relation using "logicalrep_rel_open". It finds the related
> > entry in LogicalRepRelMap is valid, so it does not update its attrmap
> > and directly opens and locks the local relation:
> >
>
> - /* Try to find and lock the relation by name. */
> + /* Try to find the relation by name */
> relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,\
> remoterel->relname, -1),
> - lockmode, true);
> + NoLock, true);
>
> I think we can't do this because it could lead to locking the wrong
> reloid. See RangeVarGetRelidExtended. It ensures that after locking
> the relation (which includes accepting invalidation messages), that
> the reloid is correct. I think changing the code in the way you are
> suggesting can lead to locking incorrect reloid.
>

I have made changes to fix the comment provided. The patch for the
same is attached. Could not add a test case for this scenario is based
on timing issue.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
0001-Fix-subscriber-invalid-memory-access-on-DDL.patch application/x-patch 3.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-12-13 06:49:58 Re: [HACKERS] Block level parallel vacuum
Previous Message Tom Lane 2019-12-13 06:26:11 Re: Errors "failed to construct the join relation" and "failed to build any 2-way joins"