Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date: 2020-09-14 11:37:38
Message-ID: CAFiTN-uGxgo5258hZy2QJoz=s7_Cs7v9=b8Z2GgFV7qmQUOwxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 14, 2020 at 4:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Sep 14, 2020 at 1:23 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > Yeah, this is right, and here is some initial analysis. It seems to be
> > > failing in below code:
> > > rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}
> > >
> > > This list can have elements only in 'streaming' mode (need to enable
> > > 'streaming' with Create Subscription command) whereas none of the
> > > tests in 010_truncate.pl is using 'streaming', so this list should be
> > > empty (NULL). The two different assertion failures shown in BF reports
> > > in list_free code are as below:
> > > Assert(list->length > 0);
> > > Assert(list->length <= list->max_length);
> > >
> > > It seems to me that this list is not initialized properly when it is
> > > not used or maybe that is true in some special circumstances because
> > > we initialize it in get_rel_sync_entry(). I am not sure if CCI build
> > > is impacting this in some way.
> >
> >
> > Even I have analyzed this but did not find any reason why the
> > streamed_txns list should be anything other than NULL. The only thing
> > is we are initializing the entry->streamed_txns to NULL and the list
> > free is checking "if (list == NIL)" then return. However IMHO, that
> > should not be an issue becase NIL is defined as (List*) NULL.
> >
>
> Yeah, that is not the issue but it is better to initialize it with NIL
> for the sake of consistency. The basic issue here was we were trying
> to open/lock the relation(s) before initializing this list. Now, when
> we process the invalidations during open relation, we try to access
> this list in rel_sync_cache_relation_cb and that leads to assertion
> failure. I have reproduced the exact scenario of 010_truncate.pl via
> debugger. Basically, the backend on publisher has sent the
> invalidation after truncating the relation 'tab1' and while processing
> the truncate message if WALSender receives that message exactly after
> creating the RelSyncEntry for 'tab1', the Assertion shown in BF can be
> reproduced.

Yeah, this is an issue and I am also able to reproduce this manually
using gdb. Basically, I have inserted some data in publication table
and after that, I stopped in get_rel_sync_entry after creating the
reentry and before calling GetRelationPublications. Meanwhile, I have
truncated this table and then it hit the same issue you pointed here.

> The attached patch will fix the issue. What do you think?

The patch looks good to me and fixing the reported issue.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Anastasia Lubennikova 2020-09-14 11:38:56 Re: [PATCH] Automatic HASH and LIST partition creation
Previous Message David Rowley 2020-09-14 11:34:53 Re: Division in dynahash.c due to HASH_FFACTOR