Re: logical replication worker accesses catalogs in error context callback

From: Andres Freund <andres(at)anarazel(dot)de>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Zhihong Yu <zyu(at)yugabyte(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: logical replication worker accesses catalogs in error context callback
Date: 2021-02-04 19:54:59
Message-ID: 20210204195459.ve7wdhmc75vy6df7@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-02-04 15:08:42 +0530, Bharath Rupireddy wrote:
> On Thu, Feb 4, 2021 at 5:16 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > > > About 0001, have we tried to reproduce the actual bug here which means
> > > > > when the error_callback is called we should face some problem? I feel
> > > > > with the correct testcase we should hit the Assert
> > > > > (Assert(IsTransactionState());) in SearchCatCacheInternal because
> > > > > there we expect the transaction to be in a valid state. I understand
> > > > > that the transaction is in a broken state at that time but having a
> > > > > testcase to hit the actual bug makes it easy to test the fix.
> > > >
> > > > I think it's easy to hit - add a trivial DEBUG message to XLogWrite(),
> > > > and run anything involving logical rep using a low enough log
> > > > level. There might even be enough messages with a low enough level
> > > > already somewhere in the relevant paths.
> > >
> > > I'm not sure how adding a DEBUG message to XLogWrite() would ensure
> > > the logical replication worker on the subscriber system enters
> > > slot_store_error_callback and hits the Assert(IsTransactionState());?
> > > Could you please elaborate? Or I may be missing something here to
> > > understand.
> >
> > XLogWrite() is in a critical section, the DEBUG messages triggers error
> > context callbacks to be called, the callbacks allocate memory, which
> > triggers an assertion.
>
> I see that XLogWrite() can be called from logical replication
> worker(apply_handle_commit->apply_handle_commit_internal->CommitTransactionCommand->CommitTransaction->RecordTransactionCommit->XLogFlush->XLogWrite
> --> by now the txn state is not TRANS_INPROGRESS, so
> IsTransactionState() can return false. Adding a DEBUG message there
> can call the error context callback.
>
> But the error context callback slot_store_error_callback, gets used in
> slot_store_data and slot_modify_data. In both places we just register
> the callback before an attribute processing for loop and deregister
> after it. So, when we are in those for loops, we expect to be in
> TRANS_INPROGRESS, see ensure_transaction before slot_store_data and
> the same is true for slot_modify_data. So, to really hit the assert
> failure in the catalogue access from slot_store_error_callback, first,
> we shouldn't be in TRANS_INPROGRESS in those for loops and have a
> DEBUG message. I'm not exactly sure how we achieve this.

I think you're focussing too much on IsTransactionState(). It's easier
to hit the critical section issues.

What makes the issue a bit hard to hit is that to hit it you need a
syscache miss while slot_store_error_callback is set up, and that cache
miss then needs to trigger a read from disk (to potentially trigger a
WAL flush), and then the slot_store_error_callback() needs to actually
do something.

With the below patch

$ cat /tmp/tmp_config.conf
log_min_messages=debug5
shared_buffers=1MB

make TEMP_CONFIG=/tmp/extra_config.conf -C src/test/subscription/ check

triggers an assertion most of the time. The catcache.c changes aren't
necessary for this test, but we should have them...

Regards,

Andres

diff --git c/src/backend/access/transam/xlog.c i/src/backend/access/transam/xlog.c
index f03bd473e2b..7f422b4f5f2 100644
--- c/src/backend/access/transam/xlog.c
+++ i/src/backend/access/transam/xlog.c
@@ -2432,6 +2432,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
int startidx;
uint32 startoffset;

+ elog(DEBUG1, "writing!");
+
/* We should always be inside a critical section here */
Assert(CritSectionCount > 0);

@@ -2885,6 +2887,8 @@ XLogFlush(XLogRecPtr record)

START_CRIT_SECTION();

+ elog(DEBUG1, "checking for flushing");
+
/*
* Since fsync is usually a horribly expensive operation, we try to
* piggyback as much data as we can on each fsync: if we see any more data
diff --git c/src/backend/replication/logical/worker.c i/src/backend/replication/logical/worker.c
index eb7db89cef7..19aa76af889 100644
--- c/src/backend/replication/logical/worker.c
+++ i/src/backend/replication/logical/worker.c
@@ -435,6 +435,8 @@ slot_store_error_callback(void *arg)
Oid remotetypoid,
localtypoid;

+ Assert(IsTransactionState() && CritSectionCount == 0);
+
/* Nothing to do if remote attribute number is not set */
if (errarg->remote_attnum < 0)
return;
diff --git c/src/backend/utils/cache/catcache.c i/src/backend/utils/cache/catcache.c
index fa2b49c676e..c7335c08570 100644
--- c/src/backend/utils/cache/catcache.c
+++ i/src/backend/utils/cache/catcache.c
@@ -1210,7 +1210,7 @@ SearchCatCacheInternal(CatCache *cache,
CatCTup *ct;

/* Make sure we're in an xact, even if this ends up being a cache hit */
- Assert(IsTransactionState());
+ Assert(IsTransactionState() && CritSectionCount == 0);

Assert(cache->cc_nkeys == nkeys);

@@ -1523,6 +1523,9 @@ SearchCatCacheList(CatCache *cache,
MemoryContext oldcxt;
int i;

+ /* Make sure we're in an xact, even if this ends up being a cache hit */
+ Assert(IsTransactionState() && CritSectionCount == 0);
+
/*
* one-time startup overhead for each cache
*/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-02-04 21:04:10 Re: new heapcheck contrib module
Previous Message Heikki Linnakangas 2021-02-04 19:37:54 Re: Bug in COPY FROM backslash escaping multi-byte chars