Re: BUG #15808: ERROR: subtransaction logged without previous top-level txn record (SQLSTATE XX000)

From: Dave Cramer <davecramer(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: mansour(at)oxplot(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15808: ERROR: subtransaction logged without previous top-level txn record (SQLSTATE XX000)
Date: 2019-09-06 20:39:15
Message-ID: CADK3HHL97Z3ZsDp0WUPWjjZzFZsyP3Po1LJ4xcjC=JjgtUiZOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, 16 May 2019 at 13:04, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2019-05-16 04:56:15 +0000, PG Bug reporting form wrote:
> > The following bug has been logged on the website:
> >
> > Bug reference: 15808
> > Logged by: Mansour Behabadi
> > Email address: mansour(at)oxplot(dot)com
> > PostgreSQL version: 10.6
> > Operating system: Amazon RDS
> > Description:
> >
> > We have some custom logical replication client that makes
> > pg_logical_slot_get_changes() calls in SQL. E.g.:
>
> Unrelated to the bug: You really should use the streaming
> interface. It's much, much, much more efficient.
>
> https://www.postgresql.org/docs/current/logicaldecoding-walsender.html
>
>
> > Once every few thousand calls, we get the following error:
> >
> > ERROR: subtransaction logged without previous top-level txn record
> (SQLSTATE
> > XX000)
> >
> > which will persist on all subsequent calls, essentially forcing us to
> drop
> > the slot and create a new one.
>
> That obviously shouldn't happen.
>
>
> > We had little success looking for solutions online and the only lead is
> that
> > of a recent commit
> > (
> https://github.com/postgres/postgres/commit/f49a80c481f74fa81407dce8e51dea6956cb64f8
> )
> > whose commit message seem to correlate to the error we're getting. Below
> is
> > the relevant excerpt:
> >
> > The second issue concerns SnapBuilder snapshots and subtransactions.
> > SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
> > transaction that is known to be a subtxn, which is good in the common
> > case that the top-level transaction already has one (no point in doing
> > so), but a bug otherwise. To fix, arrange to transfer the snapshot from
> > the subtxn to its top-level txn as soon as the kinship gets known.
> > test_decoding's snapshot_transfer verifies this.
>
> That seems unrelated to the error message you're getting.
>
>
> > We're not sure if this is a fix to our problem and whether upgrading to
> > Postgres 11 (which has this change in it) will solve the issue.
>
> Note that this change isn't just in 11:
>
> Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
> Branch: master Release: REL_11_BR [f49a80c48] 2018-06-26 16:48:10 -0400
> Branch: REL_10_STABLE Release: REL_10_5 [b767b3f2e] 2018-06-26 16:38:34
> -0400
> Branch: REL9_6_STABLE Release: REL9_6_10 [da10d6a8a] 2018-06-26 16:38:34
> -0400
> Branch: REL9_5_STABLE Release: REL9_5_14 [4cb6f7837] 2018-06-26 16:38:34
> -0400
> Branch: REL9_4_STABLE Release: REL9_4_19 [962313558] 2018-06-26 16:38:34
> -0400
>
>
> > Please let me know if any more info is needed.
>
> The easiest way to progress here would be a recipe to reproduce the
> problem. As long as the problem is on RDS, we unfortunately can't really
> debug this - neither can we modify the source to emit more debugging
> information, nor can we inspect the WAL files ourselves (I think).
>
> It's possible that trying to reproduce this on RDS with the debug level
> set to very high (debug5) would allow for a bit more insight. But I'm
> somewhat doubtful.
>
>
Andres,

It's possible that I have someone that would be able to run this in a
non-RDS environment.

It's unlikely we have a reproducible test case, but it's likely we can
modify the code on their boxes for debugging and or get WAL files for
inspection.

This is in a version of 9.6.14 so the above fix should be in it.

I'm willing to facilitate if you can provide some direction.

Dave

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2019-09-06 20:40:53 Re: ERROR: multixact X from before cutoff Y found to be still running
Previous Message Robert Haas 2019-09-06 17:25:36 Re: ERROR: multixact X from before cutoff Y found to be still running