Re: Some pgq table rewrite incompatibility with logical decoding?

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Some pgq table rewrite incompatibility with logical decoding?
Date: 2018-08-09 17:43:58
Message-ID: 19de930f-044f-2ee8-44b5-503c47450a35@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/25/2018 07:48 PM, Jeremy Finzel wrote:
>
>
> On Mon, Jun 25, 2018 at 12:41 PM, Andres Freund <andres(at)anarazel(dot)de
> <mailto:andres(at)anarazel(dot)de>> wrote:
>
> Hi,
>
> On 2018-06-25 10:37:18 -0500, Jeremy Finzel wrote:
> > I am hoping someone here can shed some light on this issue - I apologize if
> > this isn't the right place to ask this but I'm almost some of you all were
> > involving in pgq's dev and might be able to answer this.
> >
> > We are actually running 2 replication technologies on a few of our dbs,
> > skytools and pglogical.  Although we are moving towards only using logical
> > decoding-based replication, right now we have both for different purposes.
> >
> > There seems to be a table rewrite happening on table pgq.event_58_1 that
> > has happened twice, and it ends up in the decoding stream, resulting in the
> > following error:
> >
> > ERROR,XX000,"could not map filenode ""base/16418/1173394526"" to relation
> > OID"
> >
> > In retracing what happened, we discovered that this relfilenode was
> > rewritten.  But somehow, it is ending up in the logical decoding stream as
> > is "undecodable".  This is pretty disastrous because the only way to fix it
> > really is to advance the replication slot and lose data.
> >
> > The only obvious table rewrite I can find in the pgq codebase is a truncate
> > in pgq.maint_rotate_tables.sql.  But there isn't anything surprising
> > there.  If anyone has any ideas as to what might cause this so that we
> > could somehow mitigate the possibility of this happening again until we
> > move off pgq, that would be much appreciated.
>
> I suspect the issue might be that pgq does some updates to catalog
> tables. Is that indeed the case?
>
>
> I also suspected this.  The only case I found of this is that it is
> doing deletes and inserts to pg_autovacuum.  I could not find anything
> quickly otherwise but I'm not sure if I'm missing something in some of
> the C code.
>

I don't think that's true, for two reasons.

Firstly, I don't think pgq updates catalogs directly, it simply
truncates the queue tables when rotating them (which updates the
relfilenode in pg_class, of course).

Secondly, we're occasionally seeing this on systems that do not use pgq,
but that do VACUUM FULL on custom "queue" tables. The symptoms are
exactly the same ("ERROR: could not map filenode"). It's pretty damn
rare and we don't have direct access to the systems, so investigation is
difficult :-( Our current hypothesis is that it's somewhat related to
subtransactions (because of triggers with exception blocks).

Jeremy, are you able to reproduce the issue locally, using pgq? That
would be very valuable.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-08-09 17:47:46 Re: logical decoding / rewrite map vs. maxAllocatedDescs
Previous Message Tom Lane 2018-08-09 17:42:17 Re: logical decoding / rewrite map vs. maxAllocatedDescs