Re: [Incident report]Backend process crashed when executing 2pc transaction

From: Marco Slot <marco(at)citusdata(dot)com>
To: Amit Langote <amitlangote09(at)gmail(dot)com>
Cc: LIANGBO <liangboa(at)suning(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [Incident report]Backend process crashed when executing 2pc transaction
Date: 2019-11-28 08:01:55
Message-ID: CANNhMLAjdTUzdwL50f8LX09je1jh+bZ6C4i=iZh8hgDEH0i0QA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 28, 2019 at 6:18 AM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> Interesting. Still, I think you'd be in better position than anyone
> else to come up with reproduction steps for vanilla PostgreSQL by
> analyzing the stack trace if and when the crash next occurs (or using
> the existing core dump). It's hard to tell by only guessing what may
> have gone wrong when there is external code involved, especially
> something like Citus that hooks into many points within vanilla
> PostgreSQL.

To clarify: In a Citus cluster you typically have a coordinator which
contains the "distributed tables" and one or more workers which
contain the data. All are PostgreSQL servers with the citus extension.
The coordinator uses every available hook in PostgreSQL to make the
distributed tables behave like regular tables. Any crash on the
coordinator is likely to be attributable to Citus, because most of the
code that is exercised is Citus code. The workers are used as regular
PostgreSQL servers with the coordinator acting as a regular client. On
the worker, the ProcessUtility hook will just pass on the arguments to
standard_ProcessUtility without any processing. The crash happened on
a worker.

One interesting thing is the prepared transaction name generated by
the coordinator, which follows the form: citus_<coordinator node
id>_<pid>_<server-wide transaction number >_<prepared transaction
number in session>. The server-wide transaction number is a 64-bit
counter that is kept in shared memory and starts at 1. That means that
over 4 billion (4207001212) transactions happened on the coordinator
since the server started, which quite possibly resulted in 4 billion
prepared transactions on this particular server. I'm wondering if some
counter is overflowing.

cheers,
Marco

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yugo Nagata 2019-11-28 08:10:52 Re: Implementing Incremental View Maintenance
Previous Message Masahiko Sawada 2019-11-28 08:01:21 Re: [HACKERS] Block level parallel vacuum