Re: [Incident report]Backend process crashed when executing 2pc transaction

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Marco Slot <marco(at)citusdata(dot)com>
Cc: LIANGBO <liangboa(at)suning(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [Incident report]Backend process crashed when executing 2pc transaction
Date: 2019-11-28 09:01:18
Message-ID: CA+HiwqGxmSxu8e07sNLEmKJqFm7-69QhidjA+huA1ifm0n1CnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Marco,

On Thu, Nov 28, 2019 at 5:02 PM Marco Slot <marco(at)citusdata(dot)com> wrote:
>
> On Thu, Nov 28, 2019 at 6:18 AM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> > Interesting. Still, I think you'd be in better position than anyone
> > else to come up with reproduction steps for vanilla PostgreSQL by
> > analyzing the stack trace if and when the crash next occurs (or using
> > the existing core dump). It's hard to tell by only guessing what may
> > have gone wrong when there is external code involved, especially
> > something like Citus that hooks into many points within vanilla
> > PostgreSQL.
>
> To clarify: In a Citus cluster you typically have a coordinator which
> contains the "distributed tables" and one or more workers which
> contain the data. All are PostgreSQL servers with the citus extension.
> The coordinator uses every available hook in PostgreSQL to make the
> distributed tables behave like regular tables. Any crash on the
> coordinator is likely to be attributable to Citus, because most of the
> code that is exercised is Citus code. The workers are used as regular
> PostgreSQL servers with the coordinator acting as a regular client. On
> the worker, the ProcessUtility hook will just pass on the arguments to
> standard_ProcessUtility without any processing. The crash happened on
> a worker.

Thanks for clarifying.

> One interesting thing is the prepared transaction name generated by
> the coordinator, which follows the form: citus_<coordinator node
> id>_<pid>_<server-wide transaction number >_<prepared transaction
> number in session>. The server-wide transaction number is a 64-bit
> counter that is kept in shared memory and starts at 1. That means that
> over 4 billion (4207001212) transactions happened on the coordinator
> since the server started, which quite possibly resulted in 4 billion
> prepared transactions on this particular server. I'm wondering if some
> counter is overflowing.

Interesting. This does kind of gets us closer to figuring out what
might have gone wrong, but hard to tell without the core dump at hand.

Thanks,
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hubert Zhang 2019-11-28 09:23:59 Yet another vectorized engine
Previous Message Daniel Gustafsson 2019-11-28 08:58:06 Re: format of pg_upgrade loadable_libraries warning